Although there is abundance of such data both in print and electronic format but it is mostly either buried deep in voluminous books or in a long threaded conversation? I think it will be appropriate to “cluster” all such useful packages as used in two popular data mining languages R and Python in a single thread.
- Hierarchical Clustering Methods
- For hierarchical clustering methods use the cluster package in R. An example implementation is posted on thisthread In the same package you can find methods for clues, clara, clarans, Diana, ClustOfVar algorithms
- BIRCH methods- TheR package has been removed from the CRAN repository. You can either use the earlier versions found here or else you can modify the code. For Python, you can use sckit-learn
- Agglomerative Clustering- the r function is agnes found in the cluster package
- Expectation-Maximization algorithm- the r package isEMCluster
- K-modesRfor classical k-means, kernlab, Flexclust
- Clustering and Cluster validation in RPackage - fpc, RANN for k-nearest neighbors
- For clustering mixed-type dataset, the R package isCluster Ensembles
- In Python- Text processing tasks can be handled byNatural Language Toolkit (NLP) is a mature, well-documented package for NLP, TextBlob is a simpler alternative, spaCy is a brand new alternative focused on performance. The R package for text processing is tm package
- CRAN Task View- contains a list of packages that can be used for finding groups in data and modeling unobserved cross-sectional heterogeneity. This is one place where you can find both the function name and its description. Is data cleaning your objective? So if your focus is on data cleaning also known as data munging then python is more powerful in my experience because its backed by regular expression
Is data exploration your objective? The pandas package in Python is very powerful and extremely flexible but its equally challenging to learn too. Similarly, the dplyr package in R can be used for the same.
Is data visualization your objective? If so then in R, ggplot2 is an excellent package for data visualization. Similarly, you can use ggplot for python for graphics
And finally, like the CRAN-R project is a single repository for R packages the Anaconda distribution for Python has a similar package management system
comments powered by Disqus