A so-called “Clustermap” chart serves different purposes and needs. This article has the aim to describe how you can create one, what purposes it serves and we will have a detailed look into the chart. This chart includes a hierarchical clustering which we will investigate as well.
During the last years, the programming language Python became popular for many people interested to provide insights from data. For some time now, Seaborn has been my package of choice for creating charts (Waskom, 2018)).
How it all started:
The way I came across the Clustermap started with a data-related problem. From a current private project, I received a large amount of real estate data which looked like this:
The aim was to display the place and the construction phase of the objects to make some conclusions about the place and the number of houses during the construction phase.
Let us have a look at the steps how to create the chart in Python.
- Bring it into a matrix format:
Please note that for the Clustermap no NA values are allowed.
- Creation of Basic chart
This is very easily being done by
h=sns.clustermap(grouped, cmap="Blues", fmt="d", linewidth=.5, method="average", annot=True)
This looks very interesting. It already shows us some information, for example that most of the house offers are located in Nordrhein-Westfalen.
If we want to point out the differences of construction phase by state, we have to apply an intermediate step.
- Apply MinMaxScaler to get differences by row/column level
Let’s say we want to find out the state with the most houses in “projected” stage, regardless of the amount. We do that by applying the MinMaxScaler into the equation. The code looks as follows for our chart:
h=sns.clustermap(grouped, cmap="Blues", linewidth=.5, standard_scale=0, method="average", annot=True)
The phrase “standard_scale=0” applies the scaler either by row (0) or column (1). The Scaler works with the following formula:
It basically shrinks the range of the values to 0 and 1 (or -1 and 1 for negative values) (Keen, 2020). We have done that in our chart and here it is:
We can now say that in Saxony is the highest ratio of projected houses.
- What about the clusters?
On the margin you might have realized that there are several lines. These are part of a so called “Dendrogram” and display the hierarchical clustering (Bock, 2013).
The interesting thing about the dendrogram is that it can show us the differences in the clusters. In the example we see that A and B for example is much closer to the other clusters C, D, E and F.
It becomes much clearer if we put a line between the clusters:
There are two hierarchical clustering methods. In our example we focus on the Agglomerative Hierarchical Clustering Technique which is showing each point as one cluster and in each iteration combines it until only one cluster is left, this picture sums it up (Dey, 2020):
Transferring that into example, we see that some states are very close to others – for example Brandenburg and Sachsen. Why is that? In this case we need to understand the underlying clustering methods.
- How is the clustering created? Let’s have a detailed look:
From our last code we recall that the current clustermap was created by:
h=sns.clustermap(grouped, cmap="Blues", fmt="d", linewidth=.5, method="?", annot=True)
The method gives us the hierarchical clustering method. We can choose from (The SciPy community, 2019):
- Single linkage method
- Complete – Farthest Point Algorithm
- Group Average
- Wards method
And some more, but we will focus on these five methods.
Single linkage method
Also called the min method and defined by: . This means that this algorithm takes the closest two points in the clusters and therefore describes it as the similarity of two clusters. See the picture on the right side (Alvez, 2011). This method applies well if there are non-elliptical shapes without any outliers or noise. You see that this method separates the clusters in the first and second case but fails to do so in the case with noisy data (Al-Fuqaha, 2014).
Complete – Farthest Point Algorithm
As the name outlines, this is the opposite method – also called max and defined by: . It means that the algorithm takes the farthest point and describes it as the similarity of two clusters. It does very well in case of noisy data but could risk to break large clusters. You can see that in the third panel of the picture on the right. For the other clusters this method does not perform so well (Al-Fuqaha, 2014).
Group Average
The next method is called Group average or UPGMA or Average Linkage and defined by: . It is basically the distance between two clusters and then calculates the average of the similarities. This method is not so much influenced by outliers and performs well in case of noisy data. You can see that in the third tile of the picture on the right. The downside is that this method is skewed towards globular clusters (Al-Fuqaha, 2014).
WARDS method:
This method is defined by: and uses the increase in squared error (SSE). It is similar to group average if the distance between points is squared. It is less biased by outliers and noise but to globular clusters (similar as group average). The method is the hierarchical twin to K-Means clustering (Al-Fuqaha, 2014).
Let’s turn back to the example but to better understand the data we will use the standard values, not scaled. We will also use the total values per state to not get confused. The code is as follows:
h=sns.clustermap(grouped, cmap="Blues", fmt="d", linewidth=.5, method="single", annot=True, col_cluster=False, figsize=(11,9))
In the example we used the single linkage method which means that the closest points form a cluster. You can see by looking on the chart that this already happened. For example, Bayern and Niedersachsen form one cluster because they lie close to each other – data-wise of course 😉.
The question is now: Is this the right method to cluster the data? To visualize the data, we need to put the chart into a Multidimensional scaling (“Multidimensional scaling is a visual representation of distances or dissimilarities between sets of objects” (Statistics How To, 2015)). To achieve this, we need to first put the data in a distance matrix which looks like this:
Afterwards we are able to create the mentioned visual which looks as follows:
By looking on the picture, I would separate three large clusters as follows:
We recall that the first Clustermap with the single method does that quite well. What about the other ones?
Complete:
Average:
Ward:
Overall, the single and average method show a good performance here since they separate the clusters pretty well. It could be however that you want to achieve a different goal, then you should investigate the other methods as well.
In conclusion, this article looked at the Clustermap and how to create it. Besides that we also investigated the hierarchical clustering methodsa which are part of this chart.
Please also have a look on the Jupyter Notebook on nbviewer or on GitHub.
Thank you for reading,
Armin
Bibliography
Al-Fuqaha, A. (2014). Clustering Analysis – Lecture Slides. Kalamazoo: Western Michigan University.
Alvez, P. B. (2011). Inference of a human brain fiber bundle atlas from high angular resolution diffusion imaging. PhD THESIS, University of Paris-Sud 11, Graduate School of Sciences and Information Technologies,Telecommunications and Systems, Paris. Retrieved February 1, 2020
Bock, T. (2013, September 20). What is a Dendrogram? Retrieved January 19, 2020, from https://www.displayr.com/what-is-dendrogram/
Dey, D. (2020). ML | Hierarchical clustering (Agglomerative and Divisive clustering). Retrieved February 1, 2020, from https://www.geeksforgeeks.org/ml-hierarchical-clustering-agglomerative-and-divisive-clustering/
Keen, B. A. (2020). Feature Scaling with scikit-learn. Retrieved January 18, 2020, from http://benalexkeen.com/feature-scaling-with-scikit-learn/
Michener, C., & Sokal, R. (1957). A quantitative approach to a problem of classification. Evolution, pp. 11:490–499. Retrieved from https://www.sequentix.de/gelquest/help/upgma_method.htm
Statistics How To. (2015, June 17). Multidimensional Scaling: Definition, Overview, Examples. Retrieved February 6, 2020, from https://www.statisticshowto.datasciencecentral.com/multidimensional-scaling/
The SciPy community. (2019, December 19). scipy.cluster.hierarchy.linkage. Retrieved January 19, 2020, from https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html
Waskom, M. (2018). Seaborn – Example gallery. Retrieved January 08, 2020, from https://seaborn.pydata.org/examples/index.html
Schreibe einen Kommentar