In part 1 of this introductory series on working with high-dimensional data we determined that cluster analysis is a commonly used method to perform more reliable and scalable classification of large data sets. In the case of high-dimensional data such as geochemical assays or SEM-EDS mineralogy and texture analyses, dimensionality reduction enables visualisation of the raw data and provides a framework to assess the quality of the cluster analysis results. In this article we take a closer look at the K-means clustering method to form a better understanding of its strengths and weaknesses.
Continuing with the example in part 1 (chemical assays for four elements from a mine planning feasibility study) figure 1 shows the reduced 2D representation of the original data using multi-dimensional scaling (MDS). Together the x and y dimensions, with arbitrary units, describe the relative positioning of the high-dimensional data, while retaining assay information from the original four elements. In this example we see that MDS highlights two or three main clusters to the right in the x direction, with the remaining data points scattered almost randomly.
The next step is to develop a classification using cluster analysis. Arguably the most popular clustering approach is the K-means method, which involves six basic steps:
1) Assign a user-defined number of random initial centroids to the data set (‘K’ number of centroids/clusters),
2) For each centroid, identify the set of closest data points and assign them as belonging to that cluster (distances can be computed using a range of metrics, however, euclidean distance is probably the most popular distance metric),
3) Compute the mean values for each cluster,
4) Assign the newly computed mean as an updated centroid, and
5) For each updated centroid, identify the set of closest data points and re-assign them as belonging to that cluster.
6) Repeat steps 2-5 until convergence, i.e. the cluster centroids no longer change.
K-means is a fast and effective method, and is available in most analytical software packages (e.g.: ioGAS for geochemistry). However, the main drawback of the K-means method is that it does not identify true clusters, but rather groups data points into the number of clusters the user specifies, which makes it more of an unsupervised partitioning method. In many cases this is not a problem, but it does happen that K-means forces samples into groups they do not belong or where the user doesn’t want them, which is at least something to be aware of. This interactive graph is a good demonstration how the method works.
Figure 2 demonstrates some of the drawbacks I mentioned using our working example. Here I requested seven clusters. Zone A indicates three samples that are grouped into cluster 0, while their chemical profiles indicate they should be left separate from cluster 0 and probably from each other. Zones B and C show similar occurrences where samples are incorrectly included in clusters. The main challenge with K-means is choosing the number of clusters and deciding whether or not divisions between samples, or the inclusion of samples in clusters, are correct. The process requires background knowledge of the data, the project objectives, and a bit of test work. In general it is recommended to go through several iterations of clustering, each time using a different number of requested clusters.
In this example we developed a two-step clustering procedure to produce the desired result. The first step is a standard K-means analysis requesting 24 clusters (we arrived at the number 24 after several test runs). The large number of clusters places key divisions between samples, however, it also divides the data into far too many groups. The second step recombines “sub-clusters” (guided by their chemistry) into their appropriate clusters (figure 3). The resulting seven clusters correctly group samples, and leave the “outlier” samples to the far left on the x-axis as individuals.
I am not familiar with the details and flexibility of clustering algorithms in commercial analytical packages, but the Python open source scripting libraries allow easy implementation of this two-step procedure. Specifically it gives us greater control over the K-means method and the results it produces.
In part 1 I explained that the chemical assays in this example were performed on screened fractions and therefore partially represent the mineral processing behaviour of the feed material. It follows that the output from the cluster analysis serve as a framework for a geometallurgical domain model. In the next article in this introductory series (part 3) we will explore how the geospatial relationships between the clusters define a behaviour profile and how it assists the operator with critical mine planning decisions.
Feel free to provide some feedback on your experiences with K-means clustering and its implementation in commercial analytical packages.