Considerations with QEMSCAN Grain Size Estimation
November 7, 2017Working with High-Dimensional Data Part 2: Classification by Cluster Analysis
November 28, 2017Mineral exploration, mining, ore processing, and, more generally, earth science research, involves the collection of large and complex data sets where single sampling points often consist of multiple dimensions. For example, the analysis results of a single field specimen collected during exploration, or a sample from a mineral processing stream, contains the proportions of multiple elements and minerals. A common data analysis goal is to group samples according to their chemical and mineralogical characteristics or according to the manner in which they behave during metallurgical testing. In this series of articles we briefly look at the techniques that enable the grouping and assessment of high-dimensionsional data.
Examples of high-dimensional data include the results of geochemical and mineral analyses of specimens such as surface soil samples, drill chips from an ore body, mineral processing stream samples, etc. The results typically contain the proportions of multiple elements, minerals, and mineral texture and associations. With the goal of developing a grouping or classification one option is to manually investigate the data and design a customised grouping protocol. Although this is a reliable method, it is often time consuming, non-transferrable to other projects, and not scalable with high sample numbers. Although not fully automated, mathematical cluster analysis provides an alternative to manual methods. Clustering techniques should not be treated as black boxes and require basic knowledge of how they work and verification of the results. The visualisation of clusters and their properties as a verification method is a good approach, however, whole data sets containing hundreds of samples with more than three dimensions are difficult to represent graphically. Mathematical data reduction allows the transformation of high-dimensional data to one, two, or three dimensions, which can be visualised more easily, thereby simplifying the assessment of cluster analysis results. Methods of dimensionality reduction include multi-dimensional scaling (MDS) and principal component analysis (PCA). Florian Wickelmaier at Aalborg University wrote an introductory paper on MDS and Lindsay Smith at the University of Otago published an excellent tutorial on PCA.
Given a set of pairwise dissimilarities between data points, MDS operates with the objective of reconstructing/projecting data in a lower dimensional space while preserving the dissimilarity information. A commonly used dissimilarity measure is the euclidean distance between data points (vectors). In the example used in this article each sample is represented by a vector of geochemical data. Figure 1 shows the results of MDS applied to a data set containing four dimensions (chemical assay results for four elements) reduced to two dimensions for visualisation. Chemical assays were performed on screened fractions and therefore partially represent the mineral processing behaviour of the feed material. Together the x and y dimensions, with arbitrary units, describe the relative positioning of the high-dimensional data. In this example we see that MDS highlights two or three main clusters to the right in the x direction, with the remaining data points scattered almost randomly.
Instead of using a set of dissimilarities between individual data points, PCA relies on the covariance matrix of the data cloud to find the directions of maximum variance, which define the so-called “principal components”. The intuitive expectation for PCA is that the results tell you which dimension/variable is the most important, but this is not the case. There are as many principal components (orthogonal to one-another) as there are dimensions in the original data. The original data are transformed/projected onto the principal components so that each principal component is some linear combination of the original data. The dimensionality reduction step happens when we choose the number of principal components to retain, which, for visualisation, would be two or three. Figure 2 shows the results of PCA applied to the same data set as in figure 1 containing four dimensions (chemical assay results for four elements) reduced to two dimensions for visualisation. Depending on how we choose to interpret the graph PCA highlights two or three clusters (possibly more) to the left in the x direction.
MDS and PCA are used with the same objective in mind, which is to reduce high-dimensional data to fewer dimensions for visualisation. The next step is to apply a cluster analysis method and investigate the results to make sure it produces a reasonable and useful classification. In the next article (part 2) we will explore the K-means cluster analysis method and what to look out for when considering its application.
Some Wikipedia References:
MDS: https://en.wikipedia.org/wiki/Multidimensional_scaling.
PCA: https://en.wikipedia.org/wiki/Principal_component_analysis