Issues : 1. discriminant analysis (supervised classification) with MCLUST 2. MCLUST applied to large data sets 3. MCLUST applied to high-dimensional data This posting addresses several recent inquiries concerning the use of MCLUST for large or high-dimensional data sets. The MCLUST documentation (README and technical report) has been updated to reflect these issues. ------------------------------------------ Discriminant analysis and large data sets ------------------------------------------ First, it should be noted that the function estep() in MCLUST can be used for discriminant analysis (supervised classification). It accepts as input the parameters of a Gaussian mixture (means, covariances, and mixing proportions) and a model specification, and returns conditional probabilities which can be converted to a classification if desired. Large data sets can be classified by first clustering a subset of the data, and then classifying the remaining observations by discriminant analysis (as was done e.g. with the MRI brain scan image in Banfield and Raftery, Biometrics 49, 1993). Within MCLUST, the function emclust() can be run on a subset of the data to find the clusters, and the optimal conditional probabilities obtained via summary(). The function mstep() can then be invoked to give the associated maximum likelihood parameters. New observations are then classified using estep() with these parameters as input. emclust() and emclust1() also include a provision for using a subsample of size k of the data in the hierarchical clustering phase before applying EM to the full data set. This strategy is often adequate for data sets that are large but not extremely large in size. --------------------- High dimensional data --------------------- Models in which the orientation is allowed to vary between clusters (EEV, VEV, and VVV in the current version of MCLUST) have O(d^2) parameters per cluster, where d is the dimension of the data. For this reason, MCLUST may not work well or may otherwise be inefficient for these models when applied to high-dimensional data. It may still be possible to analyze such data with MCLUST by restriction to models with fewer parameters (the spherical models EI and VI and the constant variance model EEE in MCLUST), or else by applying a dimension-reduction technique such as principal components. Note that none of the methods currently in MCLUST can handle datasets in which the number of observations is smaller than the data dimension.
