Estimating accuracy this way will almost always give you very poor results. The reason is that unsupervised clustering will draw its own boundaries which are very unlikely to match your own.
If you want to make this work you can do a few different things: a) semi-supervised clustering. Include your target variables during training and don't include them in testing. This helps force the cluster boundaries to align with your definitions. b) use lots of clusters as features to a classifier. Consider using 3-20x times more clusters than you have categories. Then use proximity or distance to those clusters as features for classifiers. A quick diagnostic for any technique like this is to see if table containing one row per cluster and one column per target category has higher than expected mutual information. If so, then the clusters are encoding your categories in some way. On Tue, Feb 5, 2013 at 9:57 AM, Aysu Ezen <[email protected]> wrote: > Hello, > > To my understanding from the book, ClusterDumper tool can be used to get > the top features of each cluster and the centroid vector. However, I have a > dataset with manual labels on it. I would like to evaluate the clusters > based on the manual labels to calculate accuracy of clustering (set the > majority label of each cluster as its class and calculate TP/FP rates). Is > there a way to understand which instance belongs to which cluster so that I > can compute the accuracy? > > Thanks >
