Hi Managed to get the Manning Chap 09 example NewsKMeansClustering working with my own documents. However, I thought the main point of this was to cluster the news articles together to get groups of similar content.
The example allows you to get the cluster membership in terms of WeightedVectorWritables. But most of us want to know which actual news articles are in the cluster - not which numeric results are in a cluster (though this is useful for getting the most significant terms in the vector albeit indirectly). It seems to me that the only way of achieving this most useful result would be to used NamedVectors from the very onset and assign document identifiers to the name-label in each. Then presumably these would survive the pipe-line through the various calls like DictionaryVectorizer.createTermFrequencyVectors; TFIDFConverter.processTfIdf; etc However, I have not seen a way of doing this. Anyone got any ideas? The other thing I explored was whether there was a way of correlating the output WeightedVectorWritables with the original documents. However, there is not even an equals() method on the WeightedVectorWritables to allow it (though that would be a bad solution anyhow). I'm new to Mahout and have to admit I've been struggling even to get this far. Any help would be gratefully received. R
