NewsKMeansClustering - the result most people want seems to be missing

Rob Podolski Wed, 09 Nov 2011 03:18:18 -0800

Hi

Managed to get the Manning Chap 09 example NewsKMeansClustering  working with 
my own documents.  However, I thought the main point of this was to cluster the 
news articles together to get groups of similar content.



The example allows you to get the cluster membership in terms of 
WeightedVectorWritables.  But most of us want to know which actual news 
articles are in the cluster - not which numeric results are in a cluster 
(though this is useful for getting the most significant terms in the vector 
albeit indirectly).


It seems to me that the only way of achieving this most useful result would be 
to used NamedVectors from the very onset and assign document identifiers to the 
name-label in each.  Then presumably these would survive the pipe-line through 
the various calls like


DictionaryVectorizer.createTermFrequencyVectors;
TFIDFConverter.processTfIdf;
etc

However, I have not seen a way of doing this.  Anyone got any ideas?


The other thing I explored was whether there was a way of correlating the 
output WeightedVectorWritables with the original documents.  However, there is 
not even an equals() method on the WeightedVectorWritables to allow it (though 
that would be a bad solution anyhow).

I'm new to Mahout and have to admit I've been struggling even to get this far.  
Any help would be gratefully received.


R

NewsKMeansClustering - the result most people want seems to be missing

Reply via email to