On Nov 9, 2011, at 3:17 AM, Rob Podolski wrote: > Hi > > Managed to get the Manning Chap 09 example NewsKMeansClustering working with > my own documents. However, I thought the main point of this was to cluster > the news articles together to get groups of similar content. > > > The example allows you to get the cluster membership in terms of > WeightedVectorWritables. But most of us want to know which actual news > articles are in the cluster - not which numeric results are in a cluster > (though this is useful for getting the most significant terms in the vector > albeit indirectly). > > > It seems to me that the only way of achieving this most useful result would > be to used NamedVectors from the very onset and assign document identifiers > to the name-label in each. Then presumably these would survive the pipe-line > through the various calls like > > > DictionaryVectorizer.createTermFrequencyVectors; > TFIDFConverter.processTfIdf; > etc > > However, I have not seen a way of doing this. Anyone got any ideas?
You should be able to pass in --namedVectors to the seq2sparse command, and those named vectors should be preserved throughout the process. From build-asf-email.sh in trunk: $MAHOUT seq2sparse --input $MAIL_OUT --output $SEQ2SP --norm 2 --weight TFIDF --namedVector --maxDFPercent 90 --minSupport 2 --analyzerName org.apache.mahout.text.MailArchivesClusteringAnalyzer > > > The other thing I explored was whether there was a way of correlating the > output WeightedVectorWritables with the original documents. However, there > is not even an equals() method on the WeightedVectorWritables to allow it > (though that would be a bad solution anyhow). See the ClusterDumper code. > > I'm new to Mahout and have to admit I've been struggling even to get this > far. Any help would be gratefully received. > > > R -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com
