On Nov 9, 2011, at 3:17 AM, Rob Podolski wrote:

> Hi
> 
> Managed to get the Manning Chap 09 example NewsKMeansClustering  working with 
> my own documents.  However, I thought the main point of this was to cluster 
> the news articles together to get groups of similar content.  
> 
> 
> The example allows you to get the cluster membership in terms of 
> WeightedVectorWritables.  But most of us want to know which actual news 
> articles are in the cluster - not which numeric results are in a cluster 
> (though this is useful for getting the most significant terms in the vector 
> albeit indirectly).
> 
> 
> It seems to me that the only way of achieving this most useful result would 
> be to used NamedVectors from the very onset and assign document identifiers 
> to the name-label in each.  Then presumably these would survive the pipe-line 
> through the various calls like
> 
> 
> DictionaryVectorizer.createTermFrequencyVectors;
> TFIDFConverter.processTfIdf;
> etc
> 
> However, I have not seen a way of doing this.  Anyone got any ideas?

You should be able to pass in --namedVectors to the seq2sparse command, and 
those named vectors should be preserved throughout the process.  From 
build-asf-email.sh in trunk:
$MAHOUT seq2sparse --input $MAIL_OUT --output $SEQ2SP --norm 2 --weight TFIDF 
--namedVector --maxDFPercent 90 --minSupport 2 --analyzerName 
org.apache.mahout.text.MailArchivesClusteringAnalyzer



> 
> 
> The other thing I explored was whether there was a way of correlating the 
> output WeightedVectorWritables with the original documents.  However, there 
> is not even an equals() method on the WeightedVectorWritables to allow it 
> (though that would be a bad solution anyhow).

See the ClusterDumper code.

> 
> I'm new to Mahout and have to admit I've been struggling even to get this 
> far.  Any help would be gratefully received.
> 
> 
> R

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com



Reply via email to