Re: NewsKMeansClustering - the result most people want seems to be missing

Rob Podolski Wed, 09 Nov 2011 23:44:46 -0800

Many thanks.  Actually I delved into the source code and found out that if you 
set the (undocumented) namedVector boolean to true in...


        DictionaryVectorizer.createTermFrequencyVectors(
            tokenizedPath,
            new Path(OUTPUT_HFS_FOLDER), 
            conf, 
            minFrequencyToSupport, // minimum frequency to allow (1 mostly)
            maxNGramSize, // Maximum size of n-gram to allow
            minLLRValue, // Minimum log likelihood ratio
            -1f, 
            true, 
            reduceTasks,
            chunkSize, 
            sequentialAccessOutput, 
            true); // Modified so that named vectors are used - document id 
used as name apparently


and...

        TFIDFConverter.processTfIdf(
          new Path(OUTPUT_HFS_FOLDER , 
DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER),
          new Path(OUTPUT_HFS_FOLDER), 
          conf, 
          chunkSize, 
          minDf,
          maxDFPercent, 
          2, 
          true, 
          sequentialAccessOutput, 
          true, // Modified so named vectors are used 
          reduceTasks);

that the code uses NamedVector's with your document id's as the names.  For the 
printing of the output at the end you can then cast the vectors to NamedVectors 
and retrieve the name (document id). Hence you can get the document id's 
against the IntWritable cluster numbers.


Many thanks though - I will certainly try what you suggested out too.

R


________________________________
From: Grant Ingersoll <[email protected]>
To: [email protected]; Rob Podolski <[email protected]>
Sent: Thursday, 10 November 2011, 7:20
Subject: Re: NewsKMeansClustering  - the result most people want seems to be 
missing


On Nov 9, 2011, at 3:17 AM, Rob Podolski wrote:

> Hi
> 
> Managed to get the Manning Chap 09 example NewsKMeansClustering  working with 
> my own documents.  However, I thought the main point of this was to cluster 
> the news articles together to get groups of similar content.  
> 
> 
> The example allows you to get the cluster membership in terms of 
> WeightedVectorWritables.  But most of us want to know which actual news 
> articles are in the cluster - not which numeric results are in a cluster 
> (though this is useful for getting the most significant terms in the vector 
> albeit indirectly).
> 
> 
> It seems to me that the only way of achieving this most useful result would 
> be to used NamedVectors from the very onset and assign document identifiers 
> to the name-label in each.  Then presumably these would survive the pipe-line 
> through the various calls like
> 
> 
> DictionaryVectorizer.createTermFrequencyVectors;
> TFIDFConverter.processTfIdf;
> etc
> 
> However, I have not seen a way of doing this.  Anyone got any ideas?

You should be able to pass in --namedVectors to the seq2sparse command, and 
those named vectors should be preserved throughout the process.  From 
build-asf-email.sh in trunk:
$MAHOUT seq2sparse --input $MAIL_OUT --output $SEQ2SP --norm 2 --weight TFIDF 
--namedVector --maxDFPercent 90 --minSupport 2 --analyzerName 
org.apache.mahout.text.MailArchivesClusteringAnalyzer



> 
> 
> The other thing I explored was whether there was a way of correlating the 
> output WeightedVectorWritables with the original documents.  However, there 
> is not even an equals() method on the WeightedVectorWritables to allow it 
> (though that would be a bad solution anyhow).

See the ClusterDumper code.

> 
> I'm new to Mahout and have to admit I've been struggling even to get this 
> far.  Any help would be gratefully received.
> 
> 
> R

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com

Re: NewsKMeansClustering - the result most people want seems to be missing

Reply via email to