Re: Making Mahout cluster results more like Cluto's?

Nivlem Trahm Tue, 20 Sep 2011 16:10:37 -0700


Hi Jeff, Thanks for the very informative reply.


Some comments:


[jeff] What distance measure are you using? The default squared Euclidean won't 
do text justice. Try cosine?


Cosine seems to work much better.

[jeff] add a -cp argument to your kmeans invocation and look for clustered 
documents in <output>/clusteredPoints.

The -cp argument doesn't seem to be supported by the kmeans command, at least 
on the version of Mahout that I'm using (v0.5). However, just adding 
--namedVector to the seq2.............. command seems to do the trick.

[jeff] seq2sparse has --maxDFPercent which can be used to remove really high 
frequency terms. No explicit stop word lists though.


I feared as much... I guess for the time being I'll write a quick Perl script 
to pre-process the text before clustering.

Thanks,

Nivlem

Re: Making Mahout cluster results more like Cluto's?

Reply via email to