Hi Jeff, Thanks for the very informative reply.

Some comments:


[jeff] What distance measure are you using? The default squared Euclidean won't 
do text justice. Try cosine?


Cosine seems to work much better.

[jeff] add a -cp argument to your kmeans invocation and look for clustered 
documents in <output>/clusteredPoints.

The -cp argument doesn't seem to be supported by the kmeans command, at least 
on the version of Mahout that I'm using (v0.5). However, just adding 
--namedVector to the seq2.............. command seems to do the trick.

[jeff] seq2sparse has --maxDFPercent which can be used to remove really high 
frequency terms. No explicit stop word lists though.


I feared as much... I guess for the time being I'll write a quick Perl script 
to pre-process the text before clustering.

Thanks,

Nivlem

Reply via email to