Hi Jeff, Thanks for the very informative reply.
Some comments: [jeff] What distance measure are you using? The default squared Euclidean won't do text justice. Try cosine? Cosine seems to work much better. [jeff] add a -cp argument to your kmeans invocation and look for clustered documents in <output>/clusteredPoints. The -cp argument doesn't seem to be supported by the kmeans command, at least on the version of Mahout that I'm using (v0.5). However, just adding --namedVector to the seq2.............. command seems to do the trick. [jeff] seq2sparse has --maxDFPercent which can be used to remove really high frequency terms. No explicit stop word lists though. I feared as much... I guess for the time being I'll write a quick Perl script to pre-process the text before clustering. Thanks, Nivlem
