RE: Making Mahout cluster results more like Cluto's?

Jeff Eastman Fri, 16 Sep 2011 12:52:16 -0700

See inline comments

-----Original Message-----
From: Nivlem Trahm [mailto:[email protected]] 
Sent: Friday, September 16, 2011 12:32 PM
To: [email protected]
Subject: Making Mahout cluster results more like Cluto's?




Hi,

I evaluated Cluto some time ago, and the results I was getting from it were 
"reasonable". I'd like to make Mahout give similar results.

I'm testing Mahout on a 2,000-document dataset, using K-means to cluster it 
into 30 clusters. The problems I'm seeing are:

1) better quality clusters -- right now when I run kmeans on a test dataset, it 
always generates 1 or 2 huge clusters, and the rest are 3 documents in size 
tops. Cluto used to generate all clusters of comparable size.

[jeff] What distance measure are you using? The default squared Euclidean won't 
do text justice. Try cosine?

2) cannot list which documents are clustered in which cluster -- how do I get  
clusterdump to do this?

[jeff] add a -cp argument to your kmeans invocation and look for clustered 
documents in <output>/clusteredPoints. Give that directory to clusterdump and 
you will see clustered documents. If not, did you add -nv to seq2sparse to 
output NamedVectors with your document names?

3) cannot specify a list of stopwords to remove -- is there a way to do this, 
or do I have to write a program to pre-process the script before running 
seqdirectory?

[jeff] seq2sparse has --maxDFPercent which can be used to remove really high 
frequency terms. No explicit stop word lists though.


BTW, I'm running Mahout from the command line at the moment. My Java chops are 
a little rusty, so if I can just keep using the command-line functionality, 
that would be preferred. But, I can install Eclipse if needed.

Thanks in advance!

Nivlem

RE: Making Mahout cluster results more like Cluto's?

Reply via email to