Making Mahout cluster results more like Cluto's?

Nivlem Trahm Fri, 16 Sep 2011 12:32:52 -0700

Hi,


I evaluated Cluto some time ago, and the results I was getting from it were 
"reasonable". I'd like to make Mahout give similar results.

I'm testing Mahout on a 2,000-document dataset, using K-means to cluster it 
into 30 clusters. The problems I'm seeing are:

1) better quality clusters -- right now when I run kmeans on a test dataset, it 
always generates 1 or 2 huge clusters, and the rest are 3 documents in size 
tops. Cluto used to generate all clusters of comparable size.

2) cannot list which documents are clustered in which cluster -- how do I get  
clusterdump to do this?

3) cannot specify a list of stopwords to remove -- is there a way to do this, 
or do I have to write a program to pre-process the script before running 
seqdirectory?

BTW, I'm running Mahout from the command line at the moment. My Java chops are 
a little rusty, so if I can just keep using the command-line functionality, 
that would be preferred. But, I can install Eclipse if needed.

Thanks in advance!

Nivlem

Making Mahout cluster results more like Cluto's?

Reply via email to