Hi,
I evaluated Cluto some time ago, and the results I was getting from it were "reasonable". I'd like to make Mahout give similar results. I'm testing Mahout on a 2,000-document dataset, using K-means to cluster it into 30 clusters. The problems I'm seeing are: 1) better quality clusters -- right now when I run kmeans on a test dataset, it always generates 1 or 2 huge clusters, and the rest are 3 documents in size tops. Cluto used to generate all clusters of comparable size. 2) cannot list which documents are clustered in which cluster -- how do I get clusterdump to do this? 3) cannot specify a list of stopwords to remove -- is there a way to do this, or do I have to write a program to pre-process the script before running seqdirectory? BTW, I'm running Mahout from the command line at the moment. My Java chops are a little rusty, so if I can just keep using the command-line functionality, that would be preferred. But, I can install Eclipse if needed. Thanks in advance! Nivlem
