See inline comments -----Original Message----- From: Nivlem Trahm [mailto:[email protected]] Sent: Friday, September 16, 2011 12:32 PM To: [email protected] Subject: Making Mahout cluster results more like Cluto's?
Hi, I evaluated Cluto some time ago, and the results I was getting from it were "reasonable". I'd like to make Mahout give similar results. I'm testing Mahout on a 2,000-document dataset, using K-means to cluster it into 30 clusters. The problems I'm seeing are: 1) better quality clusters -- right now when I run kmeans on a test dataset, it always generates 1 or 2 huge clusters, and the rest are 3 documents in size tops. Cluto used to generate all clusters of comparable size. [jeff] What distance measure are you using? The default squared Euclidean won't do text justice. Try cosine? 2) cannot list which documents are clustered in which cluster -- how do I get clusterdump to do this? [jeff] add a -cp argument to your kmeans invocation and look for clustered documents in <output>/clusteredPoints. Give that directory to clusterdump and you will see clustered documents. If not, did you add -nv to seq2sparse to output NamedVectors with your document names? 3) cannot specify a list of stopwords to remove -- is there a way to do this, or do I have to write a program to pre-process the script before running seqdirectory? [jeff] seq2sparse has --maxDFPercent which can be used to remove really high frequency terms. No explicit stop word lists though. BTW, I'm running Mahout from the command line at the moment. My Java chops are a little rusty, so if I can just keep using the command-line functionality, that would be preferred. But, I can install Eclipse if needed. Thanks in advance! Nivlem
