I'm currently using KMeans with canopy and Cosine as the measure. The data I'm 
using has been somewhat curated into categories so I expected them to cluster 
alongside the other documents in their respective categories. Some of them fall 
nicely into clusters I'd expect but others are like the examples I gave in the 
first mail. i suspect some of the oddities are due to noise in the data (of 
which there is a considerable amount e.g. documents with only 2 words).

 
On 4 Feb 2013, at 22:28, Jeff Eastman wrote:

> That's a really good question. Mahout does not have an "explain" feature; 
> however, you can use the ClusterDumper to print out the cluster centers and 
> vectors clustered within each cluster. Output is pretty verbose and, with 
> large text vectors being truncated, might not be that useful. You might need 
> to write something to do this. Look at the cluster evaluator tests for some 
> hints.
> 
> Which algorithm were you using?
> 
> On 2/4/13 1:57 PM, Chris Harrington wrote:
>> I was wondering if there was an explain feature in Mahout, something that 
>> gives the reason why it did what it did, shows the values of the various 
>> features it used to evaluate and choose the result, etc.
>> 
>> Because I have some wildly different text data being clustered together, for 
>> example it clustered these 2 together and I'd like to be able to figure out 
>> why
>> 
>> Text 1: "Iron Butterfly Bassist Lee Dorman Dies at 70"
>> 
>> Text 2: "The BEST Memes Of 2012 2012 was a landmark year for memes -- and we 
>> could say that due to the Ikea Monkey alone -- but it's not always easy…"
>> 
> 

Reply via email to