I'm currently using KMeans with canopy and Cosine as the measure. The data I'm using has been somewhat curated into categories so I expected them to cluster alongside the other documents in their respective categories. Some of them fall nicely into clusters I'd expect but others are like the examples I gave in the first mail. i suspect some of the oddities are due to noise in the data (of which there is a considerable amount e.g. documents with only 2 words).
On 4 Feb 2013, at 22:28, Jeff Eastman wrote: > That's a really good question. Mahout does not have an "explain" feature; > however, you can use the ClusterDumper to print out the cluster centers and > vectors clustered within each cluster. Output is pretty verbose and, with > large text vectors being truncated, might not be that useful. You might need > to write something to do this. Look at the cluster evaluator tests for some > hints. > > Which algorithm were you using? > > On 2/4/13 1:57 PM, Chris Harrington wrote: >> I was wondering if there was an explain feature in Mahout, something that >> gives the reason why it did what it did, shows the values of the various >> features it used to evaluate and choose the result, etc. >> >> Because I have some wildly different text data being clustered together, for >> example it clustered these 2 together and I'd like to be able to figure out >> why >> >> Text 1: "Iron Butterfly Bassist Lee Dorman Dies at 70" >> >> Text 2: "The BEST Memes Of 2012 2012 was a landmark year for memes -- and we >> could say that due to the Ikea Monkey alone -- but it's not always easy…" >> >
