I spoke off-line to Dan and he confirmed your inference. Color was just there for visual esthetics.
On Sun, Feb 24, 2013 at 6:18 AM, David Murgatroyd <[email protected]> wrote: > >What does color mean here? What about width of the box? > FWIW, I infer color is solely for visual distinction -- rotating through > orange, red, yellow, pink from left to right. I infer width is proportional > to count of items in each cluster, though apparently not linearly. > > I agree that a single plot comparing the algorithms is important since the > purpose of the plot is to compare the algorithms rather than better > understand the data on which they've been run. I haven't thought of a good > way to do that while still having a cluster-by-cluster visual element. > > On Fri, Feb 22, 2013 at 12:47 PM, Ted Dunning <[email protected]> > wrote: > > > What does color mean here? > > > > What about width of the box? > > > > When you say median or mean of all cluster distances, do you mean across > > that single run? > > > > I think that this plot is fine as it is except that it needs a legend > that > > explains all of these issues. My general rule of thumb is that most > > figures should have what I call a "Kipling caption". See the caption of > > the first image here: http://www.boop.org/jan/justso/butter.htm to see > > what > > I mean by this. Imagine that there is a very mathematically inclined 4 > > year old who is looking at your diagram and quizzing you about every > part. > > Answer all their questions in the caption and you have a Kipling > caption. > > > > For comparing different runs of the clustering or different algorithms, I > > think that a cumulative distribution plot (using plot.ecdf) with all of > the > > different algorithms on one plot would be the best comparison tool. > > > > On Fri, Feb 22, 2013 at 8:33 AM, Dan Filimon < > [email protected] > > >wrote: > > > > > As most of the regulars know, I'm working with Ted Dunning on a new > > > clustering framework for Mahout that should land in 0.8. > > > > > > Part of my work is comparing the clustering quality of the new code > > > with the existing Mahout implementation. > > > > > > I compiled a CSV of the quality data [1]. I ran 5 runs of the > > > clustering on the 20 newsgroups data set comparing Mahout KMeans (km), > > > Ball KMeans (bkm), Streaming KMeans (skm) and Streaming KMeans > > > followed by Ball KMeans (bskm). > > > > > > I'm looking at now making some appealing plots for the data. For > > > instance, I think want to make box plots of individual clustering > > > runs. Here's an example [2] of what a clustering looks like for one > > > run of Mahout's standard k-means. > > > > > > There's a box for each cluster, the mean distance is the thick line, > > > the limits are the 1st and 3rd quartiles and the whiskers are the min > > > and max distances. > > > The blue horizontal line is the mean of all average cluster distances. > > > The green horizontal line is the median of all average cluster > distances. > > > > > > I intend on making similar plots for the other runs and then > > > aggregating the means of the runs into box plots for the different > > > classes of k-means. > > > The main result being that streaming k-means + ball k-means (as done > > > in the MR) gives a high quality clustering. > > > > > > How do you feel about this plot? Is it too dense? Too colorful? Should > > > I not draw the median any more? > > > What are some other good ways of plotting the quality given the data > set? > > > > > > Thanks! > > > > > > [1] > > > > > > https://github.com/dfilimon/mahout/blob/skm/examples/src/main/resources/kmeans-comparison-nospace.csv > > > [2] > > > > > > http://swarm.cs.pub.ro/~dfilimon/skm-mahout/Mahout%20KMeans%20Run%201.pdf > > > > > >
