Re: Plotting cluster quality

Ted Dunning Sun, 24 Feb 2013 09:12:23 -0800

I spoke off-line to Dan and he confirmed your inference.  Color was just
there for visual esthetics.


On Sun, Feb 24, 2013 at 6:18 AM, David Murgatroyd <[email protected]> wrote:

> >What does color mean here? What about width of the box?
> FWIW, I infer color is solely for visual distinction -- rotating through
> orange, red, yellow, pink from left to right. I infer width is proportional
> to count of items in each cluster, though apparently not linearly.
>
> I agree that a single plot comparing the algorithms is important since the
> purpose of the plot is to compare the algorithms rather than better
> understand the data on which they've been run. I haven't thought of a good
> way to do that while still having a cluster-by-cluster visual element.
>
> On Fri, Feb 22, 2013 at 12:47 PM, Ted Dunning <[email protected]>
> wrote:
>
> > What does color mean here?
> >
> > What about width of the box?
> >
> > When you say median or mean of all cluster distances, do you mean across
> > that single run?
> >
> > I think that this plot is fine as it is except that it needs a legend
> that
> > explains all of these issues.  My general rule of thumb is that most
> > figures should have what I call a "Kipling caption".  See the caption of
> > the first image here: http://www.boop.org/jan/justso/butter.htm to see
> > what
> > I mean by this.  Imagine that there is a very mathematically inclined 4
> > year old who is looking at your diagram and quizzing you about every
> part.
> >  Answer all their questions in the caption and you have a Kipling
> caption.
> >
> > For comparing different runs of the clustering or different algorithms, I
> > think that a cumulative distribution plot (using plot.ecdf) with all of
> the
> > different algorithms on one plot would be the best comparison tool.
> >
> > On Fri, Feb 22, 2013 at 8:33 AM, Dan Filimon <
> [email protected]
> > >wrote:
> >
> > > As most of the regulars know, I'm working with Ted Dunning on a new
> > > clustering framework for Mahout that should land in 0.8.
> > >
> > > Part of my work is comparing the clustering quality of the new code
> > > with the existing Mahout implementation.
> > >
> > > I compiled a CSV of the quality data [1]. I ran 5 runs of the
> > > clustering on the 20 newsgroups data set comparing Mahout KMeans (km),
> > > Ball KMeans (bkm), Streaming KMeans (skm) and Streaming KMeans
> > > followed by Ball KMeans (bskm).
> > >
> > > I'm looking at now making some appealing plots for the data. For
> > > instance, I think want to make box plots of individual clustering
> > > runs. Here's an example [2] of what a clustering looks like for one
> > > run of Mahout's standard k-means.
> > >
> > > There's a box for each cluster, the mean distance is the thick line,
> > > the limits are the 1st and 3rd quartiles and the whiskers are the min
> > > and max distances.
> > > The blue horizontal line is the mean of all average cluster distances.
> > > The green horizontal line is the median of all average cluster
> distances.
> > >
> > > I intend on making similar plots for the other runs and then
> > > aggregating the means of the runs into box plots for the different
> > > classes of k-means.
> > > The main result being that streaming k-means + ball k-means (as done
> > > in the MR) gives a high quality clustering.
> > >
> > > How do you feel about this plot? Is it too dense? Too colorful? Should
> > > I not draw the median any more?
> > > What are some other good ways of plotting the quality given the data
> set?
> > >
> > > Thanks!
> > >
> > > [1]
> > >
> >
> https://github.com/dfilimon/mahout/blob/skm/examples/src/main/resources/kmeans-comparison-nospace.csv
> > > [2]
> > >
> >
> http://swarm.cs.pub.ro/~dfilimon/skm-mahout/Mahout%20KMeans%20Run%201.pdf
> > >
> >
>

Re: Plotting cluster quality

Reply via email to