-----Original Message-----
From: Pere Ferrera [mailto:[email protected]] 
Sent: Wednesday, January 05, 2011 4:37 AM
To: [email protected]
Subject: two-tiered clustering

Hello all,

I have a clustering problem that involves clustering a huge data set.
[jeff] I'm having some similar problems with my own data so let's collaborate 
to see if we can improve things.

My knowledge of clustering techniques is very limited so I just tried a few
algorithms. Because I can regression test them against a valid clustered
set, I found out a configuration that gives best accuracy wiht no false
positives (MeanShift Canopy, Tanimoto distance). However, because the number
of output clusters is somehow linear to the number of documents to cluster,
I had problems scaling out with MeanShift - because of the single Reducer
step. I have read in the mailing list that MeanShift is good fit if the
number of clusters grows logarithmically, not linearly with input size.
[jeff] I keep being surprised that MeanShift Canopy performs as well as it 
does. I'm pretty sure there is no other implementation like it anywhere. I 
think one can increase the number of reducers early in the iterations to 
overcome some of the scalability limitations you cite, but the canopies gather 
a list of all their contained pointIds and this will suffer another scalability 
issue. There are ways with logging to eliminate this memory growth but I feel 
it is likely just whack-a-mole.

Then I considered K-Means because it seems to scale well despite the number
of clusters. Under some good K choice, I got some false positives but very
good coverage. To avoid false positives, I have read I would need to choose
a good initial configuration, but I can't apply Canopy before because it
won't scale with a large number of clusters, and there's no K-means++ yet
implemented.
[jeff] K-Means works well too but we lack a good way to prime it with -k 
clusters. Canopy has scalability problems as you note and the random sampling 
works only so-so. We have a JIRA for K-means++.

Then I thought that I could apply a multi-level technique. I could use
K-Means with an approximated and pessimistic (low) number of clusters and
use MeanShift inside each cluster to refine the results and avoid false
positives. I didn't know if that was a good idea, but found a reference in
"Mahout in Action" in page 215 to something similar (two-tiered clustering).

Even though I have this solution working fine, I have the feeling that it
got too complex. To try it, I programmed some custom code. In my current
solution I have a job that executes with the output of a clustering and
launches as many map tasks as clusters were generated, performing a second
clustering inside them and accumulating the output of each mapper (so, no
Reducer step).

I was wondering if there is a better way to accomplish that. Is there
something already implemented to execute a two-tiered (or n-tiered)
clustering?
I also thought that maybe you can show me a better idea to solve my
problem.
[jeff] Nothing is implemented out of the box for n-tiered clustering though I 
believe that the results of any clustering can be used as the prior for most of 
our iterative algorithms. Another way to approach the problem might be to try 
Dirichlet clustering to prime k-means. It is non-parametric and could give you 
an objective opinion on the best value of k for your data. Dirichlet can use 
DistanceMeasureModels, e.g. Tanimoto, Cosine, Manhattan, et. Al.
 
Thanks,

Pere.

Reply via email to