Hello all,

I have a clustering problem that involves clustering a huge data set.

My knowledge of clustering techniques is very limited so I just tried a few
algorithms. Because I can regression test them against a valid clustered
set, I found out a configuration that gives best accuracy wiht no false
positives (MeanShift Canopy, Tanimoto distance). However, because the number
of output clusters is somehow linear to the number of documents to cluster,
I had problems scaling out with MeanShift - because of the single Reducer
step. I have read in the mailing list that MeanShift is good fit if the
number of clusters grows logarithmically, not linearly with input size.

Then I considered K-Means because it seems to scale well despite the number
of clusters. Under some good K choice, I got some false positives but very
good coverage. To avoid false positives, I have read I would need to choose
a good initial configuration, but I can't apply Canopy before because it
won't scale with a large number of clusters, and there's no K-means++ yet
implemented.

Then I thought that I could apply a multi-level technique. I could use
K-Means with an approximated and pessimistic (low) number of clusters and
use MeanShift inside each cluster to refine the results and avoid false
positives. I didn't know if that was a good idea, but found a reference in
"Mahout in Action" in page 215 to something similar (two-tiered clustering).

Even though I have this solution working fine, I have the feeling that it
got too complex. To try it, I programmed some custom code. In my current
solution I have a job that executes with the output of a clustering and
launches as many map tasks as clusters were generated, performing a second
clustering inside them and accumulating the output of each mapper (so, no
Reducer step).

I was wondering if there is a better way to accomplish that. Is there
something already implemented to execute a two-tiered (or n-tiered)
clustering?
I also thought that maybe you can show me a better idea to solve my
problem.

Thanks,

Pere.

Reply via email to