Re: Canopy estimator

Jeff Eastman Fri, 11 May 2012 07:58:54 -0700

The reason I use T1==T2 is that T2 is the only threshold that determinesthe number of clusters. T1 affects how many adjacent points areconsidered in the centroid calculations. So you could simplify yourhistogram analysis to 2-d without affecting #clusters.

Hierarchical clustering is one way to think about the clustering ofinformation that we have just recently added to Mahout. Any experiencesyou can share with its application would be valuable.


On 5/10/12 12:20 PM, Pat Ferrel wrote:

Naively I imagine giving a range, divide up into equal increments andcalculate all relevant cluster numbers. It would take the order of (#of increments)**2 time to do but it seems to me that for a givencorpus you wouldn't need to do this very often (actually you only need1/2 this data). You would get a 3-d surface/histogram with magnitude =# of clusters, x and y = t1 and t2. Then search this data for localmaxes, mins and inflection points. I'm not sure what this data wouldlook like -- hence the "naively" disclaimer at the start. It iscertainly a large landscape to search by hand.
Your method only looks at the diagonal (t1==t2)and maybe that is themost interesting part, in which case the calculations are much quicker.
Ultimately I'm interested in finding a better way to do hierarchicalclustering. Information very often has a natural hierarchy but theusual methods produce spotty results. If we had a reasonable canopyestimator we could employ it at each level on the subset of the corpusbeing clustered. Doing this by hand quickly becomes prohibitive giventhat the number of times you have to estimate canopy values increasesexponentially with each level of hierarchy
Even a mediocre estimator would likely be better that picking k out ofthe air. And the times it would fail to produce would also tell yousomething about your data.
On 5/10/12 6:12 AM, Jeff Eastman wrote:
No, the issue was discussed but never reached critical mass. Itypically do a binary search to find the best value setting T1==T2and then tweak T1 up a bit. For feeding k-means, this latter step isnot so important.
If you could figure out a way to automate this we would beinterested. Conceptually, using the RandomSeedGenerator to sample afew vectors and comparing them with your chosen DistanceMeasure wouldgive you a hint at the T-value to begin the search. A utility to dothat would be a useful contribution.
On 5/9/12 8:36 PM, Pat Ferrel wrote:
Some thoughts on https://issues.apache.org/jira/browse/MAHOUT-563
Did anything ever get done with this? Ted mentions limitedusefulness. This may be true but the cases he mentions as counterexamples are also not very good for using canopy ahead of kmeans,no? That info would be a useful result. To use canopies I findmyself running it over and over trying to see some inflection in thenumber of clusters. Why not automate this? Even if the data showsnothing, that is itself an answer of value and it would save a lotof hand work to find out the same thing.

Re: Canopy estimator

Reply via email to