Re: Canopy estimator

Pat Ferrel Sat, 12 May 2012 09:05:48 -0700

Wrote a shell script to do t1==t2 over a range and ist does give usefulinformation. Picking a few point outside of t1==t2 doesn't seem toaffect things by much, number of clusters-wise. Since there is really noway to talk about canopy quality AKAIK the number is how I make a decision.

One problem I have is that virtually any value for T gives me a verylarge number of canopies--on the order of 2-5 docs per cluster. WhetherI create clusters using random seeds or canopies they are of poorquality to my eye. A few are good but many are silly. I've tried a widerange of vectorizing knobs including L2 norm, n-grams with a high ml,and doing a cutom lucene filter to filer out numbers and do stemming tolittle avail. Using your method of t1==t2 - get 2 docs per cluster witht=0.3 (tanimoto or cosine) and 5 docs per cluster with t = 0.95. This istelling me that the docs are not really clusterable contrary to intuition.


Next stop SVD? Maybe a larger data set from fewer sources will help?

As to hierarchical clustering in my case it makes little sense whencanopies gives 2-5 docs per cluster. My experimental data set is webcrawled news since it has a clear hierarchy, you can easily see it incategories like root:sports:baseball, soccer, basketball, etc.

As to hierarchical clustering using another tool set where we had aproprietary patented algorithm for picking k it worked pretty well. Itwas for email though so it was not very noisy data. What I was hoping todo is use canopy or other method to estimate cluster numbersautomatically for each level and if I can get a crude canopy estimatorworking I'll report back.


On 5/11/12 7:58 AM, Jeff Eastman wrote:

The reason I use T1==T2 is that T2 is the only threshold thatdetermines the number of clusters. T1 affects how many adjacent pointsare considered in the centroid calculations. So you could simplifyyour histogram analysis to 2-d without affecting #clusters.
Hierarchical clustering is one way to think about the clustering ofinformation that we have just recently added to Mahout. Anyexperiences you can share with its application would be valuable.
On 5/10/12 12:20 PM, Pat Ferrel wrote:
Naively I imagine giving a range, divide up into equal increments andcalculate all relevant cluster numbers. It would take the order of (#of increments)**2 time to do but it seems to me that for a givencorpus you wouldn't need to do this very often (actually you onlyneed 1/2 this data). You would get a 3-d surface/histogram withmagnitude = # of clusters, x and y = t1 and t2. Then search this datafor local maxes, mins and inflection points. I'm not sure what thisdata would look like -- hence the "naively" disclaimer at the start.It is certainly a large landscape to search by hand.
Your method only looks at the diagonal (t1==t2)and maybe that is themost interesting part, in which case the calculations are much quicker.
Ultimately I'm interested in finding a better way to do hierarchicalclustering. Information very often has a natural hierarchy but theusual methods produce spotty results. If we had a reasonable canopyestimator we could employ it at each level on the subset of thecorpus being clustered. Doing this by hand quickly becomesprohibitive given that the number of times you have to estimatecanopy values increases exponentially with each level of hierarchy
Even a mediocre estimator would likely be better that picking k outof the air. And the times it would fail to produce would also tellyou something about your data.
On 5/10/12 6:12 AM, Jeff Eastman wrote:
No, the issue was discussed but never reached critical mass. Itypically do a binary search to find the best value setting T1==T2and then tweak T1 up a bit. For feeding k-means, this latter step isnot so important.
If you could figure out a way to automate this we would beinterested. Conceptually, using the RandomSeedGenerator to sample afew vectors and comparing them with your chosen DistanceMeasurewould give you a hint at the T-value to begin the search. A utilityto do that would be a useful contribution.
On 5/9/12 8:36 PM, Pat Ferrel wrote:
Some thoughts on https://issues.apache.org/jira/browse/MAHOUT-563
Did anything ever get done with this? Ted mentions limitedusefulness. This may be true but the cases he mentions as counterexamples are also not very good for using canopy ahead of kmeans,no? That info would be a useful result. To use canopies I findmyself running it over and over trying to see some inflection inthe number of clusters. Why not automate this? Even if the datashows nothing, that is itself an answer of value and it would savea lot of hand work to find out the same thing.

Re: Canopy estimator

Reply via email to