Re: Canopy estimator

Pat Ferrel Sat, 12 May 2012 11:20:00 -0700

If I understand your comment correctly this is why I hope that applyinglevels of specificity will help. On a particular subject L1 will givegood quality and on another L2 will be better. I may be able to use anestimate of quality here to prune out bad clusters, not sure. The natureof my problem gives me no control over the input data in production so Ihave to come up with methods that are adaptive.

If you are asking about using your post 0.7 clustering, no I haven'tyet. Will it help with varying scale? I assume by scale you mean thedensity of docs in certain areas of the vector space? One thing I amtrying now is limiting the subject matter crawled and getting a muchlarger sample, which should get me a denser distribution.

If you think it might help do I build it inside 0.7 snapshot? Is it adrop in replacement for kmeans?


On 5/12/12 10:33 AM, Ted Dunning wrote:

One thing that may be happening here is that the scale of your data varies
from place to place.

Have you tried the upcoming k-means stuff?

On Sat, May 12, 2012 at 8:53 AM, Pat Ferrel<[email protected]>  wrote:

One problem I have is that virtually any value for T gives me a very large
number of canopies--on the order of 2-5 docs per cluster. Whether I create
clusters using random seeds or canopies they are of poor quality to my eye.
A few are good but many are silly. I've tried a wide range of vectorizing
knobs including L2 norm, n-grams with a high ml, and doing a cutom lucene
filter to filer out numbers and do stemming to little avail. Using your
method of t1==t2 - get 2 docs per cluster with t=0.3 (tanimoto or cosine)
and 5 docs per cluster with t = 0.95. This is telling me that the docs are
not really clusterable contrary to

Re: Canopy estimator

Reply via email to