If I understand your comment correctly this is why I hope that applying
levels of specificity will help. On a particular subject L1 will give
good quality and on another L2 will be better. I may be able to use an
estimate of quality here to prune out bad clusters, not sure. The nature
of my problem gives me no control over the input data in production so I
have to come up with methods that are adaptive.
If you are asking about using your post 0.7 clustering, no I haven't
yet. Will it help with varying scale? I assume by scale you mean the
density of docs in certain areas of the vector space? One thing I am
trying now is limiting the subject matter crawled and getting a much
larger sample, which should get me a denser distribution.
If you think it might help do I build it inside 0.7 snapshot? Is it a
drop in replacement for kmeans?
On 5/12/12 10:33 AM, Ted Dunning wrote:
One thing that may be happening here is that the scale of your data varies
from place to place.
Have you tried the upcoming k-means stuff?
On Sat, May 12, 2012 at 8:53 AM, Pat Ferrel<[email protected]> wrote:
One problem I have is that virtually any value for T gives me a very large
number of canopies--on the order of 2-5 docs per cluster. Whether I create
clusters using random seeds or canopies they are of poor quality to my eye.
A few are good but many are silly. I've tried a wide range of vectorizing
knobs including L2 norm, n-grams with a high ml, and doing a cutom lucene
filter to filer out numbers and do stemming to little avail. Using your
method of t1==t2 - get 2 docs per cluster with t=0.3 (tanimoto or cosine)
and 5 docs per cluster with t = 0.95. This is telling me that the docs are
not really clusterable contrary to