As I said in another reply my data doesn't seem to cluster well contrary to my intuition. I may have a data problem or need to do some dimensional reduction. I've skimmed your docs and would love to give it a try. With 10x or 100x faster results alone it would be a big help. It seems to promise doing away with the canopy step all together, no? But I need to read your docs more carefully.

Ultimately I'm trying to find a good way to automatically generate several levels of clustering with varying "specificity". Each level might be completely independent of the other but the clusters should be more and more specific. They could be hierarchical but it might be better to just find the nearest clusters from less specific to more specific. From a high number of doc members to a low number.

This why canopy has been frustrating because by varying t I would have hoped to generate these levels of specificity, then replace hierarchical clustering with a similarity measure. In other words L1 has 1000 docs per cluster, L2 has 100 docs per cluster. I'd find the 100 docs closest to L1 clusters (that's all the user wants to see in my case) and reference the 10 L2 clusters nearest by centroid similarity using rowsimilarity to calculate. I'm hoping that this is a useful way to browse the information space.

Naively speaking your streaming k seems to have elements of this built in.

On 5/11/12 8:26 AM, Ted Dunning wrote:
The streaming k-means stuff might be an interesting alternative to setting
parameters manually.  In that work, the algorithm adaptively sets a
parameter that has similar function to T1 and T2.

More importantly, the output of the main pass is a large number of weighted
centroids that can be used as a small surrogate for the entire data set in
subsequent clustering.  Since these centroids should fit in memory, you
could do an adaptive search for propitious values of T1 and T2.

My github repo has a description of this algorithm with an analysis of
scaling properties.  See

https://github.com/tdunning/knn

As soon as we finish the cleanup release, I will start folding in this code.

On Fri, May 11, 2012 at 7:58 AM, Jeff Eastman<[email protected]>wrote:

The reason I use T1==T2 is that T2 is the only threshold that determines
the number of clusters. T1 affects how many adjacent points are considered
in the centroid calculations. So you could simplify your histogram analysis
to 2-d without affecting #clusters.

Hierarchical clustering is one way to think about the clustering of
information that we have just recently added to Mahout. Any experiences you
can share with its application would be valuable.

On 5/10/12 12:20 PM, Pat Ferrel wrote:

Naively I imagine giving a range, divide up into equal increments and
calculate all relevant cluster numbers. It would take the order of (# of
increments)**2  time to do but it seems to me that for a given corpus you
wouldn't need to do this very often (actually you only need 1/2 this data).
You would get a 3-d surface/histogram with magnitude = # of clusters, x and
y = t1 and t2. Then search this data for local maxes, mins and inflection
points. I'm not sure what this data would look like -- hence the "naively"
disclaimer at the start. It is certainly a large landscape to search by
hand.

Your method only looks at the diagonal (t1==t2)and maybe that is the most
interesting part, in which case the calculations are much quicker.

Ultimately I'm interested in finding a better way to do hierarchical
clustering. Information very often has a natural hierarchy but the usual
methods produce spotty results. If we had a reasonable canopy estimator we
could employ it at each level on the subset of the corpus being clustered.
Doing this by hand quickly becomes prohibitive given that the number of
times you have to estimate canopy values increases exponentially with each
level of hierarchy

Even a mediocre estimator would likely be better that picking k out of
the air. And the times it would fail to produce would also tell you
something about your data.

On 5/10/12 6:12 AM, Jeff Eastman wrote:

No, the issue was discussed but never reached critical mass. I typically
do a binary search to find the best value setting T1==T2 and then tweak T1
up a bit. For feeding k-means, this latter step is not so important.

If you could figure out a way to automate this we would be interested.
Conceptually, using the RandomSeedGenerator to sample a few vectors and
comparing them with your chosen DistanceMeasure would give you a hint at
the T-value to begin the search. A utility to do that would be a useful
contribution.

On 5/9/12 8:36 PM, Pat Ferrel wrote:

Some thoughts on 
https://issues.apache.org/**jira/browse/MAHOUT-563<https://issues.apache.org/jira/browse/MAHOUT-563>

Did anything ever get done with this? Ted mentions limited usefulness.
This may be true but the cases he mentions as counter examples are also not
very good for using canopy ahead of kmeans, no? That info would be a useful
result. To use canopies I find myself running it over and over trying to
see some inflection in the number of clusters. Why not automate this? Even
if the data shows nothing, that is itself an answer of value and it would
save a lot of hand work to find out the same thing.




Reply via email to