Re: Canopy Clustering not scaling

2010-05-02 Thread Jeff Eastman
These sorts of optimizations could delay the growth of canopy clusters in situations where the clustering thresholds are set too low for the dataset. At some point the mapper would still OME with enough points if all become clusters. That decision rests with the T2 threshold which determines if

Re: Canopy Clustering not scaling

2010-05-02 Thread Ted Dunning
How about making the threshold adapt over time? Another option is to keep a count of all of the canopies so far and evict any which have too few points with too large average distance. The points emitted so far would still reference these canopies, but we wouldn't be able to add new points to the

Re: Canopy Clustering not scaling

2010-05-02 Thread Jeff Eastman
You could try using more, smaller input splits, but large datasets and too-small distance thresholds will choke up the mappers with number of canopies approaching the number of points seen by the mapper. Also the single reducer will choke unless the thresholds allow condensing the mapper canopi

Re: Canopy Clustering not scaling

2010-05-02 Thread Sean Owen
As I said, "you can imagine how the rest goes" -- this is a taste of how you might distribute the key piece of the computation you asked about, and certainly does that correctly. It is not the whole algorithm of course -- up to you. On Sun, May 2, 2010 at 1:52 PM, Robin Anil wrote: > I dont think

Re: Canopy Clustering not scaling

2010-05-02 Thread Robin Anil
I dont think you got the algorithm correct. The canopy list is empty at start, And automatically populated using the distance threshold, this may work, I dont have a clue how to get till here. On Sun, May 2, 2010 at 6:15 PM, Sean Owen wrote: > How about this for the first phase? I think you can

Re: Canopy Clustering not scaling

2010-05-02 Thread Sean Owen
How about this for the first phase? I think you can imagine how the rest goes, more later... Mapper 1A. map() input: One canopy map() output: canopy ID -> canopy Mapper 1B. Has in memory all canopy IDs, read at startup) map() input: one point map() output: for each canopy ID, canopy ID -> point

Re: Canopy Clustering not scaling

2010-05-02 Thread Robin Anil
On Sun, May 2, 2010 at 5:45 PM, Sean Owen wrote: > Not surprising indeed, that won't scale at some point. > What is the stage that needs everything in memory? maybe describing > that helps imagine solutions. > Algorithm is simple For each point read into the mapper. Find the canopy it

Re: Canopy Clustering not scaling

2010-05-02 Thread Sean Owen
Not surprising indeed, that won't scale at some point. What is the stage that needs everything in memory? maybe describing that helps imagine solutions. The typical reason for this, in my experience back in the day, was needing to look up data infrequently in a key-value way. "Side-loading" off HD