These sorts of optimizations could delay the growth of canopy clusters
in situations where the clustering thresholds are set too low for the
dataset. At some point the mapper would still OME with enough points if
all become clusters. That decision rests with the T2 threshold which
determines if
How about making the threshold adapt over time?
Another option is to keep a count of all of the canopies so far and evict
any which have too few points with too large average distance. The points
emitted so far would still reference these canopies, but we wouldn't be able
to add new points to the
You could try using more, smaller input splits, but large datasets and
too-small distance thresholds will choke up the mappers with number of
canopies approaching the number of points seen by the mapper. Also the
single reducer will choke unless the thresholds allow condensing the
mapper canopi
As I said, "you can imagine how the rest goes" -- this is a taste of
how you might distribute the key piece of the computation you asked
about, and certainly does that correctly. It is not the whole
algorithm of course -- up to you.
On Sun, May 2, 2010 at 1:52 PM, Robin Anil wrote:
> I dont think
I dont think you got the algorithm correct. The canopy list is empty at
start, And automatically populated using the distance threshold, this may
work, I dont have a clue how to get till here.
On Sun, May 2, 2010 at 6:15 PM, Sean Owen wrote:
> How about this for the first phase? I think you can
How about this for the first phase? I think you can imagine how the
rest goes, more later...
Mapper 1A.
map() input: One canopy
map() output: canopy ID -> canopy
Mapper 1B.
Has in memory all canopy IDs, read at startup)
map() input: one point
map() output: for each canopy ID, canopy ID -> point
On Sun, May 2, 2010 at 5:45 PM, Sean Owen wrote:
> Not surprising indeed, that won't scale at some point.
> What is the stage that needs everything in memory? maybe describing
> that helps imagine solutions.
>
Algorithm is simple
For each point read into the mapper.
Find the canopy it
Not surprising indeed, that won't scale at some point.
What is the stage that needs everything in memory? maybe describing
that helps imagine solutions.
The typical reason for this, in my experience back in the day, was
needing to look up data infrequently in a key-value way.
"Side-loading" off HD