Re: distributed RandomSampler job?

Ted Dunning Mon, 08 Aug 2011 23:50:49 -0700

On Mon, Aug 8, 2011 at 10:46 PM, Lance Norskog <[email protected]> wrote:


> Do the parallel sampler mappers need to be deterministic? That is, do
> they all start with the same random seed?
>

No.  Just the opposite.  They need to be independent.


> Can the mapper generate a high-quality hash of each vector, and throw
> away a part of the output space?


No.  Each sample is a vector which must be accepted or rejected.  If
accepted, then it is kept until the end of the split and then sent in a
group to the reducer.


> This would serve as a first cut in
> the mapper. Using the hash (or part of the hash) as the key for the
> remaining values allows tuning the number of keys v.s. how many
> samples a reducer receives.
>

Sort of.  To be fair, each mapper has to retain as many samples are desired
in the end.  Then the reducer has to take a fair sample of all of the groups
that it receives accounting for the fact that each group is from a
(potentially) different sized input stream.

Re: distributed RandomSampler job?

Reply via email to