Re: distributed RandomSampler job?

Timothy Potter Tue, 09 Aug 2011 08:44:47 -0700

Hi Ted,

Can you clarify your point about "each mapper needs to retain as many
samples are desired in the end"? Does this mean I'm restricted to sample
sizes based on the max number of key/value pairs in a split? From what I've
read in the Hadoop docs, the number of map tasks for a job is determined by
the number of splits with mapred.map.tasks being only a hint to Hadoop ...


Tim


On Tue, Aug 9, 2011 at 12:49 AM, Ted Dunning <[email protected]> wrote:

> On Mon, Aug 8, 2011 at 10:46 PM, Lance Norskog <[email protected]> wrote:
>
> > Do the parallel sampler mappers need to be deterministic? That is, do
> > they all start with the same random seed?
> >
>
> No.  Just the opposite.  They need to be independent.
>
>
> > Can the mapper generate a high-quality hash of each vector, and throw
> > away a part of the output space?
>
>
> No.  Each sample is a vector which must be accepted or rejected.  If
> accepted, then it is kept until the end of the split and then sent in a
> group to the reducer.
>
>
> > This would serve as a first cut in
> > the mapper. Using the hash (or part of the hash) as the key for the
> > remaining values allows tuning the number of keys v.s. how many
> > samples a reducer receives.
> >
>
> Sort of.  To be fair, each mapper has to retain as many samples are desired
> in the end.  Then the reducer has to take a fair sample of all of the
> groups
> that it receives accounting for the fact that each group is from a
> (potentially) different sized input stream.
>

Re: distributed RandomSampler job?

Reply via email to