On Mon, Aug 8, 2011 at 10:46 PM, Lance Norskog <[email protected]> wrote:
> Do the parallel sampler mappers need to be deterministic? That is, do > they all start with the same random seed? > No. Just the opposite. They need to be independent. > Can the mapper generate a high-quality hash of each vector, and throw > away a part of the output space? No. Each sample is a vector which must be accepted or rejected. If accepted, then it is kept until the end of the split and then sent in a group to the reducer. > This would serve as a first cut in > the mapper. Using the hash (or part of the hash) as the key for the > remaining values allows tuning the number of keys v.s. how many > samples a reducer receives. > Sort of. To be fair, each mapper has to retain as many samples are desired in the end. Then the reducer has to take a fair sample of all of the groups that it receives accounting for the fact that each group is from a (potentially) different sized input stream.
