Do the parallel sampler mappers need to be deterministic? That is, do they all start with the same random seed?
Can the mapper generate a high-quality hash of each vector, and throw away a part of the output space? This would serve as a first cut in the mapper. Using the hash (or part of the hash) as the key for the remaining values allows tuning the number of keys v.s. how many samples a reducer receives. Lance On Mon, Aug 8, 2011 at 10:18 PM, Ted Dunning <[email protected]> wrote: > The trick is to create a reservoir sampling system with different virtual > pools > > On Mon, Aug 8, 2011 at 9:20 PM, Timothy Potter <[email protected]> wrote: > >> Hi Ted, >> >> Thanks for the response. I'll implement, open a ticket, and post a patch >> after I'm satisfied with the outcome. >> >> Cheers, >> Tim >> >> On Mon, Aug 8, 2011 at 1:34 PM, Ted Dunning <[email protected]> wrote: >> >> > There is not such a thing now. It should be relatively easy to build. >> The >> > simplest method is to have each mapper produce a full-sized sample which >> is >> > sent to a single reducer which produces another sample. The output of >> the >> > mappers needs to have a count of items retained and items considered in >> > order for this to work correctly. >> > >> > This cuts down on the amount of data that the reducer has to handle but >> is >> > similar in many respects. >> > >> > On Mon, Aug 8, 2011 at 11:47 AM, Timothy Potter <[email protected] >> > >wrote: >> > >> > > Is there a distributed Mahout job to produce a random sample for a >> large >> > > collection of vectors stored in HDFS? For example, if I wanted only 2M >> > > vectors randomly selected from the ASF mail archive vectors (~6M >> total), >> > is >> > > there a Mahout job to do this (I'm using trunk 0.6-SNAPSHOT)? If not, >> can >> > > this be done in a distributed manner using multiple reducers or would I >> > > have >> > > to send all vectors to 1 reducer and then use RandomSampler in the >> single >> > > reducer? >> > > >> > > Cheers, >> > > Tim >> > > >> > >> > -- Lance Norskog [email protected]
