The trick is to create a reservoir sampling system with different virtual pools
On Mon, Aug 8, 2011 at 9:20 PM, Timothy Potter <[email protected]> wrote: > Hi Ted, > > Thanks for the response. I'll implement, open a ticket, and post a patch > after I'm satisfied with the outcome. > > Cheers, > Tim > > On Mon, Aug 8, 2011 at 1:34 PM, Ted Dunning <[email protected]> wrote: > > > There is not such a thing now. It should be relatively easy to build. > The > > simplest method is to have each mapper produce a full-sized sample which > is > > sent to a single reducer which produces another sample. The output of > the > > mappers needs to have a count of items retained and items considered in > > order for this to work correctly. > > > > This cuts down on the amount of data that the reducer has to handle but > is > > similar in many respects. > > > > On Mon, Aug 8, 2011 at 11:47 AM, Timothy Potter <[email protected] > > >wrote: > > > > > Is there a distributed Mahout job to produce a random sample for a > large > > > collection of vectors stored in HDFS? For example, if I wanted only 2M > > > vectors randomly selected from the ASF mail archive vectors (~6M > total), > > is > > > there a Mahout job to do this (I'm using trunk 0.6-SNAPSHOT)? If not, > can > > > this be done in a distributed manner using multiple reducers or would I > > > have > > > to send all vectors to 1 reducer and then use RandomSampler in the > single > > > reducer? > > > > > > Cheers, > > > Tim > > > > > >
