The trick is to create a reservoir sampling system with different virtual
pools

On Mon, Aug 8, 2011 at 9:20 PM, Timothy Potter <[email protected]> wrote:

> Hi Ted,
>
> Thanks for the response. I'll implement, open a ticket, and post a patch
> after I'm satisfied with the outcome.
>
> Cheers,
> Tim
>
> On Mon, Aug 8, 2011 at 1:34 PM, Ted Dunning <[email protected]> wrote:
>
> > There is not such a thing now.  It should be relatively easy to build.
>  The
> > simplest method is to have each mapper produce a full-sized sample which
> is
> > sent to a single reducer which produces another sample.  The output of
> the
> > mappers needs to have a count of items retained and items considered in
> > order for this to work correctly.
> >
> > This cuts down on the amount of data that the reducer has to handle but
> is
> > similar in many respects.
> >
> > On Mon, Aug 8, 2011 at 11:47 AM, Timothy Potter <[email protected]
> > >wrote:
> >
> > > Is there a distributed Mahout job to produce a random sample for a
> large
> > > collection of vectors stored in HDFS? For example, if I wanted only 2M
> > > vectors randomly selected from the ASF mail archive vectors (~6M
> total),
> > is
> > > there a Mahout job to do this (I'm using trunk 0.6-SNAPSHOT)? If not,
> can
> > > this be done in a distributed manner using multiple reducers or would I
> > > have
> > > to send all vectors to 1 reducer and then use RandomSampler in the
> single
> > > reducer?
> > >
> > > Cheers,
> > > Tim
> > >
> >
>

Reply via email to