There is not such a thing now.  It should be relatively easy to build.  The
simplest method is to have each mapper produce a full-sized sample which is
sent to a single reducer which produces another sample.  The output of the
mappers needs to have a count of items retained and items considered in
order for this to work correctly.

This cuts down on the amount of data that the reducer has to handle but is
similar in many respects.

On Mon, Aug 8, 2011 at 11:47 AM, Timothy Potter <[email protected]>wrote:

> Is there a distributed Mahout job to produce a random sample for a large
> collection of vectors stored in HDFS? For example, if I wanted only 2M
> vectors randomly selected from the ASF mail archive vectors (~6M total), is
> there a Mahout job to do this (I'm using trunk 0.6-SNAPSHOT)? If not, can
> this be done in a distributed manner using multiple reducers or would I
> have
> to send all vectors to 1 reducer and then use RandomSampler in the single
> reducer?
>
> Cheers,
> Tim
>

Reply via email to