There is not such a thing now. It should be relatively easy to build. The simplest method is to have each mapper produce a full-sized sample which is sent to a single reducer which produces another sample. The output of the mappers needs to have a count of items retained and items considered in order for this to work correctly.
This cuts down on the amount of data that the reducer has to handle but is similar in many respects. On Mon, Aug 8, 2011 at 11:47 AM, Timothy Potter <[email protected]>wrote: > Is there a distributed Mahout job to produce a random sample for a large > collection of vectors stored in HDFS? For example, if I wanted only 2M > vectors randomly selected from the ASF mail archive vectors (~6M total), is > there a Mahout job to do this (I'm using trunk 0.6-SNAPSHOT)? If not, can > this be done in a distributed manner using multiple reducers or would I > have > to send all vectors to 1 reducer and then use RandomSampler in the single > reducer? > > Cheers, > Tim >
