Is there a distributed Mahout job to produce a random sample for a large
collection of vectors stored in HDFS? For example, if I wanted only 2M
vectors randomly selected from the ASF mail archive vectors (~6M total), is
there a Mahout job to do this (I'm using trunk 0.6-SNAPSHOT)? If not, can
this be done in a distributed manner using multiple reducers or would I have
to send all vectors to 1 reducer and then use RandomSampler in the single
reducer?

Cheers,
Tim

Reply via email to