Is there a distributed Mahout job to produce a random sample for a large collection of vectors stored in HDFS? For example, if I wanted only 2M vectors randomly selected from the ASF mail archive vectors (~6M total), is there a Mahout job to do this (I'm using trunk 0.6-SNAPSHOT)? If not, can this be done in a distributed manner using multiple reducers or would I have to send all vectors to 1 reducer and then use RandomSampler in the single reducer?
Cheers, Tim
