Do the parallel sampler mappers need to be deterministic? That is, do
they all start with the same random seed?

Can the mapper generate a high-quality hash of each vector, and throw
away a part of the output space? This would serve as a first cut in
the mapper. Using the hash (or part of the hash) as the key for the
remaining values allows tuning the number of keys v.s. how many
samples a reducer receives.

Lance

On Mon, Aug 8, 2011 at 10:18 PM, Ted Dunning <[email protected]> wrote:
> The trick is to create a reservoir sampling system with different virtual
> pools
>
> On Mon, Aug 8, 2011 at 9:20 PM, Timothy Potter <[email protected]> wrote:
>
>> Hi Ted,
>>
>> Thanks for the response. I'll implement, open a ticket, and post a patch
>> after I'm satisfied with the outcome.
>>
>> Cheers,
>> Tim
>>
>> On Mon, Aug 8, 2011 at 1:34 PM, Ted Dunning <[email protected]> wrote:
>>
>> > There is not such a thing now.  It should be relatively easy to build.
>>  The
>> > simplest method is to have each mapper produce a full-sized sample which
>> is
>> > sent to a single reducer which produces another sample.  The output of
>> the
>> > mappers needs to have a count of items retained and items considered in
>> > order for this to work correctly.
>> >
>> > This cuts down on the amount of data that the reducer has to handle but
>> is
>> > similar in many respects.
>> >
>> > On Mon, Aug 8, 2011 at 11:47 AM, Timothy Potter <[email protected]
>> > >wrote:
>> >
>> > > Is there a distributed Mahout job to produce a random sample for a
>> large
>> > > collection of vectors stored in HDFS? For example, if I wanted only 2M
>> > > vectors randomly selected from the ASF mail archive vectors (~6M
>> total),
>> > is
>> > > there a Mahout job to do this (I'm using trunk 0.6-SNAPSHOT)? If not,
>> can
>> > > this be done in a distributed manner using multiple reducers or would I
>> > > have
>> > > to send all vectors to 1 reducer and then use RandomSampler in the
>> single
>> > > reducer?
>> > >
>> > > Cheers,
>> > > Tim
>> > >
>> >
>>
>



-- 
Lance Norskog
[email protected]

Reply via email to