RE: How to sample an inner bag?

william.dowling Wed, 28 May 2014 13:07:55 -0700

Thanks Mehmet! I tried that and it seems to work on a small test case. I'm also 
experimenting now with your other suggestion, a UDF. 
I will probably use something like this, which seems less tricky and does not 
rely on a sort:


#!/usr/bin/python
import random
@outputSchema('id_bag: {items: (item: chararray)}')
def random_subset(bag, n):
    # return bag if it has <= n elements or n=-1, else return n random elements 
from it
    if n == -1 or len(bag) <= n:
        return bag
    else:
        return random.sample(bag, n)


Thanks again,

Will


William F Dowling
Senior Technologist
Thomson Reuters


-----Original Message-----
From: Mehmet Tepedelenlioglu [mailto:[email protected]] 
Sent: Tuesday, May 27, 2014 5:09 PM
To: [email protected] [email protected]
Subject: Re: How to sample an inner bag?

If you know how many items you want from each inner bag exactly, you can hack 
it like this:

x = foreach x {
    y = foreach x generate RANDOM() as rnd, *;
    y = order y by rnd;
    y = limit y $SAMPLE_NUM;
    y = foreach y generate $1 ..;
    generate group, y;
}

Basically randomize the inner bag, sort it wrt the random number and limit it 
to the sample size you want. No reducers needed.
If the inner bags are huge, ordering will obviously be expensive. If you don’t 
like this, you might have to write your own udf.

Mehmet

On May 27, 2014, at 10:03 AM, <[email protected]> 
<[email protected]> wrote:

> Hi Pig users,
> 
> Is there an easy/efficient way to sample an inner bag? For example, with 
> input in a relation like
> 
> (id1,att1,{(a,0.01),(b,0.02),(x,0.999749968742)})
> (id1,att2,{(a,0.03),(b,0.04),(x,0.998749217772)})
> (id2,att1,{(b,0.05),(c,0.06),(x,0.996945334509)})
> 
> I’d like to sample 1/3 the elements of the bags, and get something like 
> (ignoring the non-determinism)
> (id1,att1,{(x,0.999749968742)})
> (id1,att2,{(b,0.04)})
> (id2,att1,{(b,0.05)})
> 
> I have a circumlocution that seems to work using flatten+ group but that 
> looks ugly to me:
> 
> tfidf1 = load '$tfidf' as (id: chararray,
>                          att: chararray,
>                          pairs: {pair: (word: chararray, value: double)});
> 
> flat_tfidf = foreach tfidf1 generate id, att, FLATTEN(pairs);
> sample_flat_tfidf = sample flat_tfidf 0.33;
> tfidf2 = group sample_flat_tfidf by (id, att);
> 
> tfidf = foreach tfidf2 {
>   pairs = foreach sample_flat_tfidf generate pairs::word, pairs::value;
>   generate group.id, group.att, pairs;
> };
> 
> Can someone suggest a better way to do this?  Many thanks!
> 
> William F Dowling
> Senior Technologist
> 
> Thomson Reuters
> 
> 
>

RE: How to sample an inner bag?

Reply via email to