I have no experience with the python udfs (I use Java). But I doubt the example 
you supplied would work. First, I am not sure if a bag is a subclass of 
sequence, which is, I believe, what you need to pass to the sample method. 
Second, at least in Java, if I remember correctly, you can iterate over the bag 
only once, and unless you know how the sample method works, I would caution 
against passing a bag to it. You could just read the input bag into a sequence 
and pass it, or you could iterate over it and accept elements with a certain 
probability, and spill to a output bag.


On May 28, 2014, at 1:06 PM, <[email protected]> 
<[email protected]> wrote:

> Thanks Mehmet! I tried that and it seems to work on a small test case. I'm 
> also experimenting now with your other suggestion, a UDF. 
> I will probably use something like this, which seems less tricky and does not 
> rely on a sort:
> 
> #!/usr/bin/python
> import random
> @outputSchema('id_bag: {items: (item: chararray)}')
> def random_subset(bag, n):
>    # return bag if it has <= n elements or n=-1, else return n random 
> elements from it
>    if n == -1 or len(bag) <= n:
>        return bag
>    else:
>        return random.sample(bag, n)
> 
> 
> Thanks again,
> 
> Will
> 
> 
> William F Dowling
> Senior Technologist
> Thomson Reuters
> 
> 
> -----Original Message-----
> From: Mehmet Tepedelenlioglu [mailto:[email protected]] 
> Sent: Tuesday, May 27, 2014 5:09 PM
> To: [email protected] [email protected]
> Subject: Re: How to sample an inner bag?
> 
> If you know how many items you want from each inner bag exactly, you can hack 
> it like this:
> 
> x = foreach x {
>    y = foreach x generate RANDOM() as rnd, *;
>    y = order y by rnd;
>    y = limit y $SAMPLE_NUM;
>    y = foreach y generate $1 ..;
>    generate group, y;
> }
> 
> Basically randomize the inner bag, sort it wrt the random number and limit it 
> to the sample size you want. No reducers needed.
> If the inner bags are huge, ordering will obviously be expensive. If you 
> don’t like this, you might have to write your own udf.
> 
> Mehmet
> 
> On May 27, 2014, at 10:03 AM, <[email protected]> 
> <[email protected]> wrote:
> 
>> Hi Pig users,
>> 
>> Is there an easy/efficient way to sample an inner bag? For example, with 
>> input in a relation like
>> 
>> (id1,att1,{(a,0.01),(b,0.02),(x,0.999749968742)})
>> (id1,att2,{(a,0.03),(b,0.04),(x,0.998749217772)})
>> (id2,att1,{(b,0.05),(c,0.06),(x,0.996945334509)})
>> 
>> I’d like to sample 1/3 the elements of the bags, and get something like 
>> (ignoring the non-determinism)
>> (id1,att1,{(x,0.999749968742)})
>> (id1,att2,{(b,0.04)})
>> (id2,att1,{(b,0.05)})
>> 
>> I have a circumlocution that seems to work using flatten+ group but that 
>> looks ugly to me:
>> 
>> tfidf1 = load '$tfidf' as (id: chararray,
>>                         att: chararray,
>>                         pairs: {pair: (word: chararray, value: double)});
>> 
>> flat_tfidf = foreach tfidf1 generate id, att, FLATTEN(pairs);
>> sample_flat_tfidf = sample flat_tfidf 0.33;
>> tfidf2 = group sample_flat_tfidf by (id, att);
>> 
>> tfidf = foreach tfidf2 {
>>  pairs = foreach sample_flat_tfidf generate pairs::word, pairs::value;
>>  generate group.id, group.att, pairs;
>> };
>> 
>> Can someone suggest a better way to do this?  Many thanks!
>> 
>> William F Dowling
>> Senior Technologist
>> 
>> Thomson Reuters
>> 
>> 
>> 
> 

Reply via email to