I have no experience with the python udfs (I use Java). But I doubt the example you supplied would work. First, I am not sure if a bag is a subclass of sequence, which is, I believe, what you need to pass to the sample method. Second, at least in Java, if I remember correctly, you can iterate over the bag only once, and unless you know how the sample method works, I would caution against passing a bag to it. You could just read the input bag into a sequence and pass it, or you could iterate over it and accept elements with a certain probability, and spill to a output bag.
On May 28, 2014, at 1:06 PM, <[email protected]> <[email protected]> wrote: > Thanks Mehmet! I tried that and it seems to work on a small test case. I'm > also experimenting now with your other suggestion, a UDF. > I will probably use something like this, which seems less tricky and does not > rely on a sort: > > #!/usr/bin/python > import random > @outputSchema('id_bag: {items: (item: chararray)}') > def random_subset(bag, n): > # return bag if it has <= n elements or n=-1, else return n random > elements from it > if n == -1 or len(bag) <= n: > return bag > else: > return random.sample(bag, n) > > > Thanks again, > > Will > > > William F Dowling > Senior Technologist > Thomson Reuters > > > -----Original Message----- > From: Mehmet Tepedelenlioglu [mailto:[email protected]] > Sent: Tuesday, May 27, 2014 5:09 PM > To: [email protected] [email protected] > Subject: Re: How to sample an inner bag? > > If you know how many items you want from each inner bag exactly, you can hack > it like this: > > x = foreach x { > y = foreach x generate RANDOM() as rnd, *; > y = order y by rnd; > y = limit y $SAMPLE_NUM; > y = foreach y generate $1 ..; > generate group, y; > } > > Basically randomize the inner bag, sort it wrt the random number and limit it > to the sample size you want. No reducers needed. > If the inner bags are huge, ordering will obviously be expensive. If you > don’t like this, you might have to write your own udf. > > Mehmet > > On May 27, 2014, at 10:03 AM, <[email protected]> > <[email protected]> wrote: > >> Hi Pig users, >> >> Is there an easy/efficient way to sample an inner bag? For example, with >> input in a relation like >> >> (id1,att1,{(a,0.01),(b,0.02),(x,0.999749968742)}) >> (id1,att2,{(a,0.03),(b,0.04),(x,0.998749217772)}) >> (id2,att1,{(b,0.05),(c,0.06),(x,0.996945334509)}) >> >> I’d like to sample 1/3 the elements of the bags, and get something like >> (ignoring the non-determinism) >> (id1,att1,{(x,0.999749968742)}) >> (id1,att2,{(b,0.04)}) >> (id2,att1,{(b,0.05)}) >> >> I have a circumlocution that seems to work using flatten+ group but that >> looks ugly to me: >> >> tfidf1 = load '$tfidf' as (id: chararray, >> att: chararray, >> pairs: {pair: (word: chararray, value: double)}); >> >> flat_tfidf = foreach tfidf1 generate id, att, FLATTEN(pairs); >> sample_flat_tfidf = sample flat_tfidf 0.33; >> tfidf2 = group sample_flat_tfidf by (id, att); >> >> tfidf = foreach tfidf2 { >> pairs = foreach sample_flat_tfidf generate pairs::word, pairs::value; >> generate group.id, group.att, pairs; >> }; >> >> Can someone suggest a better way to do this? Many thanks! >> >> William F Dowling >> Senior Technologist >> >> Thomson Reuters >> >> >> >
