Shawn, This looks indeed pretty good except for one thing.
I already do a GROUP on my table in order to group my records by "item". If I run your code, I end up filtering against the entire data set instead of filtering each group individually. I tried to put your code inside a foreach statement without much luck. Any idea? -- Pierre On 2011-09-13, at 5:49 PM, Xiaomeng Wan wrote: > try this: > > a = group allrecords all; > b = foreach a generate COUNT(allrecords) as total; //or COUNT_STAR > > c = order allrecords by thevalue; > d = limit c b.total*0.05; > > e = order allrecords by thevalue desc; > f = limit e b.total*0.05; > > g = union d, f; > > Shawn > > On Tue, Sep 13, 2011 at 2:39 PM, Pierre-Luc Brunet <[email protected]> wrote: >> Question. >> >> What would be the best way in Pig to grab a set of data, find the record >> that matches the 5th percentile, find the record that matches the 95th >> percentile and throw away what's before and after that? >> >> Obviously, my math doesn't work for this. >> >> In hope that it helps clarifying what I'm trying to do, here's how I >> currently do it in Javascript: >> >> result.bid_array.sort(function(a,b) { return a - b; }); >> var bid_p5 = Math.round(5/100 * result.bid_array.length); >> var bid_p95 = Math.round(95/100 * result.bid_array.length); >> >> result.bid_array.splice(bid_p95, result.bid_array.length - ( >> result.bid_array.length - bid_p95)); >> result.bid_array.splice(0, bid_p5); >> >> -- >> Pierre-Luc Brunet >> ZeStuff - http://www.zestuff.com >> >> Phone: (877) 5ZESTUFF >> Mobile: (514) 600-0234 >> Email: [email protected] >> >> 9320 Saint-Laurent, #502 >> Montreal, QC, Canada, H2N 1N7 >> >> On 2011-09-08, at 10:49 PM, Dmitriy Ryaboy wrote: >> >>> If you look at the data for #25 you posted below, you will find that there >>> is no row such that the price is between 5 and 95%! >>> khadgar is such an extreme outlier, it moves the 5% line above everyone >>> else, and of course it itself sets the 100% line. >>> >>> D >>> >>> On Thu, Sep 8, 2011 at 7:03 PM, Pierre-Luc Brunet <[email protected]>wrote: >>> >>>> That worked except that for some reason, there's a lot of data that is >>>> missing in the final output (compared to what it should return). >>>> >>>> For example, the file I load has these lines: >>>> >>>> 7 25 us darkspear a Redacted 4750 >>>> 5000 1 >>>> 8 25 us emerald-dream a Lornadoome 9500 >>>> 10000 1 >>>> 21 25 eu khadgar a Haiibanklol 769499 809999 1 >>>> 7 25 us queldorei a Worfgt 27862 34827 1 >>>> 3 25 us antonidas a Oldcrafter 19000 >>>> 20000 1 >>>> >>>> However, when I load up the script http://pastebin.com/Bk8RBAHt (now >>>> grouped on only one column), I don't have any records with 25 as the key. >>>> The first 5 rows in my tsv files are >>>> >>>> 35 3.19973415E7 >>>> 36 122914.0 >>>> 37 50000.0 >>>> 38 416099.9 >>>> 39 901333.8571428572 >>>> 43 191496.5 >>>> 44 236454.0 >>>> >>>> >>>> I really have no idea where the missing rows went :\ >>>> >>>> -- >>>> Pierre-Luc Brunet >>>> ZeStuff - http://www.zestuff.com >>>> >>>> Phone: (877) 5ZESTUFF >>>> Mobile: (514) 600-0234 >>>> Email: [email protected] >>>> >>>> 9320 Saint-Laurent, #502 >>>> Montreal, QC, Canada, H2N 1N7 >>>> >>>> On 2011-09-08, at 8:45 PM, Xiaomeng Wan wrote: >>>> >>>>> you can change >>>>> >>>>> GENERATE group, auctionsPrice.price AS price:tuple, p5 AS p5, p95 AS p95; >>>>> to >>>>> GENERATE FLATTEN(group) as (item, region, realm, faction), >>>>> FLATTEN(auctionsPrice.price) AS price, p5 AS p5, p95 AS p95; >>>>> >>>>> then regroup after the foreach block >>>>> >>>>> p2 = FILTER p1 BY (price >= p5 AND price <= p95); >>>>> p2a = group p2 by (item, region, realm, faction); >>>>> p3 = FOREACH p2a GENERATE group, AVG(p2.price) AS price; >>>>> >>>>> or write you own UDF to get the average within the foreach block. It >>>>> would be ideal if we can move p2 statement into the foreach block like >>>>> this: p2 = filter autionsPrice by price >= p5 and price <= p95, but i >>>>> donot think it is supported right now. >>>>> >>>>> Shawn >>>>> >>>>> >>>>> On Thu, Sep 8, 2011 at 5:54 PM, Pierre-Luc Brunet <[email protected]> >>>> wrote: >>>>>> Heya! >>>>>> >>>>>> I've been trying to do something with Pig for about 4 days now and I >>>> have nothing but failure to show for it. I was wondering if anybody could >>>> look at my queries and slap some sense into me? I've uploaded the queries >>>> to >>>> pastebin: http://pastebin.com/kzMxYwrY >>>>>> >>>>>> In short, I want to take my data, group it by 4 fields, then for each >>>> group, I want to: >>>>>> - Find out the 5th and the 95th percentile for the 'price' >>>>>> - Filter each group to remove the records that are < 5th percentile and >>>>> 95 percentile. >>>>>> >>>>>> Then for each group, I want to grab the AVG() of what's left. >>>>>> >>>>>> I tried many variations of the same code and always ended up with either >>>> "incompatible types in GreaterThanEqual Operator" or "Scalar has more than >>>> one row in the output." >>>>>> >>>>>> Any help would be greatly appreciated. Thanks! :) >>>>>> -- >>>>>> Pierre-Luc Brunet >>>>>> >>>> >>>> >>>> >> >> >>
