Re: Filter grouped data with two percentile

Xiaomeng Wan Tue, 13 Sep 2011 14:50:04 -0700

try this:

a = group allrecords all;
b = foreach a generate COUNT(allrecords) as total; //or COUNT_STAR


c = order allrecords by thevalue;
d = limit c b.total*0.05;

e = order allrecords by thevalue desc;
f = limit e b.total*0.05;

g = union d, f;

Shawn

On Tue, Sep 13, 2011 at 2:39 PM, Pierre-Luc Brunet <[email protected]> wrote:
> Question.
>
> What would be the best way in Pig to grab a set of data, find the record that 
> matches the 5th percentile, find the record that matches the 95th percentile 
> and throw away what's before and after that?
>
> Obviously, my math doesn't work for this.
>
> In hope that it helps clarifying what I'm trying to do, here's how I 
> currently do it in Javascript:
>
> result.bid_array.sort(function(a,b) { return a - b; });
> var bid_p5 = Math.round(5/100 * result.bid_array.length);
> var bid_p95 = Math.round(95/100 * result.bid_array.length);
>
> result.bid_array.splice(bid_p95, result.bid_array.length - ( 
> result.bid_array.length - bid_p95));
> result.bid_array.splice(0, bid_p5);
>
> --
> Pierre-Luc Brunet
> ZeStuff - http://www.zestuff.com
>
> Phone: (877) 5ZESTUFF
> Mobile: (514) 600-0234
> Email: [email protected]
>
> 9320 Saint-Laurent, #502
> Montreal, QC, Canada, H2N 1N7
>
> On 2011-09-08, at 10:49 PM, Dmitriy Ryaboy wrote:
>
>> If you look at the data for #25 you posted below, you will find that there
>> is no row such that the price is between 5 and 95%!
>> khadgar is such an extreme outlier, it moves the 5% line above everyone
>> else, and of course it itself sets the 100% line.
>>
>> D
>>
>> On Thu, Sep 8, 2011 at 7:03 PM, Pierre-Luc Brunet <[email protected]>wrote:
>>
>>> That worked except that for some reason, there's a lot of data that is
>>> missing in the final output (compared to what it should return).
>>>
>>> For example, the file I load has these lines:
>>>
>>> 7       25      us      darkspear       a       Redacted        4750
>>> 5000    1
>>> 8       25      us      emerald-dream   a       Lornadoome      9500
>>> 10000   1
>>> 21      25      eu      khadgar a       Haiibanklol     769499  809999  1
>>> 7       25      us      queldorei       a       Worfgt  27862   34827   1
>>> 3       25      us      antonidas       a       Oldcrafter      19000
>>> 20000   1
>>>
>>> However, when I load up the script http://pastebin.com/Bk8RBAHt (now
>>> grouped on only one column), I don't have any records with 25 as the key.
>>> The first 5 rows in my tsv files are
>>>
>>> 35      3.19973415E7
>>> 36      122914.0
>>> 37      50000.0
>>> 38      416099.9
>>> 39      901333.8571428572
>>> 43      191496.5
>>> 44      236454.0
>>>
>>>
>>> I really have no idea where the missing rows went :\
>>>
>>> --
>>> Pierre-Luc Brunet
>>> ZeStuff - http://www.zestuff.com
>>>
>>> Phone: (877) 5ZESTUFF
>>> Mobile: (514) 600-0234
>>> Email: [email protected]
>>>
>>> 9320 Saint-Laurent, #502
>>> Montreal, QC, Canada, H2N 1N7
>>>
>>> On 2011-09-08, at 8:45 PM, Xiaomeng Wan wrote:
>>>
>>>> you can change
>>>>
>>>> GENERATE group, auctionsPrice.price AS price:tuple, p5 AS p5, p95 AS p95;
>>>> to
>>>> GENERATE FLATTEN(group) as (item, region, realm, faction),
>>>> FLATTEN(auctionsPrice.price) AS price, p5 AS p5, p95 AS p95;
>>>>
>>>> then regroup after the foreach block
>>>>
>>>> p2 = FILTER p1 BY (price >= p5 AND price <= p95);
>>>> p2a = group p2 by (item, region, realm, faction);
>>>> p3 = FOREACH p2a GENERATE group, AVG(p2.price) AS price;
>>>>
>>>> or write you own UDF to get the average within the foreach block. It
>>>> would be ideal if we can move p2 statement into the foreach block like
>>>> this: p2 = filter autionsPrice by price >= p5 and price <= p95, but i
>>>> donot think it is supported right now.
>>>>
>>>> Shawn
>>>>
>>>>
>>>> On Thu, Sep 8, 2011 at 5:54 PM, Pierre-Luc Brunet <[email protected]>
>>> wrote:
>>>>> Heya!
>>>>>
>>>>> I've been trying to do something with Pig for about 4 days now and I
>>> have nothing but failure to show for it. I was wondering if anybody could
>>> look at my queries and slap some sense into me? I've uploaded the queries to
>>> pastebin: http://pastebin.com/kzMxYwrY
>>>>>
>>>>> In short, I want to take my data, group it by 4 fields, then for each
>>> group, I want to:
>>>>> - Find out the 5th and the 95th percentile for the 'price'
>>>>> - Filter each group to remove the records that are < 5th percentile and
>>>> 95 percentile.
>>>>>
>>>>> Then for each group, I want to grab the AVG() of what's left.
>>>>>
>>>>> I tried many variations of the same code and always ended up with either
>>> "incompatible types in GreaterThanEqual Operator" or "Scalar has more than
>>> one row in the output."
>>>>>
>>>>> Any help would be greatly appreciated. Thanks! :)
>>>>> --
>>>>> Pierre-Luc Brunet
>>>>>
>>>
>>>
>>>
>
>
>

Re: Filter grouped data with two percentile

Reply via email to