Re: Filter grouped data with two percentile

Pierre-Luc Brunet Wed, 14 Sep 2011 09:10:21 -0700

Shawn,

This looks indeed pretty good except for one thing.


I already do a GROUP on my table in order to group my records by "item". If I 
run your code, I end up filtering against the entire data set instead of 
filtering each group individually.  I tried to put your code inside a foreach 
statement without much luck.

Any idea?

--
Pierre


On 2011-09-13, at 5:49 PM, Xiaomeng Wan wrote:

> try this:
> 
> a = group allrecords all;
> b = foreach a generate COUNT(allrecords) as total; //or COUNT_STAR
> 
> c = order allrecords by thevalue;
> d = limit c b.total*0.05;
> 
> e = order allrecords by thevalue desc;
> f = limit e b.total*0.05;
> 
> g = union d, f;
> 
> Shawn
> 
> On Tue, Sep 13, 2011 at 2:39 PM, Pierre-Luc Brunet <[email protected]> wrote:
>> Question.
>> 
>> What would be the best way in Pig to grab a set of data, find the record 
>> that matches the 5th percentile, find the record that matches the 95th 
>> percentile and throw away what's before and after that?
>> 
>> Obviously, my math doesn't work for this.
>> 
>> In hope that it helps clarifying what I'm trying to do, here's how I 
>> currently do it in Javascript:
>> 
>> result.bid_array.sort(function(a,b) { return a - b; });
>> var bid_p5 = Math.round(5/100 * result.bid_array.length);
>> var bid_p95 = Math.round(95/100 * result.bid_array.length);
>> 
>> result.bid_array.splice(bid_p95, result.bid_array.length - ( 
>> result.bid_array.length - bid_p95));
>> result.bid_array.splice(0, bid_p5);
>> 
>> --
>> Pierre-Luc Brunet
>> ZeStuff - http://www.zestuff.com
>> 
>> Phone: (877) 5ZESTUFF
>> Mobile: (514) 600-0234
>> Email: [email protected]
>> 
>> 9320 Saint-Laurent, #502
>> Montreal, QC, Canada, H2N 1N7
>> 
>> On 2011-09-08, at 10:49 PM, Dmitriy Ryaboy wrote:
>> 
>>> If you look at the data for #25 you posted below, you will find that there
>>> is no row such that the price is between 5 and 95%!
>>> khadgar is such an extreme outlier, it moves the 5% line above everyone
>>> else, and of course it itself sets the 100% line.
>>> 
>>> D
>>> 
>>> On Thu, Sep 8, 2011 at 7:03 PM, Pierre-Luc Brunet <[email protected]>wrote:
>>> 
>>>> That worked except that for some reason, there's a lot of data that is
>>>> missing in the final output (compared to what it should return).
>>>> 
>>>> For example, the file I load has these lines:
>>>> 
>>>> 7       25      us      darkspear       a       Redacted        4750
>>>> 5000    1
>>>> 8       25      us      emerald-dream   a       Lornadoome      9500
>>>> 10000   1
>>>> 21      25      eu      khadgar a       Haiibanklol     769499  809999  1
>>>> 7       25      us      queldorei       a       Worfgt  27862   34827   1
>>>> 3       25      us      antonidas       a       Oldcrafter      19000
>>>> 20000   1
>>>> 
>>>> However, when I load up the script http://pastebin.com/Bk8RBAHt (now
>>>> grouped on only one column), I don't have any records with 25 as the key.
>>>> The first 5 rows in my tsv files are
>>>> 
>>>> 35      3.19973415E7
>>>> 36      122914.0
>>>> 37      50000.0
>>>> 38      416099.9
>>>> 39      901333.8571428572
>>>> 43      191496.5
>>>> 44      236454.0
>>>> 
>>>> 
>>>> I really have no idea where the missing rows went :\
>>>> 
>>>> --
>>>> Pierre-Luc Brunet
>>>> ZeStuff - http://www.zestuff.com
>>>> 
>>>> Phone: (877) 5ZESTUFF
>>>> Mobile: (514) 600-0234
>>>> Email: [email protected]
>>>> 
>>>> 9320 Saint-Laurent, #502
>>>> Montreal, QC, Canada, H2N 1N7
>>>> 
>>>> On 2011-09-08, at 8:45 PM, Xiaomeng Wan wrote:
>>>> 
>>>>> you can change
>>>>> 
>>>>> GENERATE group, auctionsPrice.price AS price:tuple, p5 AS p5, p95 AS p95;
>>>>> to
>>>>> GENERATE FLATTEN(group) as (item, region, realm, faction),
>>>>> FLATTEN(auctionsPrice.price) AS price, p5 AS p5, p95 AS p95;
>>>>> 
>>>>> then regroup after the foreach block
>>>>> 
>>>>> p2 = FILTER p1 BY (price >= p5 AND price <= p95);
>>>>> p2a = group p2 by (item, region, realm, faction);
>>>>> p3 = FOREACH p2a GENERATE group, AVG(p2.price) AS price;
>>>>> 
>>>>> or write you own UDF to get the average within the foreach block. It
>>>>> would be ideal if we can move p2 statement into the foreach block like
>>>>> this: p2 = filter autionsPrice by price >= p5 and price <= p95, but i
>>>>> donot think it is supported right now.
>>>>> 
>>>>> Shawn
>>>>> 
>>>>> 
>>>>> On Thu, Sep 8, 2011 at 5:54 PM, Pierre-Luc Brunet <[email protected]>
>>>> wrote:
>>>>>> Heya!
>>>>>> 
>>>>>> I've been trying to do something with Pig for about 4 days now and I
>>>> have nothing but failure to show for it. I was wondering if anybody could
>>>> look at my queries and slap some sense into me? I've uploaded the queries 
>>>> to
>>>> pastebin: http://pastebin.com/kzMxYwrY
>>>>>> 
>>>>>> In short, I want to take my data, group it by 4 fields, then for each
>>>> group, I want to:
>>>>>> - Find out the 5th and the 95th percentile for the 'price'
>>>>>> - Filter each group to remove the records that are < 5th percentile and
>>>>> 95 percentile.
>>>>>> 
>>>>>> Then for each group, I want to grab the AVG() of what's left.
>>>>>> 
>>>>>> I tried many variations of the same code and always ended up with either
>>>> "incompatible types in GreaterThanEqual Operator" or "Scalar has more than
>>>> one row in the output."
>>>>>> 
>>>>>> Any help would be greatly appreciated. Thanks! :)
>>>>>> --
>>>>>> Pierre-Luc Brunet
>>>>>> 
>>>> 
>>>> 
>>>> 
>> 
>> 
>>

Re: Filter grouped data with two percentile

Reply via email to