Re: Filter grouped data with two percentile

Pierre-Luc Brunet Tue, 13 Sep 2011 13:51:04 -0700

Question.

What would be the best way in Pig to grab a set of data, find the record that 
matches the 5th percentile, find the record that matches the 95th percentile 
and throw away what's before and after that?


Obviously, my math doesn't work for this.

In hope that it helps clarifying what I'm trying to do, here's how I currently 
do it in Javascript:

result.bid_array.sort(function(a,b) { return a - b; });
var bid_p5 = Math.round(5/100 * result.bid_array.length);
var bid_p95 = Math.round(95/100 * result.bid_array.length);

result.bid_array.splice(bid_p95, result.bid_array.length - ( 
result.bid_array.length - bid_p95));
result.bid_array.splice(0, bid_p5);

--
Pierre-Luc Brunet
ZeStuff - http://www.zestuff.com
 
Phone: (877) 5ZESTUFF
Mobile: (514) 600-0234
Email: [email protected]
 
9320 Saint-Laurent, #502
Montreal, QC, Canada, H2N 1N7

On 2011-09-08, at 10:49 PM, Dmitriy Ryaboy wrote:

> If you look at the data for #25 you posted below, you will find that there
> is no row such that the price is between 5 and 95%!
> khadgar is such an extreme outlier, it moves the 5% line above everyone
> else, and of course it itself sets the 100% line.
> 
> D
> 
> On Thu, Sep 8, 2011 at 7:03 PM, Pierre-Luc Brunet <[email protected]>wrote:
> 
>> That worked except that for some reason, there's a lot of data that is
>> missing in the final output (compared to what it should return).
>> 
>> For example, the file I load has these lines:
>> 
>> 7       25      us      darkspear       a       Redacted        4750
>> 5000    1
>> 8       25      us      emerald-dream   a       Lornadoome      9500
>> 10000   1
>> 21      25      eu      khadgar a       Haiibanklol     769499  809999  1
>> 7       25      us      queldorei       a       Worfgt  27862   34827   1
>> 3       25      us      antonidas       a       Oldcrafter      19000
>> 20000   1
>> 
>> However, when I load up the script http://pastebin.com/Bk8RBAHt (now
>> grouped on only one column), I don't have any records with 25 as the key.
>> The first 5 rows in my tsv files are
>> 
>> 35      3.19973415E7
>> 36      122914.0
>> 37      50000.0
>> 38      416099.9
>> 39      901333.8571428572
>> 43      191496.5
>> 44      236454.0
>> 
>> 
>> I really have no idea where the missing rows went :\
>> 
>> --
>> Pierre-Luc Brunet
>> ZeStuff - http://www.zestuff.com
>> 
>> Phone: (877) 5ZESTUFF
>> Mobile: (514) 600-0234
>> Email: [email protected]
>> 
>> 9320 Saint-Laurent, #502
>> Montreal, QC, Canada, H2N 1N7
>> 
>> On 2011-09-08, at 8:45 PM, Xiaomeng Wan wrote:
>> 
>>> you can change
>>> 
>>> GENERATE group, auctionsPrice.price AS price:tuple, p5 AS p5, p95 AS p95;
>>> to
>>> GENERATE FLATTEN(group) as (item, region, realm, faction),
>>> FLATTEN(auctionsPrice.price) AS price, p5 AS p5, p95 AS p95;
>>> 
>>> then regroup after the foreach block
>>> 
>>> p2 = FILTER p1 BY (price >= p5 AND price <= p95);
>>> p2a = group p2 by (item, region, realm, faction);
>>> p3 = FOREACH p2a GENERATE group, AVG(p2.price) AS price;
>>> 
>>> or write you own UDF to get the average within the foreach block. It
>>> would be ideal if we can move p2 statement into the foreach block like
>>> this: p2 = filter autionsPrice by price >= p5 and price <= p95, but i
>>> donot think it is supported right now.
>>> 
>>> Shawn
>>> 
>>> 
>>> On Thu, Sep 8, 2011 at 5:54 PM, Pierre-Luc Brunet <[email protected]>
>> wrote:
>>>> Heya!
>>>> 
>>>> I've been trying to do something with Pig for about 4 days now and I
>> have nothing but failure to show for it. I was wondering if anybody could
>> look at my queries and slap some sense into me? I've uploaded the queries to
>> pastebin: http://pastebin.com/kzMxYwrY
>>>> 
>>>> In short, I want to take my data, group it by 4 fields, then for each
>> group, I want to:
>>>> - Find out the 5th and the 95th percentile for the 'price'
>>>> - Filter each group to remove the records that are < 5th percentile and
>>> 95 percentile.
>>>> 
>>>> Then for each group, I want to grab the AVG() of what's left.
>>>> 
>>>> I tried many variations of the same code and always ended up with either
>> "incompatible types in GreaterThanEqual Operator" or "Scalar has more than
>> one row in the output."
>>>> 
>>>> Any help would be greatly appreciated. Thanks! :)
>>>> --
>>>> Pierre-Luc Brunet
>>>> 
>> 
>> 
>>

Re: Filter grouped data with two percentile

Reply via email to