Re: Filter grouped data with two percentile

Pierre-Luc Brunet Wed, 14 Sep 2011 10:00:10 -0700

Under pig  0.9.1-SNAPSHOT, I get:

2011-09-14 12:54:10,075 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1200: <line 7, column 12>  Syntax error, unexpected symbol at or near 'y'


Under pig 0.8.1-cdh3u1, I get:

2011-09-14 12:55:35,765 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1000: Error during parsing. Encountered " <IDENTIFIER> "x "" at line 4, column 
15.

Any idea what's the problem? 

--
Pierre

On 2011-09-14, at 12:52 PM, Xiaomeng Wan wrote:

> wrong button or what? not sure, anyway, try this:
> 
> a = group records by id;
> b = foreach a { x = COUNT(records); y = order records by thevalue; z =
> limit y x*0.95; z1 = order records by thevalue desc; z2 = limit z1
> x*0.9; generate group, z2; }
> 
> never try this before, if no luck, you need to find other workaround.
> 
> Shawn
> 
> On Wed, Sep 14, 2011 at 10:26 AM, Xiaomeng Wan <[email protected]> wrote:
>> Pierre,
>> 
>> Union is not allowed within foreach. Fortunately, you donot need it. I
>> just realize the code I give you doesnot generate what you want,
>> actually it generates the complement of what you want. Try something
>> like this:
>> 
>> a = group records by id;
>> b = foreach a {
>> 
>> 
>> 
>> 
>> On Wed, Sep 14, 2011 at 10:09 AM, Pierre-Luc Brunet
>> <[email protected]> wrote:
>>> Shawn,
>>> 
>>> This looks indeed pretty good except for one thing.
>>> 
>>> I already do a GROUP on my table in order to group my records by "item". If 
>>> I run your code, I end up filtering against the entire data set instead of 
>>> filtering each group individually.  I tried to put your code inside a 
>>> foreach statement without much luck.
>>> 
>>> Any idea?
>>> 
>>> --
>>> Pierre
>>> 
>>> 
>>> On 2011-09-13, at 5:49 PM, Xiaomeng Wan wrote:
>>> 
>>>> try this:
>>>> 
>>>> a = group allrecords all;
>>>> b = foreach a generate COUNT(allrecords) as total; //or COUNT_STAR
>>>> 
>>>> c = order allrecords by thevalue;
>>>> d = limit c b.total*0.05;
>>>> 
>>>> e = order allrecords by thevalue desc;
>>>> f = limit e b.total*0.05;
>>>> 
>>>> g = union d, f;
>>>> 
>>>> Shawn
>>>> 
>>>> On Tue, Sep 13, 2011 at 2:39 PM, Pierre-Luc Brunet <[email protected]> 
>>>> wrote:
>>>>> Question.
>>>>> 
>>>>> What would be the best way in Pig to grab a set of data, find the record 
>>>>> that matches the 5th percentile, find the record that matches the 95th 
>>>>> percentile and throw away what's before and after that?
>>>>> 
>>>>> Obviously, my math doesn't work for this.
>>>>> 
>>>>> In hope that it helps clarifying what I'm trying to do, here's how I 
>>>>> currently do it in Javascript:
>>>>> 
>>>>> result.bid_array.sort(function(a,b) { return a - b; });
>>>>> var bid_p5 = Math.round(5/100 * result.bid_array.length);
>>>>> var bid_p95 = Math.round(95/100 * result.bid_array.length);
>>>>> 
>>>>> result.bid_array.splice(bid_p95, result.bid_array.length - ( 
>>>>> result.bid_array.length - bid_p95));
>>>>> result.bid_array.splice(0, bid_p5);
>>>>> 
>>>>> --
>>>>> Pierre-Luc Brunet
>>>>> ZeStuff - http://www.zestuff.com
>>>>> 
>>>>> Phone: (877) 5ZESTUFF
>>>>> Mobile: (514) 600-0234
>>>>> Email: [email protected]
>>>>> 
>>>>> 9320 Saint-Laurent, #502
>>>>> Montreal, QC, Canada, H2N 1N7
>>>>> 
>>>>> On 2011-09-08, at 10:49 PM, Dmitriy Ryaboy wrote:
>>>>> 
>>>>>> If you look at the data for #25 you posted below, you will find that 
>>>>>> there
>>>>>> is no row such that the price is between 5 and 95%!
>>>>>> khadgar is such an extreme outlier, it moves the 5% line above everyone
>>>>>> else, and of course it itself sets the 100% line.
>>>>>> 
>>>>>> D
>>>>>> 
>>>>>> On Thu, Sep 8, 2011 at 7:03 PM, Pierre-Luc Brunet 
>>>>>> <[email protected]>wrote:
>>>>>> 
>>>>>>> That worked except that for some reason, there's a lot of data that is
>>>>>>> missing in the final output (compared to what it should return).
>>>>>>> 
>>>>>>> For example, the file I load has these lines:
>>>>>>> 
>>>>>>> 7       25      us      darkspear       a       Redacted        4750
>>>>>>> 5000    1
>>>>>>> 8       25      us      emerald-dream   a       Lornadoome      9500
>>>>>>> 10000   1
>>>>>>> 21      25      eu      khadgar a       Haiibanklol     769499  809999  
>>>>>>> 1
>>>>>>> 7       25      us      queldorei       a       Worfgt  27862   34827   
>>>>>>> 1
>>>>>>> 3       25      us      antonidas       a       Oldcrafter      19000
>>>>>>> 20000   1
>>>>>>> 
>>>>>>> However, when I load up the script http://pastebin.com/Bk8RBAHt (now
>>>>>>> grouped on only one column), I don't have any records with 25 as the 
>>>>>>> key.
>>>>>>> The first 5 rows in my tsv files are
>>>>>>> 
>>>>>>> 35      3.19973415E7
>>>>>>> 36      122914.0
>>>>>>> 37      50000.0
>>>>>>> 38      416099.9
>>>>>>> 39      901333.8571428572
>>>>>>> 43      191496.5
>>>>>>> 44      236454.0
>>>>>>> 
>>>>>>> 
>>>>>>> I really have no idea where the missing rows went :\
>>>>>>> 
>>>>>>> --
>>>>>>> Pierre-Luc Brunet
>>>>>>> ZeStuff - http://www.zestuff.com
>>>>>>> 
>>>>>>> Phone: (877) 5ZESTUFF
>>>>>>> Mobile: (514) 600-0234
>>>>>>> Email: [email protected]
>>>>>>> 
>>>>>>> 9320 Saint-Laurent, #502
>>>>>>> Montreal, QC, Canada, H2N 1N7
>>>>>>> 
>>>>>>> On 2011-09-08, at 8:45 PM, Xiaomeng Wan wrote:
>>>>>>> 
>>>>>>>> you can change
>>>>>>>> 
>>>>>>>> GENERATE group, auctionsPrice.price AS price:tuple, p5 AS p5, p95 AS 
>>>>>>>> p95;
>>>>>>>> to
>>>>>>>> GENERATE FLATTEN(group) as (item, region, realm, faction),
>>>>>>>> FLATTEN(auctionsPrice.price) AS price, p5 AS p5, p95 AS p95;
>>>>>>>> 
>>>>>>>> then regroup after the foreach block
>>>>>>>> 
>>>>>>>> p2 = FILTER p1 BY (price >= p5 AND price <= p95);
>>>>>>>> p2a = group p2 by (item, region, realm, faction);
>>>>>>>> p3 = FOREACH p2a GENERATE group, AVG(p2.price) AS price;
>>>>>>>> 
>>>>>>>> or write you own UDF to get the average within the foreach block. It
>>>>>>>> would be ideal if we can move p2 statement into the foreach block like
>>>>>>>> this: p2 = filter autionsPrice by price >= p5 and price <= p95, but i
>>>>>>>> donot think it is supported right now.
>>>>>>>> 
>>>>>>>> Shawn
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, Sep 8, 2011 at 5:54 PM, Pierre-Luc Brunet <[email protected]>
>>>>>>> wrote:
>>>>>>>>> Heya!
>>>>>>>>> 
>>>>>>>>> I've been trying to do something with Pig for about 4 days now and I
>>>>>>> have nothing but failure to show for it. I was wondering if anybody 
>>>>>>> could
>>>>>>> look at my queries and slap some sense into me? I've uploaded the 
>>>>>>> queries to
>>>>>>> pastebin: http://pastebin.com/kzMxYwrY
>>>>>>>>> 
>>>>>>>>> In short, I want to take my data, group it by 4 fields, then for each
>>>>>>> group, I want to:
>>>>>>>>> - Find out the 5th and the 95th percentile for the 'price'
>>>>>>>>> - Filter each group to remove the records that are < 5th percentile 
>>>>>>>>> and
>>>>>>>> 95 percentile.
>>>>>>>>> 
>>>>>>>>> Then for each group, I want to grab the AVG() of what's left.
>>>>>>>>> 
>>>>>>>>> I tried many variations of the same code and always ended up with 
>>>>>>>>> either
>>>>>>> "incompatible types in GreaterThanEqual Operator" or "Scalar has more 
>>>>>>> than
>>>>>>> one row in the output."
>>>>>>>>> 
>>>>>>>>> Any help would be greatly appreciated. Thanks! :)
>>>>>>>>> --
>>>>>>>>> Pierre-Luc Brunet
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>>> 
>>

Re: Filter grouped data with two percentile

Reply via email to