Re: Filter grouped data with two percentile

Xiaomeng Wan Wed, 14 Sep 2011 12:03:36 -0700

it is just not supported, similar to the case I mentioned before. I
would suggest you write your own UDF to do that.


Shawn

On Wed, Sep 14, 2011 at 10:59 AM, Pierre-Luc Brunet
<[email protected]> wrote:
> Under pig  0.9.1-SNAPSHOT, I get:
>
> 2011-09-14 12:54:10,075 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 1200: <line 7, column 12>  Syntax error, unexpected symbol at or near 'y'
>
> Under pig 0.8.1-cdh3u1, I get:
>
> 2011-09-14 12:55:35,765 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 1000: Error during parsing. Encountered " <IDENTIFIER> "x "" at line 4, 
> column 15.
>
> Any idea what's the problem?
>
> --
> Pierre
>
> On 2011-09-14, at 12:52 PM, Xiaomeng Wan wrote:
>
>> wrong button or what? not sure, anyway, try this:
>>
>> a = group records by id;
>> b = foreach a { x = COUNT(records); y = order records by thevalue; z =
>> limit y x*0.95; z1 = order records by thevalue desc; z2 = limit z1
>> x*0.9; generate group, z2; }
>>
>> never try this before, if no luck, you need to find other workaround.
>>
>> Shawn
>>
>> On Wed, Sep 14, 2011 at 10:26 AM, Xiaomeng Wan <[email protected]> wrote:
>>> Pierre,
>>>
>>> Union is not allowed within foreach. Fortunately, you donot need it. I
>>> just realize the code I give you doesnot generate what you want,
>>> actually it generates the complement of what you want. Try something
>>> like this:
>>>
>>> a = group records by id;
>>> b = foreach a {
>>>
>>>
>>>
>>>
>>> On Wed, Sep 14, 2011 at 10:09 AM, Pierre-Luc Brunet
>>> <[email protected]> wrote:
>>>> Shawn,
>>>>
>>>> This looks indeed pretty good except for one thing.
>>>>
>>>> I already do a GROUP on my table in order to group my records by "item". 
>>>> If I run your code, I end up filtering against the entire data set instead 
>>>> of filtering each group individually.  I tried to put your code inside a 
>>>> foreach statement without much luck.
>>>>
>>>> Any idea?
>>>>
>>>> --
>>>> Pierre
>>>>
>>>>
>>>> On 2011-09-13, at 5:49 PM, Xiaomeng Wan wrote:
>>>>
>>>>> try this:
>>>>>
>>>>> a = group allrecords all;
>>>>> b = foreach a generate COUNT(allrecords) as total; //or COUNT_STAR
>>>>>
>>>>> c = order allrecords by thevalue;
>>>>> d = limit c b.total*0.05;
>>>>>
>>>>> e = order allrecords by thevalue desc;
>>>>> f = limit e b.total*0.05;
>>>>>
>>>>> g = union d, f;
>>>>>
>>>>> Shawn
>>>>>
>>>>> On Tue, Sep 13, 2011 at 2:39 PM, Pierre-Luc Brunet <[email protected]> 
>>>>> wrote:
>>>>>> Question.
>>>>>>
>>>>>> What would be the best way in Pig to grab a set of data, find the record 
>>>>>> that matches the 5th percentile, find the record that matches the 95th 
>>>>>> percentile and throw away what's before and after that?
>>>>>>
>>>>>> Obviously, my math doesn't work for this.
>>>>>>
>>>>>> In hope that it helps clarifying what I'm trying to do, here's how I 
>>>>>> currently do it in Javascript:
>>>>>>
>>>>>> result.bid_array.sort(function(a,b) { return a - b; });
>>>>>> var bid_p5 = Math.round(5/100 * result.bid_array.length);
>>>>>> var bid_p95 = Math.round(95/100 * result.bid_array.length);
>>>>>>
>>>>>> result.bid_array.splice(bid_p95, result.bid_array.length - ( 
>>>>>> result.bid_array.length - bid_p95));
>>>>>> result.bid_array.splice(0, bid_p5);
>>>>>>
>>>>>> --
>>>>>> Pierre-Luc Brunet
>>>>>> ZeStuff - http://www.zestuff.com
>>>>>>
>>>>>> Phone: (877) 5ZESTUFF
>>>>>> Mobile: (514) 600-0234
>>>>>> Email: [email protected]
>>>>>>
>>>>>> 9320 Saint-Laurent, #502
>>>>>> Montreal, QC, Canada, H2N 1N7
>>>>>>
>>>>>> On 2011-09-08, at 10:49 PM, Dmitriy Ryaboy wrote:
>>>>>>
>>>>>>> If you look at the data for #25 you posted below, you will find that 
>>>>>>> there
>>>>>>> is no row such that the price is between 5 and 95%!
>>>>>>> khadgar is such an extreme outlier, it moves the 5% line above everyone
>>>>>>> else, and of course it itself sets the 100% line.
>>>>>>>
>>>>>>> D
>>>>>>>
>>>>>>> On Thu, Sep 8, 2011 at 7:03 PM, Pierre-Luc Brunet 
>>>>>>> <[email protected]>wrote:
>>>>>>>
>>>>>>>> That worked except that for some reason, there's a lot of data that is
>>>>>>>> missing in the final output (compared to what it should return).
>>>>>>>>
>>>>>>>> For example, the file I load has these lines:
>>>>>>>>
>>>>>>>> 7       25      us      darkspear       a       Redacted        4750
>>>>>>>> 5000    1
>>>>>>>> 8       25      us      emerald-dream   a       Lornadoome      9500
>>>>>>>> 10000   1
>>>>>>>> 21      25      eu      khadgar a       Haiibanklol     769499  809999 
>>>>>>>>  1
>>>>>>>> 7       25      us      queldorei       a       Worfgt  27862   34827  
>>>>>>>>  1
>>>>>>>> 3       25      us      antonidas       a       Oldcrafter      19000
>>>>>>>> 20000   1
>>>>>>>>
>>>>>>>> However, when I load up the script http://pastebin.com/Bk8RBAHt (now
>>>>>>>> grouped on only one column), I don't have any records with 25 as the 
>>>>>>>> key.
>>>>>>>> The first 5 rows in my tsv files are
>>>>>>>>
>>>>>>>> 35      3.19973415E7
>>>>>>>> 36      122914.0
>>>>>>>> 37      50000.0
>>>>>>>> 38      416099.9
>>>>>>>> 39      901333.8571428572
>>>>>>>> 43      191496.5
>>>>>>>> 44      236454.0
>>>>>>>>
>>>>>>>>
>>>>>>>> I really have no idea where the missing rows went :\
>>>>>>>>
>>>>>>>> --
>>>>>>>> Pierre-Luc Brunet
>>>>>>>> ZeStuff - http://www.zestuff.com
>>>>>>>>
>>>>>>>> Phone: (877) 5ZESTUFF
>>>>>>>> Mobile: (514) 600-0234
>>>>>>>> Email: [email protected]
>>>>>>>>
>>>>>>>> 9320 Saint-Laurent, #502
>>>>>>>> Montreal, QC, Canada, H2N 1N7
>>>>>>>>
>>>>>>>> On 2011-09-08, at 8:45 PM, Xiaomeng Wan wrote:
>>>>>>>>
>>>>>>>>> you can change
>>>>>>>>>
>>>>>>>>> GENERATE group, auctionsPrice.price AS price:tuple, p5 AS p5, p95 AS 
>>>>>>>>> p95;
>>>>>>>>> to
>>>>>>>>> GENERATE FLATTEN(group) as (item, region, realm, faction),
>>>>>>>>> FLATTEN(auctionsPrice.price) AS price, p5 AS p5, p95 AS p95;
>>>>>>>>>
>>>>>>>>> then regroup after the foreach block
>>>>>>>>>
>>>>>>>>> p2 = FILTER p1 BY (price >= p5 AND price <= p95);
>>>>>>>>> p2a = group p2 by (item, region, realm, faction);
>>>>>>>>> p3 = FOREACH p2a GENERATE group, AVG(p2.price) AS price;
>>>>>>>>>
>>>>>>>>> or write you own UDF to get the average within the foreach block. It
>>>>>>>>> would be ideal if we can move p2 statement into the foreach block like
>>>>>>>>> this: p2 = filter autionsPrice by price >= p5 and price <= p95, but i
>>>>>>>>> donot think it is supported right now.
>>>>>>>>>
>>>>>>>>> Shawn
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Sep 8, 2011 at 5:54 PM, Pierre-Luc Brunet <[email protected]>
>>>>>>>> wrote:
>>>>>>>>>> Heya!
>>>>>>>>>>
>>>>>>>>>> I've been trying to do something with Pig for about 4 days now and I
>>>>>>>> have nothing but failure to show for it. I was wondering if anybody 
>>>>>>>> could
>>>>>>>> look at my queries and slap some sense into me? I've uploaded the 
>>>>>>>> queries to
>>>>>>>> pastebin: http://pastebin.com/kzMxYwrY
>>>>>>>>>>
>>>>>>>>>> In short, I want to take my data, group it by 4 fields, then for each
>>>>>>>> group, I want to:
>>>>>>>>>> - Find out the 5th and the 95th percentile for the 'price'
>>>>>>>>>> - Filter each group to remove the records that are < 5th percentile 
>>>>>>>>>> and
>>>>>>>>> 95 percentile.
>>>>>>>>>>
>>>>>>>>>> Then for each group, I want to grab the AVG() of what's left.
>>>>>>>>>>
>>>>>>>>>> I tried many variations of the same code and always ended up with 
>>>>>>>>>> either
>>>>>>>> "incompatible types in GreaterThanEqual Operator" or "Scalar has more 
>>>>>>>> than
>>>>>>>> one row in the output."
>>>>>>>>>>
>>>>>>>>>> Any help would be greatly appreciated. Thanks! :)
>>>>>>>>>> --
>>>>>>>>>> Pierre-Luc Brunet
>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>>
>>>
>
>
>

Re: Filter grouped data with two percentile

Reply via email to