wrong button or what? not sure, anyway, try this:
a = group records by id;
b = foreach a { x = COUNT(records); y = order records by thevalue; z =
limit y x*0.95; z1 = order records by thevalue desc; z2 = limit z1
x*0.9; generate group, z2; }
never try this before, if no luck, you need to find other workaround.
Shawn
On Wed, Sep 14, 2011 at 10:26 AM, Xiaomeng Wan <[email protected]> wrote:
> Pierre,
>
> Union is not allowed within foreach. Fortunately, you donot need it. I
> just realize the code I give you doesnot generate what you want,
> actually it generates the complement of what you want. Try something
> like this:
>
> a = group records by id;
> b = foreach a {
>
>
>
>
> On Wed, Sep 14, 2011 at 10:09 AM, Pierre-Luc Brunet
> <[email protected]> wrote:
>> Shawn,
>>
>> This looks indeed pretty good except for one thing.
>>
>> I already do a GROUP on my table in order to group my records by "item". If
>> I run your code, I end up filtering against the entire data set instead of
>> filtering each group individually. I tried to put your code inside a
>> foreach statement without much luck.
>>
>> Any idea?
>>
>> --
>> Pierre
>>
>>
>> On 2011-09-13, at 5:49 PM, Xiaomeng Wan wrote:
>>
>>> try this:
>>>
>>> a = group allrecords all;
>>> b = foreach a generate COUNT(allrecords) as total; //or COUNT_STAR
>>>
>>> c = order allrecords by thevalue;
>>> d = limit c b.total*0.05;
>>>
>>> e = order allrecords by thevalue desc;
>>> f = limit e b.total*0.05;
>>>
>>> g = union d, f;
>>>
>>> Shawn
>>>
>>> On Tue, Sep 13, 2011 at 2:39 PM, Pierre-Luc Brunet <[email protected]>
>>> wrote:
>>>> Question.
>>>>
>>>> What would be the best way in Pig to grab a set of data, find the record
>>>> that matches the 5th percentile, find the record that matches the 95th
>>>> percentile and throw away what's before and after that?
>>>>
>>>> Obviously, my math doesn't work for this.
>>>>
>>>> In hope that it helps clarifying what I'm trying to do, here's how I
>>>> currently do it in Javascript:
>>>>
>>>> result.bid_array.sort(function(a,b) { return a - b; });
>>>> var bid_p5 = Math.round(5/100 * result.bid_array.length);
>>>> var bid_p95 = Math.round(95/100 * result.bid_array.length);
>>>>
>>>> result.bid_array.splice(bid_p95, result.bid_array.length - (
>>>> result.bid_array.length - bid_p95));
>>>> result.bid_array.splice(0, bid_p5);
>>>>
>>>> --
>>>> Pierre-Luc Brunet
>>>> ZeStuff - http://www.zestuff.com
>>>>
>>>> Phone: (877) 5ZESTUFF
>>>> Mobile: (514) 600-0234
>>>> Email: [email protected]
>>>>
>>>> 9320 Saint-Laurent, #502
>>>> Montreal, QC, Canada, H2N 1N7
>>>>
>>>> On 2011-09-08, at 10:49 PM, Dmitriy Ryaboy wrote:
>>>>
>>>>> If you look at the data for #25 you posted below, you will find that there
>>>>> is no row such that the price is between 5 and 95%!
>>>>> khadgar is such an extreme outlier, it moves the 5% line above everyone
>>>>> else, and of course it itself sets the 100% line.
>>>>>
>>>>> D
>>>>>
>>>>> On Thu, Sep 8, 2011 at 7:03 PM, Pierre-Luc Brunet
>>>>> <[email protected]>wrote:
>>>>>
>>>>>> That worked except that for some reason, there's a lot of data that is
>>>>>> missing in the final output (compared to what it should return).
>>>>>>
>>>>>> For example, the file I load has these lines:
>>>>>>
>>>>>> 7 25 us darkspear a Redacted 4750
>>>>>> 5000 1
>>>>>> 8 25 us emerald-dream a Lornadoome 9500
>>>>>> 10000 1
>>>>>> 21 25 eu khadgar a Haiibanklol 769499 809999 1
>>>>>> 7 25 us queldorei a Worfgt 27862 34827 1
>>>>>> 3 25 us antonidas a Oldcrafter 19000
>>>>>> 20000 1
>>>>>>
>>>>>> However, when I load up the script http://pastebin.com/Bk8RBAHt (now
>>>>>> grouped on only one column), I don't have any records with 25 as the key.
>>>>>> The first 5 rows in my tsv files are
>>>>>>
>>>>>> 35 3.19973415E7
>>>>>> 36 122914.0
>>>>>> 37 50000.0
>>>>>> 38 416099.9
>>>>>> 39 901333.8571428572
>>>>>> 43 191496.5
>>>>>> 44 236454.0
>>>>>>
>>>>>>
>>>>>> I really have no idea where the missing rows went :\
>>>>>>
>>>>>> --
>>>>>> Pierre-Luc Brunet
>>>>>> ZeStuff - http://www.zestuff.com
>>>>>>
>>>>>> Phone: (877) 5ZESTUFF
>>>>>> Mobile: (514) 600-0234
>>>>>> Email: [email protected]
>>>>>>
>>>>>> 9320 Saint-Laurent, #502
>>>>>> Montreal, QC, Canada, H2N 1N7
>>>>>>
>>>>>> On 2011-09-08, at 8:45 PM, Xiaomeng Wan wrote:
>>>>>>
>>>>>>> you can change
>>>>>>>
>>>>>>> GENERATE group, auctionsPrice.price AS price:tuple, p5 AS p5, p95 AS
>>>>>>> p95;
>>>>>>> to
>>>>>>> GENERATE FLATTEN(group) as (item, region, realm, faction),
>>>>>>> FLATTEN(auctionsPrice.price) AS price, p5 AS p5, p95 AS p95;
>>>>>>>
>>>>>>> then regroup after the foreach block
>>>>>>>
>>>>>>> p2 = FILTER p1 BY (price >= p5 AND price <= p95);
>>>>>>> p2a = group p2 by (item, region, realm, faction);
>>>>>>> p3 = FOREACH p2a GENERATE group, AVG(p2.price) AS price;
>>>>>>>
>>>>>>> or write you own UDF to get the average within the foreach block. It
>>>>>>> would be ideal if we can move p2 statement into the foreach block like
>>>>>>> this: p2 = filter autionsPrice by price >= p5 and price <= p95, but i
>>>>>>> donot think it is supported right now.
>>>>>>>
>>>>>>> Shawn
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Sep 8, 2011 at 5:54 PM, Pierre-Luc Brunet <[email protected]>
>>>>>> wrote:
>>>>>>>> Heya!
>>>>>>>>
>>>>>>>> I've been trying to do something with Pig for about 4 days now and I
>>>>>> have nothing but failure to show for it. I was wondering if anybody could
>>>>>> look at my queries and slap some sense into me? I've uploaded the
>>>>>> queries to
>>>>>> pastebin: http://pastebin.com/kzMxYwrY
>>>>>>>>
>>>>>>>> In short, I want to take my data, group it by 4 fields, then for each
>>>>>> group, I want to:
>>>>>>>> - Find out the 5th and the 95th percentile for the 'price'
>>>>>>>> - Filter each group to remove the records that are < 5th percentile and
>>>>>>> 95 percentile.
>>>>>>>>
>>>>>>>> Then for each group, I want to grab the AVG() of what's left.
>>>>>>>>
>>>>>>>> I tried many variations of the same code and always ended up with
>>>>>>>> either
>>>>>> "incompatible types in GreaterThanEqual Operator" or "Scalar has more
>>>>>> than
>>>>>> one row in the output."
>>>>>>>>
>>>>>>>> Any help would be greatly appreciated. Thanks! :)
>>>>>>>> --
>>>>>>>> Pierre-Luc Brunet
>>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>>
>>
>>
>>
>