Under pig 0.9.1-SNAPSHOT, I get:
2011-09-14 12:54:10,075 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR
1200: <line 7, column 12> Syntax error, unexpected symbol at or near 'y'
Under pig 0.8.1-cdh3u1, I get:
2011-09-14 12:55:35,765 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR
1000: Error during parsing. Encountered " <IDENTIFIER> "x "" at line 4, column
15.
Any idea what's the problem?
--
Pierre
On 2011-09-14, at 12:52 PM, Xiaomeng Wan wrote:
> wrong button or what? not sure, anyway, try this:
>
> a = group records by id;
> b = foreach a { x = COUNT(records); y = order records by thevalue; z =
> limit y x*0.95; z1 = order records by thevalue desc; z2 = limit z1
> x*0.9; generate group, z2; }
>
> never try this before, if no luck, you need to find other workaround.
>
> Shawn
>
> On Wed, Sep 14, 2011 at 10:26 AM, Xiaomeng Wan <[email protected]> wrote:
>> Pierre,
>>
>> Union is not allowed within foreach. Fortunately, you donot need it. I
>> just realize the code I give you doesnot generate what you want,
>> actually it generates the complement of what you want. Try something
>> like this:
>>
>> a = group records by id;
>> b = foreach a {
>>
>>
>>
>>
>> On Wed, Sep 14, 2011 at 10:09 AM, Pierre-Luc Brunet
>> <[email protected]> wrote:
>>> Shawn,
>>>
>>> This looks indeed pretty good except for one thing.
>>>
>>> I already do a GROUP on my table in order to group my records by "item". If
>>> I run your code, I end up filtering against the entire data set instead of
>>> filtering each group individually. I tried to put your code inside a
>>> foreach statement without much luck.
>>>
>>> Any idea?
>>>
>>> --
>>> Pierre
>>>
>>>
>>> On 2011-09-13, at 5:49 PM, Xiaomeng Wan wrote:
>>>
>>>> try this:
>>>>
>>>> a = group allrecords all;
>>>> b = foreach a generate COUNT(allrecords) as total; //or COUNT_STAR
>>>>
>>>> c = order allrecords by thevalue;
>>>> d = limit c b.total*0.05;
>>>>
>>>> e = order allrecords by thevalue desc;
>>>> f = limit e b.total*0.05;
>>>>
>>>> g = union d, f;
>>>>
>>>> Shawn
>>>>
>>>> On Tue, Sep 13, 2011 at 2:39 PM, Pierre-Luc Brunet <[email protected]>
>>>> wrote:
>>>>> Question.
>>>>>
>>>>> What would be the best way in Pig to grab a set of data, find the record
>>>>> that matches the 5th percentile, find the record that matches the 95th
>>>>> percentile and throw away what's before and after that?
>>>>>
>>>>> Obviously, my math doesn't work for this.
>>>>>
>>>>> In hope that it helps clarifying what I'm trying to do, here's how I
>>>>> currently do it in Javascript:
>>>>>
>>>>> result.bid_array.sort(function(a,b) { return a - b; });
>>>>> var bid_p5 = Math.round(5/100 * result.bid_array.length);
>>>>> var bid_p95 = Math.round(95/100 * result.bid_array.length);
>>>>>
>>>>> result.bid_array.splice(bid_p95, result.bid_array.length - (
>>>>> result.bid_array.length - bid_p95));
>>>>> result.bid_array.splice(0, bid_p5);
>>>>>
>>>>> --
>>>>> Pierre-Luc Brunet
>>>>> ZeStuff - http://www.zestuff.com
>>>>>
>>>>> Phone: (877) 5ZESTUFF
>>>>> Mobile: (514) 600-0234
>>>>> Email: [email protected]
>>>>>
>>>>> 9320 Saint-Laurent, #502
>>>>> Montreal, QC, Canada, H2N 1N7
>>>>>
>>>>> On 2011-09-08, at 10:49 PM, Dmitriy Ryaboy wrote:
>>>>>
>>>>>> If you look at the data for #25 you posted below, you will find that
>>>>>> there
>>>>>> is no row such that the price is between 5 and 95%!
>>>>>> khadgar is such an extreme outlier, it moves the 5% line above everyone
>>>>>> else, and of course it itself sets the 100% line.
>>>>>>
>>>>>> D
>>>>>>
>>>>>> On Thu, Sep 8, 2011 at 7:03 PM, Pierre-Luc Brunet
>>>>>> <[email protected]>wrote:
>>>>>>
>>>>>>> That worked except that for some reason, there's a lot of data that is
>>>>>>> missing in the final output (compared to what it should return).
>>>>>>>
>>>>>>> For example, the file I load has these lines:
>>>>>>>
>>>>>>> 7 25 us darkspear a Redacted 4750
>>>>>>> 5000 1
>>>>>>> 8 25 us emerald-dream a Lornadoome 9500
>>>>>>> 10000 1
>>>>>>> 21 25 eu khadgar a Haiibanklol 769499 809999
>>>>>>> 1
>>>>>>> 7 25 us queldorei a Worfgt 27862 34827
>>>>>>> 1
>>>>>>> 3 25 us antonidas a Oldcrafter 19000
>>>>>>> 20000 1
>>>>>>>
>>>>>>> However, when I load up the script http://pastebin.com/Bk8RBAHt (now
>>>>>>> grouped on only one column), I don't have any records with 25 as the
>>>>>>> key.
>>>>>>> The first 5 rows in my tsv files are
>>>>>>>
>>>>>>> 35 3.19973415E7
>>>>>>> 36 122914.0
>>>>>>> 37 50000.0
>>>>>>> 38 416099.9
>>>>>>> 39 901333.8571428572
>>>>>>> 43 191496.5
>>>>>>> 44 236454.0
>>>>>>>
>>>>>>>
>>>>>>> I really have no idea where the missing rows went :\
>>>>>>>
>>>>>>> --
>>>>>>> Pierre-Luc Brunet
>>>>>>> ZeStuff - http://www.zestuff.com
>>>>>>>
>>>>>>> Phone: (877) 5ZESTUFF
>>>>>>> Mobile: (514) 600-0234
>>>>>>> Email: [email protected]
>>>>>>>
>>>>>>> 9320 Saint-Laurent, #502
>>>>>>> Montreal, QC, Canada, H2N 1N7
>>>>>>>
>>>>>>> On 2011-09-08, at 8:45 PM, Xiaomeng Wan wrote:
>>>>>>>
>>>>>>>> you can change
>>>>>>>>
>>>>>>>> GENERATE group, auctionsPrice.price AS price:tuple, p5 AS p5, p95 AS
>>>>>>>> p95;
>>>>>>>> to
>>>>>>>> GENERATE FLATTEN(group) as (item, region, realm, faction),
>>>>>>>> FLATTEN(auctionsPrice.price) AS price, p5 AS p5, p95 AS p95;
>>>>>>>>
>>>>>>>> then regroup after the foreach block
>>>>>>>>
>>>>>>>> p2 = FILTER p1 BY (price >= p5 AND price <= p95);
>>>>>>>> p2a = group p2 by (item, region, realm, faction);
>>>>>>>> p3 = FOREACH p2a GENERATE group, AVG(p2.price) AS price;
>>>>>>>>
>>>>>>>> or write you own UDF to get the average within the foreach block. It
>>>>>>>> would be ideal if we can move p2 statement into the foreach block like
>>>>>>>> this: p2 = filter autionsPrice by price >= p5 and price <= p95, but i
>>>>>>>> donot think it is supported right now.
>>>>>>>>
>>>>>>>> Shawn
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Sep 8, 2011 at 5:54 PM, Pierre-Luc Brunet <[email protected]>
>>>>>>> wrote:
>>>>>>>>> Heya!
>>>>>>>>>
>>>>>>>>> I've been trying to do something with Pig for about 4 days now and I
>>>>>>> have nothing but failure to show for it. I was wondering if anybody
>>>>>>> could
>>>>>>> look at my queries and slap some sense into me? I've uploaded the
>>>>>>> queries to
>>>>>>> pastebin: http://pastebin.com/kzMxYwrY
>>>>>>>>>
>>>>>>>>> In short, I want to take my data, group it by 4 fields, then for each
>>>>>>> group, I want to:
>>>>>>>>> - Find out the 5th and the 95th percentile for the 'price'
>>>>>>>>> - Filter each group to remove the records that are < 5th percentile
>>>>>>>>> and
>>>>>>>> 95 percentile.
>>>>>>>>>
>>>>>>>>> Then for each group, I want to grab the AVG() of what's left.
>>>>>>>>>
>>>>>>>>> I tried many variations of the same code and always ended up with
>>>>>>>>> either
>>>>>>> "incompatible types in GreaterThanEqual Operator" or "Scalar has more
>>>>>>> than
>>>>>>> one row in the output."
>>>>>>>>>
>>>>>>>>> Any help would be greatly appreciated. Thanks! :)
>>>>>>>>> --
>>>>>>>>> Pierre-Luc Brunet
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>>