That worked except that for some reason, there's a lot of data that is missing 
in the final output (compared to what it should return).

For example, the file I load has these lines:

7       25      us      darkspear       a       Redacted        4750    5000    
1
8       25      us      emerald-dream   a       Lornadoome      9500    10000   
1
21      25      eu      khadgar a       Haiibanklol     769499  809999  1
7       25      us      queldorei       a       Worfgt  27862   34827   1
3       25      us      antonidas       a       Oldcrafter      19000   20000   
1

However, when I load up the script http://pastebin.com/Bk8RBAHt (now grouped on 
only one column), I don't have any records with 25 as the key. The first 5 rows 
in my tsv files are

35      3.19973415E7
36      122914.0
37      50000.0
38      416099.9
39      901333.8571428572
43      191496.5
44      236454.0


I really have no idea where the missing rows went :\

--
Pierre-Luc Brunet
ZeStuff - http://www.zestuff.com
 
Phone: (877) 5ZESTUFF
Mobile: (514) 600-0234
Email: [email protected]
 
9320 Saint-Laurent, #502
Montreal, QC, Canada, H2N 1N7

On 2011-09-08, at 8:45 PM, Xiaomeng Wan wrote:

> you can change
> 
> GENERATE group, auctionsPrice.price AS price:tuple, p5 AS p5, p95 AS p95;
> to
> GENERATE FLATTEN(group) as (item, region, realm, faction),
> FLATTEN(auctionsPrice.price) AS price, p5 AS p5, p95 AS p95;
> 
> then regroup after the foreach block
> 
> p2 = FILTER p1 BY (price >= p5 AND price <= p95);
> p2a = group p2 by (item, region, realm, faction);
> p3 = FOREACH p2a GENERATE group, AVG(p2.price) AS price;
> 
> or write you own UDF to get the average within the foreach block. It
> would be ideal if we can move p2 statement into the foreach block like
> this: p2 = filter autionsPrice by price >= p5 and price <= p95, but i
> donot think it is supported right now.
> 
> Shawn
> 
> 
> On Thu, Sep 8, 2011 at 5:54 PM, Pierre-Luc Brunet <[email protected]> wrote:
>> Heya!
>> 
>> I've been trying to do something with Pig for about 4 days now and I have 
>> nothing but failure to show for it. I was wondering if anybody could look at 
>> my queries and slap some sense into me? I've uploaded the queries to  
>> pastebin: http://pastebin.com/kzMxYwrY
>> 
>> In short, I want to take my data, group it by 4 fields, then for each group, 
>> I want to:
>>  - Find out the 5th and the 95th percentile for the 'price'
>>  - Filter each group to remove the records that are < 5th percentile and > 
>> 95 percentile.
>> 
>> Then for each group, I want to grab the AVG() of what's left.
>> 
>> I tried many variations of the same code and always ended up with either 
>> "incompatible types in GreaterThanEqual Operator" or "Scalar has more than 
>> one row in the output."
>> 
>> Any help would be greatly appreciated. Thanks! :)
>> --
>> Pierre-Luc Brunet
>> 


Reply via email to