Re: Filter grouped data with two percentile

Dmitriy Ryaboy Thu, 08 Sep 2011 19:49:49 -0700

If you look at the data for #25 you posted below, you will find that there
is no row such that the price is between 5 and 95%!
khadgar is such an extreme outlier, it moves the 5% line above everyone
else, and of course it itself sets the 100% line.


D

On Thu, Sep 8, 2011 at 7:03 PM, Pierre-Luc Brunet <[email protected]>wrote:

> That worked except that for some reason, there's a lot of data that is
> missing in the final output (compared to what it should return).
>
> For example, the file I load has these lines:
>
> 7       25      us      darkspear       a       Redacted        4750
>  5000    1
> 8       25      us      emerald-dream   a       Lornadoome      9500
>  10000   1
> 21      25      eu      khadgar a       Haiibanklol     769499  809999  1
> 7       25      us      queldorei       a       Worfgt  27862   34827   1
> 3       25      us      antonidas       a       Oldcrafter      19000
> 20000   1
>
> However, when I load up the script http://pastebin.com/Bk8RBAHt (now
> grouped on only one column), I don't have any records with 25 as the key.
> The first 5 rows in my tsv files are
>
> 35      3.19973415E7
> 36      122914.0
> 37      50000.0
> 38      416099.9
> 39      901333.8571428572
> 43      191496.5
> 44      236454.0
>
>
> I really have no idea where the missing rows went :\
>
> --
> Pierre-Luc Brunet
> ZeStuff - http://www.zestuff.com
>
> Phone: (877) 5ZESTUFF
> Mobile: (514) 600-0234
> Email: [email protected]
>
> 9320 Saint-Laurent, #502
> Montreal, QC, Canada, H2N 1N7
>
> On 2011-09-08, at 8:45 PM, Xiaomeng Wan wrote:
>
> > you can change
> >
> > GENERATE group, auctionsPrice.price AS price:tuple, p5 AS p5, p95 AS p95;
> > to
> > GENERATE FLATTEN(group) as (item, region, realm, faction),
> > FLATTEN(auctionsPrice.price) AS price, p5 AS p5, p95 AS p95;
> >
> > then regroup after the foreach block
> >
> > p2 = FILTER p1 BY (price >= p5 AND price <= p95);
> > p2a = group p2 by (item, region, realm, faction);
> > p3 = FOREACH p2a GENERATE group, AVG(p2.price) AS price;
> >
> > or write you own UDF to get the average within the foreach block. It
> > would be ideal if we can move p2 statement into the foreach block like
> > this: p2 = filter autionsPrice by price >= p5 and price <= p95, but i
> > donot think it is supported right now.
> >
> > Shawn
> >
> >
> > On Thu, Sep 8, 2011 at 5:54 PM, Pierre-Luc Brunet <[email protected]>
> wrote:
> >> Heya!
> >>
> >> I've been trying to do something with Pig for about 4 days now and I
> have nothing but failure to show for it. I was wondering if anybody could
> look at my queries and slap some sense into me? I've uploaded the queries to
>  pastebin: http://pastebin.com/kzMxYwrY
> >>
> >> In short, I want to take my data, group it by 4 fields, then for each
> group, I want to:
> >>  - Find out the 5th and the 95th percentile for the 'price'
> >>  - Filter each group to remove the records that are < 5th percentile and
> > 95 percentile.
> >>
> >> Then for each group, I want to grab the AVG() of what's left.
> >>
> >> I tried many variations of the same code and always ended up with either
> "incompatible types in GreaterThanEqual Operator" or "Scalar has more than
> one row in the output."
> >>
> >> Any help would be greatly appreciated. Thanks! :)
> >> --
> >> Pierre-Luc Brunet
> >>
>
>
>

Re: Filter grouped data with two percentile

Reply via email to