Re: Filter grouped data with two percentile

Dmitriy Ryaboy Thu, 08 Sep 2011 20:00:16 -0700

You  are right Shawn, it gives the following error:

grunt>  p1 = FOREACH grouped {
>>  min = MIN(auctionsPrice.price);
>>   max = MAX(auctionsPrice.price);
>>  p5 = min + (max-min) * 0.05;
>>  p95 = min + (max-min) * 0.95;
>>   p2 = filter auctionsPrice BY (price >= p5 AND price <= p95);
>> generate FLATTEN(group) as item, flatten(p2.price) as price,  p5 AS p5,
p95 AS p95;
>> }
2011-09-08 19:57:35,142 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1000:
<line 34, column 24> Invalid field reference. Referenced field [price] does
not exist in schema: .


Looks like we are losing the schema of the auctionsPrice somewhere along the
line -- even though if you do a describe on the result of grouping
auctionsPrice, it's
there: group:int,auctionsPrice:bag{:tuple(item:int,price:double)}.

D

On Thu, Sep 8, 2011 at 6:34 PM, Xiaomeng Wan <[email protected]> wrote:

> I am talking about this part in Pierre's code:
>
> #
> p1 = FOREACH grouped {
> #
>  min = MIN(auctionsPrice.price);
> #
>  max = MAX(auctionsPrice.price);
> #
>  p5 = min + (max-min) * 0.05;
> #
>  p95 = min + (max-min) * 0.95;
> #
>
> #
>  GENERATE group, auctionsPrice.price AS price:tuple, p5 AS p5, p95 AS p95;
> #
> }
> #
>
> #
> p2 = FILTER p1 BY (price >= p5 AND price <= p95);
>
> what he really wants is to move p2 into the foreach block after p95
> as: p2=filter auctionPrice BY (price >= p5 AND price <= p95); It would
> be great to know if this already been handled by scalar
>
> Shawn
>
>
> On Thu, Sep 8, 2011 at 7:08 PM, Dmitriy Ryaboy <[email protected]> wrote:
> > Not sure what you mean.. can you write out the script you are thinking of
> > that is currently not supported, and we'll see if there's a method for
> > getting it to work?
> > I suspect a judicious use for the pig scalar feature might be in order.
> >
> > D
> >
> > On Thu, Sep 8, 2011 at 5:45 PM, Xiaomeng Wan <[email protected]> wrote:
> >
> >> you can change
> >>
> >> GENERATE group, auctionsPrice.price AS price:tuple, p5 AS p5, p95 AS
> p95;
> >> to
> >> GENERATE FLATTEN(group) as (item, region, realm, faction),
> >> FLATTEN(auctionsPrice.price) AS price, p5 AS p5, p95 AS p95;
> >>
> >> then regroup after the foreach block
> >>
> >> p2 = FILTER p1 BY (price >= p5 AND price <= p95);
> >> p2a = group p2 by (item, region, realm, faction);
> >> p3 = FOREACH p2a GENERATE group, AVG(p2.price) AS price;
> >>
> >> or write you own UDF to get the average within the foreach block. It
> >> would be ideal if we can move p2 statement into the foreach block like
> >> this: p2 = filter autionsPrice by price >= p5 and price <= p95, but i
> >> donot think it is supported right now.
> >>
> >> Shawn
> >>
> >>
> >> On Thu, Sep 8, 2011 at 5:54 PM, Pierre-Luc Brunet <[email protected]>
> >> wrote:
> >> > Heya!
> >> >
> >> > I've been trying to do something with Pig for about 4 days now and I
> have
> >> nothing but failure to show for it. I was wondering if anybody could
> look at
> >> my queries and slap some sense into me? I've uploaded the queries to
> >>  pastebin: http://pastebin.com/kzMxYwrY
> >> >
> >> > In short, I want to take my data, group it by 4 fields, then for each
> >> group, I want to:
> >> >  - Find out the 5th and the 95th percentile for the 'price'
> >> >  - Filter each group to remove the records that are < 5th percentile
> and
> >> > 95 percentile.
> >> >
> >> > Then for each group, I want to grab the AVG() of what's left.
> >> >
> >> > I tried many variations of the same code and always ended up with
> either
> >> "incompatible types in GreaterThanEqual Operator" or "Scalar has more
> than
> >> one row in the output."
> >> >
> >> > Any help would be greatly appreciated. Thanks! :)
> >> > --
> >> > Pierre-Luc Brunet
> >> >
> >>
> >
>

Re: Filter grouped data with two percentile

Reply via email to