Vincent, Dmitriy, I appreciate the explanations. They helped a lot. I think
the last point of your explanation, Dmitriy, was what I was getting at...
functions as arguments and all that. I guess I'm spoiled from functional
programming stuff, but it's exciting to thing that such advanced
functionality might come to pig.

Thanks again
Jon

2010/11/30 Dmitriy Ryaboy <[email protected]>

> Jonathan,
>
> At the higher level:
>
> When you group a relation, let's call it A, you get a new relation, let's
> call it B, with two fields -- "group" (the grouping key) and "A" (a bag of
> tuples from A with a matching key).  When you iterate over B with
> "foreach",
> you are iterating over these two-field rows.
>
> When, while iterating, you say A.foo, you are referring to a new bag that
> is
> the projection of the field foo from the bag A. So if we had
> B: (1, { (1, a) , (1, b), (1, c) }
> and you said
> C = foreach B generate A.foo;
> you would get
> C: ( { (a), (b), (c) } )
>
> So now we get to SUM. SUM takes a bag of 1-element tuples (which is what
> you
> get when you create a projection above), and returns the sum of their
> contents.  IsEmpty takes a bag and returns a boolean. (IsEmpty(A.foo) ? 0 :
> 1) returns an int.  Can't call SUM on an int.
>
> What you actually mean when you say SUM( A.foo is null ? 0 : A.foo ) is
> "apply the function `if  val is null return 0 else return val` to every
> element in A.foo, and run SUM over that".  Which is functional programming,
> and that's cool, but we are pretty far from tackling functions as arguments
> in Pig..
>
> So the answer is -- do the null check before you group and use the result
> later, as Vincent suggests.
>
> D
>
> On Tue, Nov 30, 2010 at 8:56 AM, Jonathan Coveney <[email protected]
> >wrote:
>
> > Vincent, I really appreciate you taking a look. That worked (I tried the
> > second fix...I still am curious if it'd be possible to do something
> > internal
> > to sum, but I think in its current implementation you cannot).
> >
> > At a higher level, why would things like that fail? Is it a feature or a
> > limitation of pig that things like that happen? I only ask as someone who
> > hopes to one day be intelligent enough to contribute to the project...
> >
> > 2010/11/30 Vincent <[email protected]>
> >
> > > Looking at your code I found the following mistakes:
> > >
> > > SUM((IsEmpty(pared.n2) ? 0:1)) will try to do SUM(0) or SUM(1) while
> SUM
> > > expects a tuple.
> > >
> > > COUNT(pared.n2) can return 0, and you are making a division by 0, maybe
> > it
> > > would be better to filter non-null or to test NULL values. It would
> avoid
> > > an
> > > internal exception giving you a NULL result.
> > >
> > > In the second code give a try to this, I hope it would do the trick:
> > >
> > > pared = foreach beacon_fact generate n1, n2, (IsEmpty(n2) ? 0 : 1) as
> > > ooz:int;
> > > grouped = group pared by n1;
> > > counted  = foreach grouped generate group, (IsEmpty(pared.n2) ?
> > > 0:(double)SUM(pared.n1)/(double)COUNT(pared.n2)) as ratio:double;
> > >
> > > Regards
> > >
> > > -Vincent
> > >
> > > On Tue, Nov 30, 2010 at 7:17 PM, Jonathan Coveney <[email protected]
> > > >wrote:
> > >
> > > > (not sure if this double posted or not... I accidentally sent it to
> the
> > > > Hadoop mailing list and not the pig mailing list)
> > > >
> > > > I appreciate any help you can give. I've searched around and haven't
> > > found
> > > > anything directly related... I've gone through documentation but
> can't
> > > find
> > > > a real reason why this doesn't work.
> > > >
> > > > Here is the jist of my code (n1 is arbitrary, just to group by, n2 is
> > > > either
> > > > null or a large integer):
> > > >
> > > > table = LOAD stuff AS (n1:chararray, n2:chararray, other irrelevant
> > > stuff);
> > > > pared = foreach table generate n1, n2;
> > > > grouped = group pared by n1;
> > > > counted  = foreach grouped generate group,
> > (double)SUM((IsEmpty(pared.n2)
> > > ?
> > > > 0:1))/(double)COUNT(pared.n2) as ratio:double;
> > > > ordered = order counted by ratio desc;
> > > > limited = limit ordered 200;
> > > > dump limited;
> > > >
> > > > This gets this error:
> > > >
> > > > ERROR 1045: Could not infer the matching function for
> > > > org.apache.pig.builtin.SUM as multiple or none of them fit. Please
> use
> > an
> > > > explicit cast.
> > > >
> > > > If I take out the double parenthesis in the counted sum
> > > >
> > > > ERROR 1000: Error during parsing. Invalid alias: SUM in {group:
> > > > chararray,pared: {n1: chararray,n2: chararray}}
> > > >
> > > > I THINK the error is that sum wants the column of a bag as an input,
> > not
> > > > actual integers...so I thought I'd try and make that happen by making
> > the
> > > > input take the form I want.
> > > >
> > > > So in order to try and get around this, I thought this might work
> > > (changing
> > > > only these lines)
> > > >
> > > > pared = foreach beacon_fact generate n1, (IsEmpty(n2) ? 0 : 1) as
> > > ooz:int;
> > > > grouped = group pared by n1;
> > > > counted  = foreach grouped generate group,
> > > > (double)SUM(pared.n1)/(double)COUNT(pared.n2) as ratio:double;
> > > >
> > > > But this gives this error:
> > > > ERROR 1000: Error during parsing. Invalid alias: n2 in {n1:
> > > chararray,ooz:
> > > > int}
> > > >
> > > > I have no real clue why this fails... I tried breaking it up into two
> > > steps
> > > > and it doesn't matter.
> > > >
> > > > I'd ideally like to do this without making a UDF, as I feel the base
> > > > functionality should support it. Not sure.
> > > >
> > > > Either way, I'd appreciate any help or pointers, as well as any
> > rationale
> > > > as
> > > > to why it does or doesn't work within the pig framework. The whole
> bag
> > > > system is still somewhat counterintuitive.
> > > >
> > > > Thank you for your time
> > > >
> > >
> >
>

Reply via email to