Jonathan,
At the higher level:
When you group a relation, let's call it A, you get a new relation, let's
call it B, with two fields -- "group" (the grouping key) and "A" (a bag of
tuples from A with a matching key). When you iterate over B with "foreach",
you are iterating over these two-field rows.
When, while iterating, you say A.foo, you are referring to a new bag that is
the projection of the field foo from the bag A. So if we had
B: (1, { (1, a) , (1, b), (1, c) }
and you said
C = foreach B generate A.foo;
you would get
C: ( { (a), (b), (c) } )
So now we get to SUM. SUM takes a bag of 1-element tuples (which is what you
get when you create a projection above), and returns the sum of their
contents. IsEmpty takes a bag and returns a boolean. (IsEmpty(A.foo) ? 0 :
1) returns an int. Can't call SUM on an int.
What you actually mean when you say SUM( A.foo is null ? 0 : A.foo ) is
"apply the function `if val is null return 0 else return val` to every
element in A.foo, and run SUM over that". Which is functional programming,
and that's cool, but we are pretty far from tackling functions as arguments
in Pig..
So the answer is -- do the null check before you group and use the result
later, as Vincent suggests.
D
On Tue, Nov 30, 2010 at 8:56 AM, Jonathan Coveney <[email protected]>wrote:
> Vincent, I really appreciate you taking a look. That worked (I tried the
> second fix...I still am curious if it'd be possible to do something
> internal
> to sum, but I think in its current implementation you cannot).
>
> At a higher level, why would things like that fail? Is it a feature or a
> limitation of pig that things like that happen? I only ask as someone who
> hopes to one day be intelligent enough to contribute to the project...
>
> 2010/11/30 Vincent <[email protected]>
>
> > Looking at your code I found the following mistakes:
> >
> > SUM((IsEmpty(pared.n2) ? 0:1)) will try to do SUM(0) or SUM(1) while SUM
> > expects a tuple.
> >
> > COUNT(pared.n2) can return 0, and you are making a division by 0, maybe
> it
> > would be better to filter non-null or to test NULL values. It would avoid
> > an
> > internal exception giving you a NULL result.
> >
> > In the second code give a try to this, I hope it would do the trick:
> >
> > pared = foreach beacon_fact generate n1, n2, (IsEmpty(n2) ? 0 : 1) as
> > ooz:int;
> > grouped = group pared by n1;
> > counted = foreach grouped generate group, (IsEmpty(pared.n2) ?
> > 0:(double)SUM(pared.n1)/(double)COUNT(pared.n2)) as ratio:double;
> >
> > Regards
> >
> > -Vincent
> >
> > On Tue, Nov 30, 2010 at 7:17 PM, Jonathan Coveney <[email protected]
> > >wrote:
> >
> > > (not sure if this double posted or not... I accidentally sent it to the
> > > Hadoop mailing list and not the pig mailing list)
> > >
> > > I appreciate any help you can give. I've searched around and haven't
> > found
> > > anything directly related... I've gone through documentation but can't
> > find
> > > a real reason why this doesn't work.
> > >
> > > Here is the jist of my code (n1 is arbitrary, just to group by, n2 is
> > > either
> > > null or a large integer):
> > >
> > > table = LOAD stuff AS (n1:chararray, n2:chararray, other irrelevant
> > stuff);
> > > pared = foreach table generate n1, n2;
> > > grouped = group pared by n1;
> > > counted = foreach grouped generate group,
> (double)SUM((IsEmpty(pared.n2)
> > ?
> > > 0:1))/(double)COUNT(pared.n2) as ratio:double;
> > > ordered = order counted by ratio desc;
> > > limited = limit ordered 200;
> > > dump limited;
> > >
> > > This gets this error:
> > >
> > > ERROR 1045: Could not infer the matching function for
> > > org.apache.pig.builtin.SUM as multiple or none of them fit. Please use
> an
> > > explicit cast.
> > >
> > > If I take out the double parenthesis in the counted sum
> > >
> > > ERROR 1000: Error during parsing. Invalid alias: SUM in {group:
> > > chararray,pared: {n1: chararray,n2: chararray}}
> > >
> > > I THINK the error is that sum wants the column of a bag as an input,
> not
> > > actual integers...so I thought I'd try and make that happen by making
> the
> > > input take the form I want.
> > >
> > > So in order to try and get around this, I thought this might work
> > (changing
> > > only these lines)
> > >
> > > pared = foreach beacon_fact generate n1, (IsEmpty(n2) ? 0 : 1) as
> > ooz:int;
> > > grouped = group pared by n1;
> > > counted = foreach grouped generate group,
> > > (double)SUM(pared.n1)/(double)COUNT(pared.n2) as ratio:double;
> > >
> > > But this gives this error:
> > > ERROR 1000: Error during parsing. Invalid alias: n2 in {n1:
> > chararray,ooz:
> > > int}
> > >
> > > I have no real clue why this fails... I tried breaking it up into two
> > steps
> > > and it doesn't matter.
> > >
> > > I'd ideally like to do this without making a UDF, as I feel the base
> > > functionality should support it. Not sure.
> > >
> > > Either way, I'd appreciate any help or pointers, as well as any
> rationale
> > > as
> > > to why it does or doesn't work within the pig framework. The whole bag
> > > system is still somewhat counterintuitive.
> > >
> > > Thank you for your time
> > >
> >
>