I suppose it's possible to modify SUM to allow stuff like SUM(0), but I can
tell you in advance it will give 0 :-).
At higher level, Pig is an interpreted language, and sometimes it doesn't
understand us, like many other interpreted language Perl, Python, mainly
cause of wrong or undefined syntax.
Here Pig has found SUM(int), whereas it knows only SUM(tuple) and according
to the error Invalid alias: SUM in {group:
chararray,pared: {n1: chararray,n2: chararray}}, it was trying to see if it
was not a field name.
Otherwise in the answer I gave you, you can remove ooz:
pared = foreach beacon_fact generate n1, n2;
grouped = group pared by n1;
counted = foreach grouped generate group, (IsEmpty(pared.n2) ?
0:(double)SUM(pared.n1)/(double)COUNT(pared.n2)) as ratio:double;
or you can remove n2:
pared = foreach beacon_fact generate n1, (IsEmpty(n2) ? 0 : 1) as ooz:int;
grouped = group pared by n1;
counted = foreach grouped generate group,
(double)SUM(pared.n1)/(double)SUM(ooz)) as ratio:double;
Regards
-Vincent
On Tue, Nov 30, 2010 at 7:56 PM, Jonathan Coveney <[email protected]>wrote:
> Vincent, I really appreciate you taking a look. That worked (I tried the
> second fix...I still am curious if it'd be possible to do something
> internal
> to sum, but I think in its current implementation you cannot).
>
> At a higher level, why would things like that fail? Is it a feature or a
> limitation of pig that things like that happen? I only ask as someone who
> hopes to one day be intelligent enough to contribute to the project...
>
> 2010/11/30 Vincent <[email protected]>
>
> > Looking at your code I found the following mistakes:
> >
> > SUM((IsEmpty(pared.n2) ? 0:1)) will try to do SUM(0) or SUM(1) while SUM
> > expects a tuple.
> >
> > COUNT(pared.n2) can return 0, and you are making a division by 0, maybe
> it
> > would be better to filter non-null or to test NULL values. It would avoid
> > an
> > internal exception giving you a NULL result.
> >
> > In the second code give a try to this, I hope it would do the trick:
> >
> > pared = foreach beacon_fact generate n1, n2, (IsEmpty(n2) ? 0 : 1) as
> > ooz:int;
> > grouped = group pared by n1;
> > counted = foreach grouped generate group, (IsEmpty(pared.n2) ?
> > 0:(double)SUM(pared.n1)/(double)COUNT(pared.n2)) as ratio:double;
> >
> > Regards
> >
> > -Vincent
> >
> > On Tue, Nov 30, 2010 at 7:17 PM, Jonathan Coveney <[email protected]
> > >wrote:
> >
> > > (not sure if this double posted or not... I accidentally sent it to the
> > > Hadoop mailing list and not the pig mailing list)
> > >
> > > I appreciate any help you can give. I've searched around and haven't
> > found
> > > anything directly related... I've gone through documentation but can't
> > find
> > > a real reason why this doesn't work.
> > >
> > > Here is the jist of my code (n1 is arbitrary, just to group by, n2 is
> > > either
> > > null or a large integer):
> > >
> > > table = LOAD stuff AS (n1:chararray, n2:chararray, other irrelevant
> > stuff);
> > > pared = foreach table generate n1, n2;
> > > grouped = group pared by n1;
> > > counted = foreach grouped generate group,
> (double)SUM((IsEmpty(pared.n2)
> > ?
> > > 0:1))/(double)COUNT(pared.n2) as ratio:double;
> > > ordered = order counted by ratio desc;
> > > limited = limit ordered 200;
> > > dump limited;
> > >
> > > This gets this error:
> > >
> > > ERROR 1045: Could not infer the matching function for
> > > org.apache.pig.builtin.SUM as multiple or none of them fit. Please use
> an
> > > explicit cast.
> > >
> > > If I take out the double parenthesis in the counted sum
> > >
> > > ERROR 1000: Error during parsing. Invalid alias: SUM in {group:
> > > chararray,pared: {n1: chararray,n2: chararray}}
> > >
> > > I THINK the error is that sum wants the column of a bag as an input,
> not
> > > actual integers...so I thought I'd try and make that happen by making
> the
> > > input take the form I want.
> > >
> > > So in order to try and get around this, I thought this might work
> > (changing
> > > only these lines)
> > >
> > > pared = foreach beacon_fact generate n1, (IsEmpty(n2) ? 0 : 1) as
> > ooz:int;
> > > grouped = group pared by n1;
> > > counted = foreach grouped generate group,
> > > (double)SUM(pared.n1)/(double)COUNT(pared.n2) as ratio:double;
> > >
> > > But this gives this error:
> > > ERROR 1000: Error during parsing. Invalid alias: n2 in {n1:
> > chararray,ooz:
> > > int}
> > >
> > > I have no real clue why this fails... I tried breaking it up into two
> > steps
> > > and it doesn't matter.
> > >
> > > I'd ideally like to do this without making a UDF, as I feel the base
> > > functionality should support it. Not sure.
> > >
> > > Either way, I'd appreciate any help or pointers, as well as any
> rationale
> > > as
> > > to why it does or doesn't work within the pig framework. The whole bag
> > > system is still somewhat counterintuitive.
> > >
> > > Thank you for your time
> > >
> >
>