Vincent, I really appreciate you taking a look. That worked (I tried the
second fix...I still am curious if it'd be possible to do something internal
to sum, but I think in its current implementation you cannot).

At a higher level, why would things like that fail? Is it a feature or a
limitation of pig that things like that happen? I only ask as someone who
hopes to one day be intelligent enough to contribute to the project...

2010/11/30 Vincent <[email protected]>

> Looking at your code I found the following mistakes:
>
> SUM((IsEmpty(pared.n2) ? 0:1)) will try to do SUM(0) or SUM(1) while SUM
> expects a tuple.
>
> COUNT(pared.n2) can return 0, and you are making a division by 0, maybe it
> would be better to filter non-null or to test NULL values. It would avoid
> an
> internal exception giving you a NULL result.
>
> In the second code give a try to this, I hope it would do the trick:
>
> pared = foreach beacon_fact generate n1, n2, (IsEmpty(n2) ? 0 : 1) as
> ooz:int;
> grouped = group pared by n1;
> counted  = foreach grouped generate group, (IsEmpty(pared.n2) ?
> 0:(double)SUM(pared.n1)/(double)COUNT(pared.n2)) as ratio:double;
>
> Regards
>
> -Vincent
>
> On Tue, Nov 30, 2010 at 7:17 PM, Jonathan Coveney <[email protected]
> >wrote:
>
> > (not sure if this double posted or not... I accidentally sent it to the
> > Hadoop mailing list and not the pig mailing list)
> >
> > I appreciate any help you can give. I've searched around and haven't
> found
> > anything directly related... I've gone through documentation but can't
> find
> > a real reason why this doesn't work.
> >
> > Here is the jist of my code (n1 is arbitrary, just to group by, n2 is
> > either
> > null or a large integer):
> >
> > table = LOAD stuff AS (n1:chararray, n2:chararray, other irrelevant
> stuff);
> > pared = foreach table generate n1, n2;
> > grouped = group pared by n1;
> > counted  = foreach grouped generate group, (double)SUM((IsEmpty(pared.n2)
> ?
> > 0:1))/(double)COUNT(pared.n2) as ratio:double;
> > ordered = order counted by ratio desc;
> > limited = limit ordered 200;
> > dump limited;
> >
> > This gets this error:
> >
> > ERROR 1045: Could not infer the matching function for
> > org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an
> > explicit cast.
> >
> > If I take out the double parenthesis in the counted sum
> >
> > ERROR 1000: Error during parsing. Invalid alias: SUM in {group:
> > chararray,pared: {n1: chararray,n2: chararray}}
> >
> > I THINK the error is that sum wants the column of a bag as an input, not
> > actual integers...so I thought I'd try and make that happen by making the
> > input take the form I want.
> >
> > So in order to try and get around this, I thought this might work
> (changing
> > only these lines)
> >
> > pared = foreach beacon_fact generate n1, (IsEmpty(n2) ? 0 : 1) as
> ooz:int;
> > grouped = group pared by n1;
> > counted  = foreach grouped generate group,
> > (double)SUM(pared.n1)/(double)COUNT(pared.n2) as ratio:double;
> >
> > But this gives this error:
> > ERROR 1000: Error during parsing. Invalid alias: n2 in {n1:
> chararray,ooz:
> > int}
> >
> > I have no real clue why this fails... I tried breaking it up into two
> steps
> > and it doesn't matter.
> >
> > I'd ideally like to do this without making a UDF, as I feel the base
> > functionality should support it. Not sure.
> >
> > Either way, I'd appreciate any help or pointers, as well as any rationale
> > as
> > to why it does or doesn't work within the pig framework. The whole bag
> > system is still somewhat counterintuitive.
> >
> > Thank you for your time
> >
>

Reply via email to