Vincent, Dmitriy, I appreciate the explanations. They helped a lot. I think the last point of your explanation, Dmitriy, was what I was getting at... functions as arguments and all that. I guess I'm spoiled from functional programming stuff, but it's exciting to thing that such advanced functionality might come to pig.
Thanks again Jon 2010/11/30 Dmitriy Ryaboy <[email protected]> > Jonathan, > > At the higher level: > > When you group a relation, let's call it A, you get a new relation, let's > call it B, with two fields -- "group" (the grouping key) and "A" (a bag of > tuples from A with a matching key). When you iterate over B with > "foreach", > you are iterating over these two-field rows. > > When, while iterating, you say A.foo, you are referring to a new bag that > is > the projection of the field foo from the bag A. So if we had > B: (1, { (1, a) , (1, b), (1, c) } > and you said > C = foreach B generate A.foo; > you would get > C: ( { (a), (b), (c) } ) > > So now we get to SUM. SUM takes a bag of 1-element tuples (which is what > you > get when you create a projection above), and returns the sum of their > contents. IsEmpty takes a bag and returns a boolean. (IsEmpty(A.foo) ? 0 : > 1) returns an int. Can't call SUM on an int. > > What you actually mean when you say SUM( A.foo is null ? 0 : A.foo ) is > "apply the function `if val is null return 0 else return val` to every > element in A.foo, and run SUM over that". Which is functional programming, > and that's cool, but we are pretty far from tackling functions as arguments > in Pig.. > > So the answer is -- do the null check before you group and use the result > later, as Vincent suggests. > > D > > On Tue, Nov 30, 2010 at 8:56 AM, Jonathan Coveney <[email protected] > >wrote: > > > Vincent, I really appreciate you taking a look. That worked (I tried the > > second fix...I still am curious if it'd be possible to do something > > internal > > to sum, but I think in its current implementation you cannot). > > > > At a higher level, why would things like that fail? Is it a feature or a > > limitation of pig that things like that happen? I only ask as someone who > > hopes to one day be intelligent enough to contribute to the project... > > > > 2010/11/30 Vincent <[email protected]> > > > > > Looking at your code I found the following mistakes: > > > > > > SUM((IsEmpty(pared.n2) ? 0:1)) will try to do SUM(0) or SUM(1) while > SUM > > > expects a tuple. > > > > > > COUNT(pared.n2) can return 0, and you are making a division by 0, maybe > > it > > > would be better to filter non-null or to test NULL values. It would > avoid > > > an > > > internal exception giving you a NULL result. > > > > > > In the second code give a try to this, I hope it would do the trick: > > > > > > pared = foreach beacon_fact generate n1, n2, (IsEmpty(n2) ? 0 : 1) as > > > ooz:int; > > > grouped = group pared by n1; > > > counted = foreach grouped generate group, (IsEmpty(pared.n2) ? > > > 0:(double)SUM(pared.n1)/(double)COUNT(pared.n2)) as ratio:double; > > > > > > Regards > > > > > > -Vincent > > > > > > On Tue, Nov 30, 2010 at 7:17 PM, Jonathan Coveney <[email protected] > > > >wrote: > > > > > > > (not sure if this double posted or not... I accidentally sent it to > the > > > > Hadoop mailing list and not the pig mailing list) > > > > > > > > I appreciate any help you can give. I've searched around and haven't > > > found > > > > anything directly related... I've gone through documentation but > can't > > > find > > > > a real reason why this doesn't work. > > > > > > > > Here is the jist of my code (n1 is arbitrary, just to group by, n2 is > > > > either > > > > null or a large integer): > > > > > > > > table = LOAD stuff AS (n1:chararray, n2:chararray, other irrelevant > > > stuff); > > > > pared = foreach table generate n1, n2; > > > > grouped = group pared by n1; > > > > counted = foreach grouped generate group, > > (double)SUM((IsEmpty(pared.n2) > > > ? > > > > 0:1))/(double)COUNT(pared.n2) as ratio:double; > > > > ordered = order counted by ratio desc; > > > > limited = limit ordered 200; > > > > dump limited; > > > > > > > > This gets this error: > > > > > > > > ERROR 1045: Could not infer the matching function for > > > > org.apache.pig.builtin.SUM as multiple or none of them fit. Please > use > > an > > > > explicit cast. > > > > > > > > If I take out the double parenthesis in the counted sum > > > > > > > > ERROR 1000: Error during parsing. Invalid alias: SUM in {group: > > > > chararray,pared: {n1: chararray,n2: chararray}} > > > > > > > > I THINK the error is that sum wants the column of a bag as an input, > > not > > > > actual integers...so I thought I'd try and make that happen by making > > the > > > > input take the form I want. > > > > > > > > So in order to try and get around this, I thought this might work > > > (changing > > > > only these lines) > > > > > > > > pared = foreach beacon_fact generate n1, (IsEmpty(n2) ? 0 : 1) as > > > ooz:int; > > > > grouped = group pared by n1; > > > > counted = foreach grouped generate group, > > > > (double)SUM(pared.n1)/(double)COUNT(pared.n2) as ratio:double; > > > > > > > > But this gives this error: > > > > ERROR 1000: Error during parsing. Invalid alias: n2 in {n1: > > > chararray,ooz: > > > > int} > > > > > > > > I have no real clue why this fails... I tried breaking it up into two > > > steps > > > > and it doesn't matter. > > > > > > > > I'd ideally like to do this without making a UDF, as I feel the base > > > > functionality should support it. Not sure. > > > > > > > > Either way, I'd appreciate any help or pointers, as well as any > > rationale > > > > as > > > > to why it does or doesn't work within the pig framework. The whole > bag > > > > system is still somewhat counterintuitive. > > > > > > > > Thank you for your time > > > > > > > > > >
