CUBE? On Feb 6, 2013 8:33 AM, "Adair Kovac" <[email protected]> wrote:
> Thanks Bill! I didn't see that documentation--only the javadoc, which for > Pig 9.1 definitely doesn't explain that. I realize now I really shouldn't > be trusting a Google search for pig-related documentation because the > search engine seems to get confused by all the versions. > > I also eventually realized what this (from the javadoc) means: "Generates > the count of the number of values in a bag. This count does not include > null values, and thus matches SQL semantics for COUNT(a) (where a is field) > but not for COUNT(*) (where * in SQL indicates all)."--COUNT is supposed to > run over a single field. I think it's bad for the documentation to conflate > the two meanings of "null value". For what it's worth, in regards to the > jira proposal, I think if it retains its current behavior, COUNT should > error if the tuple size is greater than 1. > > Okay, next question: > > Say I'm doing counts on votes cast for student government at a large > university. Every position (president, vice president, down to dozens of > senators) is optional: a student can vote for them or not vote for them. > Many students don't vote, but the data I have reads that as a tuple of > nulls. I want to group on president and vice president and count all the > records where not every field is null (i.e., I want my counts to include > one for the number of students who voted for some office, but neither > president or vice president). I can't use COUNT_STAR, because it would > count the non-voters (I think). I can't use COUNT, because all students who > didn't vote for whatever column I passed in (or whatever column was first) > would get dropped. Filtering on $0 is not null or $1 is not null etc would > get really cumbersome, since I have dozens of fields. Is there a better > solution? > > Thanks! > > Adair > > On Tue, Feb 5, 2013 at 9:33 PM, Bill Graham <[email protected]> wrote: > > > This behavior is discussed in the count docs: > > > > http://pig.apache.org/docs/r0.10.0/func.html#count > > > > The COUNT function follows syntax semantics and ignores nulls. What this > > means is that a tuple in the bag will not be counted if the FIRST FIELD > in > > this tuple is NULL. If you want to include NULL values in the count > > computation, use > > COUNT_STAR<http://pig.apache.org/docs/r0.10.0/func.html#COUNT-STAR> > > . > > > > There is a proposal to change this though, which provides more context: > > https://issues.apache.org/jira/browse/PIG-1014 > > > > On Tue, Feb 5, 2013 at 2:30 PM, Adair Kovac <[email protected]> > wrote: > > > > > Sorry, correcting an imprecision here--field1 is the first field of the > > > records that have been grouped; that it's the first field in the key is > > > nonessential. So basically *any* group/count that I have done in the > past > > > could have been dropping records because the first field happened to be > > > something I didn't care about at the time that could be null. I am > > > distressed by this realization. > > > > > > Thanks again, > > > > > > Adair > > > > > > On Tue, Feb 5, 2013 at 3:14 PM, Adair Kovac <[email protected]> > > wrote: > > > > > > > Hi, guys, was wondering what's going on with this. > > > > > > > > In pig 0.9 if I do something like this: > > > > > > > > grouped = group data by (field1, field2); > > > > count = foreach grouped generate COUNT(data); > > > > > > > > That count is 0 wherever field1 is null regardless of what comes > after. > > > > > > > > I can use COUNT_STAR() instead (data fresh from a group won't have > any > > > > null records, right?), but it seems like that should be the expected > > > > behavior of COUNT(). > > > > > > > > This was obviously intended behavior, since it's right there in the > > > > function: > > > > > > > > if (t != null && t.size() > 0 && t.get(0) != null ) > > > > cnt++; > > > > > > > > but it just seems bizarre and inconvenient to me. Nor is it mentioned > > in > > > > the documentation, unless the bit written for people who are good at > > SQL > > > > implies it. Now I'm wondering which of my past scripts might be buggy > > > > because I didn't expect this behavior. > > > > > > > > Anyone have an explanation? > > > > > > > > Thanks, > > > > > > > > Adair > > > > > > > > > > > > > > > -- > > *Note that I'm no longer using my Yahoo! email address. Please email me > at > > [email protected] going forward.* > > >
