I think FILTER will do the trick? E.g. rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING CassandraStorage() AS (key, columns: bag {T: tuple(name, value)}); filter_rows = FILTER rows BY columns is not null; counts = FOREACH filter_rows GENERATE COUNT(columns); counts_in_bag = GROUP counts ALL; sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1); dump sum_of_bag;
On Tue, Jun 7, 2011 at 4:33 PM, William Oberman <ober...@civicscience.com>wrote: > I tried this same script on closer to production data, and I'm getting > errors. I'm 50% sure it's this: > https://issues.apache.org/jira/browse/PIG-1283 > > One of my rows in cassandra has no columns (maybe?), which maybe causes a > null bag, which causes COUNT to blow up (at least, that's my theory). As a > workaround, can I have COUNT ignore/skip rows with null columns? I'll start > digging through the docs as well. > > will > > > On Fri, Jun 3, 2011 at 4:09 PM, William Oberman > <ober...@civicscience.com>wrote: > >> That is exactly what I wanted, thanks for the confirm! >> >> >> On Fri, Jun 3, 2011 at 4:06 PM, Dmitriy Ryaboy <dvrya...@gmail.com>wrote: >> >>> I am not sure what you mean by "count all columns". The code you have >>> counts all *cells*. >>> So: >>> id1: col1, col2 >>> id2: col1, col2, col3 >>> >>> has 3 columns in a conventional sense, but your code will return 5. Is >>> that what you want? If so, your code seems correct. >>> >>> D >>> >>> On Fri, Jun 3, 2011 at 12:53 PM, William Oberman >>> <ober...@civicscience.com> wrote: >>> > Howdy, >>> > >>> > I'm coming from cassandra, and I'm actually trying to count all columns >>> in a >>> > column family. I believe that is similar to counting the number tuples >>> in a >>> > bag in the lingo in the pig manual. It was harder than I expected, but >>> I >>> > think this works: >>> > rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING >>> CassandraStorage() >>> > AS (key, columns: bag {T: tuple(name, value)}); >>> > counts = FOREACH rows GENERATE COUNT(columns); >>> > counts_in_bag = GROUP counts ALL; >>> > sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1); >>> > dump sum_of_bag; >>> > >>> > My question is: am I right that it works? I started with 3 keys having >>> a >>> > total of 5 columns and got (5). Then I added a new key/column, and >>> another >>> > column on an existing key and got (7). So, it seems like it's working. >>> > But, was there a better way to write it? >>> > >>> > Thanks! >>> > >>> > will >>> > >>> >> >> >> >> -- >> Will Oberman >> Civic Science, Inc. >> 3030 Penn Avenue., First Floor >> Pittsburgh, PA 15201 >> (M) 412-480-7835 >> (E) ober...@civicscience.com >> > > > > -- > Will Oberman > Civic Science, Inc. > 3030 Penn Avenue., First Floor > Pittsburgh, PA 15201 > (M) 412-480-7835 > (E) ober...@civicscience.com > -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com