Woops, fat fingere dit. Part two:

grunt> d = foreach c generate SUM($0);

Wait a second...this doesn't make much sense. Foreaches work on columns in
rows, not on relations (nothing works on relations). So how do we count
things? We need to put everything in one row.

grunt> d = group c all;

grunt> describe d;
d: {group: chararray,c: {(long)}}

What is going on here? Grouping takes all of the unique keys you're
grouping on, then gives you all of the rows associated with that key. In
the case of group all, that key is "any," so you'll have a new relation, d,
which has one row, which will then have a bag of every row in the relation
you grouped on. This is kind of a weird thing to have to do, but the point
is to make all of those rows available on a foreach level. now you do:

grunt> e = foreach d generate SUM($1);

Why does this work? Well, SUM is just a function that takes a bag, and
returns a value. COUNT works the same way: it accepts a bag (NOT a
relation), and returns a value. So by grouping, we collect all of the rows
into a Bag (Pig's only spillable datatype). This let's us use these
functions on that bag of rows, and return the value we want.

2012/3/22 Jonathan Coveney <[email protected]>

> The reason can be a little hard to grok at first, but it's core to
> Pig...perhaps we need a tutorial explaining the model a bit more clearly.
>
> The foundation of Pig is a relation, ie, scans. What does this means? It
> means that you have a bunch of rows, and these rows have things. I'm going
> to diverge from your case and just do a dummy example.
>
> grunt> a = load 'thing' as (x:int, y:long);
> grunt> describe a;
> a: {x: int,y: long}
>
> What is this telling us? It's saying "we have a relation named a, and that
> relation consists of a bunch of rows that are and int and then a long." All
> relations work like this. You have rows of stuff, and that stuff is columns
> of Pig datatypes.
>
> So next you have a filter, so you do:
>
> grunt> b = filter a by x > 1000;
>
> Now, this makes sense. What does a filter do? It goes row by row, and
> throws some out if they don't match the criteria. Now let's say we want to
> get the total sum of the squares. So we need to get x^2+y^2, so what do we
> do?
>
> grunt> c = foreach b generate x*x+y*y;
>
> This makes sense, right? For every row in the relation b, we want to do
> some manipulation on the column. So generally, the pattern is that we go
> row by row and do stuff on the things that exist in that column. Now you
> want a SUM (which is equivalent to your count).
>
>
>
> 2012/3/22 Jason Alexander <[email protected]>
>
>> Very nice, worked like a champ, Prashant.
>>
>> Any chance you could explain why? I'd love to be taught to fish, not just
>> given the fish to eat. ;-)
>>
>> GROUP ALL, as I read it, pulls the tuples into a single group. But,
>> FOREACH'ing on each group, and counting against productscans is where my
>> brain starts to hurt.
>>
>>
>> Thanks again for your help!
>> -Jason
>>
>>
>> On Mar 22, 2012, at 3:33 PM, Prashant Kommireddi wrote:
>>
>> > Hi Jason,
>> >
>> > Are you trying to count the number of records in the relation
>> > 'productscans'? In which case you would have to use GROUP
>> > http://pig.apache.org/docs/r0.9.1/basic.html#GROUP
>> >
>> > grpd = GROUP productscans ALL;
>> > scancount = FOREACH grpd GENERATE COUNT(productscans);
>> > DUMP scancount;
>> >
>> > Thanks,
>> > Prashant
>> >
>> > On Thu, Mar 22, 2012 at 1:28 PM, Jason Alexander <[email protected]
>> >wrote:
>> >
>> >> Hey all,
>> >>
>> >>
>> >> I'm trying to write a script to pull the count of a dataset that I've
>> >> filtered.
>> >>
>> >> Here's the script so far:
>> >>
>> >> /* scans by title */
>> >>
>> >> scans = LOAD '/hive/scans/*' USING PigStorage(',') AS
>> >>
>> (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
>> >> productscans = FILTER scans BY (title MATCHES 'proactiv');
>> >> scancount = FOREACH productscans GENERATE COUNT($0);
>> >> DUMP scancount;
>> >>
>> >> For some reason, I get the error:
>> >>
>> >> Could not infer the matching function for org.apache.pig.builtin.COUNT
>> as
>> >> multiple or none of them fit. Please use an explicit cast.
>> >>
>> >> What am I doing wrong here? I'm assuming it has something to do with
>> the
>> >> type of the field I'm passing in, but I can't seem to resolve this.
>> >>
>> >>
>> >> TIA,
>> >> -Jason
>> >>
>> >>
>> >>
>> >>
>>
>>
>

Reply via email to