The reason can be a little hard to grok at first, but it's core to
Pig...perhaps we need a tutorial explaining the model a bit more clearly.
The foundation of Pig is a relation, ie, scans. What does this means? It
means that you have a bunch of rows, and these rows have things. I'm going
to diverge from your case and just do a dummy example.
grunt> a = load 'thing' as (x:int, y:long);
grunt> describe a;
a: {x: int,y: long}
What is this telling us? It's saying "we have a relation named a, and that
relation consists of a bunch of rows that are and int and then a long." All
relations work like this. You have rows of stuff, and that stuff is columns
of Pig datatypes.
So next you have a filter, so you do:
grunt> b = filter a by x > 1000;
Now, this makes sense. What does a filter do? It goes row by row, and
throws some out if they don't match the criteria. Now let's say we want to
get the total sum of the squares. So we need to get x^2+y^2, so what do we
do?
grunt> c = foreach b generate x*x+y*y;
This makes sense, right? For every row in the relation b, we want to do
some manipulation on the column. So generally, the pattern is that we go
row by row and do stuff on the things that exist in that column. Now you
want a SUM (which is equivalent to your count).
2012/3/22 Jason Alexander <[email protected]>
> Very nice, worked like a champ, Prashant.
>
> Any chance you could explain why? I'd love to be taught to fish, not just
> given the fish to eat. ;-)
>
> GROUP ALL, as I read it, pulls the tuples into a single group. But,
> FOREACH'ing on each group, and counting against productscans is where my
> brain starts to hurt.
>
>
> Thanks again for your help!
> -Jason
>
>
> On Mar 22, 2012, at 3:33 PM, Prashant Kommireddi wrote:
>
> > Hi Jason,
> >
> > Are you trying to count the number of records in the relation
> > 'productscans'? In which case you would have to use GROUP
> > http://pig.apache.org/docs/r0.9.1/basic.html#GROUP
> >
> > grpd = GROUP productscans ALL;
> > scancount = FOREACH grpd GENERATE COUNT(productscans);
> > DUMP scancount;
> >
> > Thanks,
> > Prashant
> >
> > On Thu, Mar 22, 2012 at 1:28 PM, Jason Alexander <[email protected]
> >wrote:
> >
> >> Hey all,
> >>
> >>
> >> I'm trying to write a script to pull the count of a dataset that I've
> >> filtered.
> >>
> >> Here's the script so far:
> >>
> >> /* scans by title */
> >>
> >> scans = LOAD '/hive/scans/*' USING PigStorage(',') AS
> >>
> (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
> >> productscans = FILTER scans BY (title MATCHES 'proactiv');
> >> scancount = FOREACH productscans GENERATE COUNT($0);
> >> DUMP scancount;
> >>
> >> For some reason, I get the error:
> >>
> >> Could not infer the matching function for org.apache.pig.builtin.COUNT
> as
> >> multiple or none of them fit. Please use an explicit cast.
> >>
> >> What am I doing wrong here? I'm assuming it has something to do with the
> >> type of the field I'm passing in, but I can't seem to resolve this.
> >>
> >>
> >> TIA,
> >> -Jason
> >>
> >>
> >>
> >>
>
>