Jonathan, Prashant, you guys are awesome! Thanks for the explanation! It's much 
clearer now!


On Mar 22, 2012, at 4:40 PM, Prashant Kommireddi wrote:

> Aggregation functions (COUNT, SUM, AVG..) work on bags. Since you are
> counting on the entire relation in this case you did a GROUP ALL, in which
> case, as you said, forms a bag out of all tuples.
> 
> grunt> A = load 'data' as (a:int, b:int);
> grunt> describe A;
> A: {a: int,b: int}
> 
> Now, once the GROUP operator is applied, Pig implicitly assigns 'group' as
> the first field in resulting relation. And the relation you grouped on (in
> this example 'A' and in your case 'productscans') holds a handle to the bag
> of tuples.
> 
> grunt> B = group A ALL;
> grunt> describe B;
> B: {group: chararray,A: {(a: int,b: int)}}
> 
> So you can now either reference using alias 'A' or using positional
> notation '$1'. Note $0 refers to implicit field generated by Pig called
> 'group'.
> 
> grunt> C = foreach B generate group, COUNT(A);
> OR
> grunt> C = foreach B generate group, COUNT($1);
> 
> Thanks,
> Prashant
> 
> 
> 
> You are counting against 'productscans' as that is the original relation
> you group'ed on.
> 
> On Thu, Mar 22, 2012 at 1:46 PM, Jason Alexander <[email protected]>wrote:
> 
>> Very nice, worked like a champ, Prashant.
>> 
>> Any chance you could explain why? I'd love to be taught to fish, not just
>> given the fish to eat. ;-)
>> 
>> GROUP ALL, as I read it, pulls the tuples into a single group. But,
>> FOREACH'ing on each group, and counting against productscans is where my
>> brain starts to hurt.
>> 
>> 
>> Thanks again for your help!
>> -Jason
>> 
>> 
>> On Mar 22, 2012, at 3:33 PM, Prashant Kommireddi wrote:
>> 
>>> Hi Jason,
>>> 
>>> Are you trying to count the number of records in the relation
>>> 'productscans'? In which case you would have to use GROUP
>>> http://pig.apache.org/docs/r0.9.1/basic.html#GROUP
>>> 
>>> grpd = GROUP productscans ALL;
>>> scancount = FOREACH grpd GENERATE COUNT(productscans);
>>> DUMP scancount;
>>> 
>>> Thanks,
>>> Prashant
>>> 
>>> On Thu, Mar 22, 2012 at 1:28 PM, Jason Alexander <[email protected]
>>> wrote:
>>> 
>>>> Hey all,
>>>> 
>>>> 
>>>> I'm trying to write a script to pull the count of a dataset that I've
>>>> filtered.
>>>> 
>>>> Here's the script so far:
>>>> 
>>>> /* scans by title */
>>>> 
>>>> scans = LOAD '/hive/scans/*' USING PigStorage(',') AS
>>>> 
>> (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
>>>> productscans = FILTER scans BY (title MATCHES 'proactiv');
>>>> scancount = FOREACH productscans GENERATE COUNT($0);
>>>> DUMP scancount;
>>>> 
>>>> For some reason, I get the error:
>>>> 
>>>> Could not infer the matching function for org.apache.pig.builtin.COUNT
>> as
>>>> multiple or none of them fit. Please use an explicit cast.
>>>> 
>>>> What am I doing wrong here? I'm assuming it has something to do with the
>>>> type of the field I'm passing in, but I can't seem to resolve this.
>>>> 
>>>> 
>>>> TIA,
>>>> -Jason
>>>> 
>>>> 
>>>> 
>>>> 
>> 
>> 

Reply via email to