Jonathan, Prashant, you guys are awesome! Thanks for the explanation! It's much
clearer now!
On Mar 22, 2012, at 4:40 PM, Prashant Kommireddi wrote:
> Aggregation functions (COUNT, SUM, AVG..) work on bags. Since you are
> counting on the entire relation in this case you did a GROUP ALL, in which
> case, as you said, forms a bag out of all tuples.
>
> grunt> A = load 'data' as (a:int, b:int);
> grunt> describe A;
> A: {a: int,b: int}
>
> Now, once the GROUP operator is applied, Pig implicitly assigns 'group' as
> the first field in resulting relation. And the relation you grouped on (in
> this example 'A' and in your case 'productscans') holds a handle to the bag
> of tuples.
>
> grunt> B = group A ALL;
> grunt> describe B;
> B: {group: chararray,A: {(a: int,b: int)}}
>
> So you can now either reference using alias 'A' or using positional
> notation '$1'. Note $0 refers to implicit field generated by Pig called
> 'group'.
>
> grunt> C = foreach B generate group, COUNT(A);
> OR
> grunt> C = foreach B generate group, COUNT($1);
>
> Thanks,
> Prashant
>
>
>
> You are counting against 'productscans' as that is the original relation
> you group'ed on.
>
> On Thu, Mar 22, 2012 at 1:46 PM, Jason Alexander <[email protected]>wrote:
>
>> Very nice, worked like a champ, Prashant.
>>
>> Any chance you could explain why? I'd love to be taught to fish, not just
>> given the fish to eat. ;-)
>>
>> GROUP ALL, as I read it, pulls the tuples into a single group. But,
>> FOREACH'ing on each group, and counting against productscans is where my
>> brain starts to hurt.
>>
>>
>> Thanks again for your help!
>> -Jason
>>
>>
>> On Mar 22, 2012, at 3:33 PM, Prashant Kommireddi wrote:
>>
>>> Hi Jason,
>>>
>>> Are you trying to count the number of records in the relation
>>> 'productscans'? In which case you would have to use GROUP
>>> http://pig.apache.org/docs/r0.9.1/basic.html#GROUP
>>>
>>> grpd = GROUP productscans ALL;
>>> scancount = FOREACH grpd GENERATE COUNT(productscans);
>>> DUMP scancount;
>>>
>>> Thanks,
>>> Prashant
>>>
>>> On Thu, Mar 22, 2012 at 1:28 PM, Jason Alexander <[email protected]
>>> wrote:
>>>
>>>> Hey all,
>>>>
>>>>
>>>> I'm trying to write a script to pull the count of a dataset that I've
>>>> filtered.
>>>>
>>>> Here's the script so far:
>>>>
>>>> /* scans by title */
>>>>
>>>> scans = LOAD '/hive/scans/*' USING PigStorage(',') AS
>>>>
>> (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
>>>> productscans = FILTER scans BY (title MATCHES 'proactiv');
>>>> scancount = FOREACH productscans GENERATE COUNT($0);
>>>> DUMP scancount;
>>>>
>>>> For some reason, I get the error:
>>>>
>>>> Could not infer the matching function for org.apache.pig.builtin.COUNT
>> as
>>>> multiple or none of them fit. Please use an explicit cast.
>>>>
>>>> What am I doing wrong here? I'm assuming it has something to do with the
>>>> type of the field I'm passing in, but I can't seem to resolve this.
>>>>
>>>>
>>>> TIA,
>>>> -Jason
>>>>
>>>>
>>>>
>>>>
>>
>>