Doing group all should be fast assuming you use an algebraic operation
(and COUNT is algebraic). Pig will automatically translate that to
"count on mappers, sum up the counts on the single reducer".

Some versions of pig had problems with applying this optimization when
you generate a constant in the foreach. Try this:

A = load ...;
B = group A all;
C = foreach B generate COUNT(A) as cnt;
D = foreach C generate
  'phone_records' AS description,
  cnt;

On Wed, Dec 7, 2011 at 12:35 PM, Austin Stickney
<[email protected]> wrote:
> I am trying to find the most efficient way to count the total number of 
> records in a relation. The simplest way would be to do a GROUP ALL, and then 
> do a COUNT, but doing a GROUP ALL seems to always use just one reducer, which 
> can take a really long time when counting several hundred million records. 
> Here is an example:
>
> A = LOAD 'phones.tsv' USING PigStorage AS (
>  fname:chararray,
>  lname:chararray,
>  phone:chararray
> );
>
> B = GROUP A ALL;
>
> C = FOREACH A GENERATE
>  'phone_records' AS description,
>  COUNT(A) AS count
> ;
>
> Is there a quicker way to get the total count without having to group all?  
> Would it be any quicker to do it the following way, since it avoids a GROUP 
> ALL, or would the grouping job still be limited to one reducer because it is 
> essentially doing the same thing?
>
>
> D = FOREACH A GENERATE 1 AS group_key;
>
> E = GROUP D BY group_key;
>
> F = FOREACH E GENERATE
>  'phone_records' AS description,
>  COUNT(D) AS count
> ;
>
> Thanks,
> Austin

Reply via email to