I am trying to find the most efficient way to count the total number of records 
in a relation. The simplest way would be to do a GROUP ALL, and then do a 
COUNT, but doing a GROUP ALL seems to always use just one reducer, which can 
take a really long time when counting several hundred million records. Here is 
an example:

A = LOAD 'phones.tsv' USING PigStorage AS (
  fname:chararray,
  lname:chararray,
  phone:chararray
);

B = GROUP A ALL;

C = FOREACH A GENERATE
  'phone_records' AS description,
  COUNT(A) AS count
;

Is there a quicker way to get the total count without having to group all?  
Would it be any quicker to do it the following way, since it avoids a GROUP 
ALL, or would the grouping job still be limited to one reducer because it is 
essentially doing the same thing?


D = FOREACH A GENERATE 1 AS group_key;

E = GROUP D BY group_key;

F = FOREACH E GENERATE
  'phone_records' AS description,
  COUNT(D) AS count
;

Thanks,
Austin

Reply via email to