Try using Parallel, or perform a group by a certain key and then do a
group all on the result?

Sent from my iPhone

On Dec 7, 2011, at 12:36 PM, Austin Stickney <[email protected]> wrote:

> I am trying to find the most efficient way to count the total number of 
> records in a relation. The simplest way would be to do a GROUP ALL, and then 
> do a COUNT, but doing a GROUP ALL seems to always use just one reducer, which 
> can take a really long time when counting several hundred million records. 
> Here is an example:
>
> A = LOAD 'phones.tsv' USING PigStorage AS (
>  fname:chararray,
>  lname:chararray,
>  phone:chararray
> );
>
> B = GROUP A ALL;
>
> C = FOREACH A GENERATE
>  'phone_records' AS description,
>  COUNT(A) AS count
> ;
>
> Is there a quicker way to get the total count without having to group all?  
> Would it be any quicker to do it the following way, since it avoids a GROUP 
> ALL, or would the grouping job still be limited to one reducer because it is 
> essentially doing the same thing?
>
>
> D = FOREACH A GENERATE 1 AS group_key;
>
> E = GROUP D BY group_key;
>
> F = FOREACH E GENERATE
>  'phone_records' AS description,
>  COUNT(D) AS count
> ;
>
> Thanks,
> Austin

Reply via email to