Try using Parallel, or perform a group by a certain key and then do a group all on the result?
Sent from my iPhone On Dec 7, 2011, at 12:36 PM, Austin Stickney <[email protected]> wrote: > I am trying to find the most efficient way to count the total number of > records in a relation. The simplest way would be to do a GROUP ALL, and then > do a COUNT, but doing a GROUP ALL seems to always use just one reducer, which > can take a really long time when counting several hundred million records. > Here is an example: > > A = LOAD 'phones.tsv' USING PigStorage AS ( > fname:chararray, > lname:chararray, > phone:chararray > ); > > B = GROUP A ALL; > > C = FOREACH A GENERATE > 'phone_records' AS description, > COUNT(A) AS count > ; > > Is there a quicker way to get the total count without having to group all? > Would it be any quicker to do it the following way, since it avoids a GROUP > ALL, or would the grouping job still be limited to one reducer because it is > essentially doing the same thing? > > > D = FOREACH A GENERATE 1 AS group_key; > > E = GROUP D BY group_key; > > F = FOREACH E GENERATE > 'phone_records' AS description, > COUNT(D) AS count > ; > > Thanks, > Austin
