I am trying to find the most efficient way to count the total number of records in a relation. The simplest way would be to do a GROUP ALL, and then do a COUNT, but doing a GROUP ALL seems to always use just one reducer, which can take a really long time when counting several hundred million records. Here is an example:
A = LOAD 'phones.tsv' USING PigStorage AS ( fname:chararray, lname:chararray, phone:chararray ); B = GROUP A ALL; C = FOREACH A GENERATE 'phone_records' AS description, COUNT(A) AS count ; Is there a quicker way to get the total count without having to group all? Would it be any quicker to do it the following way, since it avoids a GROUP ALL, or would the grouping job still be limited to one reducer because it is essentially doing the same thing? D = FOREACH A GENERATE 1 AS group_key; E = GROUP D BY group_key; F = FOREACH E GENERATE 'phone_records' AS description, COUNT(D) AS count ; Thanks, Austin
