Hi,

Lets say you have a file with columns userid username location amount

To count the total number of users: 
A = LOAD 'myfile' as (userid:long, username:chararray, location:chararray,
amount:long);
G = GROUP A ALL PARALLEL 40;
R = FOREACH G GENERATE COUNT($1);

dump R;

To count the number of users by location;

A = LOAD 'myfile' as (userid:long, username:chararray, location:chararray,
amount:long);
G = GROUP A BY location PARALLEL 40;
R = FOREACH G GENERATE FLATTEN(group), COUNT($1);

dump R;

To get the sum of amount per location, userid

A = LOAD 'myfile' as (userid:long, username:chararray, location:chararray,
amount:long);
G = GROUP A BY (location, userid) PARALLEL 40;
R = FOREACH G GENERATE FLATTEN(group), COUNT($1) as usercount,
SUM($1.amount) as useramount;


NOTE PARALLEL is set to 40 as an example, this should be set by you, and
depends on your cluster setup, data etc.

To count its always GROUP either ALL or BY <column name>
Then FOREACH and generate COUNT($1) the $1.

Hope this helps,


-----Original Message-----
From: Anze [mailto:[email protected]] 
Sent: Friday, October 29, 2010 12:01 PM
To: [email protected]
Subject: relations count

Hi!

I hope this is not too newbie question, but it's driving me crazy... How do 
you count the records in a relation? Like DUMP, but instead of list of 
records, I would like their count.

Thanks,

Anze

Reply via email to