Hi, Lets say you have a file with columns userid username location amount
To count the total number of users: A = LOAD 'myfile' as (userid:long, username:chararray, location:chararray, amount:long); G = GROUP A ALL PARALLEL 40; R = FOREACH G GENERATE COUNT($1); dump R; To count the number of users by location; A = LOAD 'myfile' as (userid:long, username:chararray, location:chararray, amount:long); G = GROUP A BY location PARALLEL 40; R = FOREACH G GENERATE FLATTEN(group), COUNT($1); dump R; To get the sum of amount per location, userid A = LOAD 'myfile' as (userid:long, username:chararray, location:chararray, amount:long); G = GROUP A BY (location, userid) PARALLEL 40; R = FOREACH G GENERATE FLATTEN(group), COUNT($1) as usercount, SUM($1.amount) as useramount; NOTE PARALLEL is set to 40 as an example, this should be set by you, and depends on your cluster setup, data etc. To count its always GROUP either ALL or BY <column name> Then FOREACH and generate COUNT($1) the $1. Hope this helps, -----Original Message----- From: Anze [mailto:[email protected]] Sent: Friday, October 29, 2010 12:01 PM To: [email protected] Subject: relations count Hi! I hope this is not too newbie question, but it's driving me crazy... How do you count the records in a relation? Like DUMP, but instead of list of records, I would like their count. Thanks, Anze
