Hi, You can do the following:
groupedByDept = group deptInfo by dept; groupedByAll = group groupedByDept all; uniqcnt = foreach groupedByAll generate COUNT(groupedByDept); The "group groupedByDept all" turns every row of "groupedByDept" into a bag. Since "groupedByDept" has one row per department, counting elements in this bag will return the number of unique departments. Please be aware that group all may take long if your data is big since it forces every mapper to send their output to a single reducer. Thanks, Cheolsoo On Tue, Sep 25, 2012 at 6:55 PM, Hadoop Learner <[email protected]>wrote: > Hello, > > Need help with finding the distinct count. Would appreciate if you > could please help. > > Here's my data file: > > id , dept, budget > > 1, Marketing, 9000 > 2, Marketing, 1000 > 3, Finance, 9000 > 4, Sales, 2000 > > > I am trying to get the unique count of the departments in the company > so I expect 3 - since there are 3 departments. > > Here's my PIG program: > > > deptInfo = load 'dept.txt' using PigStorage(',') as (id, dept, budget ); > > -- get a distinct count of departments > > groupedByDept = group deptInfo by dept; > > uniqcnt = foreach groupedByDept { > dept = deptInfo.dept; > uniq_dept = distinct dept ; > generate group, COUNT(uniq_dept); > > } > > dump uniqcnt; > > > What this gives me is this: > > ( Sales,1) > ( Finance,1) > ( Marketing,1) > > > What I want is : 3. > > How could I get just the raw count of departments instead of a listing > of each department. > > Thanks! >
