Re: Distinct Count

Cheolsoo Park Tue, 25 Sep 2012 19:20:16 -0700

Hi,

You can do the following:


groupedByDept = group deptInfo by dept;
groupedByAll = group groupedByDept all;
uniqcnt = foreach groupedByAll generate COUNT(groupedByDept);

The "group groupedByDept all" turns every row of "groupedByDept" into a
bag. Since "groupedByDept" has one row per department, counting elements in
this bag will return the number of unique departments.

Please be aware that group all may take long if your data is big since it
forces every mapper to send their output to a single reducer.

Thanks,
Cheolsoo

On Tue, Sep 25, 2012 at 6:55 PM, Hadoop Learner <[email protected]>wrote:

> Hello,
>
> Need help with finding the distinct count. Would appreciate if you
> could please help.
>
> Here's my data file:
>
> id , dept, budget
>
> 1, Marketing, 9000
> 2, Marketing, 1000
> 3, Finance, 9000
> 4, Sales, 2000
>
>
> I am trying to get the unique count of the departments in the company
> so I expect 3 - since there are 3 departments.
>
> Here's my PIG program:
>
>
> deptInfo = load 'dept.txt'  using PigStorage(',') as (id, dept, budget );
>
> -- get a distinct count of departments
>
> groupedByDept = group  deptInfo by dept;
>
> uniqcnt  = foreach groupedByDept  {
>            dept      = deptInfo.dept;
>            uniq_dept  = distinct dept ;
>            generate group, COUNT(uniq_dept);
>
>            }
>
> dump uniqcnt;
>
>
> What this gives me is this:
>
> ( Sales,1)
> ( Finance,1)
> ( Marketing,1)
>
>
> What I want is : 3.
>
> How could I get just the raw count of departments instead of a listing
> of each department.
>
> Thanks!
>

Re: Distinct Count

Reply via email to