Kim,
This is something pig addresses exceedingly well:
A = LOAD 'data' AS (item:chararray, user:chararray);
B = GROUP A BY item;
C = FOREACH B {
distinct_users = DISTINCT A.user;
GENERATE
group AS item,
distinct_users AS distinct_users
;
};
should work. Haven't tested it though.
--jacob
@thedatachef
On Fri, 2011-05-06 at 11:08 -0700, Kim Vogt wrote:
> Hi,
>
> I'm stuck on a query for counting distinct users. Say I have data that looks
> like this:
>
> book, user1
> book, user2
> book, user1
> movie, user1
> movie, user2
> movie, user3
> music, user4
>
> I want to group by the first column and count the number of distinct users
> for that product. The result would just be:
>
> book, 2
> movie, 3
> music, 1
>
> Is this piggable?
>
> Happy Friday!
>
> -Kim