Sorry, that's what I get for trying to do things quickly :)
A = LOAD 'foo.tsv' AS (item:chararray, user:chararray);
B = GROUP A BY item;
C = FOREACH B {
distinct_users = DISTINCT A.user;
GENERATE
group AS item,
COUNT(distinct_users) AS num_distinct_users
;
};
And I just tested it in local mode with Pig 0.8, works great.
--jacob
@thedatachef
On Fri, 2011-05-06 at 11:30 -0700, Kim Vogt wrote:
> I think you're missing a SUM and/or COUNT and that's the part I'm stuck on.
>
> -Kim
>
> On Fri, May 6, 2011 at 11:24 AM, jacob <[email protected]> wrote:
>
> > Kim,
> >
> > This is something pig addresses exceedingly well:
> >
> > A = LOAD 'data' AS (item:chararray, user:chararray);
> > B = GROUP A BY item;
> > C = FOREACH B {
> > distinct_users = DISTINCT A.user;
> > GENERATE
> > group AS item,
> > distinct_users AS distinct_users
> > ;
> > };
> >
> > should work. Haven't tested it though.
> >
> > --jacob
> > @thedatachef
> >
> >
> > On Fri, 2011-05-06 at 11:08 -0700, Kim Vogt wrote:
> > > Hi,
> > >
> > > I'm stuck on a query for counting distinct users. Say I have data that
> > looks
> > > like this:
> > >
> > > book, user1
> > > book, user2
> > > book, user1
> > > movie, user1
> > > movie, user2
> > > movie, user3
> > > music, user4
> > >
> > > I want to group by the first column and count the number of distinct
> > users
> > > for that product. The result would just be:
> > >
> > > book, 2
> > > movie, 3
> > > music, 1
> > >
> > > Is this piggable?
> > >
> > > Happy Friday!
> > >
> > > -Kim
> >
> >
> >