you can group on group, like this: A = LOAD '/some/dir' Using PigStorage (date, directive);
B = GROUP A by (date, directive); C = FOREACH B GENERATE FLATTEN(group) as (date, directive), COUNT(A) as cnt; D = group c by date; E = foreach D generate group as date, c.(directive,cnt) as cnts; Shawn On Fri, May 6, 2011 at 3:14 PM, Christian <engr...@gmail.com> wrote: > I am sorry if this has been asked in the past. I can't seem to find > information on it. > > I have two questions, but they are somewhat related. > > #1) Let's say you are tracking messages and extracting the hash tags from > the message and storing them as one field (#hash1#hash2#hash3). This means > you might have a line that looks something like the following: > 2343 2011-05-06T03:04:00.000Z username > some+message+goes+here#with+#hash+#tags #with#hash#tags some other > info > > How can I get the # of tweets per hash tag? Also, how can I get the # of > tweets per user per hash tag? > I know I can use the STRSPLIT function to split on '#'. That will give me a > bag of hash tags. How can I then group by these such that each hash tag has > a set of tweets? > > > #2) Let's say you have a field that has a fairly small, but still unknown > number of unique values (say between 20-5). I know I can group by these > fields to get a count by doing something like so: > > A = LOAD '/some/dir' Using PigStorage (date, directive); > > B = GROUP A by (date, directive); > > C = FOREACH B GENERATE FLATTEN(group), COUNT(A.date); > > But now I want to end up something like the following: > > 2011-05-01 DIRECTIVE1 32423 DIRECTIVE2 3433 DIRECTIVE3 > 1983 > > If I knew the directives ahead of time, I know I can do something like the > following: > > D = GROUP C BY date; > > E = FOREACH D { > DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1'; > DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2'; > DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3'; > GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date), 'DIRECTIVE2', > COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date); > } > > But how do I do this w/o having to hardcode the filters? Am I thinking about > this all wrong? > > Thanks very much for you help, > Christian >