I am sorry if this has been asked in the past. I can't seem to find information on it.
I have two questions, but they are somewhat related. #1) Let's say you are tracking messages and extracting the hash tags from the message and storing them as one field (#hash1#hash2#hash3). This means you might have a line that looks something like the following: 2343 2011-05-06T03:04:00.000Z username some+message+goes+here#with+#hash+#tags #with#hash#tags some other info How can I get the # of tweets per hash tag? Also, how can I get the # of tweets per user per hash tag? I know I can use the STRSPLIT function to split on '#'. That will give me a bag of hash tags. How can I then group by these such that each hash tag has a set of tweets? #2) Let's say you have a field that has a fairly small, but still unknown number of unique values (say between 20-5). I know I can group by these fields to get a count by doing something like so: A = LOAD '/some/dir' Using PigStorage (date, directive); B = GROUP A by (date, directive); C = FOREACH B GENERATE FLATTEN(group), COUNT(A.date); But now I want to end up something like the following: 2011-05-01 DIRECTIVE1 32423 DIRECTIVE2 3433 DIRECTIVE3 1983 If I knew the directives ahead of time, I know I can do something like the following: D = GROUP C BY date; E = FOREACH D { DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1'; DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2'; DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3'; GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date), 'DIRECTIVE2', COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date); } But how do I do this w/o having to hardcode the filters? Am I thinking about this all wrong? Thanks very much for you help, Christian