> > > #1) Let's say you are tracking messages and extracting the hash tags from > > the message and storing them as one field (#hash1#hash2#hash3). This > means > > you might have a line that looks something like the following: > > 2343 2011-05-06T03:04:00.000Z username > > some+message+goes+here#with+#hash+#tags #with#hash#tags some > other > > info > > > > How can I get the # of tweets per hash tag? Also, how can I get the # of > > tweets per user per hash tag? > > I know I can use the STRSPLIT function to split on '#'. That will give me > a > > bag of hash tags. How can I then group by these such that each hash tag > has > > a set of tweets? > You will need to 'FLATTEN' the bag of hashtags then do a 'GROUP BY' on > the hashtag itself. >
If each message has an unknown number of hashtags, will a 'FLATTEN' given me an unknown # of fields? If so, how do I know which field to group by? I don't want to group by messages that have the exact hash tags. I want all messages that have one of the hash tags. > > But now I want to end up something like the following: > > > > 2011-05-01 DIRECTIVE1 32423 DIRECTIVE2 3433 DIRECTIVE3 > > 1983 > > > > If I knew the directives ahead of time, I know I can do something like > the > > following: > > > > D = GROUP C BY date; > > > > E = FOREACH D { > > DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1'; > > DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2'; > > DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3'; > > GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date), > 'DIRECTIVE2', > > COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date); > > } > > > > But how do I do this w/o having to hardcode the filters? Am I thinking > about > > this all wrong? > > > It's really a matter of how you structure your data ahead of time. > Imagine the data looking like this instead (call it X): > > 201101,directive1 > 201101,directive1 > 201101,directive2 > 201101,directive2 > 201101,directive2 > 201101,directive3 > 201102,directive2 > 201102,directive4 > 201103,directive1 > > This is how my data looks (row and column wise) > > then, a simple: > > Y = GROUP X BY (date,directive); > Z = FOREACH Y GENERATE FLATTEN(group) AS (date,directive), COUNT(X) AS > num_occurrences; > > would result in: > > 201101,directive1,2 > 201101,directive2,3 > 201101,directive3,1 > 201102,directive2,1 > 201102,directive4,1 > 201103,directive1,1 > > At least, that's what it _seems_ like you're asking for. > > I've gotten that far. I'm actually asking for the being able to put those into columns and not rows. > > --jacob > @thedatachef > > Thanks Jacob! -Christian > > > Thanks very much for you help, > > Christian > > >