On Fri, 2011-05-06 at 16:06 -0600, Christian wrote: > Thank you for taking the time to explain this to me Jacob! > > Am I stuck with hard-coding for my other question? > > Instead of: > 2011-05-01 DIRECTIVE1 32423 DIRECTIVE2 3433 DIRECTIVE3 > 1983 > -- > 2011-05-01 32423 3433 1983 > > would also do as long as I could count on the column order. > The best you can do here is to do a GROUP BY on the date (like Xiaomeng Wan indicated) after you get the counts. Then, I think theres a UDF in contrib that allows you to go from a bag to a tuple.
Best of luck, --jacob @thedatachef > > Thanks, > Christian > On Fri, May 6, 2011 at 3:59 PM, jacob <jacob.a.perk...@gmail.com> wrote: > > > On Fri, 2011-05-06 at 15:38 -0600, Christian wrote: > > > > > > > > > #1) Let's say you are tracking messages and extracting the hash tags > > from > > > > > the message and storing them as one field (#hash1#hash2#hash3). This > > > > means > > > > > you might have a line that looks something like the following: > > > > > 2343 2011-05-06T03:04:00.000Z username > > > > > some+message+goes+here#with+#hash+#tags #with#hash#tags some > > > > other > > > > > info > > > > > > > > > > How can I get the # of tweets per hash tag? Also, how can I get the # > > of > > > > > tweets per user per hash tag? > > > > > I know I can use the STRSPLIT function to split on '#'. That will > > give me > > > > a > > > > > bag of hash tags. How can I then group by these such that each hash > > tag > > > > has > > > > > a set of tweets? > > > > You will need to 'FLATTEN' the bag of hashtags then do a 'GROUP BY' on > > > > the hashtag itself. > > > > > > > > > > If each message has an unknown number of hashtags, will a 'FLATTEN' given > > me > > > an unknown # of fields? If so, how do I know which field to group by? I > > > don't want to group by messages that have the exact hash tags. I want all > > > messages that have one of the hash tags. > > > > Oh, that's right, STRSPLIT (rather uselessly) yields a nested tuple and > > NOT a bag. If you could get a bag then you could do the following (I'm > > throwing out some fields for now): > > > > A = LOAD 'tweets_and_meta' AS (text:chararray, hashtags:chararray); > > B = FOREACH A GENERATE text, FLATTEN(MySplittingUDF(hashtags)) AS > > hashtag; > > C = GROUP B BY hastag; > > > > Then C will contain a key (the hashtag) and a bag containing all the > > tweets with that hashtag. You'll have to write 'MySplittingUDF' yourself > > to do the same as STRSPLIT but that returns a bag instead. > > > > ie. > > > > #foobar tweet text,#foobar > > this tweet has #two #hashtags,#two#hashtags > > another #foobar tweet,#foobar > > > > will yield: > > > > #foobar, {(#foobar tweet text, #foobar),(another #foobar tweet, > > #foobar)} > > #two, {(this tweet has #two #hashtags, #two)} > > #hashtags, {(this tweet has #two #hashtags, #hashtags)} > > > > > > > > > > > > > > > But now I want to end up something like the following: > > > > > > > > > > > > > 2011-05-01 DIRECTIVE1 32423 DIRECTIVE2 3433 DIRECTIVE3 > > > > > 1983 > > > > > > > > > > If I knew the directives ahead of time, I know I can do something > > like > > > > the > > > > > following: > > > > > > > > > > D = GROUP C BY date; > > > > > > > > > > E = FOREACH D { > > > > > DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1'; > > > > > DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2'; > > > > > DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3'; > > > > > GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date), > > > > 'DIRECTIVE2', > > > > > COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date); > > > > > } > > > > > > > > > > But how do I do this w/o having to hardcode the filters? Am I > > thinking > > > > about > > > > > this all wrong? > > > > > > > > > It's really a matter of how you structure your data ahead of time. > > > > Imagine the data looking like this instead (call it X): > > > > > > > > 201101,directive1 > > > > 201101,directive1 > > > > 201101,directive2 > > > > 201101,directive2 > > > > 201101,directive2 > > > > 201101,directive3 > > > > 201102,directive2 > > > > 201102,directive4 > > > > 201103,directive1 > > > > > > > > This is how my data looks (row and column wise) > > > > > > > > > > > then, a simple: > > > > > > > > Y = GROUP X BY (date,directive); > > > > Z = FOREACH Y GENERATE FLATTEN(group) AS (date,directive), COUNT(X) AS > > > > num_occurrences; > > > > > > > > would result in: > > > > > > > > 201101,directive1,2 > > > > 201101,directive2,3 > > > > 201101,directive3,1 > > > > 201102,directive2,1 > > > > 201102,directive4,1 > > > > 201103,directive1,1 > > > > > > > > At least, that's what it _seems_ like you're asking for. > > > > > > > > I've gotten that far. I'm actually asking for the being able to put > > those > > > into columns and not rows. > > > > > > > > > > > --jacob > > > > @thedatachef > > > > > > > > Thanks Jacob! > > > > > > -Christian > > > > > > > > > > > > Thanks very much for you help, > > > > > Christian > > > > > > > > > > > > > > > > > >