Working with an unknown number of values

Christian Fri, 06 May 2011 14:14:55 -0700

I am sorry if this has been asked in the past. I can't seem to find
information on it.


I have two questions, but they are somewhat related.

#1) Let's say you are tracking messages and extracting the hash tags from
the message and storing them as one field (#hash1#hash2#hash3). This means
you might have a line that looks something like the following:
      2343    2011-05-06T03:04:00.000Z    username
some+message+goes+here#with+#hash+#tags    #with#hash#tags   some    other
 info

How can I get the # of tweets per hash tag? Also, how can I get the # of
tweets per user per hash tag?
I know I can use the STRSPLIT function to split on '#'. That will give me a
bag of hash tags. How can I then group by these such that each hash tag has
a set of tweets?


#2) Let's say you have a field that has a fairly small, but still unknown
number of unique values (say between 20-5). I know I can group by these
fields to get a count by doing something like so:

A = LOAD '/some/dir' Using PigStorage (date, directive);

B = GROUP A by (date, directive);

C = FOREACH B GENERATE FLATTEN(group), COUNT(A.date);

    But now I want to end up something like the following:

2011-05-01    DIRECTIVE1    32423    DIRECTIVE2    3433    DIRECTIVE3
 1983

If I knew the directives ahead of time, I know I can do something like the
following:

D = GROUP C BY date;

E = FOREACH D {
     DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1';
     DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2';
     DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3';
        GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date), 'DIRECTIVE2',
COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date);
}

But how do I do this w/o having to hardcode the filters? Am I thinking about
this all wrong?

Thanks very much for you help,
Christian

Working with an unknown number of values

Reply via email to