Christian,

I've answered inline:

On Fri, 2011-05-06 at 15:14 -0600, Christian wrote:
> I am sorry if this has been asked in the past. I can't seem to find
> information on it.
> 
> I have two questions, but they are somewhat related.
> 
> #1) Let's say you are tracking messages and extracting the hash tags from
> the message and storing them as one field (#hash1#hash2#hash3). This means
> you might have a line that looks something like the following:
>       2343    2011-05-06T03:04:00.000Z    username
> some+message+goes+here#with+#hash+#tags    #with#hash#tags   some    other
>  info
> 
> How can I get the # of tweets per hash tag? Also, how can I get the # of
> tweets per user per hash tag?
> I know I can use the STRSPLIT function to split on '#'. That will give me a
> bag of hash tags. How can I then group by these such that each hash tag has
> a set of tweets?
You will need to 'FLATTEN' the bag of hashtags then do a 'GROUP BY' on
the hashtag itself.

> 
> 
> #2) Let's say you have a field that has a fairly small, but still unknown
> number of unique values (say between 20-5). I know I can group by these
> fields to get a count by doing something like so:
> 
> A = LOAD '/some/dir' Using PigStorage (date, directive);
> 
> B = GROUP A by (date, directive);
> 
> C = FOREACH B GENERATE FLATTEN(group), COUNT(A.date);
> 
>     But now I want to end up something like the following:
> 
> 2011-05-01    DIRECTIVE1    32423    DIRECTIVE2    3433    DIRECTIVE3
>  1983
> 
> If I knew the directives ahead of time, I know I can do something like the
> following:
> 
> D = GROUP C BY date;
> 
> E = FOREACH D {
>      DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1';
>      DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2';
>      DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3';
>         GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date), 'DIRECTIVE2',
> COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date);
> }
> 
> But how do I do this w/o having to hardcode the filters? Am I thinking about
> this all wrong?
> 
It's really a matter of how you structure your data ahead of time.
Imagine the data looking like this instead (call it X):

201101,directive1
201101,directive1
201101,directive2
201101,directive2
201101,directive2
201101,directive3
201102,directive2
201102,directive4
201103,directive1


then, a simple:

Y = GROUP X BY (date,directive);
Z = FOREACH Y GENERATE FLATTEN(group) AS (date,directive), COUNT(X) AS
num_occurrences;

would result in:

201101,directive1,2
201101,directive2,3
201101,directive3,1
201102,directive2,1
201102,directive4,1
201103,directive1,1

At least, that's what it _seems_ like you're asking for.


--jacob
@thedatachef


> Thanks very much for you help,
> Christian


Reply via email to