Re: Working with an unknown number of values

Christian Fri, 06 May 2011 14:39:28 -0700

>
> > #1) Let's say you are tracking messages and extracting the hash tags from
> > the message and storing them as one field (#hash1#hash2#hash3). This
> means
> > you might have a line that looks something like the following:
> >       2343    2011-05-06T03:04:00.000Z    username
> > some+message+goes+here#with+#hash+#tags    #with#hash#tags   some
>  other
> >  info
> >
> > How can I get the # of tweets per hash tag? Also, how can I get the # of
> > tweets per user per hash tag?
> > I know I can use the STRSPLIT function to split on '#'. That will give me
> a
> > bag of hash tags. How can I then group by these such that each hash tag
> has
> > a set of tweets?
> You will need to 'FLATTEN' the bag of hashtags then do a 'GROUP BY' on
> the hashtag itself.
>


If each message has an unknown number of hashtags, will a 'FLATTEN' given me
an unknown # of fields? If so, how do I know which field to group by? I
don't want to group by messages that have the exact hash tags. I want all
messages that have one of the hash tags.


> >     But now I want to end up something like the following:
>
>
> > 2011-05-01    DIRECTIVE1    32423    DIRECTIVE2    3433    DIRECTIVE3
> >  1983
> >
> > If I knew the directives ahead of time, I know I can do something like
> the
> > following:
> >
> > D = GROUP C BY date;
> >
> > E = FOREACH D {
> >      DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1';
> >      DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2';
> >      DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3';
> >         GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date),
> 'DIRECTIVE2',
> > COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date);
> > }
> >
> > But how do I do this w/o having to hardcode the filters? Am I thinking
> about
> > this all wrong?
> >
> It's really a matter of how you structure your data ahead of time.
> Imagine the data looking like this instead (call it X):
>
> 201101,directive1
> 201101,directive1
> 201101,directive2
> 201101,directive2
> 201101,directive2
> 201101,directive3
> 201102,directive2
> 201102,directive4
> 201103,directive1
>
> This is how my data looks (row and column wise)

>
> then, a simple:
>
> Y = GROUP X BY (date,directive);
> Z = FOREACH Y GENERATE FLATTEN(group) AS (date,directive), COUNT(X) AS
> num_occurrences;
>
> would result in:
>
> 201101,directive1,2
> 201101,directive2,3
> 201101,directive3,1
> 201102,directive2,1
> 201102,directive4,1
> 201103,directive1,1
>
> At least, that's what it _seems_ like you're asking for.
>
> I've gotten that far. I'm actually asking for the being able to put those
into columns and not rows.

>
> --jacob
> @thedatachef
>
> Thanks Jacob!

-Christian

>
> > Thanks very much for you help,
> > Christian
>
>
>

Re: Working with an unknown number of values

Reply via email to