Re: Working with an unknown number of values

jacob Fri, 06 May 2011 15:14:41 -0700

On Fri, 2011-05-06 at 16:06 -0600, Christian wrote:
> Thank you for taking the time to explain this to me Jacob!
> 
> Am I stuck with hard-coding for my other question?
> 
> Instead of:
> 2011-05-01    DIRECTIVE1    32423    DIRECTIVE2    3433    DIRECTIVE3
>  1983
> --
> 2011-05-01    32423    3433    1983
> 
> would also do as long as I could count on the column order.
> 
The best you can do here is to do a GROUP BY on the date (like Xiaomeng
Wan indicated) after you get the counts. Then, I think theres a UDF in
contrib that allows you to go from a bag to a tuple.


Best of luck,

--jacob
@thedatachef

> 
> Thanks,
> Christian
> On Fri, May 6, 2011 at 3:59 PM, jacob <jacob.a.perk...@gmail.com> wrote:
> 
> > On Fri, 2011-05-06 at 15:38 -0600, Christian wrote:
> > > >
> > > > > #1) Let's say you are tracking messages and extracting the hash tags
> > from
> > > > > the message and storing them as one field (#hash1#hash2#hash3). This
> > > > means
> > > > > you might have a line that looks something like the following:
> > > > >       2343    2011-05-06T03:04:00.000Z    username
> > > > > some+message+goes+here#with+#hash+#tags    #with#hash#tags   some
> > > >  other
> > > > >  info
> > > > >
> > > > > How can I get the # of tweets per hash tag? Also, how can I get the #
> > of
> > > > > tweets per user per hash tag?
> > > > > I know I can use the STRSPLIT function to split on '#'. That will
> > give me
> > > > a
> > > > > bag of hash tags. How can I then group by these such that each hash
> > tag
> > > > has
> > > > > a set of tweets?
> > > > You will need to 'FLATTEN' the bag of hashtags then do a 'GROUP BY' on
> > > > the hashtag itself.
> > > >
> > >
> > > If each message has an unknown number of hashtags, will a 'FLATTEN' given
> > me
> > > an unknown # of fields? If so, how do I know which field to group by? I
> > > don't want to group by messages that have the exact hash tags. I want all
> > > messages that have one of the hash tags.
> >
> > Oh, that's right, STRSPLIT (rather uselessly) yields a nested tuple and
> > NOT a bag. If you could get a bag then you could do the following (I'm
> > throwing out some fields for now):
> >
> > A = LOAD 'tweets_and_meta' AS (text:chararray, hashtags:chararray);
> > B = FOREACH A GENERATE text, FLATTEN(MySplittingUDF(hashtags)) AS
> > hashtag;
> > C = GROUP B BY hastag;
> >
> > Then C will contain a key (the hashtag) and a bag containing all the
> > tweets with that hashtag. You'll have to write 'MySplittingUDF' yourself
> > to do the same as STRSPLIT but that returns a bag instead.
> >
> > ie.
> >
> > #foobar tweet text,#foobar
> > this tweet has #two #hashtags,#two#hashtags
> > another #foobar tweet,#foobar
> >
> > will yield:
> >
> > #foobar,   {(#foobar tweet text, #foobar),(another #foobar tweet,
> > #foobar)}
> > #two,      {(this tweet has #two #hashtags, #two)}
> > #hashtags, {(this tweet has #two #hashtags, #hashtags)}
> >
> >
> > >
> > >
> > > > >     But now I want to end up something like the following:
> > > >
> > > >
> > > > > 2011-05-01    DIRECTIVE1    32423    DIRECTIVE2    3433    DIRECTIVE3
> > > > >  1983
> > > > >
> > > > > If I knew the directives ahead of time, I know I can do something
> > like
> > > > the
> > > > > following:
> > > > >
> > > > > D = GROUP C BY date;
> > > > >
> > > > > E = FOREACH D {
> > > > >      DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1';
> > > > >      DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2';
> > > > >      DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3';
> > > > >         GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date),
> > > > 'DIRECTIVE2',
> > > > > COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date);
> > > > > }
> > > > >
> > > > > But how do I do this w/o having to hardcode the filters? Am I
> > thinking
> > > > about
> > > > > this all wrong?
> > > > >
> > > > It's really a matter of how you structure your data ahead of time.
> > > > Imagine the data looking like this instead (call it X):
> > > >
> > > > 201101,directive1
> > > > 201101,directive1
> > > > 201101,directive2
> > > > 201101,directive2
> > > > 201101,directive2
> > > > 201101,directive3
> > > > 201102,directive2
> > > > 201102,directive4
> > > > 201103,directive1
> > > >
> > > > This is how my data looks (row and column wise)
> > >
> > > >
> > > > then, a simple:
> > > >
> > > > Y = GROUP X BY (date,directive);
> > > > Z = FOREACH Y GENERATE FLATTEN(group) AS (date,directive), COUNT(X) AS
> > > > num_occurrences;
> > > >
> > > > would result in:
> > > >
> > > > 201101,directive1,2
> > > > 201101,directive2,3
> > > > 201101,directive3,1
> > > > 201102,directive2,1
> > > > 201102,directive4,1
> > > > 201103,directive1,1
> > > >
> > > > At least, that's what it _seems_ like you're asking for.
> > > >
> > > > I've gotten that far. I'm actually asking for the being able to put
> > those
> > > into columns and not rows.
> > >
> > > >
> > > > --jacob
> > > > @thedatachef
> > > >
> > > > Thanks Jacob!
> > >
> > > -Christian
> > >
> > > >
> > > > > Thanks very much for you help,
> > > > > Christian
> > > >
> > > >
> > > >
> >
> >
> >

Re: Working with an unknown number of values

Reply via email to