Yep, i just did and it worked thanks.

I do still find it odd that the below output of the JOIN is not printing
correctly, though no ?

On Fri, Nov 4, 2011 at 10:57 AM, Jacob Perkins <[email protected]>wrote:

> Have you taken a look at Pygmalion
> (http://github.com/jeromatron/pygmalion) which makes it MUCH easier to
> work with tabular data from Cassandra like you're trying to do?
>
> For example:
>
> what_cassandrastorage_should_really_produce = FOREACH rows GENERATE key
> AS key, FromCassandraBag('url,cache_hit', columns) AS (url:chararray,
> cache_hit:chararray);
>
> DUMP what_cassandrastorage_should_really_produce;
>
> (key1, http://www.google.com, hit)
> (key2, http://www.google.com, hit)
>
> Does that work for your use case?
>
> --jacob
> @thedatachef
>
>
> On Fri, 2011-11-04 at 08:51 -0400, AD wrote:
> > Hello,
> >
> >  I am pulling data from cassandra into pig which means it ends up like
> key,
> > bag { (name,value),(name,value) }.  The info is logfiles so each column
> is
> > a field in server logfile (like apache).  I have the following pig to
> > combine 2 fields and count them but the GENERATE of the JOIN is not
> > printing the right field.  Is there an easier way to solve this, and does
> > anyone know why the join output is broken ?
> >
> > rows = LOAD 'cassandra://Keyspace1/Logs' USING CassandraStorage() AS
> (key,
> > columns: bag {T: tuple(name, value)});
> >
> >  A = FOREACH rows GENERATE $0, flatten($1) ; //FLATTEN
> > *(key1,url,http://www.google.com)*
> > *(key1,cache_hit,hit)*
> > *(key2,url,http://www.google.com)*
> > *(key2,cache_hit,miss)*
> >
> >  B = group r2 by key ; // Combine url and cache_hit into one record
> > *(key1,{(key1,url,http://www.google.com),(key1,cache_hit,hit)})*
> > *(key2,{(key2,url,http://www.google.com),(key2,cache_hit,miss)})*
> >
> >  // Create 2 lists and then JOIN them
> >
> >  C = FOREACH B {
> >  u = FILTER A by name == 'url';
> >  GENERATE FLATTEN(u.(key,value)) ;
> >  }
> > * (key1,http://www.google.com)*
> > * (key2,http://www.google.com)*
> >
> >  D = FOREACH B {
> >  u2 = FILTER A by name == 'cache_hit';
> >  GENERATE FLATTEN(u2.(key,value));
> >  }
> >  *(key1,hit)*
> > * (key2,miss)*
> >
> >  E = join C by key, D by key ;
> > *(key1,http://www.google.com,key1,hit)*
> > *(key2,http://www.google.com,key2,miss)*
> >
> > describe E ;
> > E: {C::u::key: chararray,C::u::value: chararray,D::u2::key:
> > chararray,D::u2::value: chararray}
> >
> > F = FOREACH E GENERATE C::u::value, D::u2::value ;
> >
> > *dump F ;*
> > *(http://www.google.com,http://www.google.com)  ?? Why not
> www.google.com,
> > hit ????*
> > *(http://www.google.com,http://www.google.com)*
> > *
> > *
> > Any help appreciated.
> > AD
>
>
>

Reply via email to