Yep, i just did and it worked thanks. I do still find it odd that the below output of the JOIN is not printing correctly, though no ?
On Fri, Nov 4, 2011 at 10:57 AM, Jacob Perkins <[email protected]>wrote: > Have you taken a look at Pygmalion > (http://github.com/jeromatron/pygmalion) which makes it MUCH easier to > work with tabular data from Cassandra like you're trying to do? > > For example: > > what_cassandrastorage_should_really_produce = FOREACH rows GENERATE key > AS key, FromCassandraBag('url,cache_hit', columns) AS (url:chararray, > cache_hit:chararray); > > DUMP what_cassandrastorage_should_really_produce; > > (key1, http://www.google.com, hit) > (key2, http://www.google.com, hit) > > Does that work for your use case? > > --jacob > @thedatachef > > > On Fri, 2011-11-04 at 08:51 -0400, AD wrote: > > Hello, > > > > I am pulling data from cassandra into pig which means it ends up like > key, > > bag { (name,value),(name,value) }. The info is logfiles so each column > is > > a field in server logfile (like apache). I have the following pig to > > combine 2 fields and count them but the GENERATE of the JOIN is not > > printing the right field. Is there an easier way to solve this, and does > > anyone know why the join output is broken ? > > > > rows = LOAD 'cassandra://Keyspace1/Logs' USING CassandraStorage() AS > (key, > > columns: bag {T: tuple(name, value)}); > > > > A = FOREACH rows GENERATE $0, flatten($1) ; //FLATTEN > > *(key1,url,http://www.google.com)* > > *(key1,cache_hit,hit)* > > *(key2,url,http://www.google.com)* > > *(key2,cache_hit,miss)* > > > > B = group r2 by key ; // Combine url and cache_hit into one record > > *(key1,{(key1,url,http://www.google.com),(key1,cache_hit,hit)})* > > *(key2,{(key2,url,http://www.google.com),(key2,cache_hit,miss)})* > > > > // Create 2 lists and then JOIN them > > > > C = FOREACH B { > > u = FILTER A by name == 'url'; > > GENERATE FLATTEN(u.(key,value)) ; > > } > > * (key1,http://www.google.com)* > > * (key2,http://www.google.com)* > > > > D = FOREACH B { > > u2 = FILTER A by name == 'cache_hit'; > > GENERATE FLATTEN(u2.(key,value)); > > } > > *(key1,hit)* > > * (key2,miss)* > > > > E = join C by key, D by key ; > > *(key1,http://www.google.com,key1,hit)* > > *(key2,http://www.google.com,key2,miss)* > > > > describe E ; > > E: {C::u::key: chararray,C::u::value: chararray,D::u2::key: > > chararray,D::u2::value: chararray} > > > > F = FOREACH E GENERATE C::u::value, D::u2::value ; > > > > *dump F ;* > > *(http://www.google.com,http://www.google.com) ?? Why not > www.google.com, > > hit ????* > > *(http://www.google.com,http://www.google.com)* > > * > > * > > Any help appreciated. > > AD > > >
