Hello,

 I am pulling data from cassandra into pig which means it ends up like key,
bag { (name,value),(name,value) }.  The info is logfiles so each column is
a field in server logfile (like apache).  I have the following pig to
combine 2 fields and count them but the GENERATE of the JOIN is not
printing the right field.  Is there an easier way to solve this, and does
anyone know why the join output is broken ?

rows = LOAD 'cassandra://Keyspace1/Logs' USING CassandraStorage() AS (key,
columns: bag {T: tuple(name, value)});

 A = FOREACH rows GENERATE $0, flatten($1) ; //FLATTEN
*(key1,url,http://www.google.com)*
*(key1,cache_hit,hit)*
*(key2,url,http://www.google.com)*
*(key2,cache_hit,miss)*

 B = group r2 by key ; // Combine url and cache_hit into one record
*(key1,{(key1,url,http://www.google.com),(key1,cache_hit,hit)})*
*(key2,{(key2,url,http://www.google.com),(key2,cache_hit,miss)})*

 // Create 2 lists and then JOIN them

 C = FOREACH B {
 u = FILTER A by name == 'url';
 GENERATE FLATTEN(u.(key,value)) ;
 }
* (key1,http://www.google.com)*
* (key2,http://www.google.com)*

 D = FOREACH B {
 u2 = FILTER A by name == 'cache_hit';
 GENERATE FLATTEN(u2.(key,value));
 }
 *(key1,hit)*
* (key2,miss)*

 E = join C by key, D by key ;
*(key1,http://www.google.com,key1,hit)*
*(key2,http://www.google.com,key2,miss)*

describe E ;
E: {C::u::key: chararray,C::u::value: chararray,D::u2::key:
chararray,D::u2::value: chararray}

F = FOREACH E GENERATE C::u::value, D::u2::value ;

*dump F ;*
*(http://www.google.com,http://www.google.com)  ?? Why not www.google.com,
hit ????*
*(http://www.google.com,http://www.google.com)*
*
*
Any help appreciated.
AD

Reply via email to