Re: help flattening data from cassandra loader

Jeremy Hanna Wed, 06 Apr 2011 16:20:11 -0700

On Apr 6, 2011, at 6:16 PM, bob wrote:

> Honestly, I'd rather have a keyed bag of maps on the initial load, but that'd 
> work too. Is it really that hard to get cassandra data out that you need a 
> UDF to do anything besides an initial dump?


That's what we're doing because it just makes it easier to deal with 
tabular-like data - we don't have to munge through it quite as much.  I'm still 
pretty low on my pig-fu but others on the list might have better answers on how 
to deal with that data structure.

> 
> On Apr 6, 2011, at 3:51 PM, Jeremy Hanna wrote:
> 
>> I'm going to put a UDF up on the pygmalion project hopefully today that will 
>> convert that into something more usable.  Props to Jacob from infochimps - 
>> he and I have been creating UDFs like that lately for use with Cassandra.  
>> There's an associated UDF for getting it back into the key, cols form to 
>> output to cassandra as well.  I'll try to get that pushed tonight but take a 
>> look at:
>> https://github.com/jeromatron/pygmalion/
>> That's where I'll push the code - hopefully that will help.
>> 
>> What it does is takes the data structure returned from cassandra and allows 
>> you say, give me the key and the values for these column names in a bag so 
>> for your example it would return:
>> {(faaaaaaaaazzzzzzeaaa,faaaaaaaaa,zzzzzzeaaa)}
>> and you could assign var names for each like key, first, last within pig.
>> 
>> Anyway, if that helps, look for that soon.  It's helping us use the output 
>> as tabular data.
>> 
>> On Apr 6, 2011, at 5:40 PM, bob wrote:
>> 
>>> No matter what I try, I end up losing the tuples after the initial flatten. 
>>> I'm using some auto-generated test data with firstn, last and a 
>>> concatanation for the key. The script and outputs. . .
>>> 
>>> rows = LOAD 'cassandra://Keyspace2/Standard1' USING CassandraStorage() as 
>>> (key:chararray, cols:bag{T:tuple(name:chararray, value:chararray) } );
>>> dump rows;
>>> 
>>> (faaaaaaaaazzzzzzeaaa,{(first,faaaaaaaaa),(last,zzzzzzeaaa)})
>>> (jaaaaaaaaazzzlaaaaaa,{(first,jaaaaaaaaa),(last,zzzlaaaaaa)})
>>> (naaaaaaaaazzzzzpaaaa,{(first,naaaaaaaaa),(last,zzzzzpaaaa)})
>>> (uaaaaaaaaazzzzzsaaaa,{(first,uaaaaaaaaa),(last,zzzzzsaaaa)})
>>> (vaaaaaaaaafaaaaaaaaa,{(first,vaaaaaaaaa),(last,faaaaaaaaa)})
>>> (zuaaaaaaaazpaaaaaaaa,{(first,zuaaaaaaaa),(last,zpaaaaaaaa)})
>>> (zuaaaaaaaazzzzhaaaaa,{(first,zuaaaaaaaa),(last,zzzzhaaaaa)})
>>> (zwaaaaaaaaznaaaaaaaa,{(first,zwaaaaaaaa),(last,znaaaaaaaa)})
>>> (zziaaaaaaazfaaaaaaaa,{(first,zziaaaaaaa),(last,zfaaaaaaaa)})
>>> (zzkaaaaaaazzzdaaaaaa,{(first,zzkaaaaaaa),(last,zzzdaaaaaa)})
>>> 
>>> So far, so good.
>>> 
>>> 
>>> columns = foreach rows generate flatten(cols) as (name, value);        
>>> dump columns;
>>> 
>>> ()
>>> ()
>>> ()
>>> ()
>>> ()
>>> ()
>>> ()
>>> ()
>>> ()
>>> ()
>>> 
>>> 
>>> Not so good.
>>> 
>>> 
>>> 
>>> I've tried multiple combinations w/ no success.  If I just leave bag empty 
>>> in the initial load, i.e. cols:bag{} and then leave off the as in the 
>>> flatten I get something that looks like a list of tuples. But, trying to 
>>> access $1 in that result gives me an Error 1000 index out of range. So, 
>>> that's not the ticket either.
>>> 
>>> What I'd really like is to flatten the bag into a map, but I'm about as 
>>> successful there as well.
>>> 
>>> Any help is most welcome .  (Cassandra 7.4 and Pig 0.8.0)
>>> 
>>> 
>> 
>

Re: help flattening data from cassandra loader

Reply via email to