On Apr 6, 2011, at 6:16 PM, bob wrote: > Honestly, I'd rather have a keyed bag of maps on the initial load, but that'd > work too. Is it really that hard to get cassandra data out that you need a > UDF to do anything besides an initial dump?
That's what we're doing because it just makes it easier to deal with tabular-like data - we don't have to munge through it quite as much. I'm still pretty low on my pig-fu but others on the list might have better answers on how to deal with that data structure. > > On Apr 6, 2011, at 3:51 PM, Jeremy Hanna wrote: > >> I'm going to put a UDF up on the pygmalion project hopefully today that will >> convert that into something more usable. Props to Jacob from infochimps - >> he and I have been creating UDFs like that lately for use with Cassandra. >> There's an associated UDF for getting it back into the key, cols form to >> output to cassandra as well. I'll try to get that pushed tonight but take a >> look at: >> https://github.com/jeromatron/pygmalion/ >> That's where I'll push the code - hopefully that will help. >> >> What it does is takes the data structure returned from cassandra and allows >> you say, give me the key and the values for these column names in a bag so >> for your example it would return: >> {(faaaaaaaaazzzzzzeaaa,faaaaaaaaa,zzzzzzeaaa)} >> and you could assign var names for each like key, first, last within pig. >> >> Anyway, if that helps, look for that soon. It's helping us use the output >> as tabular data. >> >> On Apr 6, 2011, at 5:40 PM, bob wrote: >> >>> No matter what I try, I end up losing the tuples after the initial flatten. >>> I'm using some auto-generated test data with firstn, last and a >>> concatanation for the key. The script and outputs. . . >>> >>> rows = LOAD 'cassandra://Keyspace2/Standard1' USING CassandraStorage() as >>> (key:chararray, cols:bag{T:tuple(name:chararray, value:chararray) } ); >>> dump rows; >>> >>> (faaaaaaaaazzzzzzeaaa,{(first,faaaaaaaaa),(last,zzzzzzeaaa)}) >>> (jaaaaaaaaazzzlaaaaaa,{(first,jaaaaaaaaa),(last,zzzlaaaaaa)}) >>> (naaaaaaaaazzzzzpaaaa,{(first,naaaaaaaaa),(last,zzzzzpaaaa)}) >>> (uaaaaaaaaazzzzzsaaaa,{(first,uaaaaaaaaa),(last,zzzzzsaaaa)}) >>> (vaaaaaaaaafaaaaaaaaa,{(first,vaaaaaaaaa),(last,faaaaaaaaa)}) >>> (zuaaaaaaaazpaaaaaaaa,{(first,zuaaaaaaaa),(last,zpaaaaaaaa)}) >>> (zuaaaaaaaazzzzhaaaaa,{(first,zuaaaaaaaa),(last,zzzzhaaaaa)}) >>> (zwaaaaaaaaznaaaaaaaa,{(first,zwaaaaaaaa),(last,znaaaaaaaa)}) >>> (zziaaaaaaazfaaaaaaaa,{(first,zziaaaaaaa),(last,zfaaaaaaaa)}) >>> (zzkaaaaaaazzzdaaaaaa,{(first,zzkaaaaaaa),(last,zzzdaaaaaa)}) >>> >>> So far, so good. >>> >>> >>> columns = foreach rows generate flatten(cols) as (name, value); >>> dump columns; >>> >>> () >>> () >>> () >>> () >>> () >>> () >>> () >>> () >>> () >>> () >>> >>> >>> Not so good. >>> >>> >>> >>> I've tried multiple combinations w/ no success. If I just leave bag empty >>> in the initial load, i.e. cols:bag{} and then leave off the as in the >>> flatten I get something that looks like a list of tuples. But, trying to >>> access $1 in that result gives me an Error 1000 index out of range. So, >>> that's not the ticket either. >>> >>> What I'd really like is to flatten the bag into a map, but I'm about as >>> successful there as well. >>> >>> Any help is most welcome . (Cassandra 7.4 and Pig 0.8.0) >>> >>> >> >
