accessing values from column names when using bag returned by CassandraStorage

Gianni Moschini Fri, 06 May 2011 11:21:20 -0700

It is possible to access the columns values (stored in cassandra) from pig,
using the column names defined in the Cassandra Schema, using the UDF from
pygmalion.



So imagine a schema being :


create column family Users
   with column_type = Standard
   and comparator = UTF8Type
   and default_validation_class = UTF8Type;


and the data being :

RowKey: 1
=> (column=firstname, value=albert, timestamp=1304694447722746)
=> (column=city, value=london, timestamp=1304694447722746)
-------------------
RowKey: 2
=> (column=firstname, value=antonio, timestamp=1304694447140376)
=> (column=city, value=roma, timestamp=1304694447140376)


Note that this is returned by CassandraStorage as a bag {T: tuple(name,
value)}

So in pig, your load statement will be something like :

rows = LOAD 'cassandra://Keyspace/Users' USING
org.apache.cassandra.hadoop.pig.CassandraStorage() as (key:chararray,
columns: bag{T:(columnname, columnvalue)});

if you illustrate this, you get :

---------------------------------------------------------------------------------------------------------------------------------------
| rows     | key: bytearray                                 | columns:
bytearray({T: (columnname: bytearray,columnvalue: bytearray)}) |
---------------------------------------------------------------------------------------------------------------------------------------
|          | 1 | {(firstname, albert), (city, london)}
   |
---------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------
| rows     | key: chararray                                 | columns:
bag({T: (columnname: bytearray,columnvalue: bytearray)}) |
---------------------------------------------------------------------------------------------------------------------------------
|          | 1 | {(firstname, antonio), (city, roma)}                    |
---------------------------------------------------------------------------------------------------------------------------------

now, if you want to access those column values by names, here is the trick.
Register the pygmalion jar first (you need to build it, of course).

register 'pygmalion.jar';

and then, here is the magic part ...

rows_namedcols = foreach rows generate key,
flatten(org.pygmalion.udf.FromCassandraBag('firstname, city', columns))
as (firstname: chararray, city: chararray);

Now you can query your columns directly from pig. Isn't that awesome ?

rows_london = filter rows_namedcols by city == 'london';
names_london = foreach rows_london generate firstname;
dump names_london;


You can download the UDF from here:
https://github.com/jeromatron/pygmalion

Thanks to jeromatron for this !

accessing values from column names when using bag returned by CassandraStorage

Reply via email to