We have a UDF that introspects the output schema and gets the field names there 
and use that in the exec method.

The UDF is found here: 
https://github.com/jeromatron/pygmalion/blob/master/udf/src/main/java/org/pygmalion/udf/ToCassandraBag.java

A simple example is found here: 
https://github.com/jeromatron/pygmalion/blob/master/scripts/from_to_cassandra_bag_example.pig

It takes the relation's aliases and uses them in the output so that the user 
doesn't have to specify them.  However we've been noticing that if you have 
more than one ToCassandraBag call in a pig script, sometimes they are run at 
the same time and the key is the same in the UDF context: 
cassandra.input_field_schema.  So we think there is an issue there (array out 
of bounds exceptions when running the script, but when running in grunt one at 
a time, it doesn't do that).

Is there a right way to do this parameter passing so that we don't get these 
errors when multiple calls exist?

We thought of using the schema hash code as a suffix (e.g. 
cassandra.input_field_schema.12344321) but we don't have access to the schema 
in the exec method.

We thought of having the first parameter of the input tuple be a unique name 
that the script specifies, like 'filename.relationalias' as a convention to 
make them unique to the file.  However in the outputSchema, we don't have 
access to the input tuple, just the schema itself, so it couldn't get that 
value in there.

Any ideas on how to make this so it doesn't stomp on each other within the pig 
script?  Is there a best way to do that?

Thanks!

Jeremy

Reply via email to