I think this is the same problem we were having earlier:
http://hadoop.markmail.org/thread/kgxhdgw6zdmadch4

One workaround is to use defines to explicitly create different
instances of your UDF, and use them separately.. it's ugly but it
works.

D

On Wed, Jul 6, 2011 at 9:42 AM, Jeremy Hanna <[email protected]> wrote:
> We have a UDF that introspects the output schema and gets the field names 
> there and use that in the exec method.
>
> The UDF is found here: 
> https://github.com/jeromatron/pygmalion/blob/master/udf/src/main/java/org/pygmalion/udf/ToCassandraBag.java
>
> A simple example is found here: 
> https://github.com/jeromatron/pygmalion/blob/master/scripts/from_to_cassandra_bag_example.pig
>
> It takes the relation's aliases and uses them in the output so that the user 
> doesn't have to specify them.  However we've been noticing that if you have 
> more than one ToCassandraBag call in a pig script, sometimes they are run at 
> the same time and the key is the same in the UDF context: 
> cassandra.input_field_schema.  So we think there is an issue there (array out 
> of bounds exceptions when running the script, but when running in grunt one 
> at a time, it doesn't do that).
>
> Is there a right way to do this parameter passing so that we don't get these 
> errors when multiple calls exist?
>
> We thought of using the schema hash code as a suffix (e.g. 
> cassandra.input_field_schema.12344321) but we don't have access to the schema 
> in the exec method.
>
> We thought of having the first parameter of the input tuple be a unique name 
> that the script specifies, like 'filename.relationalias' as a convention to 
> make them unique to the file.  However in the outputSchema, we don't have 
> access to the input tuple, just the schema itself, so it couldn't get that 
> value in there.
>
> Any ideas on how to make this so it doesn't stomp on each other within the 
> pig script?  Is there a best way to do that?
>
> Thanks!
>
> Jeremy

Reply via email to