I think this is the same problem we were having earlier: http://hadoop.markmail.org/thread/kgxhdgw6zdmadch4
One workaround is to use defines to explicitly create different instances of your UDF, and use them separately.. it's ugly but it works. D On Wed, Jul 6, 2011 at 9:42 AM, Jeremy Hanna <[email protected]> wrote: > We have a UDF that introspects the output schema and gets the field names > there and use that in the exec method. > > The UDF is found here: > https://github.com/jeromatron/pygmalion/blob/master/udf/src/main/java/org/pygmalion/udf/ToCassandraBag.java > > A simple example is found here: > https://github.com/jeromatron/pygmalion/blob/master/scripts/from_to_cassandra_bag_example.pig > > It takes the relation's aliases and uses them in the output so that the user > doesn't have to specify them. However we've been noticing that if you have > more than one ToCassandraBag call in a pig script, sometimes they are run at > the same time and the key is the same in the UDF context: > cassandra.input_field_schema. So we think there is an issue there (array out > of bounds exceptions when running the script, but when running in grunt one > at a time, it doesn't do that). > > Is there a right way to do this parameter passing so that we don't get these > errors when multiple calls exist? > > We thought of using the schema hash code as a suffix (e.g. > cassandra.input_field_schema.12344321) but we don't have access to the schema > in the exec method. > > We thought of having the first parameter of the input tuple be a unique name > that the script specifies, like 'filename.relationalias' as a convention to > make them unique to the file. However in the outputSchema, we don't have > access to the input tuple, just the schema itself, so it couldn't get that > value in there. > > Any ideas on how to make this so it doesn't stomp on each other within the > pig script? Is there a best way to do that? > > Thanks! > > Jeremy
