What is the guidance here on using member variables when implementing UDFs and passing properties? That is, what are the semantics for using them to store properties for a UDF instance? The docs talk a lot about making sure that no side effects happen from multiple calls to a UDF instance, but it is not clear whether that means it's doing things like changing the Location for a given instance of a UDF or just calling it multiple times. PigStorage suggests not (since it keeps a member var location), but the UDFContext docs suggests that one keep all state in the UDFContext under an appropriate signature.
See also https://issues.apache.org/jira/browse/CASSANDRA-2869 for another case where this has reared it's head in an improper implementation. -Grant On Jul 7, 2011, at 3:24 AM, Jeremy Hanna wrote: > > On Jul 6, 2011, at 11:10 PM, Raghu Angadi wrote: > >> On Wed, Jul 6, 2011 at 7:20 PM, Jeremy Hanna >> <[email protected]>wrote: >> >>> >>> On Jul 6, 2011, at 12:47 PM, Dmitriy Ryaboy wrote: >>> >>>> I think this is the same problem we were having earlier: >>>> http://hadoop.markmail.org/thread/kgxhdgw6zdmadch4 >>>> >>>> One workaround is to use defines to explicitly create different >>>> instances of your UDF, and use them separately.. it's ugly but it >>>> works. >>> >>> Thanks Dmitriy. >>> >>> I tried doing something like: >>> define ToCassandraBag1 org.pygmalion.udf.ToCassandraBag(); >>> define ToCassandraBag2 org.pygmalion.udf.ToCassandraBag(); >>> >> >> This still does not work since you can't distinguish the two. The way I was >> thinking of doing this is to let user pass in some unique sting as a >> substitute for context: >> >> define ToCassandraBag1 ToCassandraBag('1'); >> define ToCassandraBag2 ToCassandraBag('2'); > > Ah yes. I had misunderstood. Thanks for the clarification. Now the pig > docs also make more sense in the Passing Configurations to UDFs section: > http://pig.apache.org/docs/r0.8.1/udf.html#Passing+Configurations+to+UDFs > It says: > "The UDF can pass its constructor arguments, or some other identifying > strings. This allows each instantiation of the UDF to have a different > properties object thus avoiding name space collisions between instantiations > of the UDF." > and the HBaseStorage example was helpful to see that in action. > > Thanks both to Raghu and Dmitriy. > >> >> inside the UDF, you would use this arg to make a 'contextString' (see >> HBaseStorage.java for example use) to store any state. >> >> ideally UDFs shouldn't have to do this.. They should have the same context >> info that is available for loaders and storage. >> >> Raghu. >> >> >>> >>> at the top and then using each one only once. That still produces the same >>> error. I guess in this case we'll just have to require the field names be >>> entered into the UDF and it won't introspect them. Ah well. Would be nice >>> to be able to use it but I don't really see another way around this bug with >>> the shared UDF context. >>> >>>> >>>> D >>>> >>>> On Wed, Jul 6, 2011 at 9:42 AM, Jeremy Hanna <[email protected]> >>> wrote: >>>>> We have a UDF that introspects the output schema and gets the field >>> names there and use that in the exec method. >>>>> >>>>> The UDF is found here: >>> https://github.com/jeromatron/pygmalion/blob/master/udf/src/main/java/org/pygmalion/udf/ToCassandraBag.java >>>>> >>>>> A simple example is found here: >>> https://github.com/jeromatron/pygmalion/blob/master/scripts/from_to_cassandra_bag_example.pig >>>>> >>>>> It takes the relation's aliases and uses them in the output so that the >>> user doesn't have to specify them. However we've been noticing that if you >>> have more than one ToCassandraBag call in a pig script, sometimes they are >>> run at the same time and the key is the same in the UDF context: >>> cassandra.input_field_schema. So we think there is an issue there (array >>> out of bounds exceptions when running the script, but when running in grunt >>> one at a time, it doesn't do that). >>>>> >>>>> Is there a right way to do this parameter passing so that we don't get >>> these errors when multiple calls exist? >>>>> >>>>> We thought of using the schema hash code as a suffix (e.g. >>> cassandra.input_field_schema.12344321) but we don't have access to the >>> schema in the exec method. >>>>> >>>>> We thought of having the first parameter of the input tuple be a unique >>> name that the script specifies, like 'filename.relationalias' as a >>> convention to make them unique to the file. However in the outputSchema, we >>> don't have access to the input tuple, just the schema itself, so it couldn't >>> get that value in there. >>>>> >>>>> Any ideas on how to make this so it doesn't stomp on each other within >>> the pig script? Is there a best way to do that? >>>>> >>>>> Thanks! >>>>> >>>>> Jeremy >>> >>> > -------------------------- Grant Ingersoll
