But are you keeping member variables or do you put everything in the context?
On Jul 8, 2011, at 3:21 PM, Raghu Angadi wrote: > yes. that is exactly how HBaseStorage uses context. > > On Fri, Jul 8, 2011 at 10:19 AM, Jeremy Hanna > <[email protected]>wrote: > >> In CassandraStorage, we had been using some load/store URL specific >> information (keyspace, column family names) to make the >> UDFContext.properties key unique, but with what Grant said was in the docs, >> we just wrote a patch to instead use the udf context signatures for those >> keys when setting and getting those property values. Is that the way to go >> then? I'm setting those as member variables and then using them later. >> >> @Override >> public void setUDFContextSignature(String signature) >> { >> this.loadSignature = signature; >> } >> >> /* StoreFunc methods */ >> public void setStoreFuncUDFContextSignature(String signature) >> { >> this.storeSignature = signature; >> } >> >> >> On Jul 8, 2011, at 7:24 AM, Grant Ingersoll wrote: >> >>> What is the guidance here on using member variables when implementing >> UDFs and passing properties? That is, what are the semantics for using them >> to store properties for a UDF instance? The docs talk a lot about making >> sure that no side effects happen from multiple calls to a UDF instance, but >> it is not clear whether that means it's doing things like changing the >> Location for a given instance of a UDF or just calling it multiple times. >> PigStorage suggests not (since it keeps a member var location), but the >> UDFContext docs suggests that one keep all state in the UDFContext under an >> appropriate signature. >>> >>> See also https://issues.apache.org/jira/browse/CASSANDRA-2869 for >> another case where this has reared it's head in an improper implementation. >>> >>> -Grant >>> >>> On Jul 7, 2011, at 3:24 AM, Jeremy Hanna wrote: >>> >>>> >>>> On Jul 6, 2011, at 11:10 PM, Raghu Angadi wrote: >>>> >>>>> On Wed, Jul 6, 2011 at 7:20 PM, Jeremy Hanna < >> [email protected]>wrote: >>>>> >>>>>> >>>>>> On Jul 6, 2011, at 12:47 PM, Dmitriy Ryaboy wrote: >>>>>> >>>>>>> I think this is the same problem we were having earlier: >>>>>>> http://hadoop.markmail.org/thread/kgxhdgw6zdmadch4 >>>>>>> >>>>>>> One workaround is to use defines to explicitly create different >>>>>>> instances of your UDF, and use them separately.. it's ugly but it >>>>>>> works. >>>>>> >>>>>> Thanks Dmitriy. >>>>>> >>>>>> I tried doing something like: >>>>>> define ToCassandraBag1 org.pygmalion.udf.ToCassandraBag(); >>>>>> define ToCassandraBag2 org.pygmalion.udf.ToCassandraBag(); >>>>>> >>>>> >>>>> This still does not work since you can't distinguish the two. The way I >> was >>>>> thinking of doing this is to let user pass in some unique sting as a >>>>> substitute for context: >>>>> >>>>> define ToCassandraBag1 ToCassandraBag('1'); >>>>> define ToCassandraBag2 ToCassandraBag('2'); >>>> >>>> Ah yes. I had misunderstood. Thanks for the clarification. Now the >> pig docs also make more sense in the Passing Configurations to UDFs section: >>>> >> http://pig.apache.org/docs/r0.8.1/udf.html#Passing+Configurations+to+UDFs >>>> It says: >>>> "The UDF can pass its constructor arguments, or some other identifying >> strings. This allows each instantiation of the UDF to have a different >> properties object thus avoiding name space collisions between instantiations >> of the UDF." >>>> and the HBaseStorage example was helpful to see that in action. >>>> >>>> Thanks both to Raghu and Dmitriy. >>>> >>>>> >>>>> inside the UDF, you would use this arg to make a 'contextString' (see >>>>> HBaseStorage.java for example use) to store any state. >>>>> >>>>> ideally UDFs shouldn't have to do this.. They should have the same >> context >>>>> info that is available for loaders and storage. >>>>> >>>>> Raghu. >>>>> >>>>> >>>>>> >>>>>> at the top and then using each one only once. That still produces the >> same >>>>>> error. I guess in this case we'll just have to require the field >> names be >>>>>> entered into the UDF and it won't introspect them. Ah well. Would be >> nice >>>>>> to be able to use it but I don't really see another way around this >> bug with >>>>>> the shared UDF context. >>>>>> >>>>>>> >>>>>>> D >>>>>>> >>>>>>> On Wed, Jul 6, 2011 at 9:42 AM, Jeremy Hanna < >> [email protected]> >>>>>> wrote: >>>>>>>> We have a UDF that introspects the output schema and gets the field >>>>>> names there and use that in the exec method. >>>>>>>> >>>>>>>> The UDF is found here: >>>>>> >> https://github.com/jeromatron/pygmalion/blob/master/udf/src/main/java/org/pygmalion/udf/ToCassandraBag.java >>>>>>>> >>>>>>>> A simple example is found here: >>>>>> >> https://github.com/jeromatron/pygmalion/blob/master/scripts/from_to_cassandra_bag_example.pig >>>>>>>> >>>>>>>> It takes the relation's aliases and uses them in the output so that >> the >>>>>> user doesn't have to specify them. However we've been noticing that >> if you >>>>>> have more than one ToCassandraBag call in a pig script, sometimes they >> are >>>>>> run at the same time and the key is the same in the UDF context: >>>>>> cassandra.input_field_schema. So we think there is an issue there >> (array >>>>>> out of bounds exceptions when running the script, but when running in >> grunt >>>>>> one at a time, it doesn't do that). >>>>>>>> >>>>>>>> Is there a right way to do this parameter passing so that we don't >> get >>>>>> these errors when multiple calls exist? >>>>>>>> >>>>>>>> We thought of using the schema hash code as a suffix (e.g. >>>>>> cassandra.input_field_schema.12344321) but we don't have access to the >>>>>> schema in the exec method. >>>>>>>> >>>>>>>> We thought of having the first parameter of the input tuple be a >> unique >>>>>> name that the script specifies, like 'filename.relationalias' as a >>>>>> convention to make them unique to the file. However in the >> outputSchema, we >>>>>> don't have access to the input tuple, just the schema itself, so it >> couldn't >>>>>> get that value in there. >>>>>>>> >>>>>>>> Any ideas on how to make this so it doesn't stomp on each other >> within >>>>>> the pig script? Is there a best way to do that? >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>>> Jeremy >>>>>> >>>>>> >>>> >>> >>> -------------------------- >>> Grant Ingersoll >>> >>> >>> >> >> -------------------------- Grant Ingersoll
