What is the guidance here on using member variables when implementing UDFs and 
passing properties?  That is, what are the semantics for using them to store 
properties for a UDF instance?  The docs talk a lot about making sure that no 
side effects happen from multiple calls to a UDF instance, but it is not clear 
whether that means it's doing things like changing the Location for a given 
instance of a UDF or just calling it multiple times.  PigStorage suggests not 
(since it keeps a member var location), but the UDFContext docs suggests that 
one keep all state in the UDFContext under an appropriate signature.  

See also https://issues.apache.org/jira/browse/CASSANDRA-2869 for another case 
where this has reared it's head in an improper implementation.

-Grant

On Jul 7, 2011, at 3:24 AM, Jeremy Hanna wrote:

> 
> On Jul 6, 2011, at 11:10 PM, Raghu Angadi wrote:
> 
>> On Wed, Jul 6, 2011 at 7:20 PM, Jeremy Hanna 
>> <[email protected]>wrote:
>> 
>>> 
>>> On Jul 6, 2011, at 12:47 PM, Dmitriy Ryaboy wrote:
>>> 
>>>> I think this is the same problem we were having earlier:
>>>> http://hadoop.markmail.org/thread/kgxhdgw6zdmadch4
>>>> 
>>>> One workaround is to use defines to explicitly create different
>>>> instances of your UDF, and use them separately.. it's ugly but it
>>>> works.
>>> 
>>> Thanks Dmitriy.
>>> 
>>> I tried doing something like:
>>> define ToCassandraBag1 org.pygmalion.udf.ToCassandraBag();
>>> define ToCassandraBag2 org.pygmalion.udf.ToCassandraBag();
>>> 
>> 
>> This still does not work since you can't distinguish the two. The way I was
>> thinking of doing this is to let user pass in some unique sting as a
>> substitute for context:
>> 
>> define ToCassandraBag1 ToCassandraBag('1');
>> define ToCassandraBag2 ToCassandraBag('2');
> 
> Ah yes.  I had misunderstood.  Thanks for the clarification.  Now the pig 
> docs also make more sense in the Passing Configurations to UDFs section:
> http://pig.apache.org/docs/r0.8.1/udf.html#Passing+Configurations+to+UDFs
> It says:
> "The UDF can pass its constructor arguments, or some other identifying 
> strings. This allows each instantiation of the UDF to have a different 
> properties object thus avoiding name space collisions between instantiations 
> of the UDF."
> and the HBaseStorage example was helpful to see that in action.
> 
> Thanks both to Raghu and Dmitriy.
> 
>> 
>> inside the UDF, you would use this arg to make a 'contextString' (see
>> HBaseStorage.java for example use) to store any state.
>> 
>> ideally UDFs shouldn't have to do this.. They should have the same context
>> info that is available for loaders and storage.
>> 
>> Raghu.
>> 
>> 
>>> 
>>> at the top and then using each one only once.  That still produces the same
>>> error.  I guess in this case we'll just have to require the field names be
>>> entered into the UDF and it won't introspect them.  Ah well.  Would be nice
>>> to be able to use it but I don't really see another way around this bug with
>>> the shared UDF context.
>>> 
>>>> 
>>>> D
>>>> 
>>>> On Wed, Jul 6, 2011 at 9:42 AM, Jeremy Hanna <[email protected]>
>>> wrote:
>>>>> We have a UDF that introspects the output schema and gets the field
>>> names there and use that in the exec method.
>>>>> 
>>>>> The UDF is found here:
>>> https://github.com/jeromatron/pygmalion/blob/master/udf/src/main/java/org/pygmalion/udf/ToCassandraBag.java
>>>>> 
>>>>> A simple example is found here:
>>> https://github.com/jeromatron/pygmalion/blob/master/scripts/from_to_cassandra_bag_example.pig
>>>>> 
>>>>> It takes the relation's aliases and uses them in the output so that the
>>> user doesn't have to specify them.  However we've been noticing that if you
>>> have more than one ToCassandraBag call in a pig script, sometimes they are
>>> run at the same time and the key is the same in the UDF context:
>>> cassandra.input_field_schema.  So we think there is an issue there (array
>>> out of bounds exceptions when running the script, but when running in grunt
>>> one at a time, it doesn't do that).
>>>>> 
>>>>> Is there a right way to do this parameter passing so that we don't get
>>> these errors when multiple calls exist?
>>>>> 
>>>>> We thought of using the schema hash code as a suffix (e.g.
>>> cassandra.input_field_schema.12344321) but we don't have access to the
>>> schema in the exec method.
>>>>> 
>>>>> We thought of having the first parameter of the input tuple be a unique
>>> name that the script specifies, like 'filename.relationalias' as a
>>> convention to make them unique to the file.  However in the outputSchema, we
>>> don't have access to the input tuple, just the schema itself, so it couldn't
>>> get that value in there.
>>>>> 
>>>>> Any ideas on how to make this so it doesn't stomp on each other within
>>> the pig script?  Is there a best way to do that?
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> Jeremy
>>> 
>>> 
> 

--------------------------
Grant Ingersoll



Reply via email to