But are you keeping member variables or do you put everything in the context?


On Jul 8, 2011, at 3:21 PM, Raghu Angadi wrote:

> yes. that is exactly how HBaseStorage uses context.
> 
> On Fri, Jul 8, 2011 at 10:19 AM, Jeremy Hanna 
> <[email protected]>wrote:
> 
>> In CassandraStorage, we had been using some load/store URL specific
>> information (keyspace, column family names) to make the
>> UDFContext.properties key unique, but with what Grant said was in the docs,
>> we just wrote a patch to instead use the udf context signatures for those
>> keys when setting and getting those property values.  Is that the way to go
>> then?  I'm setting those as member variables and then using them later.
>> 
>>   @Override
>>   public void setUDFContextSignature(String signature)
>>   {
>>       this.loadSignature = signature;
>>   }
>> 
>>   /* StoreFunc methods */
>>   public void setStoreFuncUDFContextSignature(String signature)
>>   {
>>       this.storeSignature = signature;
>>    }
>> 
>> 
>> On Jul 8, 2011, at 7:24 AM, Grant Ingersoll wrote:
>> 
>>> What is the guidance here on using member variables when implementing
>> UDFs and passing properties?  That is, what are the semantics for using them
>> to store properties for a UDF instance?  The docs talk a lot about making
>> sure that no side effects happen from multiple calls to a UDF instance, but
>> it is not clear whether that means it's doing things like changing the
>> Location for a given instance of a UDF or just calling it multiple times.
>> PigStorage suggests not (since it keeps a member var location), but the
>> UDFContext docs suggests that one keep all state in the UDFContext under an
>> appropriate signature.
>>> 
>>> See also https://issues.apache.org/jira/browse/CASSANDRA-2869 for
>> another case where this has reared it's head in an improper implementation.
>>> 
>>> -Grant
>>> 
>>> On Jul 7, 2011, at 3:24 AM, Jeremy Hanna wrote:
>>> 
>>>> 
>>>> On Jul 6, 2011, at 11:10 PM, Raghu Angadi wrote:
>>>> 
>>>>> On Wed, Jul 6, 2011 at 7:20 PM, Jeremy Hanna <
>> [email protected]>wrote:
>>>>> 
>>>>>> 
>>>>>> On Jul 6, 2011, at 12:47 PM, Dmitriy Ryaboy wrote:
>>>>>> 
>>>>>>> I think this is the same problem we were having earlier:
>>>>>>> http://hadoop.markmail.org/thread/kgxhdgw6zdmadch4
>>>>>>> 
>>>>>>> One workaround is to use defines to explicitly create different
>>>>>>> instances of your UDF, and use them separately.. it's ugly but it
>>>>>>> works.
>>>>>> 
>>>>>> Thanks Dmitriy.
>>>>>> 
>>>>>> I tried doing something like:
>>>>>> define ToCassandraBag1 org.pygmalion.udf.ToCassandraBag();
>>>>>> define ToCassandraBag2 org.pygmalion.udf.ToCassandraBag();
>>>>>> 
>>>>> 
>>>>> This still does not work since you can't distinguish the two. The way I
>> was
>>>>> thinking of doing this is to let user pass in some unique sting as a
>>>>> substitute for context:
>>>>> 
>>>>> define ToCassandraBag1 ToCassandraBag('1');
>>>>> define ToCassandraBag2 ToCassandraBag('2');
>>>> 
>>>> Ah yes.  I had misunderstood.  Thanks for the clarification.  Now the
>> pig docs also make more sense in the Passing Configurations to UDFs section:
>>>> 
>> http://pig.apache.org/docs/r0.8.1/udf.html#Passing+Configurations+to+UDFs
>>>> It says:
>>>> "The UDF can pass its constructor arguments, or some other identifying
>> strings. This allows each instantiation of the UDF to have a different
>> properties object thus avoiding name space collisions between instantiations
>> of the UDF."
>>>> and the HBaseStorage example was helpful to see that in action.
>>>> 
>>>> Thanks both to Raghu and Dmitriy.
>>>> 
>>>>> 
>>>>> inside the UDF, you would use this arg to make a 'contextString' (see
>>>>> HBaseStorage.java for example use) to store any state.
>>>>> 
>>>>> ideally UDFs shouldn't have to do this.. They should have the same
>> context
>>>>> info that is available for loaders and storage.
>>>>> 
>>>>> Raghu.
>>>>> 
>>>>> 
>>>>>> 
>>>>>> at the top and then using each one only once.  That still produces the
>> same
>>>>>> error.  I guess in this case we'll just have to require the field
>> names be
>>>>>> entered into the UDF and it won't introspect them.  Ah well.  Would be
>> nice
>>>>>> to be able to use it but I don't really see another way around this
>> bug with
>>>>>> the shared UDF context.
>>>>>> 
>>>>>>> 
>>>>>>> D
>>>>>>> 
>>>>>>> On Wed, Jul 6, 2011 at 9:42 AM, Jeremy Hanna <
>> [email protected]>
>>>>>> wrote:
>>>>>>>> We have a UDF that introspects the output schema and gets the field
>>>>>> names there and use that in the exec method.
>>>>>>>> 
>>>>>>>> The UDF is found here:
>>>>>> 
>> https://github.com/jeromatron/pygmalion/blob/master/udf/src/main/java/org/pygmalion/udf/ToCassandraBag.java
>>>>>>>> 
>>>>>>>> A simple example is found here:
>>>>>> 
>> https://github.com/jeromatron/pygmalion/blob/master/scripts/from_to_cassandra_bag_example.pig
>>>>>>>> 
>>>>>>>> It takes the relation's aliases and uses them in the output so that
>> the
>>>>>> user doesn't have to specify them.  However we've been noticing that
>> if you
>>>>>> have more than one ToCassandraBag call in a pig script, sometimes they
>> are
>>>>>> run at the same time and the key is the same in the UDF context:
>>>>>> cassandra.input_field_schema.  So we think there is an issue there
>> (array
>>>>>> out of bounds exceptions when running the script, but when running in
>> grunt
>>>>>> one at a time, it doesn't do that).
>>>>>>>> 
>>>>>>>> Is there a right way to do this parameter passing so that we don't
>> get
>>>>>> these errors when multiple calls exist?
>>>>>>>> 
>>>>>>>> We thought of using the schema hash code as a suffix (e.g.
>>>>>> cassandra.input_field_schema.12344321) but we don't have access to the
>>>>>> schema in the exec method.
>>>>>>>> 
>>>>>>>> We thought of having the first parameter of the input tuple be a
>> unique
>>>>>> name that the script specifies, like 'filename.relationalias' as a
>>>>>> convention to make them unique to the file.  However in the
>> outputSchema, we
>>>>>> don't have access to the input tuple, just the schema itself, so it
>> couldn't
>>>>>> get that value in there.
>>>>>>>> 
>>>>>>>> Any ideas on how to make this so it doesn't stomp on each other
>> within
>>>>>> the pig script?  Is there a best way to do that?
>>>>>>>> 
>>>>>>>> Thanks!
>>>>>>>> 
>>>>>>>> Jeremy
>>>>>> 
>>>>>> 
>>>> 
>>> 
>>> --------------------------
>>> Grant Ingersoll
>>> 
>>> 
>>> 
>> 
>> 

--------------------------
Grant Ingersoll



Reply via email to