yes. that is exactly how HBaseStorage uses context.
On Fri, Jul 8, 2011 at 10:19 AM, Jeremy Hanna <[email protected]>wrote:
> In CassandraStorage, we had been using some load/store URL specific
> information (keyspace, column family names) to make the
> UDFContext.properties key unique, but with what Grant said was in the docs,
> we just wrote a patch to instead use the udf context signatures for those
> keys when setting and getting those property values. Is that the way to go
> then? I'm setting those as member variables and then using them later.
>
> @Override
> public void setUDFContextSignature(String signature)
> {
> this.loadSignature = signature;
> }
>
> /* StoreFunc methods */
> public void setStoreFuncUDFContextSignature(String signature)
> {
> this.storeSignature = signature;
> }
>
>
> On Jul 8, 2011, at 7:24 AM, Grant Ingersoll wrote:
>
> > What is the guidance here on using member variables when implementing
> UDFs and passing properties? That is, what are the semantics for using them
> to store properties for a UDF instance? The docs talk a lot about making
> sure that no side effects happen from multiple calls to a UDF instance, but
> it is not clear whether that means it's doing things like changing the
> Location for a given instance of a UDF or just calling it multiple times.
> PigStorage suggests not (since it keeps a member var location), but the
> UDFContext docs suggests that one keep all state in the UDFContext under an
> appropriate signature.
> >
> > See also https://issues.apache.org/jira/browse/CASSANDRA-2869 for
> another case where this has reared it's head in an improper implementation.
> >
> > -Grant
> >
> > On Jul 7, 2011, at 3:24 AM, Jeremy Hanna wrote:
> >
> >>
> >> On Jul 6, 2011, at 11:10 PM, Raghu Angadi wrote:
> >>
> >>> On Wed, Jul 6, 2011 at 7:20 PM, Jeremy Hanna <
> [email protected]>wrote:
> >>>
> >>>>
> >>>> On Jul 6, 2011, at 12:47 PM, Dmitriy Ryaboy wrote:
> >>>>
> >>>>> I think this is the same problem we were having earlier:
> >>>>> http://hadoop.markmail.org/thread/kgxhdgw6zdmadch4
> >>>>>
> >>>>> One workaround is to use defines to explicitly create different
> >>>>> instances of your UDF, and use them separately.. it's ugly but it
> >>>>> works.
> >>>>
> >>>> Thanks Dmitriy.
> >>>>
> >>>> I tried doing something like:
> >>>> define ToCassandraBag1 org.pygmalion.udf.ToCassandraBag();
> >>>> define ToCassandraBag2 org.pygmalion.udf.ToCassandraBag();
> >>>>
> >>>
> >>> This still does not work since you can't distinguish the two. The way I
> was
> >>> thinking of doing this is to let user pass in some unique sting as a
> >>> substitute for context:
> >>>
> >>> define ToCassandraBag1 ToCassandraBag('1');
> >>> define ToCassandraBag2 ToCassandraBag('2');
> >>
> >> Ah yes. I had misunderstood. Thanks for the clarification. Now the
> pig docs also make more sense in the Passing Configurations to UDFs section:
> >>
> http://pig.apache.org/docs/r0.8.1/udf.html#Passing+Configurations+to+UDFs
> >> It says:
> >> "The UDF can pass its constructor arguments, or some other identifying
> strings. This allows each instantiation of the UDF to have a different
> properties object thus avoiding name space collisions between instantiations
> of the UDF."
> >> and the HBaseStorage example was helpful to see that in action.
> >>
> >> Thanks both to Raghu and Dmitriy.
> >>
> >>>
> >>> inside the UDF, you would use this arg to make a 'contextString' (see
> >>> HBaseStorage.java for example use) to store any state.
> >>>
> >>> ideally UDFs shouldn't have to do this.. They should have the same
> context
> >>> info that is available for loaders and storage.
> >>>
> >>> Raghu.
> >>>
> >>>
> >>>>
> >>>> at the top and then using each one only once. That still produces the
> same
> >>>> error. I guess in this case we'll just have to require the field
> names be
> >>>> entered into the UDF and it won't introspect them. Ah well. Would be
> nice
> >>>> to be able to use it but I don't really see another way around this
> bug with
> >>>> the shared UDF context.
> >>>>
> >>>>>
> >>>>> D
> >>>>>
> >>>>> On Wed, Jul 6, 2011 at 9:42 AM, Jeremy Hanna <
> [email protected]>
> >>>> wrote:
> >>>>>> We have a UDF that introspects the output schema and gets the field
> >>>> names there and use that in the exec method.
> >>>>>>
> >>>>>> The UDF is found here:
> >>>>
> https://github.com/jeromatron/pygmalion/blob/master/udf/src/main/java/org/pygmalion/udf/ToCassandraBag.java
> >>>>>>
> >>>>>> A simple example is found here:
> >>>>
> https://github.com/jeromatron/pygmalion/blob/master/scripts/from_to_cassandra_bag_example.pig
> >>>>>>
> >>>>>> It takes the relation's aliases and uses them in the output so that
> the
> >>>> user doesn't have to specify them. However we've been noticing that
> if you
> >>>> have more than one ToCassandraBag call in a pig script, sometimes they
> are
> >>>> run at the same time and the key is the same in the UDF context:
> >>>> cassandra.input_field_schema. So we think there is an issue there
> (array
> >>>> out of bounds exceptions when running the script, but when running in
> grunt
> >>>> one at a time, it doesn't do that).
> >>>>>>
> >>>>>> Is there a right way to do this parameter passing so that we don't
> get
> >>>> these errors when multiple calls exist?
> >>>>>>
> >>>>>> We thought of using the schema hash code as a suffix (e.g.
> >>>> cassandra.input_field_schema.12344321) but we don't have access to the
> >>>> schema in the exec method.
> >>>>>>
> >>>>>> We thought of having the first parameter of the input tuple be a
> unique
> >>>> name that the script specifies, like 'filename.relationalias' as a
> >>>> convention to make them unique to the file. However in the
> outputSchema, we
> >>>> don't have access to the input tuple, just the schema itself, so it
> couldn't
> >>>> get that value in there.
> >>>>>>
> >>>>>> Any ideas on how to make this so it doesn't stomp on each other
> within
> >>>> the pig script? Is there a best way to do that?
> >>>>>>
> >>>>>> Thanks!
> >>>>>>
> >>>>>> Jeremy
> >>>>
> >>>>
> >>
> >
> > --------------------------
> > Grant Ingersoll
> >
> >
> >
>
>