Thanks for the feedback.  That's exactly what I was looking for.

On Mon, Feb 27, 2012 at 11:56 AM, Jonathan Coveney <[email protected]> wrote:
> Writing a custom StoreFunc, to me, is absolutely the way to do this. There
> are storers/loaders for various non-HDFS systems (Vertica, HBase,
> Cassandra, etc), and what you are doing is indeed storing to another system.
>
> Also, it's very dangerous to use an EvalFunc to store, because if a mapper
> fails halfway through, then when that split is reprocessed, your data will
> be reuploaded. That gotcha still exists with a custom StoreFunc, but at
> least the logic there is explicit.
>
> 2012/2/27 Stuart White <[email protected]>
>
>> I'm writing a pig script that will read a file of records and pass
>> them to a custom EvalFunc.  This EvalFunc has a side-effect; it
>> updates data in a separate datastore.
>>
>> In the simplest example, my pig script looks like this:
>>
>>   A = load 'data.txt' using PigStorage(',') as (dataelement1  :
>> chararray, dataelement2 : chararray);
>>   B = foreach A generate com.example.MyEvalFunc(dataelement1,
>> dataelement2);
>>
>> The problem is that pig recognizes that I never use the B records and
>> therefore optimizes my script to not execute the foreach/generate that
>> calls my UDF.  Pig doesn't realize that MyEvalFunc() updates a
>> separate datastore and therefore needs to go ahead and process the
>> records through the EvalFunc.
>>
>> Of course I could to a store/dump on B to force pig to execute that
>> line, but that feels like a hack.  There is nothing I want to
>> store/dump coming out of my EvalFunc.
>>
>> Is there any way control pig's optimization to force it to execute a
>> line even though it doesn't think it should?
>>
>> Another thought is that maybe instead of writing an EvalFunc I should
>> write a custom StoreFunc to do this.  However, it looks like
>> StoreFuncs are very tied to writing to HDFS rather than writing to any
>> arbitrary data store.
>>
>> Thoughts?
>>

Reply via email to