Writing a custom StoreFunc, to me, is absolutely the way to do this. There
are storers/loaders for various non-HDFS systems (Vertica, HBase,
Cassandra, etc), and what you are doing is indeed storing to another system.

Also, it's very dangerous to use an EvalFunc to store, because if a mapper
fails halfway through, then when that split is reprocessed, your data will
be reuploaded. That gotcha still exists with a custom StoreFunc, but at
least the logic there is explicit.

2012/2/27 Stuart White <[email protected]>

> I'm writing a pig script that will read a file of records and pass
> them to a custom EvalFunc.  This EvalFunc has a side-effect; it
> updates data in a separate datastore.
>
> In the simplest example, my pig script looks like this:
>
>   A = load 'data.txt' using PigStorage(',') as (dataelement1  :
> chararray, dataelement2 : chararray);
>   B = foreach A generate com.example.MyEvalFunc(dataelement1,
> dataelement2);
>
> The problem is that pig recognizes that I never use the B records and
> therefore optimizes my script to not execute the foreach/generate that
> calls my UDF.  Pig doesn't realize that MyEvalFunc() updates a
> separate datastore and therefore needs to go ahead and process the
> records through the EvalFunc.
>
> Of course I could to a store/dump on B to force pig to execute that
> line, but that feels like a hack.  There is nothing I want to
> store/dump coming out of my EvalFunc.
>
> Is there any way control pig's optimization to force it to execute a
> line even though it doesn't think it should?
>
> Another thought is that maybe instead of writing an EvalFunc I should
> write a custom StoreFunc to do this.  However, it looks like
> StoreFuncs are very tied to writing to HDFS rather than writing to any
> arbitrary data store.
>
> Thoughts?
>

Reply via email to