I'm writing a pig script that will read a file of records and pass
them to a custom EvalFunc.  This EvalFunc has a side-effect; it
updates data in a separate datastore.

In the simplest example, my pig script looks like this:

   A = load 'data.txt' using PigStorage(',') as (dataelement1  :
chararray, dataelement2 : chararray);
   B = foreach A generate com.example.MyEvalFunc(dataelement1, dataelement2);

The problem is that pig recognizes that I never use the B records and
therefore optimizes my script to not execute the foreach/generate that
calls my UDF.  Pig doesn't realize that MyEvalFunc() updates a
separate datastore and therefore needs to go ahead and process the
records through the EvalFunc.

Of course I could to a store/dump on B to force pig to execute that
line, but that feels like a hack.  There is nothing I want to
store/dump coming out of my EvalFunc.

Is there any way control pig's optimization to force it to execute a
line even though it doesn't think it should?

Another thought is that maybe instead of writing an EvalFunc I should
write a custom StoreFunc to do this.  However, it looks like
StoreFuncs are very tied to writing to HDFS rather than writing to any
arbitrary data store.

Thoughts?

Reply via email to