Writing a custom StoreFunc, to me, is absolutely the way to do this. There are storers/loaders for various non-HDFS systems (Vertica, HBase, Cassandra, etc), and what you are doing is indeed storing to another system.
Also, it's very dangerous to use an EvalFunc to store, because if a mapper fails halfway through, then when that split is reprocessed, your data will be reuploaded. That gotcha still exists with a custom StoreFunc, but at least the logic there is explicit. 2012/2/27 Stuart White <[email protected]> > I'm writing a pig script that will read a file of records and pass > them to a custom EvalFunc. This EvalFunc has a side-effect; it > updates data in a separate datastore. > > In the simplest example, my pig script looks like this: > > A = load 'data.txt' using PigStorage(',') as (dataelement1 : > chararray, dataelement2 : chararray); > B = foreach A generate com.example.MyEvalFunc(dataelement1, > dataelement2); > > The problem is that pig recognizes that I never use the B records and > therefore optimizes my script to not execute the foreach/generate that > calls my UDF. Pig doesn't realize that MyEvalFunc() updates a > separate datastore and therefore needs to go ahead and process the > records through the EvalFunc. > > Of course I could to a store/dump on B to force pig to execute that > line, but that feels like a hack. There is nothing I want to > store/dump coming out of my EvalFunc. > > Is there any way control pig's optimization to force it to execute a > line even though it doesn't think it should? > > Another thought is that maybe instead of writing an EvalFunc I should > write a custom StoreFunc to do this. However, it looks like > StoreFuncs are very tied to writing to HDFS rather than writing to any > arbitrary data store. > > Thoughts? >
