@All,
- Could some one give little info about "updating the data in other data-store" in Pig UDF. I have experience in writing UDF's for processing data and wondering about the above ( updating the data in other data-store). Thanks Much ! Regards, On Mon, Feb 27, 2012 at 4:39 PM, Bill Graham <[email protected]> wrote: > When writing to a non-HDFS data store in a StoreFunc, also be sure to > disable speculative execution: > > SET mapred.map.tasks.speculative.execution false > > You'll also want to manage commits/roll-backs properly at task completion > if the datasource you're writing to supports transactions. > > > On Mon, Feb 27, 2012 at 11:57 AM, Stuart White <[email protected] > >wrote: > > > Thanks for the feedback. That's exactly what I was looking for. > > > > On Mon, Feb 27, 2012 at 11:56 AM, Jonathan Coveney <[email protected]> > > wrote: > > > Writing a custom StoreFunc, to me, is absolutely the way to do this. > > There > > > are storers/loaders for various non-HDFS systems (Vertica, HBase, > > > Cassandra, etc), and what you are doing is indeed storing to another > > system. > > > > > > Also, it's very dangerous to use an EvalFunc to store, because if a > > mapper > > > fails halfway through, then when that split is reprocessed, your data > > will > > > be reuploaded. That gotcha still exists with a custom StoreFunc, but at > > > least the logic there is explicit. > > > > > > 2012/2/27 Stuart White <[email protected]> > > > > > >> I'm writing a pig script that will read a file of records and pass > > >> them to a custom EvalFunc. This EvalFunc has a side-effect; it > > >> updates data in a separate datastore. > > >> > > >> In the simplest example, my pig script looks like this: > > >> > > >> A = load 'data.txt' using PigStorage(',') as (dataelement1 : > > >> chararray, dataelement2 : chararray); > > >> B = foreach A generate com.example.MyEvalFunc(dataelement1, > > >> dataelement2); > > >> > > >> The problem is that pig recognizes that I never use the B records and > > >> therefore optimizes my script to not execute the foreach/generate that > > >> calls my UDF. Pig doesn't realize that MyEvalFunc() updates a > > >> separate datastore and therefore needs to go ahead and process the > > >> records through the EvalFunc. > > >> > > >> Of course I could to a store/dump on B to force pig to execute that > > >> line, but that feels like a hack. There is nothing I want to > > >> store/dump coming out of my EvalFunc. > > >> > > >> Is there any way control pig's optimization to force it to execute a > > >> line even though it doesn't think it should? > > >> > > >> Another thought is that maybe instead of writing an EvalFunc I should > > >> write a custom StoreFunc to do this. However, it looks like > > >> StoreFuncs are very tied to writing to HDFS rather than writing to any > > >> arbitrary data store. > > >> > > >> Thoughts? > > >> > > > > > > -- > *Note that I'm no longer using my Yahoo! email address. Please email me at > [email protected] going forward.* > -- Regards, Srinivas [email protected]
