HBaseStorage is an example of a StoreFunc that writes to another data-strore that's not (directly) HDFS: http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/hbase/HBaseStorage.java
So is CassandraStorage: http://javasourcecode.org/html/open-source/cassandra/cassandra-0.8.1/org/apache/cassandra/hadoop/pig/CassandraStorage.html Your implementation will of course differ based on what you're writing to, but those example show generally how to write a StoreFunc. On Mon, Feb 27, 2012 at 8:49 PM, Srinivas Surasani <[email protected]>wrote: > @All, > > > - Could some one give little info about "updating the data in > other data-store" in Pig UDF. I have experience in writing UDF's for > processing data and wondering about the above ( updating the data in > other data-store). > > Thanks Much ! > > Regards, > > > On Mon, Feb 27, 2012 at 4:39 PM, Bill Graham <[email protected]> wrote: > >> When writing to a non-HDFS data store in a StoreFunc, also be sure to >> disable speculative execution: >> >> SET mapred.map.tasks.speculative.execution false >> >> You'll also want to manage commits/roll-backs properly at task completion >> if the datasource you're writing to supports transactions. >> >> >> On Mon, Feb 27, 2012 at 11:57 AM, Stuart White <[email protected] >> >wrote: >> >> > Thanks for the feedback. That's exactly what I was looking for. >> > >> > On Mon, Feb 27, 2012 at 11:56 AM, Jonathan Coveney <[email protected]> >> > wrote: >> > > Writing a custom StoreFunc, to me, is absolutely the way to do this. >> > There >> > > are storers/loaders for various non-HDFS systems (Vertica, HBase, >> > > Cassandra, etc), and what you are doing is indeed storing to another >> > system. >> > > >> > > Also, it's very dangerous to use an EvalFunc to store, because if a >> > mapper >> > > fails halfway through, then when that split is reprocessed, your data >> > will >> > > be reuploaded. That gotcha still exists with a custom StoreFunc, but >> at >> > > least the logic there is explicit. >> > > >> > > 2012/2/27 Stuart White <[email protected]> >> > > >> > >> I'm writing a pig script that will read a file of records and pass >> > >> them to a custom EvalFunc. This EvalFunc has a side-effect; it >> > >> updates data in a separate datastore. >> > >> >> > >> In the simplest example, my pig script looks like this: >> > >> >> > >> A = load 'data.txt' using PigStorage(',') as (dataelement1 : >> > >> chararray, dataelement2 : chararray); >> > >> B = foreach A generate com.example.MyEvalFunc(dataelement1, >> > >> dataelement2); >> > >> >> > >> The problem is that pig recognizes that I never use the B records and >> > >> therefore optimizes my script to not execute the foreach/generate >> that >> > >> calls my UDF. Pig doesn't realize that MyEvalFunc() updates a >> > >> separate datastore and therefore needs to go ahead and process the >> > >> records through the EvalFunc. >> > >> >> > >> Of course I could to a store/dump on B to force pig to execute that >> > >> line, but that feels like a hack. There is nothing I want to >> > >> store/dump coming out of my EvalFunc. >> > >> >> > >> Is there any way control pig's optimization to force it to execute a >> > >> line even though it doesn't think it should? >> > >> >> > >> Another thought is that maybe instead of writing an EvalFunc I should >> > >> write a custom StoreFunc to do this. However, it looks like >> > >> StoreFuncs are very tied to writing to HDFS rather than writing to >> any >> > >> arbitrary data store. >> > >> >> > >> Thoughts? >> > >> >> > >> >> >> >> -- >> *Note that I'm no longer using my Yahoo! email address. Please email me at >> [email protected] going forward.* >> > > > > -- > Regards, > Srinivas > [email protected] > >
