Re: How to "force" pig to process records?

Srinivas Surasani Mon, 27 Feb 2012 20:50:16 -0800

@All,


   - Could some one give little info about "updating the data in
   other data-store" in Pig UDF. I have experience in writing UDF's for
   processing data and wondering about the above ( updating the data in
   other data-store).

Thanks Much !

Regards,


On Mon, Feb 27, 2012 at 4:39 PM, Bill Graham <[email protected]> wrote:

> When writing to a non-HDFS data store in a StoreFunc, also be sure to
> disable speculative execution:
>
> SET mapred.map.tasks.speculative.execution false
>
> You'll also want to manage commits/roll-backs properly at task completion
> if the datasource you're writing to supports transactions.
>
>
> On Mon, Feb 27, 2012 at 11:57 AM, Stuart White <[email protected]
> >wrote:
>
> > Thanks for the feedback.  That's exactly what I was looking for.
> >
> > On Mon, Feb 27, 2012 at 11:56 AM, Jonathan Coveney <[email protected]>
> > wrote:
> > > Writing a custom StoreFunc, to me, is absolutely the way to do this.
> > There
> > > are storers/loaders for various non-HDFS systems (Vertica, HBase,
> > > Cassandra, etc), and what you are doing is indeed storing to another
> > system.
> > >
> > > Also, it's very dangerous to use an EvalFunc to store, because if a
> > mapper
> > > fails halfway through, then when that split is reprocessed, your data
> > will
> > > be reuploaded. That gotcha still exists with a custom StoreFunc, but at
> > > least the logic there is explicit.
> > >
> > > 2012/2/27 Stuart White <[email protected]>
> > >
> > >> I'm writing a pig script that will read a file of records and pass
> > >> them to a custom EvalFunc.  This EvalFunc has a side-effect; it
> > >> updates data in a separate datastore.
> > >>
> > >> In the simplest example, my pig script looks like this:
> > >>
> > >>   A = load 'data.txt' using PigStorage(',') as (dataelement1  :
> > >> chararray, dataelement2 : chararray);
> > >>   B = foreach A generate com.example.MyEvalFunc(dataelement1,
> > >> dataelement2);
> > >>
> > >> The problem is that pig recognizes that I never use the B records and
> > >> therefore optimizes my script to not execute the foreach/generate that
> > >> calls my UDF.  Pig doesn't realize that MyEvalFunc() updates a
> > >> separate datastore and therefore needs to go ahead and process the
> > >> records through the EvalFunc.
> > >>
> > >> Of course I could to a store/dump on B to force pig to execute that
> > >> line, but that feels like a hack.  There is nothing I want to
> > >> store/dump coming out of my EvalFunc.
> > >>
> > >> Is there any way control pig's optimization to force it to execute a
> > >> line even though it doesn't think it should?
> > >>
> > >> Another thought is that maybe instead of writing an EvalFunc I should
> > >> write a custom StoreFunc to do this.  However, it looks like
> > >> StoreFuncs are very tied to writing to HDFS rather than writing to any
> > >> arbitrary data store.
> > >>
> > >> Thoughts?
> > >>
> >
>
>
>
> --
> *Note that I'm no longer using my Yahoo! email address. Please email me at
> [email protected] going forward.*
>



-- 
Regards,
Srinivas
[email protected]

Re: How to "force" pig to process records?

Reply via email to