Re: How to "force" pig to process records?

Bill Graham Mon, 27 Feb 2012 13:40:56 -0800

When writing to a non-HDFS data store in a StoreFunc, also be sure to
disable speculative execution:


SET mapred.map.tasks.speculative.execution false

You'll also want to manage commits/roll-backs properly at task completion
if the datasource you're writing to supports transactions.


On Mon, Feb 27, 2012 at 11:57 AM, Stuart White <[email protected]>wrote:

> Thanks for the feedback.  That's exactly what I was looking for.
>
> On Mon, Feb 27, 2012 at 11:56 AM, Jonathan Coveney <[email protected]>
> wrote:
> > Writing a custom StoreFunc, to me, is absolutely the way to do this.
> There
> > are storers/loaders for various non-HDFS systems (Vertica, HBase,
> > Cassandra, etc), and what you are doing is indeed storing to another
> system.
> >
> > Also, it's very dangerous to use an EvalFunc to store, because if a
> mapper
> > fails halfway through, then when that split is reprocessed, your data
> will
> > be reuploaded. That gotcha still exists with a custom StoreFunc, but at
> > least the logic there is explicit.
> >
> > 2012/2/27 Stuart White <[email protected]>
> >
> >> I'm writing a pig script that will read a file of records and pass
> >> them to a custom EvalFunc.  This EvalFunc has a side-effect; it
> >> updates data in a separate datastore.
> >>
> >> In the simplest example, my pig script looks like this:
> >>
> >>   A = load 'data.txt' using PigStorage(',') as (dataelement1  :
> >> chararray, dataelement2 : chararray);
> >>   B = foreach A generate com.example.MyEvalFunc(dataelement1,
> >> dataelement2);
> >>
> >> The problem is that pig recognizes that I never use the B records and
> >> therefore optimizes my script to not execute the foreach/generate that
> >> calls my UDF.  Pig doesn't realize that MyEvalFunc() updates a
> >> separate datastore and therefore needs to go ahead and process the
> >> records through the EvalFunc.
> >>
> >> Of course I could to a store/dump on B to force pig to execute that
> >> line, but that feels like a hack.  There is nothing I want to
> >> store/dump coming out of my EvalFunc.
> >>
> >> Is there any way control pig's optimization to force it to execute a
> >> line even though it doesn't think it should?
> >>
> >> Another thought is that maybe instead of writing an EvalFunc I should
> >> write a custom StoreFunc to do this.  However, it looks like
> >> StoreFuncs are very tied to writing to HDFS rather than writing to any
> >> arbitrary data store.
> >>
> >> Thoughts?
> >>
>



-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
[email protected] going forward.*

Re: How to "force" pig to process records?

Reply via email to