Re: How to "force" pig to process records?

Bill Graham Mon, 27 Feb 2012 23:37:57 -0800

HBaseStorage is an example of a StoreFunc that writes to another
data-strore that's not (directly) HDFS:
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/hbase/HBaseStorage.java


So is CassandraStorage:
http://javasourcecode.org/html/open-source/cassandra/cassandra-0.8.1/org/apache/cassandra/hadoop/pig/CassandraStorage.html

Your implementation will of course differ based on what you're writing to,
but those example show generally how to write a StoreFunc.

On Mon, Feb 27, 2012 at 8:49 PM, Srinivas Surasani <[email protected]>wrote:

> @All,
>
>
>    - Could some one give little info about "updating the data in
>    other data-store" in Pig UDF. I have experience in writing UDF's for
>    processing data and wondering about the above ( updating the data in
>    other data-store).
>
> Thanks Much !
>
> Regards,
>
>
> On Mon, Feb 27, 2012 at 4:39 PM, Bill Graham <[email protected]> wrote:
>
>> When writing to a non-HDFS data store in a StoreFunc, also be sure to
>> disable speculative execution:
>>
>> SET mapred.map.tasks.speculative.execution false
>>
>> You'll also want to manage commits/roll-backs properly at task completion
>> if the datasource you're writing to supports transactions.
>>
>>
>> On Mon, Feb 27, 2012 at 11:57 AM, Stuart White <[email protected]
>> >wrote:
>>
>> > Thanks for the feedback.  That's exactly what I was looking for.
>> >
>> > On Mon, Feb 27, 2012 at 11:56 AM, Jonathan Coveney <[email protected]>
>> > wrote:
>> > > Writing a custom StoreFunc, to me, is absolutely the way to do this.
>> > There
>> > > are storers/loaders for various non-HDFS systems (Vertica, HBase,
>> > > Cassandra, etc), and what you are doing is indeed storing to another
>> > system.
>> > >
>> > > Also, it's very dangerous to use an EvalFunc to store, because if a
>> > mapper
>> > > fails halfway through, then when that split is reprocessed, your data
>> > will
>> > > be reuploaded. That gotcha still exists with a custom StoreFunc, but
>> at
>> > > least the logic there is explicit.
>> > >
>> > > 2012/2/27 Stuart White <[email protected]>
>> > >
>> > >> I'm writing a pig script that will read a file of records and pass
>> > >> them to a custom EvalFunc.  This EvalFunc has a side-effect; it
>> > >> updates data in a separate datastore.
>> > >>
>> > >> In the simplest example, my pig script looks like this:
>> > >>
>> > >>   A = load 'data.txt' using PigStorage(',') as (dataelement1  :
>> > >> chararray, dataelement2 : chararray);
>> > >>   B = foreach A generate com.example.MyEvalFunc(dataelement1,
>> > >> dataelement2);
>> > >>
>> > >> The problem is that pig recognizes that I never use the B records and
>> > >> therefore optimizes my script to not execute the foreach/generate
>> that
>> > >> calls my UDF.  Pig doesn't realize that MyEvalFunc() updates a
>> > >> separate datastore and therefore needs to go ahead and process the
>> > >> records through the EvalFunc.
>> > >>
>> > >> Of course I could to a store/dump on B to force pig to execute that
>> > >> line, but that feels like a hack.  There is nothing I want to
>> > >> store/dump coming out of my EvalFunc.
>> > >>
>> > >> Is there any way control pig's optimization to force it to execute a
>> > >> line even though it doesn't think it should?
>> > >>
>> > >> Another thought is that maybe instead of writing an EvalFunc I should
>> > >> write a custom StoreFunc to do this.  However, it looks like
>> > >> StoreFuncs are very tied to writing to HDFS rather than writing to
>> any
>> > >> arbitrary data store.
>> > >>
>> > >> Thoughts?
>> > >>
>> >
>>
>>
>>
>> --
>> *Note that I'm no longer using my Yahoo! email address. Please email me at
>> [email protected] going forward.*
>>
>
>
>
> --
> Regards,
> Srinivas
> [email protected]
>
>

Re: How to "force" pig to process records?

Reply via email to