I have not tried this yet, but hadoop has a built-in mechanism for skipping
bad records. I'm guessing that it would work fine with Pig.
http://hadoop.apache.org/docs/r1.1.2/mapred_tutorial.html#Skipping+Bad+Records

One caveat about skipping bad records though is that it will only detect
one bad record per task failure. And skipping bad records will not be
turned on until after the second failure. Which means that if you have a
maximum of 4 attempts for a task, it will detect 1 bad record. 1st attempt
fails, 2nd attempt fails, 3rd attempt with skip bad records fails but bad
record found, 4th attempt (assuming only 1 bad record) succeeds by skipping
bad record. Because of this nuance, depending on how many bad records you
have in your data, you might have to increase the number of task attempts
until all bad records are found by the framework.


On Thu, Jul 11, 2013 at 1:09 PM, Sajid Raza <[email protected]> wrote:

> Did a bit of googling before I posted. Saw that some folks proposed an
> ONERROR extension to the language, but that doesn't seem to be implemented
> yet.
>
> Do we have any Pig Latin built-ins or primitives for handling errors?
>
> My use case is simple: I have a large number of records and dropping one or
> two bad records would be better than having my whole job fail.
>

Reply via email to