I have not tried this yet, but hadoop has a built-in mechanism for skipping bad records. I'm guessing that it would work fine with Pig. http://hadoop.apache.org/docs/r1.1.2/mapred_tutorial.html#Skipping+Bad+Records
One caveat about skipping bad records though is that it will only detect one bad record per task failure. And skipping bad records will not be turned on until after the second failure. Which means that if you have a maximum of 4 attempts for a task, it will detect 1 bad record. 1st attempt fails, 2nd attempt fails, 3rd attempt with skip bad records fails but bad record found, 4th attempt (assuming only 1 bad record) succeeds by skipping bad record. Because of this nuance, depending on how many bad records you have in your data, you might have to increase the number of task attempts until all bad records are found by the framework. On Thu, Jul 11, 2013 at 1:09 PM, Sajid Raza <[email protected]> wrote: > Did a bit of googling before I posted. Saw that some folks proposed an > ONERROR extension to the language, but that doesn't seem to be implemented > yet. > > Do we have any Pig Latin built-ins or primitives for handling errors? > > My use case is simple: I have a large number of records and dropping one or > two bad records would be better than having my whole job fail. >
