Right now error handling is controlled by the UDFs themselves, and there is
no way to direct it externally.
You can make an ErrorHandlingUDF that would take a udf spec, invoke it, trap
errors, and then do the specified error handling behavior.. that's a bit
ugly though.

There is a problem with trapping general exceptions of course, in that if
they happen 0.000001% of the time you can probably just ignore them, but if
they happen in half your dataset, you want the job to tell you something is
wrong. So this stuff gets non-trivial. If anyone wants to propose a design
to solve this general problem, I think that would be a welcome addition.

D

On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b <[email protected]> wrote:

> Thanks, I sometimes get a date like 0001-01-01. This would be a valid date
> format, but when I try to get the seconds between this and another date,
> say
> 2011-01-01, I get an error that the value is too large to be fit into int
> and the process stops. Do we have something like ifError(x-y, null,x-y)? Or
> would I have to implement this as an UDF?
>
> Thanks
>
> On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy <[email protected]>
> wrote:
>
> > Create a UDF that verifies the format, and go through a filtering step
> > first.
> > If you would like to save the malformated records so you can look at them
> > later, you can use the SPLIT operator to route the good records to your
> > regular workflow, and the bad records some place on HDFS.
> >
> > -D
> >
> > On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b <[email protected]> wrote:
> >
> > > Hello,
> > >
> > > I have a pig script that uses piggy bank to calculate date differences.
> > > Sometimes, when I get a wierd date or wrong format in the input, the
> > script
> > > throws and error and aborts.
> > >
> > > Is there a way I could trap these errors and move on without stopping
> the
> > > execution?
> > >
> > > Thanks
> > >
> > > PS: I'm using CDH2 with Pig 0.5
> > >
> >
>

Reply via email to