Pipeline error handling

Kelsey RIDER Thu, 26 Jul 2018 00:45:02 -0700

I'm trying to figure out how to handle errors in my Pipeline.

Right now, my main transform is a DoFn<FileIO.ReadableFile, CSVRecord>. I have 
a few different TupleTag<CSVRecord> that I use depending on the data contained 
in the records.
In the event there's a problem with a line (due to one of several possible 
causes), I created a TupleTag<CSVRecord> ERROR. However, just doing this 
doesn't carry with it any information about the error.
I would like for the ERROR tag to have a type other than CSVRecord, e.g. some 
sort of ErrorInfo class containing the row number, filename, message about what 
went wrong, etc...


I can't use multiple TupleTag types with ParDo, because the withOutputTags() 
method forces them to all have the same generic parameter.

I saw the example here: 
https://medium.com/@vallerylancey/error-handling-elements-in-apache-beam-pipelines-fffdea91af2a
But I don't see how this can work, since they use multiple generic types in 
withOutputTags(). (And is this good practice? Seems like they "cheat" by not 
calling apply(), instead directly transforming the PCollection (and why even 
bother extending DoFn in this case?).)

Finally, if I write my own PTransform<FileIO.ReadableFile, PCollectionTuple> 
class, and start manually creating PCollections and whatnot...then this would 
effectively become a bottleneck where everything has to be read at once, and 
there's no longer any sequential handling of the records as they're read, right?
Suite ? l'?volution des dispositifs de r?glementation du travail, si vous 
recevez ce mail avant 7h00, en soir?e, durant le week-end ou vos cong?s merci, 
sauf cas d'urgence exceptionnelle, de ne pas le traiter ni d'y r?pondre 
imm?diatement.

Pipeline error handling

Reply via email to