I'm trying to figure out how to handle errors in my Pipeline. Right now, my main transform is a DoFn<FileIO.ReadableFile, CSVRecord>. I have a few different TupleTag<CSVRecord> that I use depending on the data contained in the records. In the event there's a problem with a line (due to one of several possible causes), I created a TupleTag<CSVRecord> ERROR. However, just doing this doesn't carry with it any information about the error. I would like for the ERROR tag to have a type other than CSVRecord, e.g. some sort of ErrorInfo class containing the row number, filename, message about what went wrong, etc...
I can't use multiple TupleTag types with ParDo, because the withOutputTags() method forces them to all have the same generic parameter. I saw the example here: https://medium.com/@vallerylancey/error-handling-elements-in-apache-beam-pipelines-fffdea91af2a But I don't see how this can work, since they use multiple generic types in withOutputTags(). (And is this good practice? Seems like they "cheat" by not calling apply(), instead directly transforming the PCollection (and why even bother extending DoFn in this case?).) Finally, if I write my own PTransform<FileIO.ReadableFile, PCollectionTuple> class, and start manually creating PCollections and whatnot...then this would effectively become a bottleneck where everything has to be read at once, and there's no longer any sequential handling of the records as they're read, right? Suite ? l'?volution des dispositifs de r?glementation du travail, si vous recevez ce mail avant 7h00, en soir?e, durant le week-end ou vos cong?s merci, sauf cas d'urgence exceptionnelle, de ne pas le traiter ni d'y r?pondre imm?diatement.
