TupleTags is a good way to proceed. You can add a dead letter side output for the tag. A sample implementation is here <https://cloud.google.com/blog/products/gcp/handling-invalid-inputs-in-dataflow> .
Varun On Wed, Apr 8, 2020 at 11:00 AM Cameron Bateman <[email protected]> wrote: > I am trying to create a pipeline that intakes PDF files, parses the data > using Tika and processes the data. A problem I have is that sometimes Tika > doesn't perfectly convert certain pieces of text correctly. > > I can detect that this and would like to fork the output of my pipeline: > for correctly converted PDF files, I want to continue processing the data. > For the ones that have errors, I'd like to dump the intermediate XML data > to a directory and raise an alert. For those files, I will go and manually > fix the file and effective restart the pipeline from where it failed as if > it was correct in the first place. > > Is there any facility to do this sort of handling of imperfect data > inputs? I see that I can try to use MultiOutputReceiver and TupleTags to > try to fork the data but I'm a little at a loss where to proceed. > > Thanks, > Cameron >
