I am trying to create a pipeline that intakes PDF files, parses the data using Tika and processes the data. A problem I have is that sometimes Tika doesn't perfectly convert certain pieces of text correctly.
I can detect that this and would like to fork the output of my pipeline: for correctly converted PDF files, I want to continue processing the data. For the ones that have errors, I'd like to dump the intermediate XML data to a directory and raise an alert. For those files, I will go and manually fix the file and effective restart the pipeline from where it failed as if it was correct in the first place. Is there any facility to do this sort of handling of imperfect data inputs? I see that I can try to use MultiOutputReceiver and TupleTags to try to fork the data but I'm a little at a loss where to proceed. Thanks, Cameron
