Handling imperfect data

Cameron Bateman Tue, 07 Apr 2020 22:30:45 -0700

I am trying to create a pipeline that intakes PDF files, parses the data
using Tika and processes the data.  A problem I have is that sometimes Tika
doesn't perfectly convert certain pieces of text correctly.


I can detect that this and would like to fork the output of my pipeline:
for correctly converted PDF files, I want to continue processing the data.
For the ones that have errors, I'd like to dump the intermediate XML data
to a directory and raise an alert.  For those files, I will go and manually
fix the file and effective restart the pipeline from where it failed as if
it was correct in the first place.

Is there any facility to do this sort of handling of imperfect data
inputs?  I see that I can try to use MultiOutputReceiver and TupleTags to
try to fork the data but I'm a little at a loss where to proceed.

Thanks,
Cameron

Handling imperfect data

Reply via email to