Re: Handling imperfect data

Varun Dhussa Tue, 07 Apr 2020 23:05:45 -0700

TupleTags is a good way to proceed. You can add a dead letter side output
for the tag. A sample implementation is here
<https://cloud.google.com/blog/products/gcp/handling-invalid-inputs-in-dataflow>
.


Varun


On Wed, Apr 8, 2020 at 11:00 AM Cameron Bateman <[email protected]>
wrote:

> I am trying to create a pipeline that intakes PDF files, parses the data
> using Tika and processes the data.  A problem I have is that sometimes Tika
> doesn't perfectly convert certain pieces of text correctly.
>
> I can detect that this and would like to fork the output of my pipeline:
> for correctly converted PDF files, I want to continue processing the data.
> For the ones that have errors, I'd like to dump the intermediate XML data
> to a directory and raise an alert.  For those files, I will go and manually
> fix the file and effective restart the pipeline from where it failed as if
> it was correct in the first place.
>
> Is there any facility to do this sort of handling of imperfect data
> inputs?  I see that I can try to use MultiOutputReceiver and TupleTags to
> try to fork the data but I'm a little at a loss where to proceed.
>
> Thanks,
> Cameron
>

Re: Handling imperfect data

Reply via email to