Re: Handling imperfect data

Cameron Bateman Fri, 10 Apr 2020 00:50:26 -0700

Thanks Varun, that worked.  A small note for anyone following this is that
the API seems to have changed slightly since the blog was written.  In
particular, processElement is no longer a method of the DoFn parent as of
recent versions. Instead, it is referenced via the @ProcessElement
annotation.  Check the up to date API for more info.


On Tue, Apr 7, 2020 at 11:05 PM Varun Dhussa <[email protected]> wrote:

> TupleTags is a good way to proceed. You can add a dead letter side output
> for the tag. A sample implementation is here
> <https://cloud.google.com/blog/products/gcp/handling-invalid-inputs-in-dataflow>
> .
>
> Varun
>
>
> On Wed, Apr 8, 2020 at 11:00 AM Cameron Bateman <[email protected]>
> wrote:
>
>> I am trying to create a pipeline that intakes PDF files, parses the data
>> using Tika and processes the data.  A problem I have is that sometimes Tika
>> doesn't perfectly convert certain pieces of text correctly.
>>
>> I can detect that this and would like to fork the output of my pipeline:
>> for correctly converted PDF files, I want to continue processing the data.
>> For the ones that have errors, I'd like to dump the intermediate XML data
>> to a directory and raise an alert.  For those files, I will go and manually
>> fix the file and effective restart the pipeline from where it failed as if
>> it was correct in the first place.
>>
>> Is there any facility to do this sort of handling of imperfect data
>> inputs?  I see that I can try to use MultiOutputReceiver and TupleTags to
>> try to fork the data but I'm a little at a loss where to proceed.
>>
>> Thanks,
>> Cameron
>>
>

Re: Handling imperfect data

Reply via email to