Thanks Varun, that worked. A small note for anyone following this is that the API seems to have changed slightly since the blog was written. In particular, processElement is no longer a method of the DoFn parent as of recent versions. Instead, it is referenced via the @ProcessElement annotation. Check the up to date API for more info.
On Tue, Apr 7, 2020 at 11:05 PM Varun Dhussa <[email protected]> wrote: > TupleTags is a good way to proceed. You can add a dead letter side output > for the tag. A sample implementation is here > <https://cloud.google.com/blog/products/gcp/handling-invalid-inputs-in-dataflow> > . > > Varun > > > On Wed, Apr 8, 2020 at 11:00 AM Cameron Bateman <[email protected]> > wrote: > >> I am trying to create a pipeline that intakes PDF files, parses the data >> using Tika and processes the data. A problem I have is that sometimes Tika >> doesn't perfectly convert certain pieces of text correctly. >> >> I can detect that this and would like to fork the output of my pipeline: >> for correctly converted PDF files, I want to continue processing the data. >> For the ones that have errors, I'd like to dump the intermediate XML data >> to a directory and raise an alert. For those files, I will go and manually >> fix the file and effective restart the pipeline from where it failed as if >> it was correct in the first place. >> >> Is there any facility to do this sort of handling of imperfect data >> inputs? I see that I can try to use MultiOutputReceiver and TupleTags to >> try to fork the data but I'm a little at a loss where to proceed. >> >> Thanks, >> Cameron >> >
