Re: Checkpointing Dataflow Pipeline

2021-04-08 Thread Michael Luckey
Hi, thanks for your help here. Really appreciated! @alexey We already found your answer (magically showed up in some search engine results) and - iirc - while we were still using spark as underlying runner on some other project I do think we relied on that mentioned mechanism. But as with our

Re: Checkpointing Dataflow Pipeline

2021-04-07 Thread Vincent Marquez
On Wed, Apr 7, 2021 at 11:55 AM Kenneth Knowles wrote: > [I think this has graduated to a +dev thread] > > Yea, in Beam it is left up to the IOs primarily, hence the bundle > finalization step, or allowed runners to have their own features of course. > Dataflow also does have in-place pipeline

Re: Checkpointing Dataflow Pipeline

2021-04-07 Thread Kenneth Knowles
[I think this has graduated to a +dev thread] Yea, in Beam it is left up to the IOs primarily, hence the bundle finalization step, or allowed runners to have their own features of course. Dataflow also does have in-place pipeline update that restores the persisted checkpoints from one pipeline

Re: Checkpointing Dataflow Pipeline

2021-04-07 Thread Vincent Marquez
Looks like this is a common source of confusion, I had similar questions about checkpointing in the beam slack. In Spark Structured Streaming, checkpoints are saved to an *external* HDFS location and persist *beyond* each run, so in the event of a stream crashing, you can just point your next

Re: Checkpointing Dataflow Pipeline

2021-04-07 Thread Alexey Romanenko
I can’t say exactly how it will work with Dataflow but for Spark Runner I answered it here [1]: “Since KinesisIO is based on UnboundedSource.CheckpointMark, it uses the standard checkpoint mechanism, provided by Beam UnboundedSource.UnboundedReader. Once a KinesisRecord has been read

Re: Checkpointing Dataflow Pipeline

2021-04-06 Thread Kenneth Knowles
This sounds similar to the "Kafka Commit" in https://github.com/apache/beam/pull/12572 by +Boyuan Zhang and also to how PubsubIO ACKs messages in the finalizer. I don't know much about KinesisIO or how Kinesis works. I was just asking to clarify, in case other folks know more, like +Alexey

Re: Checkpointing Dataflow Pipeline

2021-04-06 Thread Michael Luckey
Hi Kenn, yes, resuming reading at the proper timestamp is exactly the issue we are currently struggling with. E.g. with Kinesis Client Lib we could store the last read within some dynamo table. This mechanism is not used with beam, as we understand, the runner is responsible to track that

Re: Checkpointing Dataflow Pipeline

2021-04-06 Thread Kenneth Knowles
I would assume the main issue is resuming reading from the Kinesis stream from the last read? In the case for Pubsub (just as another example of the idea) this is part of the internal state of a pre-created subscription. Kenn On Tue, Apr 6, 2021 at 1:26 PM Michael Luckey wrote: > Hi list, > >

Checkpointing Dataflow Pipeline

2021-04-06 Thread Michael Luckey
Hi list, with our current project we are implementing our streaming pipeline based on Google Dataflow. Essentially we receive input via Kinesis, doing some filtering, enrichment and sessionizing and output to PubSub and/or google storage. After short investigations it is not clear to us, how