Hi Kenn, yes, resuming reading at the proper timestamp is exactly the issue we are currently struggling with. E.g. with Kinesis Client Lib we could store the last read within some dynamo table. This mechanism is not used with beam, as we understand, the runner is responsible to track that checkpoint mark.
Now, obviously on restarting the pipeline, e.g. on non compatible upgrade, that is, an pipeline update is just not feasible, there must be some mechanism in place on how Dataflow will know where to continue. Is that simply the pipeline name? Or is there more involved? So how does checkpointing actually work here? Based on 'name', wouldn't that imply that something like (example taken from https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates ) export REGION="us-central1" gcloud dataflow flex-template run "streaming-beam-sql-`date +%Y%m%d-%H%M%S`" \ --template-file-gcs-location "$TEMPLATE_PATH" \ --parameters inputSubscription="$SUBSCRIPTION" \ --parameters outputTable="$PROJECT:$DATASET.$TABLE" \ --region "$REGION" will not resume on last read on rerun, because the name obviously changes here? best, michel On Tue, Apr 6, 2021 at 10:38 PM Kenneth Knowles <[email protected]> wrote: > I would assume the main issue is resuming reading from the Kinesis stream > from the last read? In the case for Pubsub (just as another example of the > idea) this is part of the internal state of a pre-created subscription. > > Kenn > > On Tue, Apr 6, 2021 at 1:26 PM Michael Luckey <[email protected]> wrote: > >> Hi list, >> >> with our current project we are implementing our streaming pipeline based >> on Google Dataflow. >> >> Essentially we receive input via Kinesis, doing some filtering, >> enrichment and sessionizing and output to PubSub and/or google storage. >> >> After short investigations it is not clear to us, how checkpointing will >> work running on Dataflow in connection with KinesisIO. Is there any >> documentation/discussions to get a better understanding on how that will be >> working? Especially if we are forced to restart our pipelines, how could we >> ensure not to loose any events? >> >> As far as I understand currently, it should work 'auto-magically' but it >> is not yet clear to us, how it will actually behave. Before we try to start >> testing our expectations or even try to implement some watermark-tracking >> by ourself we hoped to get some insights from other users here. >> >> Any help appreciated. >> >> Best, >> >> michel >> >
