Re: Checkpointing Dataflow Pipeline

Michael Luckey Tue, 06 Apr 2021 13:59:33 -0700

Hi Kenn,

yes, resuming reading at the proper timestamp is exactly the issue we are
currently struggling with. E.g. with Kinesis Client Lib we could store the
last read within some dynamo table. This mechanism is not used with beam,
as we understand, the runner is responsible to track that checkpoint mark.


Now, obviously on restarting the pipeline, e.g. on non compatible upgrade,
that is, an pipeline update is just not feasible, there must be some
mechanism in place on how Dataflow will know where to continue. Is that
simply the pipeline name? Or is there more involved? So how does
checkpointing actually work here?

Based on 'name', wouldn't that imply that something like (example taken
from
https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates
)

  export REGION="us-central1"

  gcloud dataflow flex-template run "streaming-beam-sql-`date +%Y%m%d-%H%M%S`" \
    --template-file-gcs-location "$TEMPLATE_PATH" \
    --parameters inputSubscription="$SUBSCRIPTION" \
    --parameters outputTable="$PROJECT:$DATASET.$TABLE" \
    --region "$REGION"

will not resume on last read on rerun, because the name obviously changes
here?

best,

michel



On Tue, Apr 6, 2021 at 10:38 PM Kenneth Knowles <[email protected]> wrote:

> I would assume the main issue is resuming reading from the Kinesis stream
> from the last read? In the case for Pubsub (just as another example of the
> idea) this is part of the internal state of a pre-created subscription.
>
> Kenn
>
> On Tue, Apr 6, 2021 at 1:26 PM Michael Luckey <[email protected]> wrote:
>
>> Hi list,
>>
>> with our current project we are implementing our streaming pipeline based
>> on Google Dataflow.
>>
>> Essentially we receive input via Kinesis, doing some filtering,
>> enrichment and sessionizing and output to PubSub and/or google storage.
>>
>> After short investigations it is not clear to us, how checkpointing will
>> work running on Dataflow in connection with KinesisIO. Is there any
>> documentation/discussions to get a better understanding on how that will be
>> working? Especially if we are forced to restart our pipelines, how could we
>> ensure not to loose any events?
>>
>> As far as I understand currently, it should work 'auto-magically' but it
>> is not yet clear to us, how it will actually behave. Before we try to start
>> testing our expectations or even try to implement some watermark-tracking
>> by ourself we hoped to get some insights from other users here.
>>
>> Any help appreciated.
>>
>> Best,
>>
>> michel
>>
>

Re: Checkpointing Dataflow Pipeline

Reply via email to