This sounds similar to the "Kafka Commit" in
https://github.com/apache/beam/pull/12572 by +Boyuan Zhang
<boyu...@google.com> and also to how PubsubIO ACKs messages in the
finalizer. I don't know much about KinesisIO or how Kinesis works. I was
just asking to clarify, in case other folks know more, like +Alexey
Romanenko <aromanenko....@gmail.com> and +Ismaël Mejía <ieme...@gmail.com> have
modified KinesisIO. If the feature does not exist today, perhaps we can
identify the best practices around this pattern.

Kenn

On Tue, Apr 6, 2021 at 1:59 PM Michael Luckey <adude3...@gmail.com> wrote:

> Hi Kenn,
>
> yes, resuming reading at the proper timestamp is exactly the issue we are
> currently struggling with. E.g. with Kinesis Client Lib we could store the
> last read within some dynamo table. This mechanism is not used with beam,
> as we understand, the runner is responsible to track that checkpoint mark.
>
> Now, obviously on restarting the pipeline, e.g. on non compatible upgrade,
> that is, an pipeline update is just not feasible, there must be some
> mechanism in place on how Dataflow will know where to continue. Is that
> simply the pipeline name? Or is there more involved? So how does
> checkpointing actually work here?
>
> Based on 'name', wouldn't that imply that something like (example taken
> from
> https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates
> )
>
>   export REGION="us-central1"
>
>   gcloud dataflow flex-template run "streaming-beam-sql-`date 
> +%Y%m%d-%H%M%S`" \
>     --template-file-gcs-location "$TEMPLATE_PATH" \
>     --parameters inputSubscription="$SUBSCRIPTION" \
>     --parameters outputTable="$PROJECT:$DATASET.$TABLE" \
>     --region "$REGION"
>
> will not resume on last read on rerun, because the name obviously changes
> here?
>
> best,
>
> michel
>
>
>
> On Tue, Apr 6, 2021 at 10:38 PM Kenneth Knowles <k...@apache.org> wrote:
>
>> I would assume the main issue is resuming reading from the Kinesis stream
>> from the last read? In the case for Pubsub (just as another example of the
>> idea) this is part of the internal state of a pre-created subscription.
>>
>> Kenn
>>
>> On Tue, Apr 6, 2021 at 1:26 PM Michael Luckey <adude3...@gmail.com>
>> wrote:
>>
>>> Hi list,
>>>
>>> with our current project we are implementing our streaming pipeline
>>> based on Google Dataflow.
>>>
>>> Essentially we receive input via Kinesis, doing some filtering,
>>> enrichment and sessionizing and output to PubSub and/or google storage.
>>>
>>> After short investigations it is not clear to us, how checkpointing will
>>> work running on Dataflow in connection with KinesisIO. Is there any
>>> documentation/discussions to get a better understanding on how that will be
>>> working? Especially if we are forced to restart our pipelines, how could we
>>> ensure not to loose any events?
>>>
>>> As far as I understand currently, it should work 'auto-magically' but it
>>> is not yet clear to us, how it will actually behave. Before we try to start
>>> testing our expectations or even try to implement some watermark-tracking
>>> by ourself we hoped to get some insights from other users here.
>>>
>>> Any help appreciated.
>>>
>>> Best,
>>>
>>> michel
>>>
>>

Reply via email to