Hi,
thanks for your help here. Really appreciated!
@alexey We already found your answer (magically showed up in some search
engine results) and - iirc - while we were still using spark as underlying
runner on some other project I do think we relied on that mentioned
mechanism. But as with our
On Wed, Apr 7, 2021 at 11:55 AM Kenneth Knowles wrote:
> [I think this has graduated to a +dev thread]
>
> Yea, in Beam it is left up to the IOs primarily, hence the bundle
> finalization step, or allowed runners to have their own features of course.
> Dataflow also does have in-place pipeline
[I think this has graduated to a +dev thread]
Yea, in Beam it is left up to the IOs primarily, hence the bundle
finalization step, or allowed runners to have their own features of course.
Dataflow also does have in-place pipeline update that restores the
persisted checkpoints from one pipeline
Looks like this is a common source of confusion, I had similar questions
about checkpointing in the beam slack.
In Spark Structured Streaming, checkpoints are saved to an *external* HDFS
location and persist *beyond* each run, so in the event of a stream
crashing, you can just point your next
I can’t say exactly how it will work with Dataflow but for Spark Runner I
answered it here [1]:
“Since KinesisIO is based on UnboundedSource.CheckpointMark, it uses the
standard checkpoint mechanism, provided by Beam UnboundedSource.UnboundedReader.
Once a KinesisRecord has been read
This sounds similar to the "Kafka Commit" in
https://github.com/apache/beam/pull/12572 by +Boyuan Zhang
and also to how PubsubIO ACKs messages in the
finalizer. I don't know much about KinesisIO or how Kinesis works. I was
just asking to clarify, in case other folks know more, like +Alexey
Hi Kenn,
yes, resuming reading at the proper timestamp is exactly the issue we are
currently struggling with. E.g. with Kinesis Client Lib we could store the
last read within some dynamo table. This mechanism is not used with beam,
as we understand, the runner is responsible to track that
I would assume the main issue is resuming reading from the Kinesis stream
from the last read? In the case for Pubsub (just as another example of the
idea) this is part of the internal state of a pre-created subscription.
Kenn
On Tue, Apr 6, 2021 at 1:26 PM Michael Luckey wrote:
> Hi list,
>
>
Hi list,
with our current project we are implementing our streaming pipeline based
on Google Dataflow.
Essentially we receive input via Kinesis, doing some filtering, enrichment
and sessionizing and output to PubSub and/or google storage.
After short investigations it is not clear to us, how