We are attempting to build a Beam pipeline running on Dataflow that reads
from a Cloud Pub/Sub subscription and writes 10 minute windows to files on
GCS via FileIO. At-least-once completeness is critical in our use case, so
we are seeking the simplest possible solution with obvious and verifiable
delivery guarantees.

We had assumed we'd be able to avoid intermediate storage, but it appears
that it's necessary to specify sharding for FileIO when reading from
Pub/Sub which implicitly calls GroupByKey [1] and that PubsubIO will ack
messages at the first GroupByKey operation [0]. Thus, we become vulnerable
to data loss scenarios if the Dataflow service becomes unavailable or if we
improperly drain a pipeline on deletion or update.

Is this assessment correct? Would it be possible to delay pubsub acks until
after the FileIO writes complete? Other workarounds we could use to avoid
Dataflow checkpoint data becoming critical state for the completeness of
our pipeline?

[0]
https://beam.apache.org/releases/javadoc/2.8.0/org/apache/beam/sdk/io/gcp/pubsub/PubsubUnboundedSource.html
[1]
https://beam.apache.org/releases/javadoc/2.8.0/org/apache/beam/sdk/io/FileIO.html

Reply via email to