We are attempting to build a Beam pipeline running on Dataflow that reads from a Cloud Pub/Sub subscription and writes 10 minute windows to files on GCS via FileIO. At-least-once completeness is critical in our use case, so we are seeking the simplest possible solution with obvious and verifiable delivery guarantees.
We had assumed we'd be able to avoid intermediate storage, but it appears that it's necessary to specify sharding for FileIO when reading from Pub/Sub which implicitly calls GroupByKey [1] and that PubsubIO will ack messages at the first GroupByKey operation [0]. Thus, we become vulnerable to data loss scenarios if the Dataflow service becomes unavailable or if we improperly drain a pipeline on deletion or update. Is this assessment correct? Would it be possible to delay pubsub acks until after the FileIO writes complete? Other workarounds we could use to avoid Dataflow checkpoint data becoming critical state for the completeness of our pipeline? [0] https://beam.apache.org/releases/javadoc/2.8.0/org/apache/beam/sdk/io/gcp/pubsub/PubsubUnboundedSource.html [1] https://beam.apache.org/releases/javadoc/2.8.0/org/apache/beam/sdk/io/FileIO.html
