On Wed, Dec 5, 2018 at 11:13 AM Jeff Klukas <[email protected]> wrote:
> We are attempting to build a Beam pipeline running on Dataflow that reads > from a Cloud Pub/Sub subscription and writes 10 minute windows to files on > GCS via FileIO. At-least-once completeness is critical in our use case, so > we are seeking the simplest possible solution with obvious and verifiable > delivery guarantees. > > We had assumed we'd be able to avoid intermediate storage, but it appears > that it's necessary to specify sharding for FileIO when reading from > Pub/Sub which implicitly calls GroupByKey [1] and that PubsubIO will ack > messages at the first GroupByKey operation [0]. Thus, we become vulnerable > to data loss scenarios if the Dataflow service becomes unavailable or if we > improperly drain a pipeline on deletion or update. > > Is this assessment correct? > Yes. > Would it be possible to delay pubsub acks until after the FileIO writes > complete? > There is no option to delay the acks for downstream stages. > Other workarounds we could use to avoid Dataflow checkpoint data becoming > critical state for the completeness of our pipeline? > I don't see a good way not to depend on Dataflow checkpoint for guarantees on a Dataflow pipeline. Others might have ideas here. If you are willing to use external storage (or another pubsub topic) you could achieve at least once. In that case, check other threads on the mailing list about Wait() transform that lets you take an action after certain operation is done (in this case FileIO). > [0] > https://beam.apache.org/releases/javadoc/2.8.0/org/apache/beam/sdk/io/gcp/pubsub/PubsubUnboundedSource.html > [1] > https://beam.apache.org/releases/javadoc/2.8.0/org/apache/beam/sdk/io/FileIO.html >
