Thanks for the response, Raghu. Does this sound like a use case that would be worth supporting in the future, or does that fall outside of Beam/Dataflow's goals? Is there a general design principle of optimizing to make complex pipelines possible in favor of allowing detailed control for a simple pipeline like this?
In effect, would you recommend we write a custom application using the Pub/Sub and Storage SDKs directly rather than trying to use Beam's abstractions? On Wed, Dec 5, 2018 at 2:24 PM Raghu Angadi <[email protected]> wrote: > > On Wed, Dec 5, 2018 at 11:13 AM Jeff Klukas <[email protected]> wrote: > >> We are attempting to build a Beam pipeline running on Dataflow that reads >> from a Cloud Pub/Sub subscription and writes 10 minute windows to files on >> GCS via FileIO. At-least-once completeness is critical in our use case, so >> we are seeking the simplest possible solution with obvious and verifiable >> delivery guarantees. >> >> We had assumed we'd be able to avoid intermediate storage, but it appears >> that it's necessary to specify sharding for FileIO when reading from >> Pub/Sub which implicitly calls GroupByKey [1] and that PubsubIO will ack >> messages at the first GroupByKey operation [0]. Thus, we become vulnerable >> to data loss scenarios if the Dataflow service becomes unavailable or if we >> improperly drain a pipeline on deletion or update. >> >> Is this assessment correct? >> > Yes. > > >> Would it be possible to delay pubsub acks until after the FileIO writes >> complete? >> > > There is no option to delay the acks for downstream stages. > > >> Other workarounds we could use to avoid Dataflow checkpoint data becoming >> critical state for the completeness of our pipeline? >> > > I don't see a good way not to depend on Dataflow checkpoint for guarantees > on a Dataflow pipeline. Others might have ideas here. If you are willing to > use external storage (or another pubsub topic) you could achieve at least > once. In that case, check other threads on the mailing list about Wait() > transform that lets you take an action after certain operation is done (in > this case FileIO). > > >> [0] >> https://beam.apache.org/releases/javadoc/2.8.0/org/apache/beam/sdk/io/gcp/pubsub/PubsubUnboundedSource.html >> [1] >> https://beam.apache.org/releases/javadoc/2.8.0/org/apache/beam/sdk/io/FileIO.html >> >
