Thanks for the response, Raghu. Does this sound like a use case that would
be worth supporting in the future, or does that fall outside of
Beam/Dataflow's goals? Is there a general design principle of optimizing to
make complex pipelines possible in favor of allowing detailed control for a
simple pipeline like this?

In effect, would you recommend we write a custom application using the
Pub/Sub and Storage SDKs directly rather than trying to use Beam's
abstractions?

On Wed, Dec 5, 2018 at 2:24 PM Raghu Angadi <[email protected]> wrote:

>
> On Wed, Dec 5, 2018 at 11:13 AM Jeff Klukas <[email protected]> wrote:
>
>> We are attempting to build a Beam pipeline running on Dataflow that reads
>> from a Cloud Pub/Sub subscription and writes 10 minute windows to files on
>> GCS via FileIO. At-least-once completeness is critical in our use case, so
>> we are seeking the simplest possible solution with obvious and verifiable
>> delivery guarantees.
>>
>> We had assumed we'd be able to avoid intermediate storage, but it appears
>> that it's necessary to specify sharding for FileIO when reading from
>> Pub/Sub which implicitly calls GroupByKey [1] and that PubsubIO will ack
>> messages at the first GroupByKey operation [0]. Thus, we become vulnerable
>> to data loss scenarios if the Dataflow service becomes unavailable or if we
>> improperly drain a pipeline on deletion or update.
>>
>> Is this assessment correct?
>>
> Yes.
>
>
>> Would it be possible to delay pubsub acks until after the FileIO writes
>> complete?
>>
>
> There is no option to delay the acks for downstream stages.
>
>
>> Other workarounds we could use to avoid Dataflow checkpoint data becoming
>> critical state for the completeness of our pipeline?
>>
>
> I don't see a good way not to depend on Dataflow checkpoint for guarantees
> on a Dataflow pipeline. Others might have ideas here. If you are willing to
> use external storage (or another pubsub topic) you could achieve at least
> once. In that case, check other threads on the mailing list about Wait()
> transform that lets you take an action after certain operation is done (in
> this case FileIO).
>
>
>> [0]
>> https://beam.apache.org/releases/javadoc/2.8.0/org/apache/beam/sdk/io/gcp/pubsub/PubsubUnboundedSource.html
>> [1]
>> https://beam.apache.org/releases/javadoc/2.8.0/org/apache/beam/sdk/io/FileIO.html
>>
>

Reply via email to