That is correct Derek, Google Cloud Dataflow will only ack the message to
Pubsub when a bundle completes.
For very simple pipelines fusion will make it so that all downstream
actions will happen within the same bundle.

On Thu, Jan 4, 2018 at 1:00 PM, Derek Hao Hu <[email protected]> wrote:

> Hi Pablo,
>
> *Regarding elements coming from PubSub into your pipeline:*
> Once the data enters your pipeline, it is 'acknowledged' on your PubSub
> subscription, and you won't be able to retrieve it again from PubSub on the
> same subscription.
>
> This part differs from my understanding of consuming Pub/Sub messages in
> the Dataflow pipeline. I think the message will only be committed when a
> PCollection in the pipeline gets materialized (https://stackoverflow.com/
> questions/41727106/when-does-dataflow-acknowledge-a-
> message-of-batched-items-from-pubsubio), which means if the pipeline is
> not complicated. Fusion optimization would fuse multiple stages together
> and if any of these stages throw an exception, the Pub/Sub message won't be
> acknowledged. I've also verified this behavior.
>
> Let me know if my understanding is correct. :)
>
> Thanks,
>
> Derek
>
>
> On Thu, Jan 4, 2018 at 11:42 AM, Pablo Estrada <[email protected]> wrote:
>
>> I am not a streaming expert, but I will answer according to how I
>> understand the system, and others can correct me if I get something wrong.
>>
>> *Regarding elements coming from PubSub into your pipeline:*
>> Once the data enters your pipeline, it is 'acknowledged' on your PubSub
>> subscription, and you won't be able to retrieve it again from PubSub on the
>> same subscription.
>>
>> *Regarding elements stuck within your pipeline:*
>> Bundles in a streaming pipeline are executed and committed individually.
>> This means that one bundle may be stuck, while all other bundles may be
>> moving forward in your pipeline. In a case like this, you won't be able to
>> drain the pipeline because there is one bundle that can't be drained out
>> (because exceptions are thrown every time processing for it is attempted).
>> On the other hand, if you cancel your pipeline, then the information
>> regarding the progress made by each bundle will be lost, so you will drop
>> the data that was stuck within your pipeline, and was never written out.
>> (That data was also acked in your PubSub subscription, so it won't come out
>> from PubSub if you reattach to the same subscription later). - So cancel
>> may not be what you're looking for either.
>>
>> For cases like these, what you'd need to do is to live-update your
>> pipeline with code that can handle the problems in your current pipeline.
>> This new code will replace the code in your pipeline stages, and then
>> Dataflow will continue processing of your data in the state that it was
>> before the update. This means that if there's one bundle that was stuck, it
>> will be retried against the new code, and it will finally make progress
>> across your pipeline.
>>
>> If you want to completely change, or stop your pipeline without dropping
>> stuck bundles, you will still need to live-update it, and then drain it.
>>
>> I hope that was clear. Let me know if you need more clarification - and
>> perhaps others will have more to add / correct.
>> Best!
>> -P.
>>
>> On Wed, Jan 3, 2018 at 3:09 AM Andrew Jones <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> I'd like to confirm Beams data guarantees when used with Google Cloud
>>> PubSub and Cloud Storage and running on Dataflow. I can't find any explicit
>>> documentation on it.
>>>
>>> If the Beam job is running successfully, then I believe all data will be
>>> delivered to GCS at least once. If I stop the job with 'Drain', then any
>>> inflight data will be processed and saved.
>>>
>>> What happens if the Beam job is not running successfully, and maybe
>>> throwing exceptions? Will the data still be available in PubSub when I
>>> cancel (not drain) the job? Does a drain work successfully if the data
>>> cannot be written to GCS because of the exceptions?
>>>
>>> Thanks,
>>> Andrew
>>>
>>
>
>
> --
> Derek Hao Hu
>
> Software Engineer | Snapchat
> Snap Inc.
>

Reply via email to