That is correct Derek, Google Cloud Dataflow will only ack the message to Pubsub when a bundle completes. For very simple pipelines fusion will make it so that all downstream actions will happen within the same bundle.
On Thu, Jan 4, 2018 at 1:00 PM, Derek Hao Hu <[email protected]> wrote: > Hi Pablo, > > *Regarding elements coming from PubSub into your pipeline:* > Once the data enters your pipeline, it is 'acknowledged' on your PubSub > subscription, and you won't be able to retrieve it again from PubSub on the > same subscription. > > This part differs from my understanding of consuming Pub/Sub messages in > the Dataflow pipeline. I think the message will only be committed when a > PCollection in the pipeline gets materialized (https://stackoverflow.com/ > questions/41727106/when-does-dataflow-acknowledge-a- > message-of-batched-items-from-pubsubio), which means if the pipeline is > not complicated. Fusion optimization would fuse multiple stages together > and if any of these stages throw an exception, the Pub/Sub message won't be > acknowledged. I've also verified this behavior. > > Let me know if my understanding is correct. :) > > Thanks, > > Derek > > > On Thu, Jan 4, 2018 at 11:42 AM, Pablo Estrada <[email protected]> wrote: > >> I am not a streaming expert, but I will answer according to how I >> understand the system, and others can correct me if I get something wrong. >> >> *Regarding elements coming from PubSub into your pipeline:* >> Once the data enters your pipeline, it is 'acknowledged' on your PubSub >> subscription, and you won't be able to retrieve it again from PubSub on the >> same subscription. >> >> *Regarding elements stuck within your pipeline:* >> Bundles in a streaming pipeline are executed and committed individually. >> This means that one bundle may be stuck, while all other bundles may be >> moving forward in your pipeline. In a case like this, you won't be able to >> drain the pipeline because there is one bundle that can't be drained out >> (because exceptions are thrown every time processing for it is attempted). >> On the other hand, if you cancel your pipeline, then the information >> regarding the progress made by each bundle will be lost, so you will drop >> the data that was stuck within your pipeline, and was never written out. >> (That data was also acked in your PubSub subscription, so it won't come out >> from PubSub if you reattach to the same subscription later). - So cancel >> may not be what you're looking for either. >> >> For cases like these, what you'd need to do is to live-update your >> pipeline with code that can handle the problems in your current pipeline. >> This new code will replace the code in your pipeline stages, and then >> Dataflow will continue processing of your data in the state that it was >> before the update. This means that if there's one bundle that was stuck, it >> will be retried against the new code, and it will finally make progress >> across your pipeline. >> >> If you want to completely change, or stop your pipeline without dropping >> stuck bundles, you will still need to live-update it, and then drain it. >> >> I hope that was clear. Let me know if you need more clarification - and >> perhaps others will have more to add / correct. >> Best! >> -P. >> >> On Wed, Jan 3, 2018 at 3:09 AM Andrew Jones <[email protected]> >> wrote: >> >>> Hi, >>> >>> I'd like to confirm Beams data guarantees when used with Google Cloud >>> PubSub and Cloud Storage and running on Dataflow. I can't find any explicit >>> documentation on it. >>> >>> If the Beam job is running successfully, then I believe all data will be >>> delivered to GCS at least once. If I stop the job with 'Drain', then any >>> inflight data will be processed and saved. >>> >>> What happens if the Beam job is not running successfully, and maybe >>> throwing exceptions? Will the data still be available in PubSub when I >>> cancel (not drain) the job? Does a drain work successfully if the data >>> cannot be written to GCS because of the exceptions? >>> >>> Thanks, >>> Andrew >>> >> > > > -- > Derek Hao Hu > > Software Engineer | Snapchat > Snap Inc. >
