HI,

I have pretty same setup. Regarding Terraform and DataFlow on GCP:

- Terraform apply does check if there is a DataFlow job running with same `job_name`

- if there is not - it does create a new one and waits till its in "running" state

- if there is one already - it does try to update the job, which means create a new job with same "job_name" which will be running the new version of the code, and send "update" signal to the old one. After that, old job halts and waits for the new one to fully start and transmit the state of the old job. Once that's done the old job goes into "updated" state, and new one does process messages. If the new one fails the old one resumes processing.

- Note for this to work the new code requires to be compatible with the old one. If its not, the new job will fail, and old job will get slightly behind as it needed to wait for the new job to fail.

- Note 2: there is a way to run verify compatibility so that the new job will not start, but there will be a check to make sure it is compatible with the new job, hence avoiding possible delays in the old job.

- Note 3: there is entirely separate job update type called "in-flight update" which does not effectively change the job, but allows to change autoscaller parameters (like max number of workers) without creating any delays in the pipeline.

Given above context, to fully diagnose your issue, more information is needed, but you might be hitting the issue mentioned by Robert:

- if you use a topic for PubSubIO, this will mean that each new job does create a new subscription on the topic on graph construction time. So this means if there are messages in the old subscription that were not yet processed (and acked) by the old pipeline, and the old pipeline gets "update" signal and halts, there might be some time duration when messages can be published to the old subscription and not published to the new one.

Workarounds:

- use subscription on PubSubIO or

- use random job names on TF and drain old pipelines.

Note all above is just hypothesis, but hopefully it might be helpful.

Best

Wiśniowski Piotr


On 16.04.2024 05:15, Juan Romero wrote:
The deployment in the job is made by terraform. I verified and seems that terraform do it incorrectly under the hood because it stop the current job and starts and new one. Thanks for the information !

On Mon, 15 Apr 2024 at 6:42 PM Robert Bradshaw via user <user@beam.apache.org> wrote:

    Are you draining[1] your pipeline or simply canceling it and
    starting a new one? Draining should close open windows and attempt
    to flush all in-flight data before shutting down. For PubSub you
    may also need to read from subscriptions rather than topics to
    ensure messages are processed by either one or the other.

    [1]
    https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline#drain

    On Mon, Apr 15, 2024 at 9:33 AM Juan Romero <jsrf...@gmail.com> wrote:

        Hi guys. Good morning.

        I haven't done some test in apache beam over data flow in
        order to see if i can do an hot update or hot swap meanwhile
        the pipeline is processing a bunch of messages that fall in a
        time window of 10 minutes. What I saw is that when I do a hot
        update over the pipeline and currently there are some messages
        in the time window (before sending them to the target), the
        current job is shutdown and dataflow creates a new one. The
        problem is that it seems that I am losing the messages that
        were being processed in the old one and they are not taken by
        the new one, which imply we are incurring in losing data .

        Can you help me or recommend any strategy to me?

        Thanks!!

Reply via email to