Re: Hot update in dataflow without lossing messages

2024-04-28 Thread Wiśniowski Piotr

HI,

I have pretty same setup. Regarding Terraform and DataFlow on GCP:

- Terraform apply does check if there is a DataFlow job running with 
same `job_name`


- if there is not - it does create a new one and waits till its in 
"running" state


- if there is one already - it does try to update the job, which means 
create a new job with same "job_name" which will be running the new 
version of the code, and send "update" signal to the old one. After 
that, old job halts and waits for the new one to fully start and 
transmit the state of the old job. Once that's done the old job goes 
into "updated" state, and new one does process messages. If the new one 
fails the old one resumes processing.


- Note for this to work the new code requires to be compatible with the 
old one. If its not, the new job will fail, and old job will get 
slightly behind as it needed to wait for the new job to fail.


- Note 2: there is a way to run verify compatibility so that the new job 
will not start, but there will be a check to make sure it is compatible 
with the new job, hence avoiding possible delays in the old job.


- Note 3: there is entirely separate job update type called "in-flight 
update" which does not effectively change the job, but allows to change 
autoscaller parameters (like max number of workers) without creating any 
delays in the pipeline.


Given above context, to fully diagnose your issue, more information is 
needed, but you might be hitting the issue mentioned by Robert:


- if you use a topic for PubSubIO, this will mean that each new job does 
create a new subscription on the topic on graph construction time. So 
this means if there are messages in the old subscription that were not 
yet processed (and acked) by the old pipeline, and the old pipeline gets 
"update" signal and halts, there might be some time duration when 
messages can be published to the old subscription and not published to 
the new one.


Workarounds:

- use subscription on PubSubIO or

- use random job names on TF and drain old pipelines.

Note all above is just hypothesis, but hopefully it might be helpful.

Best

Wiśniowski Piotr


On 16.04.2024 05:15, Juan Romero wrote:
The deployment in the job is made by terraform. I verified and seems 
that terraform do it incorrectly under the hood because it stop the 
current job and starts and new one. Thanks for the information !


On Mon, 15 Apr 2024 at 6:42 PM Robert Bradshaw via user 
 wrote:


Are you draining[1] your pipeline or simply canceling it and
starting a new one? Draining should close open windows and attempt
to flush all in-flight data before shutting down. For PubSub you
may also need to read from subscriptions rather than topics to
ensure messages are processed by either one or the other.

[1]
https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline#drain

On Mon, Apr 15, 2024 at 9:33 AM Juan Romero  wrote:

Hi guys. Good morning.

I haven't done some test in apache beam over data flow in
order to see if i can do an hot update or hot swap meanwhile
the pipeline is processing a bunch of messages that fall in a
time window of 10 minutes. What I saw is that when I do a hot
update over the pipeline and currently there are some messages
in the time window (before sending them to the target), the
current job is shutdown and dataflow creates a new one. The
problem is that it seems that I am losing the messages that
were being processed in the old one and they are not taken by
the new one, which imply we are incurring in losing data .

Can you help me or recommend any strategy to me?

Thanks!!


Re: Hot update in dataflow without lossing messages

2024-04-15 Thread Juan Romero
The deployment in the job is made by terraform. I verified and seems that
terraform do it incorrectly under the hood because it stop the current job
and starts and new one. Thanks for the information !

On Mon, 15 Apr 2024 at 6:42 PM Robert Bradshaw via user <
user@beam.apache.org> wrote:

> Are you draining[1] your pipeline or simply canceling it and starting a
> new one? Draining should close open windows and attempt to flush all
> in-flight data before shutting down. For PubSub you may also need to read
> from subscriptions rather than topics to ensure messages are processed by
> either one or the other.
>
> [1]
> https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline#drain
>
> On Mon, Apr 15, 2024 at 9:33 AM Juan Romero  wrote:
>
>> Hi guys. Good morning.
>>
>> I haven't done some test in apache beam over data flow in order to see if
>> i can do an hot update or hot swap meanwhile the pipeline is processing a
>> bunch of messages that fall in a time window of 10 minutes. What I saw is
>> that when I do a hot update over the pipeline and currently there are some
>> messages in the time window (before sending them to the target), the
>> current job is shutdown and dataflow creates a new one. The problem is that
>> it seems that I am losing the messages that were being processed in the old
>> one and they are not taken by the new one, which imply we are incurring in
>> losing data .
>>
>> Can you help me or recommend any strategy to me?
>>
>> Thanks!!
>>
>


Re: Hot update in dataflow without lossing messages

2024-04-15 Thread Robert Bradshaw via user
Are you draining[1] your pipeline or simply canceling it and starting a new
one? Draining should close open windows and attempt to flush all in-flight
data before shutting down. For PubSub you may also need to read from
subscriptions rather than topics to ensure messages are processed by either
one or the other.

[1] https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline#drain

On Mon, Apr 15, 2024 at 9:33 AM Juan Romero  wrote:

> Hi guys. Good morning.
>
> I haven't done some test in apache beam over data flow in order to see if
> i can do an hot update or hot swap meanwhile the pipeline is processing a
> bunch of messages that fall in a time window of 10 minutes. What I saw is
> that when I do a hot update over the pipeline and currently there are some
> messages in the time window (before sending them to the target), the
> current job is shutdown and dataflow creates a new one. The problem is that
> it seems that I am losing the messages that were being processed in the old
> one and they are not taken by the new one, which imply we are incurring in
> losing data .
>
> Can you help me or recommend any strategy to me?
>
> Thanks!!
>