Re: Backporting "MEDIUM: mworker: reexec in waitpid mode after successful loading" to 2.4

2022-05-13 Thread William Lallemand
On Tue, May 10, 2022 at 02:21:42PM +0200, William Lallemand wrote:
> On Tue, May 10, 2022 at 12:09:59PM +0200, Christian Ruppert wrote:
> > 
> > It even just happened when running with gdb, without a reload.
> > 
> 
> What the patch does is re-executing the master in wait-mode once the
> worker was started in order to free the master memory of huge data
> (maps, SSL certificates etc).
> 
> I thought your crash was unrelated but indeed what is weird is that you
> experienced a watchdog crash in the master... which is is really
> surprising since the master does not do much.
> 
> > Does that help? You mworker commit *seems*, at least at the first 
> > glance, to fix that. Without I have multiple coredumps within 24 hours. 
> > Often I can trigger some by just reloading/restarting. With your commit 
> > I couldn't for almost 24 hours + doing 100 reloads with 10s sleep 
> > between each.
> > Let me know if you want me to turn on some debug flags or something 
> > else. Or do you want a dump? I'd share it off-list then.
> 
> That does help indeed, but I will need a full coredump with the binaries
> to analyze what provoked this watchdog in the master!
> 
> Is it a problem you have since a while or did it happens with an update?
> It's not impossible that a fix provoked this.
> 

We had some exchanges with Christian in private about this issue, it
resulted in this fix for the watchdog.

I pushed a fix in the master repository for the watchdog issue:
http://git.haproxy.org/?p=haproxy.git;a=commit;h=ae053b30da4db588f7fabe09e5f85cbebdc421ad

-- 
William Lallemand



Re: Backporting "MEDIUM: mworker: reexec in waitpid mode after successful loading" to 2.4

2022-05-10 Thread William Lallemand
On Tue, May 10, 2022 at 12:09:59PM +0200, Christian Ruppert wrote:
> 
> It even just happened when running with gdb, without a reload.
> 

What the patch does is re-executing the master in wait-mode once the
worker was started in order to free the master memory of huge data
(maps, SSL certificates etc).

I thought your crash was unrelated but indeed what is weird is that you
experienced a watchdog crash in the master... which is is really
surprising since the master does not do much.

> Does that help? You mworker commit *seems*, at least at the first 
> glance, to fix that. Without I have multiple coredumps within 24 hours. 
> Often I can trigger some by just reloading/restarting. With your commit 
> I couldn't for almost 24 hours + doing 100 reloads with 10s sleep 
> between each.
> Let me know if you want me to turn on some debug flags or something 
> else. Or do you want a dump? I'd share it off-list then.

That does help indeed, but I will need a full coredump with the binaries
to analyze what provoked this watchdog in the master!

Is it a problem you have since a while or did it happens with an update?
It's not impossible that a fix provoked this.

-- 
William Lallemand



Re: Backporting "MEDIUM: mworker: reexec in waitpid mode after successful loading" to 2.4

2022-05-10 Thread William Lallemand
On Tue, May 10, 2022 at 11:05:01AM +0200, Christian Ruppert wrote:
> Hi guys, William,
> 
> can we please get that "MEDIUM: mworker: reexec in waitpid mode after 
> successful loading" - fab0fdce981149a4e3172f2b81113f505f415595 
> backported to 2.4?
> I seem to run into it, at least on one of our 40 LBs. This one is a VM 
> though. It sometimes crashes after each reload. Running 2.5 with 
> fab0fdce981149a4e3172f2b81113f505f415595 seems to fix the issue for me.
> 
> https://github.com/haproxy/haproxy/commit/fab0fdce981149a4e3172f2b81113f505f415595
> 

Hello Christian,

Honestly we run into a lot of issues and bugfixes after this patch was
pushed, I don't think it's even possible to backport this without
breaking the 2.4, there are a lot of corner cases and I don't want to
break this branch which is pretty stable now.

2.5 already runs with this architecture for a while in some places which
make it more robust but it was not easy to get there. Also the next LTS
version which is 2.6 is almost there!

What kind of crashes are you experimenting? It's supposed to help with
the possible OOM on reload when too much memory was consumed by the
master.

-- 
William Lallemand