Re: master-worker no-exit-on-failure with SO_REUSEPORT and a port being already in use
On 2019-11-20 11:05, William Lallemand wrote: On Wed, Nov 20, 2019 at 10:19:20AM +0100, Christian Ruppert wrote: Hi William, thanks for the patch. I'll test it later today. What I actually wanted to achieve is: https://cbonte.github.io/haproxy-dconv/2.0/management.html#4 Then HAProxy tries to bind to all listening ports. If some fatal errors happen (eg: address not present on the system, permission denied), the process quits with an error. If a socket binding fails because a port is already in use, then the process will first send a SIGTTOU signal to all the pids specified in the "-st" or "-sf" pid list. This is what is called the "pause" signal. It instructs all existing haproxy processes to temporarily stop listening to their ports so that the new process can try to bind again. During this time, the old process continues to process existing connections. If the binding still fails (because for example a port is shared with another daemon), then the new process sends a SIGTTIN signal to the old processes to instruct them to resume operations just as if nothing happened. The old processes will then restart listening to the ports and continue to accept connections. Not that this mechanism is system In my test case though it failed to do so. Well, it only works with HAProxy processes, not with other processes. There is no mechanism to ask a process which is neither an haproxy process nor a process which use SO_REUSEPORT. With HAProxy processes it will bind with SO_REUSEPORT, and will only use the SIGTTOU/SIGTTIN signals if it fails to do so. This part of the documentation is for HAProxy without master-worker mode in master-worker mode, once the master is launched successfully it is never supposed to quit upon a reload (kill -USR2). During a reload in master-worker mode, the master will do a -sf . If the reload failed for any reason (bad configuration, unable to bind etc.), the behavior is to keep the previous workers. It only tries to kill the workers if the reload succeed. So this is the default behavior. Your patch seems to fix the issue. The master process won't exit anymore. Fallback seems to work during my initial tests. Thanks! -- Regards, Christian Ruppert
Re: master-worker no-exit-on-failure with SO_REUSEPORT and a port being already in use
On Wed, Nov 20, 2019 at 10:19:20AM +0100, Christian Ruppert wrote: > Hi William, > > thanks for the patch. I'll test it later today. What I actually wanted to > achieve is: https://cbonte.github.io/haproxy-dconv/2.0/management.html#4 Then > HAProxy tries to bind to all listening ports. If some fatal errors happen > (eg: address not present on the system, permission denied), the process quits > with an error. If a socket binding fails because a port is already in use, > then the process will first send a SIGTTOU signal to all the pids specified > in the "-st" or "-sf" pid list. This is what is called the "pause" signal. It > instructs all existing haproxy processes to temporarily stop listening to > their ports so that the new process can try to bind again. During this time, > the old process continues to process existing connections. If the binding > still fails (because for example a port is shared with another daemon), then > the new process sends a SIGTTIN signal to the old processes to instruct them > to resume operations just as if nothing happened. The old processes will then > restart listening to the ports and continue to accept connections. Not that > this mechanism is system > > In my test case though it failed to do so. Well, it only works with HAProxy processes, not with other processes. There is no mechanism to ask a process which is neither an haproxy process nor a process which use SO_REUSEPORT. With HAProxy processes it will bind with SO_REUSEPORT, and will only use the SIGTTOU/SIGTTIN signals if it fails to do so. This part of the documentation is for HAProxy without master-worker mode in master-worker mode, once the master is launched successfully it is never supposed to quit upon a reload (kill -USR2). During a reload in master-worker mode, the master will do a -sf . If the reload failed for any reason (bad configuration, unable to bind etc.), the behavior is to keep the previous workers. It only tries to kill the workers if the reload succeed. So this is the default behavior. -- William Lallemand
Re: master-worker no-exit-on-failure with SO_REUSEPORT and a port being already in use
Hi William, thanks for the patch. I'll test it later today. What I actually wanted to achieve is: https://cbonte.github.io/haproxy-dconv/2.0/management.html#4 Then HAProxy tries to bind to all listening ports. If some fatal errors happen (eg: address not present on the system, permission denied), the process quits with an error. If a socket binding fails because a port is already in use, then the process will first send a SIGTTOU signal to all the pids specified in the "-st" or "-sf" pid list. This is what is called the "pause" signal. It instructs all existing haproxy processes to temporarily stop listening to their ports so that the new process can try to bind again. During this time, the old process continues to process existing connections. If the binding still fails (because for example a port is shared with another daemon), then the new process sends a SIGTTIN signal to the old processes to instruct them to resume operations just as if nothing happened. The old processes will then restart listening to the ports and continue to accept connections. Not that this mechanism is system In my test case though it failed to do so. On 2019-11-19 17:27, William Lallemand wrote: On Tue, Nov 19, 2019 at 04:19:26PM +0100, William Lallemand wrote: > I then add another bind for port 80, which is in use by squid already > and try to reload HAProxy. It takes some time until it failes: > > Nov 19 14:39:21 894a0f616fec haproxy[2978]: [WARNING] 322/143921 (2978) > : Reexecuting Master process > ... > Nov 19 14:39:28 894a0f616fec haproxy[2978]: [ALERT] 322/143922 (2978) : > Starting frontend somefrontend: cannot bind socket [0.0.0.0:80] > ... > Nov 19 14:39:28 894a0f616fec systemd[1]: haproxy.service: Main process > exited, code=exited, status=1/FAILURE > > The reload itself is still running (systemd) and will timeout after > about 90s. After that, because of the Restart=always, I guess, it ends > up in a restart loop. > > So I would have expected that the master process will fallback to the > old process and proceed with the old child until the problem has been > fixed. > The patch in attachment fixes a bug where haproxy could reexecute itself in waitpid mode with -sf -1. I'm not sure this is your bug, but if this is the case you should see haproxy in waitpid mode, then the master exiting with the usage message in your logs. -- Regards, Christian Ruppert
Re: master-worker no-exit-on-failure with SO_REUSEPORT and a port being already in use
On Tue, Nov 19, 2019 at 04:19:26PM +0100, William Lallemand wrote: > > I then add another bind for port 80, which is in use by squid already > > and try to reload HAProxy. It takes some time until it failes: > > > > Nov 19 14:39:21 894a0f616fec haproxy[2978]: [WARNING] 322/143921 (2978) > > : Reexecuting Master process > > ... > > Nov 19 14:39:28 894a0f616fec haproxy[2978]: [ALERT] 322/143922 (2978) : > > Starting frontend somefrontend: cannot bind socket [0.0.0.0:80] > > ... > > Nov 19 14:39:28 894a0f616fec systemd[1]: haproxy.service: Main process > > exited, code=exited, status=1/FAILURE > > > > The reload itself is still running (systemd) and will timeout after > > about 90s. After that, because of the Restart=always, I guess, it ends > > up in a restart loop. > > > > So I would have expected that the master process will fallback to the > > old process and proceed with the old child until the problem has been > > fixed. > > The patch in attachment fixes a bug where haproxy could reexecute itself in waitpid mode with -sf -1. I'm not sure this is your bug, but if this is the case you should see haproxy in waitpid mode, then the master exiting with the usage message in your logs. -- William Lallemand >From 481a3c62a622974587c731b1bdc1478538fd6527 Mon Sep 17 00:00:00 2001 From: William Lallemand Date: Tue, 19 Nov 2019 17:04:18 +0100 Subject: [PATCH] BUG/MEDIUM: mworker: don't fill the -sf argument with -1 during the reexec Upon a reexec_on_failure, if the process tried to exit after the initialization of the process structure but before it was filled with a PID, the PID in the mworker_proc structure is set to -1. In this particular case the -sf argument is filled with -1 and haproxy will exit with the usage message because of that argument. Should be backported in 2.0. --- src/haproxy.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/haproxy.c b/src/haproxy.c index a0e630dfa..1d4771e64 100644 --- a/src/haproxy.c +++ b/src/haproxy.c @@ -673,7 +673,7 @@ void mworker_reload() next_argv[next_argc++] = "-sf"; list_for_each_entry(child, &proc_list, list) { - if (!(child->options & (PROC_O_TYPE_WORKER|PROC_O_TYPE_PROG))) + if (!(child->options & (PROC_O_TYPE_WORKER|PROC_O_TYPE_PROG)) || child->pid <= -1 ) continue; next_argv[next_argc] = memprintf(&msg, "%d", child->pid); if (next_argv[next_argc] == NULL) -- 2.21.0
Re: master-worker no-exit-on-failure with SO_REUSEPORT and a port being already in use
On Tue, Nov 19, 2019 at 03:45:09PM +0100, Christian Ruppert wrote: > Hi list, > Hello, > I'm facing some issues with already in use ports and the fallback > feature, during a reload. SO_REUSEPORT already makes ist easier/better > but not perfect, as there are still cases were it fails. > In my test case I've got a Squid running on port 80 and a HAProxy with > "master-worker no-exit-on-failure". The "no-exit-on-failure" option is only useful when you don't want the master to kill all the HAProxy processes when one of the workers was killed by another thing that the master (segv, OOM, bug..). In this case you still need another worker available to do the job. It's mostly used with a configuration with nbproc > 1. > I am using the shipped (2.0.8) > systemd unit file and startup HAProxy with some frontend and a bind on > like 1337 or something. > I then add another bind for port 80, which is in use by squid already > and try to reload HAProxy. It takes some time until it failes: > > Nov 19 14:39:21 894a0f616fec haproxy[2978]: [WARNING] 322/143921 (2978) > : Reexecuting Master process > ... > Nov 19 14:39:28 894a0f616fec haproxy[2978]: [ALERT] 322/143922 (2978) : > Starting frontend somefrontend: cannot bind socket [0.0.0.0:80] > ... > Nov 19 14:39:28 894a0f616fec systemd[1]: haproxy.service: Main process > exited, code=exited, status=1/FAILURE > > The reload itself is still running (systemd) and will timeout after > about 90s. After that, because of the Restart=always, I guess, it ends > up in a restart loop. > > So I would have expected that the master process will fallback to the > old process and proceed with the old child until the problem has been > fixed. > > Can anybody confirm that? Is that intended? > > https://cbonte.github.io/haproxy-dconv/2.0/management.html#4 > https://cbonte.github.io/haproxy-dconv/2.0/configuration.html#3.1-master-worker > Looks like a bug to me, the master should have fallback to the "waitpid mode" in this case. Maybe we don't send the sd_notify OK when we are in waitpid mode and systemd kills the process after the reload timeout. I'll do some tests to check what's going on. -- William Lallemand
master-worker no-exit-on-failure with SO_REUSEPORT and a port being already in use
Hi list, I'm facing some issues with already in use ports and the fallback feature, during a reload. SO_REUSEPORT already makes ist easier/better but not perfect, as there are still cases were it fails. In my test case I've got a Squid running on port 80 and a HAProxy with "master-worker no-exit-on-failure". I am using the shipped (2.0.8) systemd unit file and startup HAProxy with some frontend and a bind on like 1337 or something. I then add another bind for port 80, which is in use by squid already and try to reload HAProxy. It takes some time until it failes: Nov 19 14:39:21 894a0f616fec haproxy[2978]: [WARNING] 322/143921 (2978) : Reexecuting Master process ... Nov 19 14:39:28 894a0f616fec haproxy[2978]: [ALERT] 322/143922 (2978) : Starting frontend somefrontend: cannot bind socket [0.0.0.0:80] ... Nov 19 14:39:28 894a0f616fec systemd[1]: haproxy.service: Main process exited, code=exited, status=1/FAILURE The reload itself is still running (systemd) and will timeout after about 90s. After that, because of the Restart=always, I guess, it ends up in a restart loop. So I would have expected that the master process will fallback to the old process and proceed with the old child until the problem has been fixed. Can anybody confirm that? Is that intended? https://cbonte.github.io/haproxy-dconv/2.0/management.html#4 https://cbonte.github.io/haproxy-dconv/2.0/configuration.html#3.1-master-worker -- Regards, Christian Ruppert