Re: haproxy-systemd-wrapper exit code problem

2016-11-03 Thread Willy Tarreau
Hi Gabriele,

I assigned some time on this issue and fixed it. The wrapper's code was
simply wrong as the wait()'s status must not be passed directly to exit() 
otherwise it truncates it. It's now OK, I've just merged the attached patch
that I'll backport to 1.6 and 1.5.

Thanks for reporting this bug,
Willy
>From f7659cb10cb0420c7ca06fad1067207021d2a078 Mon Sep 17 00:00:00 2001
From: Willy Tarreau 
Date: Thu, 3 Nov 2016 20:31:40 +0100
Subject: BUG/MEDIUM: systemd-wrapper: return correct exit codes
X-Bogosity: Ham, tests=bogofilter, spamicity=0.00, version=1.2.4

Gabriele Cerami reported the the exit codes of the systemd-wrapper are
wrong. In short, it directly returns the output of the wait syscall's
status, which is a composite value made of error code an signal numbers.
In general it contains the signal number on the lower bits and the error
code on the higher bits, but exit() truncates it to the lowest 8 bits,
causing config validations to incorrectly report a success. Example :

  $ ./haproxy-systemd-wrapper -c -f /dev/null
  <7>haproxy-systemd-wrapper: executing /tmp/haproxy -c -f /dev/null -Ds
  Configuration file has no error but will not start (no listener) => exit(2).
  <5>haproxy-systemd-wrapper: exit, haproxy RC=512
  $ echo $?
  0

If the process is killed however, the signal number is directly reported
in the exit code.

Let's fix all this to ensure that the exit code matches what the shell does,
which means that codes 0..127 are for exit codes, codes 128..254 for signals,
and code 255 for unknown exit code. Now the return code is correct :

  $ ./haproxy-systemd-wrapper -c -f /dev/null
  <7>haproxy-systemd-wrapper: executing /tmp/haproxy -c -f /dev/null -Ds
  Configuration file has no error but will not start (no listener) => exit(2).
  <5>haproxy-systemd-wrapper: exit, haproxy RC=2
  $ echo $?
  2

  $ ./haproxy-systemd-wrapper -f /tmp/cfg.conf
  <7>haproxy-systemd-wrapper: executing /tmp/haproxy -f /dev/null -Ds
  ^C
  <5>haproxy-systemd-wrapper: exit, haproxy RC=130
  $ echo $?
  130

This fix must be backported to 1.6 and 1.5.
---
 src/haproxy-systemd-wrapper.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/src/haproxy-systemd-wrapper.c b/src/haproxy-systemd-wrapper.c
index b426f96..f6a9c85 100644
--- a/src/haproxy-systemd-wrapper.c
+++ b/src/haproxy-systemd-wrapper.c
@@ -295,6 +295,16 @@ int main(int argc, char **argv)
}
}
 
+   /* return either exit code or signal+128 */
+   if (WIFEXITED(status))
+   status = WEXITSTATUS(status);
+   else if (WIFSIGNALED(status))
+   status = 128 + WTERMSIG(status);
+   else if (WIFSTOPPED(status))
+   status = 128 + WSTOPSIG(status);
+   else
+   status = 255;
+
fprintf(stderr, SD_NOTICE "haproxy-systemd-wrapper: exit, haproxy 
RC=%d\n",
status);
return status;
-- 
1.7.12.1



Re: haproxy-systemd-wrapper exit code problem

2016-10-26 Thread Gabriele Cerami
On 26 Oct, Willy Tarreau wrote:
> to think of it. Are you sure your version doesn't contain certain patches
> affecting the error return path, maybe just some local debugging patches
> that were forgotten ?


I'm only seeing this applied before creating the package
https://git.centos.org/blob/rpms!haproxy.git/7463da74923be7492605bf742540ca6abe44d86e/SOURCES!haproxy-tcp-user-timeout.patch

I've not checked, but this kind of patches are usually only backports 
of already existing patches.


Thanks for the quick reply.



Re: haproxy-systemd-wrapper exit code problem

2016-10-26 Thread Willy Tarreau
On Wed, Oct 26, 2016 at 11:53:08AM +0200, Gabriele Cerami wrote:
> On 22 Oct, Willy Tarreau wrote:
> > Hi Gabriele,
> > 
> > On Tue, Oct 18, 2016 at 09:40:14PM +0200, Gabriele Cerami wrote:
> > > Hi,
> > > 
> > > We're having a problem with version 1.5.14 of haproxy, packaged for
> > > CentOS 7, but it seems even the code in master is affected.
> > > 
> > > In situations where bind is not possible (in our case, the address was
> > > already in use) tcp_connect_server returns with a status of 256
> > > (ERR_ALERT). This value is then passed down as exit code for
> > > haproxy-systemd-wrapper.
> > 
> > Huh ? First, tcp_connect_server() is not involved here, it's used to
> > connect to a server, so it never provides an exit code. And even if
> > instead you meant a bind issue instead, all error codes are limited
> > to 5 bits so the higher value you can have is 0x1F = 31 if all flags
> 
> Yeah, sorry about that hastily made root cause analysis.
> I'll stick to the facts.
> This is what we're seeing
> 
> http://paste.openstack.org/show/586174/

OK so it's the bind for the listener which fails, that makes more sense.
I'm amazed to see that some warnings and errors are reported as coming
from the wrapper while they're emitted by haproxy. Is it another funny
trick of systemd to make your troubleshooting more complicated ?

I don't understand why yet, but I'm not seeing the haproxy message
which should say "[haproxy.main()] cannot bind socket.", instead we're
seeing a different syntax and it is reported from the wrapper. Also
the exit path here clearly does "exit(1)". So I don't know much what
to think of it. Are you sure your version doesn't contain certain patches
affecting the error return path, maybe just some local debugging patches
that were forgotten ?

> The complete output is somewhat difficult to get, it's available here
> http://logs.openstack.org/periodic/periodic-tripleo-ci-centos-7-ovb-nonha/9697c38/logs/undercloud/var/log/undercloud_install.txt.gz
> 
> but you'd have to download it, ungzip it, then open the journal log with
> journalctl --file
> var/log/journal/d327fbd51a9f4c4997bdeba685dc22d2/system.journal  | less
> -AXFR

Unfortunately you just made me realize another good reason for staying
away from systemd-based distros, which is that your logs can only be
processed by people using the same stuff, but I personally can't do
anything with these files :-/

> As you can see, the thread spawned by the wrapper is returning 256 as
> status code, and this is interpreded as 0/success by systemd, even if
> it's actually failing to bind to a port, and thus exiting.

That's normal because exit codes are supposed to be in the lower 8 (well
7) bits. So in practice systemd really receives zero here. However I'm
seeing something now. We're directly returning status here which is wrong
as it contains some higher bits, we should use WEXITSTATUS() there if
WIFEXITED() is true. In fact, now I'm pretty sure that this 256 simply
is exit code 1 shifted 8 bits left. Please note that the wrapper in 1.5.14
still has lots of defects that were fixed later, specifically in the
return path. But this one was not fixed yet and definitely needs to be
addressed.

I'll try to propose a patch (but you'll have to adjust it by hand if you
want to put it into 1.5.14).

Willy



Re: haproxy-systemd-wrapper exit code problem

2016-10-26 Thread Gabriele Cerami
On 22 Oct, Willy Tarreau wrote:
> Hi Gabriele,
> 
> On Tue, Oct 18, 2016 at 09:40:14PM +0200, Gabriele Cerami wrote:
> > Hi,
> > 
> > We're having a problem with version 1.5.14 of haproxy, packaged for
> > CentOS 7, but it seems even the code in master is affected.
> > 
> > In situations where bind is not possible (in our case, the address was
> > already in use) tcp_connect_server returns with a status of 256
> > (ERR_ALERT). This value is then passed down as exit code for
> > haproxy-systemd-wrapper.
> 
> Huh ? First, tcp_connect_server() is not involved here, it's used to
> connect to a server, so it never provides an exit code. And even if
> instead you meant a bind issue instead, all error codes are limited
> to 5 bits so the higher value you can have is 0x1F = 31 if all flags

Yeah, sorry about that hastily made root cause analysis.
I'll stick to the facts.
This is what we're seeing

http://paste.openstack.org/show/586174/

The complete output is somewhat difficult to get, it's available here
http://logs.openstack.org/periodic/periodic-tripleo-ci-centos-7-ovb-nonha/9697c38/logs/undercloud/var/log/undercloud_install.txt.gz

but you'd have to download it, ungzip it, then open the journal log with
journalctl --file
var/log/journal/d327fbd51a9f4c4997bdeba685dc22d2/system.journal  | less
-AXFR

As you can see, the thread spawned by the wrapper is returning 256 as
status code, and this is interpreded as 0/success by systemd, even if
it's actually failing to bind to a port, and thus exiting.

thanks for any help



Re: haproxy-systemd-wrapper exit code problem

2016-10-22 Thread Willy Tarreau
Hi Gabriele,

On Tue, Oct 18, 2016 at 09:40:14PM +0200, Gabriele Cerami wrote:
> Hi,
> 
> We're having a problem with version 1.5.14 of haproxy, packaged for
> CentOS 7, but it seems even the code in master is affected.
> 
> In situations where bind is not possible (in our case, the address was
> already in use) tcp_connect_server returns with a status of 256
> (ERR_ALERT). This value is then passed down as exit code for
> haproxy-systemd-wrapper.

Huh ? First, tcp_connect_server() is not involved here, it's used to
connect to a server, so it never provides an exit code. And even if
instead you meant a bind issue instead, all error codes are limited
to 5 bits so the higher value you can have is 0x1F = 31 if all flags
are reported together.

> The problem is that exit value is truncated to the least significant 8
> bits, so even if haproxy fails, systemd gets an exit code of 0 and
> thinks the service start succeded.

I agree this would be a problem but if you really observe an exit code
of 256, I'm interested in knowing how it is produced because I see no
way to produce this!

Thanks,
Willy



haproxy-systemd-wrapper exit code problem

2016-10-18 Thread Gabriele Cerami
Hi,

We're having a problem with version 1.5.14 of haproxy, packaged for
CentOS 7, but it seems even the code in master is affected.

In situations where bind is not possible (in our case, the address was
already in use) tcp_connect_server returns with a status of 256
(ERR_ALERT). This value is then passed down as exit code for
haproxy-systemd-wrapper.

The problem is that exit value is truncated to the least significant 8
bits, so even if haproxy fails, systemd gets an exit code of 0 and
thinks the service start succeded.

Can you confirm this analysis ? (credits should go to Sofer Athlan-Guyot )
Any workarounds ?
Thanks.