Re: Segfault with 1.8.0 build (RHEL5, old gcc).

2017-12-07 Thread Christopher Lane
On Thu, Dec 7, 2017 at 4:36 AM, Olivier Houchard  wrote:
> Hi Christopher,
>
> On Wed, Dec 06, 2017 at 05:34:15PM -0800, Christopher Lane wrote:
>> On Mon, Dec 4, 2017 at 11:56 AM, Christopher Lane
>>  wrote:
>> >
>> >
>> >
>> > On Mon, Dec 4, 2017 at 4:22 AM Lukas Tribus  wrote:
>> >
>> >>Hello Christopher,
>> >
>> >
>> >>2017-12-01 20:59 GMT+01:00 Christopher Lane :
>> >>>
>> >>> gist with backtrace, -vv output, and config file. Also strace.
>> >>>
>> >>> https://gist.github.com/jayalane/c6dbe7918aa9635b62c874d20f57dfec
>> >>>
>> >>> It does all the listens and then right after the first epoll is done it
>> >>> has this segv. all the local variables are "optimized out"
>> >>>
>> >>> (We really want the hard-stop-after -- we get a leak of children with our
>> >>> super frequent soft-reloads).
>> >
>> >>FYI, hard-stop-after has been backported to 1.7 stable and is in all
>> >>rebuilds starting with 1.7.4:
>> >
>> >> https://www.mail-archive.com/haproxy@formilux.org/msg25494.html
>> >
>> >
>> >
>> >> Regards,
>> >> Lukas
>> >
>> > Olivier:
>> >
>> > The patch worked beautifully to keep the 1.8.0 from crashing.
>> >
>> > Lukas:
>> >
>> > Thanks for the tip, we'll consider 1.7.9 then (settled on 1.8.0 by starting
>> > out with reading the release notes for it; we are upgrading from 1.5.0).
>> >
>> > --Chris
>>
>> Unrelated to the prior contents of this thread, I found the root cause
>> for our issue (child leak).
>>
>> The haproxy was being started from a Java process using Runtime.exec()
>> and the PIDs were jammed into 1 cell of argv.  It killed the first
>> child but not the later ones, as atol("13233 13234 13235 13236")
>> returns 13233.  We have fixed the Java code.
>>
>> http://git.haproxy.org/?p=haproxy.git;a=blob;f=src/haproxy.c;h=eb5e65b40e7b8b2a4f8fb04b3552401e42fb0a89;hb=HEAD#l1421
>>
>> I note that the -sf parsing code uses atol and has no error checking.
>> Would the project be interested in a patch to use strtol with error
>> checks?  Could log a warning if unconsumed bytes in the arg (safer),
>> or fail to start (unsafe).  I'm sure I'm not the only one with this
>> sort of bug, given how tricky shell escaping and so on is.
>
> Yes, replacing ato* with something more reasonnable would certainly be
> appreciated :)
>
> Regards,
>
> Olivier

Didn't really see a place to put in tests for it, but did build git
master and try it with various correct and incorrect args.  Patch is
under subject:

[PATCH] CONTRIB: log: emit warning when -sf/-sd cannot parse argument


Thanks,


--Chris



Re: Segfault with 1.8.0 build (RHEL5, old gcc).

2017-12-07 Thread Olivier Houchard
Hi Christopher,

On Wed, Dec 06, 2017 at 05:34:15PM -0800, Christopher Lane wrote:
> On Mon, Dec 4, 2017 at 11:56 AM, Christopher Lane
>  wrote:
> >
> >
> >
> > On Mon, Dec 4, 2017 at 4:22 AM Lukas Tribus  wrote:
> >
> >>Hello Christopher,
> >
> >
> >>2017-12-01 20:59 GMT+01:00 Christopher Lane :
> >>>
> >>> gist with backtrace, -vv output, and config file. Also strace.
> >>>
> >>> https://gist.github.com/jayalane/c6dbe7918aa9635b62c874d20f57dfec
> >>>
> >>> It does all the listens and then right after the first epoll is done it
> >>> has this segv. all the local variables are "optimized out"
> >>>
> >>> (We really want the hard-stop-after -- we get a leak of children with our
> >>> super frequent soft-reloads).
> >
> >>FYI, hard-stop-after has been backported to 1.7 stable and is in all
> >>rebuilds starting with 1.7.4:
> >
> >> https://www.mail-archive.com/haproxy@formilux.org/msg25494.html
> >
> >
> >
> >> Regards,
> >> Lukas
> >
> > Olivier:
> >
> > The patch worked beautifully to keep the 1.8.0 from crashing.
> >
> > Lukas:
> >
> > Thanks for the tip, we'll consider 1.7.9 then (settled on 1.8.0 by starting
> > out with reading the release notes for it; we are upgrading from 1.5.0).
> >
> > --Chris
> 
> Unrelated to the prior contents of this thread, I found the root cause
> for our issue (child leak).
> 
> The haproxy was being started from a Java process using Runtime.exec()
> and the PIDs were jammed into 1 cell of argv.  It killed the first
> child but not the later ones, as atol("13233 13234 13235 13236")
> returns 13233.  We have fixed the Java code.
> 
> http://git.haproxy.org/?p=haproxy.git;a=blob;f=src/haproxy.c;h=eb5e65b40e7b8b2a4f8fb04b3552401e42fb0a89;hb=HEAD#l1421
> 
> I note that the -sf parsing code uses atol and has no error checking.
> Would the project be interested in a patch to use strtol with error
> checks?  Could log a warning if unconsumed bytes in the arg (safer),
> or fail to start (unsafe).  I'm sure I'm not the only one with this
> sort of bug, given how tricky shell escaping and so on is.

Yes, replacing ato* with something more reasonnable would certainly be
appreciated :)

Regards,

Olivier



Re: Segfault with 1.8.0 build (RHEL5, old gcc).

2017-12-06 Thread Christopher Lane
On Mon, Dec 4, 2017 at 11:56 AM, Christopher Lane
 wrote:
>
>
>
> On Mon, Dec 4, 2017 at 4:22 AM Lukas Tribus  wrote:
>
>>Hello Christopher,
>
>
>>2017-12-01 20:59 GMT+01:00 Christopher Lane :
>>>
>>> gist with backtrace, -vv output, and config file. Also strace.
>>>
>>> https://gist.github.com/jayalane/c6dbe7918aa9635b62c874d20f57dfec
>>>
>>> It does all the listens and then right after the first epoll is done it
>>> has this segv. all the local variables are "optimized out"
>>>
>>> (We really want the hard-stop-after -- we get a leak of children with our
>>> super frequent soft-reloads).
>
>>FYI, hard-stop-after has been backported to 1.7 stable and is in all
>>rebuilds starting with 1.7.4:
>
>> https://www.mail-archive.com/haproxy@formilux.org/msg25494.html
>
>
>
>> Regards,
>> Lukas
>
> Olivier:
>
> The patch worked beautifully to keep the 1.8.0 from crashing.
>
> Lukas:
>
> Thanks for the tip, we'll consider 1.7.9 then (settled on 1.8.0 by starting
> out with reading the release notes for it; we are upgrading from 1.5.0).
>
> --Chris

Unrelated to the prior contents of this thread, I found the root cause
for our issue (child leak).

The haproxy was being started from a Java process using Runtime.exec()
and the PIDs were jammed into 1 cell of argv.  It killed the first
child but not the later ones, as atol("13233 13234 13235 13236")
returns 13233.  We have fixed the Java code.

http://git.haproxy.org/?p=haproxy.git;a=blob;f=src/haproxy.c;h=eb5e65b40e7b8b2a4f8fb04b3552401e42fb0a89;hb=HEAD#l1421

I note that the -sf parsing code uses atol and has no error checking.
Would the project be interested in a patch to use strtol with error
checks?  Could log a warning if unconsumed bytes in the arg (safer),
or fail to start (unsafe).  I'm sure I'm not the only one with this
sort of bug, given how tricky shell escaping and so on is.

Thanks again, and we are enjoying 1.8.1 and much lower loadavgs.

--Chris



Re: Segfault with 1.8.0 build (RHEL5, old gcc).

2017-12-04 Thread Christopher Lane
On Mon, Dec 4, 2017 at 4:22 AM Lukas Tribus  wrote:

>Hello Christopher,


>2017-12-01 20:59 GMT+01:00 Christopher Lane :
>>
>> gist with backtrace, -vv output, and config file. Also strace.
>>
>> https://gist.github.com/jayalane/c6dbe7918aa9635b62c874d20f57dfec
>>
>> It does all the listens and then right after the first epoll is done it
has this segv. all the local variables are "optimized out"
>>
>> (We really want the hard-stop-after -- we get a leak of children with
our super frequent soft-reloads).

>FYI, hard-stop-after has been backported to 1.7 stable and is in all
>rebuilds starting with 1.7.4:

> https://www.mail-archive.com/haproxy@formilux.org/msg25494.html




> Regards,
> Lukas

Olivier:

The patch worked beautifully to keep the 1.8.0 from crashing.

Lukas:

Thanks for the tip, we'll consider 1.7.9 then (settled on 1.8.0 by starting
out with reading the release notes for it; we are upgrading from 1.5.0).

--Chris


Re: Segfault with 1.8.0 build (RHEL5, old gcc).

2017-12-04 Thread Lukas Tribus
Hello Christopher,


2017-12-01 20:59 GMT+01:00 Christopher Lane :
>
> gist with backtrace, -vv output, and config file.  Also strace.
>
> https://gist.github.com/jayalane/c6dbe7918aa9635b62c874d20f57dfec
>
> It does all the listens and then right after the first epoll is done it has 
> this segv.  all the local variables are "optimized out"
>
> (We really want the hard-stop-after -- we get a leak of children with our 
> super frequent soft-reloads).

FYI, hard-stop-after has been backported to 1.7 stable and is in all
rebuilds starting with 1.7.4:

https://www.mail-archive.com/haproxy@formilux.org/msg25494.html



Regards,
Lukas



Re: Segfault with 1.8.0 build (RHEL5, old gcc).

2017-12-01 Thread Olivier Houchard
Hi Christopher,

On Fri, Dec 01, 2017 at 11:59:14AM -0800, Christopher Lane wrote:
> gist with backtrace, -vv output, and config file.  Also strace.
> 
> https://gist.github.com/jayalane/c6dbe7918aa9635b62c874d20f57dfec
> 
> It does all the listens and then right after the first epoll is done it has
> this segv.  all the local variables are "optimized out"
> 
> (We really want the hard-stop-after -- we get a leak of children with our
> super frequent soft-reloads).
> 
> It's possible something is messed up in my compiling/linking environment,
> but I don't see anything, and I thought maybe a .0 release had a chance of
> issues on very old RHEL versions.  We currently run 1.5 and have to
> manually clean up the left over children (but are hoping to scale up to
> nbproc 4 and use CPU pinning).
> 
> I'm debugging also, but hoped someone would have something off the top of
> their head (apologies for not having read the list for a while before
> posting).
> 

Thanks a lot for the very detailled report !
I think I know what's going on.
In your config file, you have :
 server stage.qa.example.com:11302 stage.qa.example.com check inter 2m 
fastinter 2s fall 2
It should probably be :
 server stage.qa.example.com stage.qa.example.com:11302 check inter 2m 
fastinter 2s fall 2

However this is of course no reason to segfault.
Can you please try the attached patch, and let me know if it fixes your issue ?

Thanks a lot !

Olivier
>From 5236a1a4ac19cc27c6f06d328b2df0c4cdfe220c Mon Sep 17 00:00:00 2001
From: Olivier Houchard 
Date: Fri, 1 Dec 2017 22:04:05 +0100
Subject: [PATCH] MINOR: checks: Be sure we have a mux if we created a cs.

In connect_conn_chk(), there were one case we could return with a new
conn_stream created, but no mux attached. With no mux, cs_destroy() would
segfault. Fix that by setting the mux before we can fail.

This should be backported to 1.8.
---
 src/checks.c | 18 --
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/src/checks.c b/src/checks.c
index 63747201e..eaf84a225 100644
--- a/src/checks.c
+++ b/src/checks.c
@@ -1564,25 +1564,23 @@ static int connect_conn_chk(struct task *t)
conn->addr.to = s->addr;
}
 
+   proto = protocol_by_family(conn->addr.to.ss_family);
+
+   conn_prepare(conn, proto, check->xprt);
+   conn_install_mux(conn, &mux_pt_ops, cs);
+   cs_attach(cs, check, &check_conn_cb);
+   conn->target = &s->obj_type;
+
if ((conn->addr.to.ss_family == AF_INET) || (conn->addr.to.ss_family == 
AF_INET6)) {
int i = 0;
 
i = srv_check_healthcheck_port(check);
-   if (i == 0) {
-   cs->data = check;
+   if (i == 0)
return SF_ERR_CHK_PORT;
-   }
 
set_host_port(&conn->addr.to, i);
}
 
-   proto = protocol_by_family(conn->addr.to.ss_family);
-
-   conn_prepare(conn, proto, check->xprt);
-   conn_install_mux(conn, &mux_pt_ops, cs);
-   cs_attach(cs, check, &check_conn_cb);
-   conn->target = &s->obj_type;
-
/* no client address */
clear_addr(&conn->addr.from);
 
-- 
2.14.3