Re: Segfault on 2.1.3

2020-03-23 Thread Sean Reifschneider
I'm pretty sure that we are seeing "the service is down", though only
briefly.  We started looking at the logs because we were seeing testing
failures and failures with our code deploys, which check the haproxy status
as part of rolling the code update to the machines.  We aren't manually
having to restart the service via "service haproxy restart", for example,
just to clarify.

I'm not sure I'm answering your first question, if the above doesn't answer
it, how do I tell if it's the "old one" or the "new one"?  I think you mean
that an haproxy process is restarting at some point during the run and
which one.  Or do you mean the 2.0.13 process before update ("old one") and
the 2.1.3 process after update ("new one").  To clarify: We install the
2.1.3 package, and then with no further interaction that I know of, we get
a few handfuls of segfaults in the logs.

It happens incredibly infrequently.  1-10 times a day.  I wondered if it
might be related to a lot of hits from a single IP address, but manually
doing a bunch of reloads in my web browser, didn't trigger it.  We really
have no idea what triggers it and can't reproduce it at will.

I'm pretty sure we don't have anything that updates the ACL.  At some point
in the past we had a cron job that would query the ratelists to store off
stats, but nothing that updated them, and I don't see that we have that job
on the system after the upgrade (new system spin when we did the haproxy
update).

Thanks,
Sean

On Sat, Mar 21, 2020 at 3:33 AM Willy Tarreau  wrote:

> On Sat, Mar 21, 2020 at 10:08:15AM +0100, Willy Tarreau wrote:
> > On Fri, Mar 20, 2020 at 08:10:25AM -0600, Sean Reifschneider wrote:
> > > I grabbed the source from the PPA and rebuilt it, installed the dbg
> > > package, and here's one of the "bt full"s:
> >
> > Thanks!
> >
> > > (gdb) bt full
> > > #0  pattern_exec_match (head=head@entry=0x55e4dd275478,
> > > smp=smp@entry=0x7fbf9ef650c0,
> > > fill=fill@entry=0) at src/pattern.c:2541
> > > __pl_l = 
> > > __pl_r = 
> > > list = 0x0
> > > pat = 
> >
> > This is very strange. The "list" field is null for the expression. That
> > doesn't make much sense in a linked list. This makes me suspect that the
> > previous element was added then freed without being unlinked and was then
> > reused and zeroed.
> >
> > I wanted to issue dev5 right now but I'll first try to figure if this is
> > reproducible and if so, how.
>
> I obviously can't reproduce it and the only line in your config making
> use of L4 rules is perfectly fine and straightforward.
>
> Thus I'm having two questions:
>   - is it the new or the old process that occasionally crashes on reload ?
> If it's the new one, the service is down. If it's the old one, the
> service continues and you only know about it from your logs.
>
>   - do you have anything that tries to update the "rate_whitelist" ACL
> over the stats socket ? We could for example imagine that you're
> maintaining a whitelist in a separate file that you're uploading
> upon reloads.
>
> Thanks,
> Willy
>


Re: Segfault on 2.1.3

2020-03-21 Thread Willy Tarreau
On Sat, Mar 21, 2020 at 10:08:15AM +0100, Willy Tarreau wrote:
> On Fri, Mar 20, 2020 at 08:10:25AM -0600, Sean Reifschneider wrote:
> > I grabbed the source from the PPA and rebuilt it, installed the dbg
> > package, and here's one of the "bt full"s:
> 
> Thanks!
> 
> > (gdb) bt full
> > #0  pattern_exec_match (head=head@entry=0x55e4dd275478,
> > smp=smp@entry=0x7fbf9ef650c0,
> > fill=fill@entry=0) at src/pattern.c:2541
> > __pl_l = 
> > __pl_r = 
> > list = 0x0
> > pat = 
> 
> This is very strange. The "list" field is null for the expression. That
> doesn't make much sense in a linked list. This makes me suspect that the
> previous element was added then freed without being unlinked and was then
> reused and zeroed.
> 
> I wanted to issue dev5 right now but I'll first try to figure if this is
> reproducible and if so, how.

I obviously can't reproduce it and the only line in your config making
use of L4 rules is perfectly fine and straightforward.

Thus I'm having two questions:
  - is it the new or the old process that occasionally crashes on reload ?
If it's the new one, the service is down. If it's the old one, the
service continues and you only know about it from your logs.

  - do you have anything that tries to update the "rate_whitelist" ACL
over the stats socket ? We could for example imagine that you're
maintaining a whitelist in a separate file that you're uploading
upon reloads.

Thanks,
Willy



Re: Segfault on 2.1.3

2020-03-21 Thread Willy Tarreau
On Fri, Mar 20, 2020 at 08:10:25AM -0600, Sean Reifschneider wrote:
> I grabbed the source from the PPA and rebuilt it, installed the dbg
> package, and here's one of the "bt full"s:

Thanks!

> (gdb) bt full
> #0  pattern_exec_match (head=head@entry=0x55e4dd275478,
> smp=smp@entry=0x7fbf9ef650c0,
> fill=fill@entry=0) at src/pattern.c:2541
> __pl_l = 
> __pl_r = 
> list = 0x0
> pat = 

This is very strange. The "list" field is null for the expression. That
doesn't make much sense in a linked list. This makes me suspect that the
previous element was added then freed without being unlinked and was then
reused and zeroed.

I wanted to issue dev5 right now but I'll first try to figure if this is
reproducible and if so, how.

Thanks,
Willy



Re: Segfault on 2.1.3

2020-03-20 Thread Sean Reifschneider
I grabbed the source from the PPA and rebuilt it, installed the dbg
package, and here's one of the "bt full"s:

(gdb) bt full
#0  pattern_exec_match (head=head@entry=0x55e4dd275478,
smp=smp@entry=0x7fbf9ef650c0,
fill=fill@entry=0) at src/pattern.c:2541
__pl_l = 
__pl_r = 
list = 0x0
pat = 
#1  0x55e4db757cda in acl_exec_cond (cond=0x55e4dd2760a0,
px=0x55e4dd23a600, sess=sess@entry=0x7fbf9812a1a0, strm=strm@entry=0x0,
opt=6, opt@entry=2) at src/acl.c:1160
suite = 0x55e4dd2760f0
term = 0x55e4dd271c70
expr = 0x55e4dd275470
acl = 0x55e4dd2755b0
smp = {flags = 0, data = {type = 4, u = {sint = 1728908554, ipv4 =
{s_addr = 1728908554}, ipv6 = {__in6_u = {__u6_addr8 = "\n\r\rg", '\000'
, __u6_addr16 = {3338, 26381, 0, 0, 0, 0, 0, 0},
__u6_addr32 = {1728908554, 0, 0, 0}}}, str = {size = 1728908554, area =
0x0, data = 0, head = 0}, meth = {meth = 10, str = {
  size = 0, area = 0x0, data = 0, head = 0, ctx = {p =
0x0, i = 0, ll = 0, d = 0, a = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}},
px = 0x55e4dd23a600, sess = 0x7fbf9812a1a0, strm = 0x0, opt = 6}
acl_res = ACL_TEST_FAIL
suite_res = ACL_TEST_PASS
cond_res = ACL_TEST_FAIL
#2  0x55e4db748648 in tcp_exec_l4_rules (sess=sess@entry=0x7fbf9812a1a0)
at src/tcp_rules.c:420
rule = 0x55e4dd275f00
ts = 
t = 
conn = 
result = 1
ret = ACL_TEST_PASS
#3  0x55e4db73ec54 in session_accept_fd (l=0x55e4dd23c660, cfd=26,
addr=) at src/session.c:193
cli_conn = 
p = 0x55e4dd23a600
sess = 0x7fbf9812a1a0
ret = -1
#4  0x55e4db729994 in accept_queue_process (t=,
context=0x55e4dba2c0c0 , state=)
at src/listener.c:176
ring = 0x55e4dba2c0c0 
li = 0x55e4dd23c660
addr = {ss_family = 2,
  __ss_padding =
"\272\342\n\r\rg\000\000\000\000\000\000\000\000x\322S\335\344U\000\000
\226!\230\277\177\000\000`y\031\230\277\177\000\000\000\000\000\000\000\000\000\000\060S\366\236\277\177\000\000\255>\275\025\377\177\000\000@S\366\236\277\177\000\000`<\275\025\377\177\000\000\260S\366\236\277\177\000\000\001\000\000\000\002\000\000\000PS\366\236\277\177\000\000\001\000\000\000\000\000\000\000`S\366\236\277\177\000",
__ss_align = 1}
max_accept = 64
addr_len = 16
ret = 
fd = 
#5  0x55e4db74b68e in process_runnable_tasks () at src/task.c:413
t = 
state = 
ctx = 
process = 
tt = 0x55e4dbb2d540 
lrq = 
grq = 
t = 
max_processed = 200
tmp_list = 
#6  0x55e4db6f6c62 in run_poll_loop () at src/haproxy.c:2645
next = 
wake = 
next = 
wake = 
#7  run_thread_poll_loop (data=) at src/haproxy.c:2762
ptaf = 
ptif = 
ptdf = 
ptff = 
init_left = 0
init_mutex = pthread_mutex_t = {Type = Normal, Status = Not
acquired, Robust = No, Shared = No, Protocol = None}
init_cond = pthread_cond_t = {Threads known to still execute a wait
function = 0, Clock ID = CLOCK_REALTIME, Shared = No}
#8  0x7fbfa1c406db in start_thread (arg=0x7fbf9ef88700) at
pthread_create.c:463
pd = 0x7fbf9ef88700
now = 
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140460982568704,
6118752029739392961, 140460982424832, 0, 1, 140733557630496,
-6082936609730125887, -6082833652346708031}, mask_was_saved = 0}}, priv =
{pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype
= 0}}}
not_first_call = 
#9  0x7fbfa06b488f in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95
No locals.



On Tue, Mar 3, 2020 at 11:25 PM Vincent Bernat  wrote:

>  ❦  3 mars 2020 15:34 -07, Sean Reifschneider :
>
> > We've been running haproxy 1.8 series for quite a while.  We're currently
> > in the process of updating to 2.1, and have installed from the vbernat
> PPA
> > on Ubuntu 18.04 using the same old config file.
> >
> > Now we are seeing segfaults a few times a day:
>
> You can easily collect core information if you install systemd-coredump.
> Then, use "coredumpctl list" to locate the collected core, then
> "coredumpctl info XXX" to get some stack traces. If you install the
> -dbgsym package, you can also use "coredumpctl debug XXX" then use "bt
> full" and send the output.
> --
> Don't stop with your first draft.
> - The Elements of Programming Style (Kernighan & Plauger)
>


Re: Segfault on 2.1.3

2020-03-19 Thread Christopher Faulet

Le 17/03/2020 à 16:41, Sean Reifschneider a écrit :
The only place tcp-request appears in my config is in relation to rate-limiting, 
which we have set up to track but not enforce.  Here are the associated rules:


frontend main
     [...]
     acl rate_whitelist src 10.0.0.1
     acl rate_whitelist src 10.0.1.1
     acl rate_whitelist src 10.0.1.2
     acl rate_whitelist src 10.0.1.3
     acl rate_whitelist src 10.0.1.4
     stick-table type ip size 200k expire 60s store gpc0
     tcp-request connection track-sc0 src if ! rate_whitelist
     #use_backend throttled if { sc0_get_gpc0 gt 0 }

backend www
     [...]
     stick-table type ip size 200k expire 1m store http_req_rate(30s)
     acl abuse_req_rate sc1_http_req_rate gt 1000
     acl mark_as_abuser sc0_inc_gpc0(main) gt 0
     tcp-request content track-sc1 src
     tcp-request content reject if abuse_req_rate mark_as_abuser

Here's a pastebin of the full config: https://paste.ubuntu.com/p/nM6xq4Vp2z/



Ok, so the failing ACL is rate_whitelist. But there is nothing strange here. And 
your configuration is pretty clean. It is probably a side effect of another bug. 
Without a core file it will be hard to investigate.



--
Christopher Faulet



Re: Segfault on 2.1.3

2020-03-17 Thread Sean Reifschneider
The only place tcp-request appears in my config is in relation to
rate-limiting, which we have set up to track but not enforce.  Here are the
associated rules:

frontend main
[...]
acl rate_whitelist src 10.0.0.1
acl rate_whitelist src 10.0.1.1
acl rate_whitelist src 10.0.1.2
acl rate_whitelist src 10.0.1.3
acl rate_whitelist src 10.0.1.4
stick-table type ip size 200k expire 60s store gpc0
tcp-request connection track-sc0 src if ! rate_whitelist
#use_backend throttled if { sc0_get_gpc0 gt 0 }

backend www
[...]
stick-table type ip size 200k expire 1m store http_req_rate(30s)
acl abuse_req_rate sc1_http_req_rate gt 1000
acl mark_as_abuser sc0_inc_gpc0(main) gt 0
tcp-request content track-sc1 src
tcp-request content reject if abuse_req_rate mark_as_abuser

Here's a pastebin of the full config:
https://paste.ubuntu.com/p/nM6xq4Vp2z/

Thanks!

On Tue, Mar 17, 2020 at 1:24 AM Christopher Faulet 
wrote:

> Le 06/03/2020 à 18:53, Sean Reifschneider a écrit :
> > Here's what the stack traces look like, they all seem to be showing
> > "pattern_exec_match" and "epool_wait":
> >
> > PID: 14348 (haproxy)
> > UID: 0 (root)
> > GID: 0 (root)
> >  Signal: 11 (SEGV)
> >   Timestamp: Thu 2020-03-05 19:59:05 MST (14h ago)
> >Command Line: /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p
> > /run/haproxy.pid -S /run/haproxy-master.sock
> >  Executable: /usr/sbin/haproxy
> >   Control Group: /system.slice/haproxy.service
> >Unit: haproxy.service
> >   Slice: system.slice
> > Boot ID: 847e3549533c4b9b970c6ec86776621d
> >  Machine ID: 90c4e8de95634bd898f918ea24b07374
> >Hostname: fw1
> > Storage:
> >
> /var/lib/systemd/coredump/core.haproxy.0.847e3549533c4b9b970c6ec86776621d.14348.158346354500.lz4
> > Message: Process 14348 (haproxy) of user 0 dumped core.
> >
> >  Stack trace of thread 14349:
> >  #0  0x564a9deaed08 pattern_exec_match (haproxy)
> >  #1  0x564a9dee8eda acl_exec_cond (haproxy)
> >  #2  0x564a9ded9848 tcp_exec_l4_rules (haproxy)
> >  #3  0x564a9decfe24 session_accept_fd (haproxy)
> >  #4  0x564a9debab44 n/a (haproxy)
> >  #5  0x564a9dedc88e process_runnable_tasks (haproxy)
> >  #6  0x564a9de87dd2 n/a (haproxy)
> >  #7  0x7f0f0de6a6db start_thread (libpthread.so.0)
> >  #8  0x7f0f0c8de88f __clone (libc.so.6)
> >
> >  Stack trace of thread 14348:
> >  #0  0x7f0f0c8debb7 epoll_wait (libc.so.6)
> >  #1  0x564a9dda7cef n/a (haproxy)
> >  #2  0x564a9de87dbf n/a (haproxy)
> >  #3  0x564a9dda5a4e main (haproxy)
> >  #4  0x7f0f0c7deb97 __libc_start_main (libc.so.6)
> >  #5  0x564a9dda672a _start (haproxy)
> >
> > I have a bunch of ACLs to select the backend based on the host header,
> like:
> >
> >  acl sitedown_stg_acl hdr(host)  -m reg -i ^sitedown.example.com
> > 
> >  use_backend sitedown_stg if sitedown_stg_acl
> >
> > I'm not seeing anything particularly weird about those, the most
> complicated is
> > probably:
> >
> >  acl aerial_acl hdr(host)  -m reg -i ^aerial[1-4].(dev|stg).
> example.com
> > 
> >  use_backend aerial if aerial_acl
> >
> > Thoughts?
> >
>
> Hi,
>
> If the segfault happens during execution of L4 rules, it means the faulty
> acl is
> on a "tcp-request connection" rule, not on use_backend directive. Could
> you
> share this part of your configuration ?
>
> --
> Christopher Faulet
>


Re: Segfault on 2.1.3

2020-03-17 Thread Christopher Faulet

Le 06/03/2020 à 18:53, Sean Reifschneider a écrit :
Here's what the stack traces look like, they all seem to be showing 
"pattern_exec_match" and "epool_wait":


            PID: 14348 (haproxy)
            UID: 0 (root)
            GID: 0 (root)
         Signal: 11 (SEGV)
      Timestamp: Thu 2020-03-05 19:59:05 MST (14h ago)
   Command Line: /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p 
/run/haproxy.pid -S /run/haproxy-master.sock

     Executable: /usr/sbin/haproxy
  Control Group: /system.slice/haproxy.service
           Unit: haproxy.service
          Slice: system.slice
        Boot ID: 847e3549533c4b9b970c6ec86776621d
     Machine ID: 90c4e8de95634bd898f918ea24b07374
       Hostname: fw1
        Storage: 
/var/lib/systemd/coredump/core.haproxy.0.847e3549533c4b9b970c6ec86776621d.14348.158346354500.lz4

        Message: Process 14348 (haproxy) of user 0 dumped core.

                 Stack trace of thread 14349:
                 #0  0x564a9deaed08 pattern_exec_match (haproxy)
                 #1  0x564a9dee8eda acl_exec_cond (haproxy)
                 #2  0x564a9ded9848 tcp_exec_l4_rules (haproxy)
                 #3  0x564a9decfe24 session_accept_fd (haproxy)
                 #4  0x564a9debab44 n/a (haproxy)
                 #5  0x564a9dedc88e process_runnable_tasks (haproxy)
                 #6  0x564a9de87dd2 n/a (haproxy)
                 #7  0x7f0f0de6a6db start_thread (libpthread.so.0)
                 #8  0x7f0f0c8de88f __clone (libc.so.6)

                 Stack trace of thread 14348:
                 #0  0x7f0f0c8debb7 epoll_wait (libc.so.6)
                 #1  0x564a9dda7cef n/a (haproxy)
                 #2  0x564a9de87dbf n/a (haproxy)
                 #3  0x564a9dda5a4e main (haproxy)
                 #4  0x7f0f0c7deb97 __libc_start_main (libc.so.6)
                 #5  0x564a9dda672a _start (haproxy)

I have a bunch of ACLs to select the backend based on the host header, like:

     acl sitedown_stg_acl hdr(host)  -m reg -i ^sitedown.example.com 


     use_backend sitedown_stg if sitedown_stg_acl

I'm not seeing anything particularly weird about those, the most complicated is 
probably:


     acl aerial_acl hdr(host)  -m reg -i ^aerial[1-4].(dev|stg).example.com 


     use_backend aerial if aerial_acl

Thoughts?



Hi,

If the segfault happens during execution of L4 rules, it means the faulty acl is 
on a "tcp-request connection" rule, not on use_backend directive. Could you 
share this part of your configuration ?


--
Christopher Faulet



Re: Segfault on 2.1.3

2020-03-16 Thread Vincent Bernat
 ❦ 16 mars 2020 16:02 -06, Sean Reifschneider:

> I reverted back to haproxy 2.0.13 from the PPA last Wednesday and have
> verified that we get no segfaults on that.  If there's anything else I can
> provide for you, let me know.  Otherwise I'm just gonna close this ticket
> in our bugtracker.  :-)

Sorry, can't help you more. Maybe wait for the next release. It will get
-dbgsym and it should provide more info on why you get the problem.
-- 
Don't go around saying the world owes you a living.  The world owes you
nothing.  It was here first.
-- Mark Twain



Re: Segfault on 2.1.3

2020-03-16 Thread Sean Reifschneider
I reverted back to haproxy 2.0.13 from the PPA last Wednesday and have
verified that we get no segfaults on that.  If there's anything else I can
provide for you, let me know.  Otherwise I'm just gonna close this ticket
in our bugtracker.  :-)

Sean

On Fri, Mar 6, 2020 at 10:53 AM Sean Reifschneider  wrote:

> Here's what the stack traces look like, they all seem to be showing
> "pattern_exec_match" and "epool_wait":
>
>PID: 14348 (haproxy)
>UID: 0 (root)
>GID: 0 (root)
> Signal: 11 (SEGV)
>  Timestamp: Thu 2020-03-05 19:59:05 MST (14h ago)
>   Command Line: /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p
> /run/haproxy.pid -S /run/haproxy-master.sock
> Executable: /usr/sbin/haproxy
>  Control Group: /system.slice/haproxy.service
>   Unit: haproxy.service
>  Slice: system.slice
>Boot ID: 847e3549533c4b9b970c6ec86776621d
> Machine ID: 90c4e8de95634bd898f918ea24b07374
>   Hostname: fw1
>Storage:
> /var/lib/systemd/coredump/core.haproxy.0.847e3549533c4b9b970c6ec86776621d.14348.158346354500.lz4
>Message: Process 14348 (haproxy) of user 0 dumped core.
>
> Stack trace of thread 14349:
> #0  0x564a9deaed08 pattern_exec_match (haproxy)
> #1  0x564a9dee8eda acl_exec_cond (haproxy)
> #2  0x564a9ded9848 tcp_exec_l4_rules (haproxy)
> #3  0x564a9decfe24 session_accept_fd (haproxy)
> #4  0x564a9debab44 n/a (haproxy)
> #5  0x564a9dedc88e process_runnable_tasks (haproxy)
> #6  0x564a9de87dd2 n/a (haproxy)
> #7  0x7f0f0de6a6db start_thread (libpthread.so.0)
> #8  0x7f0f0c8de88f __clone (libc.so.6)
>
> Stack trace of thread 14348:
> #0  0x7f0f0c8debb7 epoll_wait (libc.so.6)
> #1  0x564a9dda7cef n/a (haproxy)
> #2  0x564a9de87dbf n/a (haproxy)
> #3  0x564a9dda5a4e main (haproxy)
> #4  0x7f0f0c7deb97 __libc_start_main (libc.so.6)
> #5  0x564a9dda672a _start (haproxy)
>
> I have a bunch of ACLs to select the backend based on the host header,
> like:
>
> acl sitedown_stg_acl hdr(host)  -m reg -i ^sitedown.example.com
> use_backend sitedown_stg if sitedown_stg_acl
>
> I'm not seeing anything particularly weird about those, the most
> complicated is probably:
>
> acl aerial_acl hdr(host)  -m reg -i ^aerial[1-4].(dev|stg).example.com
> use_backend aerial if aerial_acl
>
> Thoughts?
>
> On Wed, Mar 4, 2020 at 1:56 PM Vincent Bernat  wrote:
>
>>  ❦  4 mars 2020 13:19 -07, Sean Reifschneider :
>>
>> > I've upgraded back to 2.1, and installed the systemd-coredump, I'll
>> update
>> > when I have additional information.  I wasn't able to find a -dbgsym
>> > package, I even looked in the debian pool directory for the PPA.  We're
>> > talking like a haproxy-dbgsym package, right?  Or am I missing
>> > something?
>>
>> Sorry, I forgot to enable this option for 2.1 PPA. You should still be
>> able to get tracebacks without the dbgsym package (with "coredumpctl
>> info XXX").
>> --
>> Indent to show the logical structure of a program.
>> - The Elements of Programming Style (Kernighan & Plauger)
>>
>


Re: Segfault on 2.1.3

2020-03-06 Thread Sean Reifschneider
Here's what the stack traces look like, they all seem to be showing
"pattern_exec_match" and "epool_wait":

   PID: 14348 (haproxy)
   UID: 0 (root)
   GID: 0 (root)
Signal: 11 (SEGV)
 Timestamp: Thu 2020-03-05 19:59:05 MST (14h ago)
  Command Line: /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p
/run/haproxy.pid -S /run/haproxy-master.sock
Executable: /usr/sbin/haproxy
 Control Group: /system.slice/haproxy.service
  Unit: haproxy.service
 Slice: system.slice
   Boot ID: 847e3549533c4b9b970c6ec86776621d
Machine ID: 90c4e8de95634bd898f918ea24b07374
  Hostname: fw1
   Storage:
/var/lib/systemd/coredump/core.haproxy.0.847e3549533c4b9b970c6ec86776621d.14348.158346354500.lz4
   Message: Process 14348 (haproxy) of user 0 dumped core.

Stack trace of thread 14349:
#0  0x564a9deaed08 pattern_exec_match (haproxy)
#1  0x564a9dee8eda acl_exec_cond (haproxy)
#2  0x564a9ded9848 tcp_exec_l4_rules (haproxy)
#3  0x564a9decfe24 session_accept_fd (haproxy)
#4  0x564a9debab44 n/a (haproxy)
#5  0x564a9dedc88e process_runnable_tasks (haproxy)
#6  0x564a9de87dd2 n/a (haproxy)
#7  0x7f0f0de6a6db start_thread (libpthread.so.0)
#8  0x7f0f0c8de88f __clone (libc.so.6)

Stack trace of thread 14348:
#0  0x7f0f0c8debb7 epoll_wait (libc.so.6)
#1  0x564a9dda7cef n/a (haproxy)
#2  0x564a9de87dbf n/a (haproxy)
#3  0x564a9dda5a4e main (haproxy)
#4  0x7f0f0c7deb97 __libc_start_main (libc.so.6)
#5  0x564a9dda672a _start (haproxy)

I have a bunch of ACLs to select the backend based on the host header, like:

acl sitedown_stg_acl hdr(host)  -m reg -i ^sitedown.example.com
use_backend sitedown_stg if sitedown_stg_acl

I'm not seeing anything particularly weird about those, the most
complicated is probably:

acl aerial_acl hdr(host)  -m reg -i ^aerial[1-4].(dev|stg).example.com
use_backend aerial if aerial_acl

Thoughts?

On Wed, Mar 4, 2020 at 1:56 PM Vincent Bernat  wrote:

>  ❦  4 mars 2020 13:19 -07, Sean Reifschneider :
>
> > I've upgraded back to 2.1, and installed the systemd-coredump, I'll
> update
> > when I have additional information.  I wasn't able to find a -dbgsym
> > package, I even looked in the debian pool directory for the PPA.  We're
> > talking like a haproxy-dbgsym package, right?  Or am I missing
> > something?
>
> Sorry, I forgot to enable this option for 2.1 PPA. You should still be
> able to get tracebacks without the dbgsym package (with "coredumpctl
> info XXX").
> --
> Indent to show the logical structure of a program.
> - The Elements of Programming Style (Kernighan & Plauger)
>


Re: Segfault on 2.1.3

2020-03-04 Thread Vincent Bernat
 ❦  4 mars 2020 13:19 -07, Sean Reifschneider :

> I've upgraded back to 2.1, and installed the systemd-coredump, I'll update
> when I have additional information.  I wasn't able to find a -dbgsym
> package, I even looked in the debian pool directory for the PPA.  We're
> talking like a haproxy-dbgsym package, right?  Or am I missing
> something?

Sorry, I forgot to enable this option for 2.1 PPA. You should still be
able to get tracebacks without the dbgsym package (with "coredumpctl
info XXX").
-- 
Indent to show the logical structure of a program.
- The Elements of Programming Style (Kernighan & Plauger)



Re: Segfault on 2.1.3

2020-03-04 Thread Sean Reifschneider
(Sorry, meant version 2.0.13, not 2.0.3.

On Wed, Mar 4, 2020 at 1:19 PM Sean Reifschneider  wrote:

> It's maybe a little bit early to say, but 2.0.3 has not segfaulted since I
> installed it, around 20 hours ago.  The previous 20 hours had maybe a dozen
> segfaults, so this might tell us something.
>
> I've upgraded back to 2.1, and installed the systemd-coredump, I'll update
> when I have additional information.  I wasn't able to find a -dbgsym
> package, I even looked in the debian pool directory for the PPA.  We're
> talking like a haproxy-dbgsym package, right?  Or am I missing something?
>
> Thanks,
> Sean
>
> On Tue, Mar 3, 2020 at 11:25 PM Vincent Bernat  wrote:
>
>>  ❦  3 mars 2020 15:34 -07, Sean Reifschneider :
>>
>> > We've been running haproxy 1.8 series for quite a while.  We're
>> currently
>> > in the process of updating to 2.1, and have installed from the vbernat
>> PPA
>> > on Ubuntu 18.04 using the same old config file.
>> >
>> > Now we are seeing segfaults a few times a day:
>>
>> You can easily collect core information if you install systemd-coredump.
>> Then, use "coredumpctl list" to locate the collected core, then
>> "coredumpctl info XXX" to get some stack traces. If you install the
>> -dbgsym package, you can also use "coredumpctl debug XXX" then use "bt
>> full" and send the output.
>> --
>> Don't stop with your first draft.
>> - The Elements of Programming Style (Kernighan & Plauger)
>>
>


Re: Segfault on 2.1.3

2020-03-04 Thread Sean Reifschneider
It's maybe a little bit early to say, but 2.0.3 has not segfaulted since I
installed it, around 20 hours ago.  The previous 20 hours had maybe a dozen
segfaults, so this might tell us something.

I've upgraded back to 2.1, and installed the systemd-coredump, I'll update
when I have additional information.  I wasn't able to find a -dbgsym
package, I even looked in the debian pool directory for the PPA.  We're
talking like a haproxy-dbgsym package, right?  Or am I missing something?

Thanks,
Sean

On Tue, Mar 3, 2020 at 11:25 PM Vincent Bernat  wrote:

>  ❦  3 mars 2020 15:34 -07, Sean Reifschneider :
>
> > We've been running haproxy 1.8 series for quite a while.  We're currently
> > in the process of updating to 2.1, and have installed from the vbernat
> PPA
> > on Ubuntu 18.04 using the same old config file.
> >
> > Now we are seeing segfaults a few times a day:
>
> You can easily collect core information if you install systemd-coredump.
> Then, use "coredumpctl list" to locate the collected core, then
> "coredumpctl info XXX" to get some stack traces. If you install the
> -dbgsym package, you can also use "coredumpctl debug XXX" then use "bt
> full" and send the output.
> --
> Don't stop with your first draft.
> - The Elements of Programming Style (Kernighan & Plauger)
>


Re: Segfault on 2.1.3

2020-03-03 Thread Vincent Bernat
 ❦  3 mars 2020 15:34 -07, Sean Reifschneider :

> We've been running haproxy 1.8 series for quite a while.  We're currently
> in the process of updating to 2.1, and have installed from the vbernat PPA
> on Ubuntu 18.04 using the same old config file.
>
> Now we are seeing segfaults a few times a day:

You can easily collect core information if you install systemd-coredump.
Then, use "coredumpctl list" to locate the collected core, then
"coredumpctl info XXX" to get some stack traces. If you install the
-dbgsym package, you can also use "coredumpctl debug XXX" then use "bt
full" and send the output.
-- 
Don't stop with your first draft.
- The Elements of Programming Style (Kernighan & Plauger)



Segfault on 2.1.3

2020-03-03 Thread Sean Reifschneider
We've been running haproxy 1.8 series for quite a while.  We're currently
in the process of updating to 2.1, and have installed from the vbernat PPA
on Ubuntu 18.04 using the same old config file.

Now we are seeing segfaults a few times a day:

Mar 03 14:53:52 fw1.dev.realgo.com kernel: haproxy[8654]: segfault at 18 ip
56389a674d08 sp 7fac18dba030 error 4 in haproxy[56389a52b000+235000]
Mar 03 14:53:52 fw1.dev.realgo.com haproxy[8649]: [ALERT] 062/145352 (8649)
: Current worker #1 (8653) exited with code 139 (Segmentation fault)
Mar 03 14:53:52 fw1.dev.realgo.com haproxy[8649]: [ALERT] 062/145352 (8649)
: exit-on-failure: killing every processes with SIGTERM
Mar 03 14:53:52 fw1.dev.realgo.com haproxy[8649]: [WARNING] 062/145352
(8649) : All workers exited. Exiting... (139)
Mar 03 14:53:52 fw1.dev.realgo.com systemd[1]: haproxy.service: Main
process exited, code=exited, status=139/n/a
Mar 03 14:53:52 fw1.dev.realgo.com systemd[1]: haproxy.service: Failed with
result 'exit-code'.

Looks like it is restarting 5-10 times a day during the work week, less
during the weekend where only our monitoring system is hitting it.

I haven't tried disabling htx, which the other recent Segfault thread said
resolved it.  I'm currently trying reverting down to 2.0.13.

I'm happy to provide the config if that helps.  It's 419 lines.

Sean