Re: DNS resolver + threads, 100% cpu usage / hang 1.9dev
Hi Olivier, On Wed, May 23, 2018 at 04:13:20PM +0200, Olivier Houchard wrote: > Thanks a lot for testing, and your usual investigation work ! > > Willy, can you please apply this ? Sure, now done, thanks! Willy
Re: DNS resolver + threads, 100% cpu usage / hang 1.9dev
Hi Pieter, On Tue, May 22, 2018 at 09:00:24PM +0200, PiBa-NL wrote: > Hi Olivier, > > Op 22-5-2018 om 18:46 schreef Olivier Houchard: > > Hi Pieter, > > > > Does the attached patch fix it for you ? It's been generated from master, > > but will probably apply against 1.8 as well. > > > > Thanks ! > > > > Olivier > > Patch works for me (on master, didn't try with 1.8). Or at least i'm running > the same testbox for an hour now without issue. > Thanks ! > Thanks a lot for testing, and your usual investigation work ! Willy, can you please apply this ? Thanks ! Olivier >From b938f86e1fe51e95adc73f9e583dd225f5ecf88d Mon Sep 17 00:00:00 2001 From: Olivier Houchard Date: Tue, 22 May 2018 18:40:07 +0200 Subject: [PATCH] BUG/MEDIUM: dns: Delay the attempt to run a DNS resolution on check failure. When checks fail, the code tries to run a dns resolution, in case the IP changed. The old way of doing that was to check, in case the last dns resolution hadn't expired yet, if there were an applicable IP, which should be useless, because it has already be done when the resolution was first done, or to run a new resolution. Both are a locking nightmare, and lead to deadlocks, so instead, just wake the resolvers task, that should do the trick. This should be backported to 1.8. --- src/dns.c | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/src/dns.c b/src/dns.c index e42950212..e2d46a658 100644 --- a/src/dns.c +++ b/src/dns.c @@ -348,11 +348,10 @@ void dns_trigger_resolution(struct dns_requester *req) /* The resolution must not be triggered yet. Use the cached response, if * valid */ exp = tick_add(res->last_resolution, resolvers->hold.valid); - if (res->status == RSLV_STATUS_VALID && - tick_isset(res->last_resolution) && !tick_is_expired(exp, now_ms)) - req->requester_cb(req, NULL); - else - dns_run_resolution(res); + if (resolvers->t && (res->status != RSLV_STATUS_VALID || + !tick_isset(res->last_resolution) || tick_is_expired(exp, now_ms))) + task_wakeup(resolvers->t, TASK_WOKEN_OTHER); + } -- 2.14.3
Re: DNS resolver + threads, 100% cpu usage / hang 1.9dev
Hi Olivier, Op 22-5-2018 om 18:46 schreef Olivier Houchard: Hi Pieter, Does the attached patch fix it for you ? It's been generated from master, but will probably apply against 1.8 as well. Thanks ! Olivier Patch works for me (on master, didn't try with 1.8). Or at least i'm running the same testbox for an hour now without issue. Thanks ! Regards, PiBa-NL (Pieter)
Re: DNS resolver + threads, 100% cpu usage / hang 1.9dev
Hi Pieter, On Sun, May 20, 2018 at 02:07:43AM +0200, PiBa-NL wrote: > Hi List, > > With 1.8.8 ran into this, tried latest 1.9dev snapshot seems to have the > same issue.. > > Running with 3 threads, a template for 8 servers, and only 2 ip's in the dns > response, neither of which is actually 'up' one responds with 'L4TOUT in > 1004ms' the other with 'L4CON in 0ms' on stats page.. DNS comes from a local > unbound dns-server on the same host as the haproxy process. after a little > while (10 minutes+-) haproxy go's to 300% usage on all the 3 spin-locks and > doesn't respond anymore. > > There was no actual traffic passing through this machine. I am watching the > stats page. > Running with 1 thread the issue does not seem to appear. > > I think 2 threads are deadlocking each other.?. and then later the 3rd joins > the waiting game. > I've added logging between a lot of most of the lock/unlock functions.. and > it seems that the 2 succeeded locks below from lines 332 and 333 in attached > logfile are the last ones where threads 1 and 0 are doing anything.. > > For example added logging like this everywhere ,the thread-id, the > servername, and the function its inside of is logged with printf > statements..: > > static inline void health_adjust(struct server *s, short status) > { > HA_SPIN_LOCK(SERVER_LOCK, &s->lock); > printf("tid:%u LOCKED srv: %s health_adjust\n", tid, s->id); > /* return now if observing nor health check is not enabled */ > if (!s->observe || !s->check.task) { > printf("tid:%u UNLOCK srv: %s health_adjust a\n", tid, s->id); > HA_SPIN_UNLOCK(SERVER_LOCK, &s->lock); > return; > } > > __health_adjust(s, status); > printf("tid:%u UNLOCK srv: %s health_adjust b\n", tid, s->id); > HA_SPIN_UNLOCK(SERVER_LOCK, &s->lock); > } > > This leads to the following calls that are seemingly never unlocked the tid: > stops appearing, and actual callstacks go several functions deeper that try > to gain another lock: > > tid:1 LOCKED srv: alias_srv1 process_chk_conn > tid:0 LOCKED srv: alias_srv8 event_srv_chk_w > > Then a little later the tid:2 also hangs.. probably it was time to perform > another check or dns lookup or so... > > (gdb) info threads > Id Target Id Frame > 1 LWP 100730 of process 15238 0x004ab6bc in > snr_check_ip_callback (srv=0x8022d0400, ip=0x80288e394, > ip_family=0x7fffe207 "\002\224\343\210\002\b") at src/server.c:3781 > 2 LWP 100924 of process 15238 0x004ab6b4 in > snr_check_ip_callback (srv=0x8022cb000, ip=0x80288e394, > ip_family=0x7fffdfffda97 "\002\224\343\210\002\b") at src/server.c:3781 > * 3 LWP 100925 of process 15238 0x0051ef21 in dns_resolve_recv > (dgram=0x8022483c0) at src/dns.c:1646 > > More gdb 'bt full' output of all 3 threads is at the bottom of the attached > logfile. > > Hope someone can try and fix this :) Config and haproxy -vv also added > below. > > Thanks > PiBa-NL (Pieter) > As usual, you did most of the work :) I think I understand what is going on, and it's ugly as hell. Does the attached patch fix it for you ? It's been generated from master, but will probably apply against 1.8 as well. Thanks ! Olivier >From b938f86e1fe51e95adc73f9e583dd225f5ecf88d Mon Sep 17 00:00:00 2001 From: Olivier Houchard Date: Tue, 22 May 2018 18:40:07 +0200 Subject: [PATCH] BUG/MEDIUM: dns: Delay the attempt to run a DNS resolution on check failure. When checks fail, the code tries to run a dns resolution, in case the IP changed. The old way of doing that was to check, in case the last dns resolution hadn't expired yet, if there were an applicable IP, which should be useless, because it has already be done when the resolution was first done, or to run a new resolution. Both are a locking nightmare, and lead to deadlocks, so instead, just wake the resolvers task, that should do the trick. This should be backported to 1.8. --- src/dns.c | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/src/dns.c b/src/dns.c index e42950212..e2d46a658 100644 --- a/src/dns.c +++ b/src/dns.c @@ -348,11 +348,10 @@ void dns_trigger_resolution(struct dns_requester *req) /* The resolution must not be triggered yet. Use the cached response, if * valid */ exp = tick_add(res->last_resolution, resolvers->hold.valid); - if (res->status == RSLV_STATUS_VALID && - tick_isset(res->last_resolution) && !tick_is_expired(exp, now_ms)) - req->requester_cb(req, NULL); - else - dns_run_resolution(res); + if (resolvers->t && (res->status != RSLV_STATUS_VALID || + !tick_isset(res->last_resolution) || tick_is_expired(exp, now_ms))) + task_wakeup(resolvers->t, TASK_WOKEN_OTHER); + } -- 2.14.3
DNS resolver + threads, 100% cpu usage / hang 1.9dev
Hi List, With 1.8.8 ran into this, tried latest 1.9dev snapshot seems to have the same issue.. Running with 3 threads, a template for 8 servers, and only 2 ip's in the dns response, neither of which is actually 'up' one responds with 'L4TOUT in 1004ms' the other with 'L4CON in 0ms' on stats page.. DNS comes from a local unbound dns-server on the same host as the haproxy process. after a little while (10 minutes+-) haproxy go's to 300% usage on all the 3 spin-locks and doesn't respond anymore. There was no actual traffic passing through this machine. I am watching the stats page. Running with 1 thread the issue does not seem to appear. I think 2 threads are deadlocking each other.?. and then later the 3rd joins the waiting game. I've added logging between a lot of most of the lock/unlock functions.. and it seems that the 2 succeeded locks below from lines 332 and 333 in attached logfile are the last ones where threads 1 and 0 are doing anything.. For example added logging like this everywhere ,the thread-id, the servername, and the function its inside of is logged with printf statements..: static inline void health_adjust(struct server *s, short status) { HA_SPIN_LOCK(SERVER_LOCK, &s->lock); printf("tid:%u LOCKED srv: %s health_adjust\n", tid, s->id); /* return now if observing nor health check is not enabled */ if (!s->observe || !s->check.task) { printf("tid:%u UNLOCK srv: %s health_adjust a\n", tid, s->id); HA_SPIN_UNLOCK(SERVER_LOCK, &s->lock); return; } __health_adjust(s, status); printf("tid:%u UNLOCK srv: %s health_adjust b\n", tid, s->id); HA_SPIN_UNLOCK(SERVER_LOCK, &s->lock); } This leads to the following calls that are seemingly never unlocked the tid: stops appearing, and actual callstacks go several functions deeper that try to gain another lock: tid:1 LOCKED srv: alias_srv1 process_chk_conn tid:0 LOCKED srv: alias_srv8 event_srv_chk_w Then a little later the tid:2 also hangs.. probably it was time to perform another check or dns lookup or so... (gdb) info threads Id Target Id Frame 1 LWP 100730 of process 15238 0x004ab6bc in snr_check_ip_callback (srv=0x8022d0400, ip=0x80288e394, ip_family=0x7fffe207 "\002\224\343\210\002\b") at src/server.c:3781 2 LWP 100924 of process 15238 0x004ab6b4 in snr_check_ip_callback (srv=0x8022cb000, ip=0x80288e394, ip_family=0x7fffdfffda97 "\002\224\343\210\002\b") at src/server.c:3781 * 3 LWP 100925 of process 15238 0x0051ef21 in dns_resolve_recv (dgram=0x8022483c0) at src/dns.c:1646 More gdb 'bt full' output of all 3 threads is at the bottom of the attached logfile. Hope someone can try and fix this :) Config and haproxy -vv also added below. Thanks PiBa-NL (Pieter) global maxconn 3002 log /var/run/log local0 info stats socket /tmp/haproxy.socket level admin expose-fd listeners uid 80 gid 80 nbproc 1 nbthread 3 hard-stop-after 15m chroot /tmp/haproxy_chroot daemon tune.ssl.default-dh-param 2048 log-send-hostname haproxy-pb-test defaults # never fail on address resolution last,libc,none default-server init-addr last,none stats show-legends userlist myuserlist user admin insecure-password pass listen HAProxyLocalStats bind :80 name localstats mode http stats enable stats refresh 5 stats admin if TRUE stats uri / stats show-legends timeout client 5000 timeout connect 5000 timeout server 5000 mailers globalmailers mailer pbmail 192.168.0.40:25 resolvers globalresolvers nameserver goog 127.0.0.1:53 resolve_retries 3 timeout retry 1s hold valid 10s frontend alias_test bind 1.2.3.1:21 name 1.2.3.1:21 transparent bind 1.2.3.1:1-10002 name 1.2.3.1:1-10002 transparent bind 1.2.3.2:21 name 1.2.3.2:21 transparent bind 1.2.3.2:1-10002 name 1.2.3.2:1-10002 transparent bind 1:2:3::3:21 name 1:2:3::3:21 transparent bind 1:2:3::3:1-10002 name 1:2:3::3:1-10002 transparent bind 1:2:3::4:21 name 1:2:3::4:21 transparent bind 1:2:3::4:1-10002 name 1:2:3::4:1-10002 transparent bind 1.2.3.9:21 name 1.2.3.9:21 transparent bind 1.2.3.9:1-10002 name 1.2.3.9:1-10002 transparent mode http log global option socket-stats option http-keep-alive timeout client 3 default_backend alias_back_http_ipvANY backend alias_back_http_ipvANY mode http id 134 log global # use mailers # level info email-alert mailers globalmailers email-aler