Re: DNS resolver + threads, 100% cpu usage / hang 1.9dev

2018-05-23 Thread Willy Tarreau
Hi Olivier,

On Wed, May 23, 2018 at 04:13:20PM +0200, Olivier Houchard wrote:
> Thanks a lot for testing, and your usual investigation work !
> 
> Willy, can you please apply this ?

Sure, now done, thanks!

Willy



Re: DNS resolver + threads, 100% cpu usage / hang 1.9dev

2018-05-23 Thread Olivier Houchard
Hi Pieter,

On Tue, May 22, 2018 at 09:00:24PM +0200, PiBa-NL wrote:
> Hi Olivier,
> 
> Op 22-5-2018 om 18:46 schreef Olivier Houchard:
> > Hi Pieter,
> > 
> > Does the attached patch fix it for you ? It's been generated from master,
> > but will probably apply against 1.8 as well.
> > 
> > Thanks !
> > 
> > Olivier
> 
> Patch works for me (on master, didn't try with 1.8). Or at least i'm running
> the same testbox for an hour now without issue.
> Thanks !
> 

Thanks a lot for testing, and your usual investigation work !

Willy, can you please apply this ?

Thanks !

Olivier
>From b938f86e1fe51e95adc73f9e583dd225f5ecf88d Mon Sep 17 00:00:00 2001
From: Olivier Houchard 
Date: Tue, 22 May 2018 18:40:07 +0200
Subject: [PATCH] BUG/MEDIUM: dns: Delay the attempt to run a DNS resolution on
 check failure.

When checks fail, the code tries to run a dns resolution, in case the IP
changed.
The old way of doing that was to check, in case the last dns resolution
hadn't expired yet, if there were an applicable IP, which should be useless,
because it has already be done when the resolution was first done, or to
run a new resolution.
Both are a locking nightmare, and lead to deadlocks, so instead, just wake the
resolvers task, that should do the trick.

This should be backported to 1.8.
---
 src/dns.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/src/dns.c b/src/dns.c
index e42950212..e2d46a658 100644
--- a/src/dns.c
+++ b/src/dns.c
@@ -348,11 +348,10 @@ void dns_trigger_resolution(struct dns_requester *req)
/* The resolution must not be triggered yet. Use the cached response, if
 * valid */
exp = tick_add(res->last_resolution, resolvers->hold.valid);
-   if (res->status == RSLV_STATUS_VALID &&
-   tick_isset(res->last_resolution) && !tick_is_expired(exp, now_ms))
-   req->requester_cb(req, NULL);
-   else
-   dns_run_resolution(res);
+   if (resolvers->t && (res->status != RSLV_STATUS_VALID ||
+   !tick_isset(res->last_resolution) || tick_is_expired(exp, now_ms)))
+   task_wakeup(resolvers->t, TASK_WOKEN_OTHER);
+
 }
 
 
-- 
2.14.3



Re: DNS resolver + threads, 100% cpu usage / hang 1.9dev

2018-05-22 Thread PiBa-NL

Hi Olivier,

Op 22-5-2018 om 18:46 schreef Olivier Houchard:

Hi Pieter,

Does the attached patch fix it for you ? It's been generated from master,
but will probably apply against 1.8 as well.

Thanks !

Olivier


Patch works for me (on master, didn't try with 1.8). Or at least i'm 
running the same testbox for an hour now without issue.

Thanks !

Regards,
PiBa-NL (Pieter)




Re: DNS resolver + threads, 100% cpu usage / hang 1.9dev

2018-05-22 Thread Olivier Houchard
Hi Pieter,

On Sun, May 20, 2018 at 02:07:43AM +0200, PiBa-NL wrote:
> Hi List,
> 
> With 1.8.8 ran into this, tried latest 1.9dev snapshot seems to have the
> same issue..
> 
> Running with 3 threads, a template for 8 servers, and only 2 ip's in the dns
> response, neither of which is actually 'up' one responds with 'L4TOUT in
> 1004ms' the other with 'L4CON in 0ms' on stats page.. DNS comes from a local
> unbound dns-server on the same host as the haproxy process. after a little
> while (10 minutes+-) haproxy go's to 300% usage on all the 3 spin-locks and
> doesn't respond anymore.
> 
> There was no actual traffic passing through this machine. I am watching the
> stats page.
> Running with 1 thread the issue does not seem to appear.
> 
> I think 2 threads are deadlocking each other.?. and then later the 3rd joins
> the waiting game.
> I've added logging between a lot of most of the lock/unlock functions.. and
> it seems that the 2 succeeded locks below from lines 332 and 333 in attached
> logfile are the last ones where threads 1 and 0 are doing anything..
> 
> For example added logging like this everywhere ,the thread-id, the
> servername, and the function its inside of is logged with printf
> statements..:
> 
> static inline void health_adjust(struct server *s, short status)
> {
>     HA_SPIN_LOCK(SERVER_LOCK, >lock);
>     printf("tid:%u LOCKED srv: %s health_adjust\n", tid, s->id);
>     /* return now if observing nor health check is not enabled */
>     if (!s->observe || !s->check.task) {
>         printf("tid:%u UNLOCK srv: %s health_adjust a\n", tid, s->id);
>         HA_SPIN_UNLOCK(SERVER_LOCK, >lock);
>         return;
>     }
> 
>     __health_adjust(s, status);
>     printf("tid:%u UNLOCK srv: %s health_adjust b\n", tid, s->id);
>     HA_SPIN_UNLOCK(SERVER_LOCK, >lock);
> }
> 
> This leads to the following calls that are seemingly never unlocked the tid:
> stops appearing, and actual callstacks go several functions deeper that try
> to gain another lock:
> 
> tid:1 LOCKED srv: alias_srv1 process_chk_conn
> tid:0 LOCKED srv: alias_srv8 event_srv_chk_w
> 
> Then a little later the tid:2 also hangs.. probably it was time to perform
> another check or dns lookup or so...
> 
> (gdb) info threads
>   Id   Target Id Frame
>   1    LWP 100730 of process 15238 0x004ab6bc in
> snr_check_ip_callback (srv=0x8022d0400, ip=0x80288e394,
>     ip_family=0x7fffe207 "\002\224\343\210\002\b") at src/server.c:3781
>   2    LWP 100924 of process 15238 0x004ab6b4 in
> snr_check_ip_callback (srv=0x8022cb000, ip=0x80288e394,
>     ip_family=0x7fffdfffda97 "\002\224\343\210\002\b") at src/server.c:3781
> * 3    LWP 100925 of process 15238 0x0051ef21 in dns_resolve_recv
> (dgram=0x8022483c0) at src/dns.c:1646
> 
> More gdb 'bt full' output of all 3 threads is at the bottom of the attached
> logfile.
> 
> Hope someone can try and fix this :) Config and haproxy -vv also added
> below.
> 
> Thanks
> PiBa-NL (Pieter)
> 

As usual, you did most of the work :)
I think I understand what is going on, and it's ugly as hell.
Does the attached patch fix it for you ? It's been generated from master,
but will probably apply against 1.8 as well.

Thanks !

Olivier
>From b938f86e1fe51e95adc73f9e583dd225f5ecf88d Mon Sep 17 00:00:00 2001
From: Olivier Houchard 
Date: Tue, 22 May 2018 18:40:07 +0200
Subject: [PATCH] BUG/MEDIUM: dns: Delay the attempt to run a DNS resolution on
 check failure.

When checks fail, the code tries to run a dns resolution, in case the IP
changed.
The old way of doing that was to check, in case the last dns resolution
hadn't expired yet, if there were an applicable IP, which should be useless,
because it has already be done when the resolution was first done, or to
run a new resolution.
Both are a locking nightmare, and lead to deadlocks, so instead, just wake the
resolvers task, that should do the trick.

This should be backported to 1.8.
---
 src/dns.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/src/dns.c b/src/dns.c
index e42950212..e2d46a658 100644
--- a/src/dns.c
+++ b/src/dns.c
@@ -348,11 +348,10 @@ void dns_trigger_resolution(struct dns_requester *req)
/* The resolution must not be triggered yet. Use the cached response, if
 * valid */
exp = tick_add(res->last_resolution, resolvers->hold.valid);
-   if (res->status == RSLV_STATUS_VALID &&
-   tick_isset(res->last_resolution) && !tick_is_expired(exp, now_ms))
-   req->requester_cb(req, NULL);
-   else
-   dns_run_resolution(res);
+   if (resolvers->t && (res->status != RSLV_STATUS_VALID ||
+   !tick_isset(res->last_resolution) || tick_is_expired(exp, now_ms)))
+   task_wakeup(resolvers->t, TASK_WOKEN_OTHER);
+
 }
 
 
-- 
2.14.3



DNS resolver + threads, 100% cpu usage / hang 1.9dev

2018-05-19 Thread PiBa-NL

Hi List,

With 1.8.8 ran into this, tried latest 1.9dev snapshot seems to have the 
same issue..


Running with 3 threads, a template for 8 servers, and only 2 ip's in the 
dns response, neither of which is actually 'up' one responds with 
'L4TOUT in 1004ms' the other with 'L4CON in 0ms' on stats page.. DNS 
comes from a local unbound dns-server on the same host as the haproxy 
process. after a little while (10 minutes+-) haproxy go's to 300% usage 
on all the 3 spin-locks and doesn't respond anymore.


There was no actual traffic passing through this machine. I am watching 
the stats page.

Running with 1 thread the issue does not seem to appear.

I think 2 threads are deadlocking each other.?. and then later the 3rd 
joins the waiting game.
I've added logging between a lot of most of the lock/unlock functions.. 
and it seems that the 2 succeeded locks below from lines 332 and 333 in 
attached logfile are the last ones where threads 1 and 0 are doing 
anything..


For example added logging like this everywhere ,the thread-id, the 
servername, and the function its inside of is logged with printf 
statements..:


static inline void health_adjust(struct server *s, short status)
{
    HA_SPIN_LOCK(SERVER_LOCK, >lock);
    printf("tid:%u LOCKED srv: %s health_adjust\n", tid, s->id);
    /* return now if observing nor health check is not enabled */
    if (!s->observe || !s->check.task) {
        printf("tid:%u UNLOCK srv: %s health_adjust a\n", tid, s->id);
        HA_SPIN_UNLOCK(SERVER_LOCK, >lock);
        return;
    }

    __health_adjust(s, status);
    printf("tid:%u UNLOCK srv: %s health_adjust b\n", tid, s->id);
    HA_SPIN_UNLOCK(SERVER_LOCK, >lock);
}

This leads to the following calls that are seemingly never unlocked the 
tid: stops appearing, and actual callstacks go several functions deeper 
that try to gain another lock:


tid:1 LOCKED srv: alias_srv1 process_chk_conn
tid:0 LOCKED srv: alias_srv8 event_srv_chk_w

Then a little later the tid:2 also hangs.. probably it was time to 
perform another check or dns lookup or so...


(gdb) info threads
  Id   Target Id Frame
  1    LWP 100730 of process 15238 0x004ab6bc in 
snr_check_ip_callback (srv=0x8022d0400, ip=0x80288e394,

    ip_family=0x7fffe207 "\002\224\343\210\002\b") at src/server.c:3781
  2    LWP 100924 of process 15238 0x004ab6b4 in 
snr_check_ip_callback (srv=0x8022cb000, ip=0x80288e394,

    ip_family=0x7fffdfffda97 "\002\224\343\210\002\b") at src/server.c:3781
* 3    LWP 100925 of process 15238 0x0051ef21 in 
dns_resolve_recv (dgram=0x8022483c0) at src/dns.c:1646


More gdb 'bt full' output of all 3 threads is at the bottom of the 
attached logfile.


Hope someone can try and fix this :) Config and haproxy -vv also added 
below.


Thanks
PiBa-NL (Pieter)

global
    maxconn            3002
    log            /var/run/log    local0    info
    stats socket /tmp/haproxy.socket level admin  expose-fd listeners
    uid            80
    gid            80
    nbproc            1
    nbthread            3
    hard-stop-after        15m
    chroot                /tmp/haproxy_chroot
    daemon
    tune.ssl.default-dh-param    2048
    log-send-hostname        haproxy-pb-test

    defaults
      # never fail on address resolution last,libc,none
      default-server init-addr last,none

      stats show-legends

    userlist myuserlist
      user admin insecure-password pass

listen HAProxyLocalStats
    bind :80 name localstats
    mode http
    stats enable
    stats refresh 5
    stats admin if TRUE
    stats uri /
stats show-legends
    timeout client 5000
    timeout connect 5000
    timeout server 5000

mailers globalmailers
    mailer pbmail 192.168.0.40:25

resolvers globalresolvers
    nameserver goog 127.0.0.1:53
    resolve_retries 3
    timeout retry 1s
    hold valid 10s

frontend alias_test
    bind            1.2.3.1:21 name 1.2.3.1:21   transparent
    bind            1.2.3.1:1-10002 name 1.2.3.1:1-10002 
transparent

    bind            1.2.3.2:21 name 1.2.3.2:21   transparent
    bind            1.2.3.2:1-10002 name 1.2.3.2:1-10002 
transparent

    bind            1:2:3::3:21 name 1:2:3::3:21   transparent
    bind            1:2:3::3:1-10002 name 1:2:3::3:1-10002 
transparent

    bind            1:2:3::4:21 name 1:2:3::4:21   transparent
    bind            1:2:3::4:1-10002 name 1:2:3::4:1-10002 
transparent

    bind            1.2.3.9:21 name 1.2.3.9:21   transparent
    bind            1.2.3.9:1-10002 name 1.2.3.9:1-10002 
transparent

    mode            http
    log            global
    option            socket-stats
    option            http-keep-alive
    timeout client        3
    default_backend alias_back_http_ipvANY

backend alias_back_http_ipvANY
    mode            http
    id            134
    log            global
    # use mailers
    # level  info
    email-alert mailers            globalmailers
    email-alert level