Re: haproxy 1.8.12 / 1.9- 20180623 / stopping process hangs with threads (100% cpu) on -sf reload / FreeBSD

2018-07-20 Thread Willy Tarreau
On Fri, Jul 20, 2018 at 09:56:23PM +0200, PiBa-NL wrote:
> Thanks Christopher & Willy,
> 
> Op 20-7-2018 om 14:26 schreef Willy Tarreau:
> > > Op 20-7-2018 om 10:43 schreef Christopher Faulet:
> > OK finally I've merged it because it obviously fixes a bug.
> > Willy
> 
> Confirmed fixed with current master's:
> HA-Proxy version 1.8.12-5e100b4 2018/07/20
> HA-Proxy version 1.9-dev0-842ed9b 2018/07/20
> 
> (Well at least my reproduction doesn't work anymore.. while it did quite
> easily before.) So thats good ;)

Excellent, thanks for confirming Pieter.
Have a nice week-end!

Willy



Re: haproxy 1.8.12 / 1.9- 20180623 / stopping process hangs with threads (100% cpu) on -sf reload / FreeBSD

2018-07-20 Thread PiBa-NL

Thanks Christopher & Willy,

Op 20-7-2018 om 14:26 schreef Willy Tarreau:

Op 20-7-2018 om 10:43 schreef Christopher Faulet:

OK finally I've merged it because it obviously fixes a bug.
Willy


Confirmed fixed with current master's:
HA-Proxy version 1.8.12-5e100b4 2018/07/20
HA-Proxy version 1.9-dev0-842ed9b 2018/07/20

(Well at least my reproduction doesn't work anymore.. while it did quite 
easily before.) So thats good ;)


Regards,
PiBa-NL (Pieter)




Re: haproxy 1.8.12 / 1.9- 20180623 / stopping process hangs with threads (100% cpu) on -sf reload / FreeBSD

2018-07-20 Thread Willy Tarreau
On Fri, Jul 20, 2018 at 10:43:13AM +0200, Christopher Faulet wrote:
> Damn! I forgot to check that. We talked about it 30 min ago though!

OK finally I've merged it because it obviously fixes a bug and William
wants to prepare another release which I think is a good idea. We can
revisit this later if it's not enough anyway.

Willy



Re: haproxy 1.8.12 / 1.9- 20180623 / stopping process hangs with threads (100% cpu) on -sf reload / FreeBSD

2018-07-20 Thread Christopher Faulet

Le 20/07/2018 à 10:38, Willy Tarreau a écrit :

On Fri, Jul 20, 2018 at 10:27:42AM +0200, Christopher Faulet wrote:

In thread_sync_barrier, we exit when all threads have set their own bit in the
barrier mask. It is done by comparing it to all_threads_mask. But we must not
use a simple equality to do so, becaue all_threads_mask may change. Since commit
ba86c6c25 ("MINOR: threads: Be sure to remove threads from all_threads_mask on
exit"), when a thread exit, its bit is removed from all_threads_mask. Instead,
we must use a bitwise AND to test is all bits of all_threads_mask are set.


I agree on the principle, however it may still not work because
all_threads_mask is not marked volatile, so the compiler will most of
the time fetch the value into a register and will loop on it, keeping
the wrong value all the time. The case where it matters is when the
all_threads_mask is fetched and cached when entering the function,
then loses the value while barrier still has this bit set.

Thus I think that adding only this to your patch will definitely kill
this bug :


Damn! I forgot to check that. We talked about it 30 min ago though!

--
Christopher Faulet



Re: haproxy 1.8.12 / 1.9- 20180623 / stopping process hangs with threads (100% cpu) on -sf reload / FreeBSD

2018-07-20 Thread Willy Tarreau
On Fri, Jul 20, 2018 at 10:27:42AM +0200, Christopher Faulet wrote:
> In thread_sync_barrier, we exit when all threads have set their own bit in the
> barrier mask. It is done by comparing it to all_threads_mask. But we must not
> use a simple equality to do so, becaue all_threads_mask may change. Since 
> commit
> ba86c6c25 ("MINOR: threads: Be sure to remove threads from all_threads_mask on
> exit"), when a thread exit, its bit is removed from all_threads_mask. Instead,
> we must use a bitwise AND to test is all bits of all_threads_mask are set.

I agree on the principle, however it may still not work because
all_threads_mask is not marked volatile, so the compiler will most of
the time fetch the value into a register and will loop on it, keeping
the wrong value all the time. The case where it matters is when the
all_threads_mask is fetched and cached when entering the function,
then loses the value while barrier still has this bit set.

Thus I think that adding only this to your patch will definitely kill
this bug :

diff --git a/include/common/hathreads.h b/include/common/hathreads.h
index 5c4ceca..274f988 100644
--- a/include/common/hathreads.h
+++ b/include/common/hathreads.h
@@ -260,7 +260,7 @@ void thread_exit_sync(void);
 int  thread_no_sync(void);
 int  thread_need_sync(void);
 
-extern unsigned long all_threads_mask;
+extern volatile unsigned long all_threads_mask;
 
 #define ha_sigmask(how, set, oldset)  pthread_sigmask(how, set, oldset)
 
diff --git a/src/hathreads.c b/src/hathreads.c
index 5db3c21..aada8ed 100644
--- a/src/hathreads.c
+++ b/src/hathreads.c
@@ -31,7 +31,7 @@ void thread_sync_io_handler(int fd)
 static HA_SPINLOCK_T sync_lock;
 static int   threads_sync_pipe[2];
 static unsigned long threads_want_sync = 0;
-unsigned long all_threads_mask  = 0;
+volatile unsigned long all_threads_mask  = 0;
 
 #if defined(DEBUG_THREAD) || defined(DEBUG_FULL)
 struct lock_stat lock_stats[LOCK_LABELS];

Pieter, your feedback is welcome, as usual :-)

Don't bother respinning the patch Christopher, I can add this myself
after Pieter's test.

Cheers,
Willy



Re: haproxy 1.8.12 / 1.9- 20180623 / stopping process hangs with threads (100% cpu) on -sf reload / FreeBSD

2018-07-20 Thread Christopher Faulet

Le 17/07/2018 à 19:39, PiBa-NL a écrit :

Hi Christopher,

Op 17-7-2018 om 10:09 schreef Christopher Faulet:

Could you try to revert the following commit please ?

  * ba86c6c25 MINOR: threads: Be sure to remove threads from
all_threads_mask on exit


Without this specific commit the termination of the old process works
'properly'.
That is.. for testing i used 1.9 snapshot of 20180714 and included a
little patch to remove the 'atomic and'.. which is basically what that
commit added..
   #ifdef USE_THREAD
-    HA_ATOMIC_AND(_threads_mask, ~tid_bit);
   if (tid > 0)
       pthread_exit(NULL);
   #endif

Also snapshot of 20180622 +
'0461-BUG-MEDIUM-threads-Use-the-sync-point-to-che-1.9-dev0.patch' works
okay.

Though i guess just reverting that line is not the right fix ;).



Hi,

Thanks Pieter. Sorry for the delay, I was busy on something else...

The bug is really in the sync point. Reverting the patch was just an 
easy way to spot it. Here is the right fix. It should be backported in 1.8.


Thanks,
--
Christopher Faulet
>From 7da071dfe4cff3cc4cf7a195d4f5e056c7281803 Mon Sep 17 00:00:00 2001
From: Christopher Faulet 
Date: Fri, 20 Jul 2018 09:31:53 +0200
Subject: [PATCH] BUG/MEDIUM: threads: Fix the exit condition of the thread
 barrier

In thread_sync_barrier, we exit when all threads have set their own bit in the
barrier mask. It is done by comparing it to all_threads_mask. But we must not
use a simple equality to do so, becaue all_threads_mask may change. Since commit
ba86c6c25 ("MINOR: threads: Be sure to remove threads from all_threads_mask on
exit"), when a thread exit, its bit is removed from all_threads_mask. Instead,
we must use a bitwise AND to test is all bits of all_threads_mask are set.

This patch must be backported in 1.8.
---
 src/hathreads.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/hathreads.c b/src/hathreads.c
index 5db3c2197..19a787fb8 100644
--- a/src/hathreads.c
+++ b/src/hathreads.c
@@ -106,7 +106,7 @@ static inline void thread_sync_barrier(volatile unsigned long *barrier)
 
 	HA_ATOMIC_CAS(barrier, , 0);
 	HA_ATOMIC_OR(barrier, tid_bit);
-	while (*barrier != all_threads_mask)
+	while ((*barrier & all_threads_mask) != all_threads_mask)
 		pl_cpu_relax();
 }
 
-- 
2.17.1



Re: haproxy 1.8.12 / 1.9- 20180623 / stopping process hangs with threads (100% cpu) on -sf reload / FreeBSD

2018-07-17 Thread PiBa-NL

Hi Christopher,

Op 17-7-2018 om 10:09 schreef Christopher Faulet:

Could you try to revert the following commit please ?

 * ba86c6c25 MINOR: threads: Be sure to remove threads from 
all_threads_mask on exit


Without this specific commit the termination of the old process works 
'properly'.
That is.. for testing i used 1.9 snapshot of 20180714 and included a 
little patch to remove the 'atomic and'.. which is basically what that 
commit added..

 #ifdef USE_THREAD
-    HA_ATOMIC_AND(_threads_mask, ~tid_bit);
 if (tid > 0)
     pthread_exit(NULL);
 #endif

Also snapshot of 20180622 + 
'0461-BUG-MEDIUM-threads-Use-the-sync-point-to-che-1.9-dev0.patch' works 
okay.


Though i guess just reverting that line is not the right fix ;).

Regards,
PiBa-NL (Pieter)



Re: haproxy 1.8.12 / 1.9- 20180623 / stopping process hangs with threads (100% cpu) on -sf reload / FreeBSD

2018-07-17 Thread Christopher Faulet

Le 17/07/2018 à 01:08, PiBa-NL a écrit :

Hi List,

With a build of 1.8.12 (and the 1.9 snapshot of 20180623 ) im getting 
the 'old' haproxy process take up 100% cpu usage when using 3 threads in 
the config and reloading with -sf parameter. I'm using FreeBSD.. (It 
also happens with the 14-7 snapshot.)


It seems to happen after 1 thread quits, one of the others gets out of 
control.

Most of the time it happens after the first reload.:
haproxy -f /var/etc/haproxy/haproxy.cfg -D
haproxy -f /var/etc/haproxy/haproxy.cfg -D -sf 19110

The main features i use are:  ssl offloading / lua / threads
Only a little to no traffic passing through though, im seeing this 
behavior also on my in-active production node, the strange part sofar 
though is that i could not reproduce it yet on my test machine.
If someone has got a idea on how to patch or what direction to search 
for a fix i'm happy to try.


If there is nothing obvious that can be spotted with the info from the 
stack-traces of both 1.8 and 1.9 below ill try and dig further tomorrow 
:).Thanks in advance for anyone's time :).


Hi Pieter,

This is a problem with the sync point... again. Could you try to revert 
the following commit please ?


 * ba86c6c25 MINOR: threads: Be sure to remove threads from 
all_threads_mask on exit


This fix was probably added too quickly. I suspect a problem when a 
thread exits, removing its tid_bit from all_threads_mask while some 
others are looping in the sync point.


Regards,
--
Christopher Faulet



haproxy 1.8.12 / 1.9- 20180623 / stopping process hangs with threads (100% cpu) on -sf reload / FreeBSD

2018-07-16 Thread PiBa-NL

Hi List,

With a build of 1.8.12 (and the 1.9 snapshot of 20180623 ) im getting 
the 'old' haproxy process take up 100% cpu usage when using 3 threads in 
the config and reloading with -sf parameter. I'm using FreeBSD.. (It 
also happens with the 14-7 snapshot.)


It seems to happen after 1 thread quits, one of the others gets out of 
control.

Most of the time it happens after the first reload.:
haproxy -f /var/etc/haproxy/haproxy.cfg -D
haproxy -f /var/etc/haproxy/haproxy.cfg -D -sf 19110

The main features i use are:  ssl offloading / lua / threads
Only a little to no traffic passing through though, im seeing this 
behavior also on my in-active production node, the strange part sofar 
though is that i could not reproduce it yet on my test machine.
If someone has got a idea on how to patch or what direction to search 
for a fix i'm happy to try.


If there is nothing obvious that can be spotted with the info from the 
stack-traces of both 1.8 and 1.9 below ill try and dig further tomorrow 
:).Thanks in advance for anyone's time :).


Regards,
PiBa-NL (Pieter)

p.s.
I CC'ed Christopher as he seems to have made the last 2 patches going 
into 20180623. Im hoping he has a clue on what to do next :).


## Below stack traces are from 1.8.12.. ##

[2.4.3-RELEASE][admin@pfsense_3]/root: /usr/local/bin/gdb --pid 2136 
/usr/local/sbin/haproxy

GNU gdb (GDB) 8.0.1 [GDB v8.0.1 for FreeBSD]
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 


This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd11.1".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
.
Find the GDB manual and other documentation resources online at:
.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/sbin/haproxy...done.
Attaching to program: /usr/local/sbin/haproxy, process 2136
[New LWP 101099 of process 2136]
Reading symbols from /lib/libcrypt.so.5...(no debugging symbols 
found)...done.

Reading symbols from /lib/libz.so.6...(no debugging symbols found)...done.
Reading symbols from /lib/libthr.so.3...(no debugging symbols found)...done.
Reading symbols from /usr/lib/libssl.so.8...(no debugging symbols 
found)...done.
Reading symbols from /lib/libcrypto.so.8...(no debugging symbols 
found)...done.
Reading symbols from /usr/local/lib/liblua-5.3.so...(no debugging 
symbols found)...done.

Reading symbols from /lib/libm.so.5...(no debugging symbols found)...done.
Reading symbols from /lib/libc.so.7...(no debugging symbols found)...done.
Reading symbols from /libexec/ld-elf.so.1...(no debugging symbols 
found)...done.

[Switching to LWP 100345 of process 2136]
0x000800f0f91c in ?? () from /lib/libthr.so.3
(gdb) thread info
Invalid thread ID: info
(gdb) info thread
  Id   Target Id Frame
* 1    LWP 100345 of process 2136 0x000800f0f91c in ?? () from 
/lib/libthr.so.3
  2    LWP 101099 of process 2136 thread_sync_barrier (barrier=0x8bc4e0 
) at src/hathreads.c:112

(gdb) bt full
#0  0x000800f0f91c in ?? () from /lib/libthr.so.3
No symbol table info available.
#1  0x000800f0bf97 in ?? () from /lib/libthr.so.3
No symbol table info available.
#2  0x0050bd4f in main (argc=6, argv=0x7fffec70) at 
src/haproxy.c:3077

    tids = 0x80249cf40
    threads = 0x802487240
    i = 2
    old_sig = {__bits = {1073741824, 0, 0, 0}}
    blocked_sig = {__bits = {4227856759, 4294967295, 4294967295, 
4294967295}}

    err = 0
    retry = 200
    limit = {rlim_cur = 2282, rlim_max = 2282}
    errmsg = 
"\000\354\377\377\377\177\000\000\250\354\377\377\377\177\000\000p\354\377\377\377\177\000\000\006\000\000\000\000\000\000\000\270\030\241\374+/\213\032`e\213\000\000\000\000\000h\354\377\377\377\177\000\000\250\354\377\377\377\177\000\000p\354\377\377\377\177\000\000\006\000\000\000\000\000\000\000\020\354\377\377\377\177\000\000r3\340\001\b\000\000\000\001\000\000"

    pidfd = 27
(gdb) thread 2
[Switching to thread 2 (LWP 101099 of process 2136)]
#0  thread_sync_barrier (barrier=0x8bc4e0 ) at 
src/hathreads.c:112

112 src/hathreads.c: No such file or directory.
(gdb) bt full
#0  thread_sync_barrier (barrier=0x8bc4e0 ) at 
src/hathreads.c:112

    old = 7
#1  0x005ae23f in thread_exit_sync () at src/hathreads.c:151
    barrier = 7
#2  0x0051260d in sync_poll_loop () at src/haproxy.c:2391
    stop = 1
#3  0x00512533 in run_poll_loop () at src/haproxy.c:2438
    next = -1636293614
    exp = -1636293614
#4  0x0050f10b in run_thread_poll_loop (data=0x80249cf48) at 
src/haproxy.c:2470
    start_lock =