Re: 100% cpu usage 1.9-dev0-48d92ee 2018/07/30, task.?. but keeps working.. (nbthread 1)
Hi Pieter, On Mon, Aug 20, 2018 at 09:33:49PM +0200, PiBa-NL wrote: > Hi Olivier, > > Op 17-8-2018 om 14:51 schreef Willy Tarreau: > > On Fri, Aug 17, 2018 at 01:41:54PM +0200, Olivier Houchard wrote: > > > That is true, this one is not a bug, but a pessimization, by using the > > > global > > > update_list which is more costly than the local one. > > > > > > Patches attached to do as suggest. > > Applied, thank you! > > willy > > Just a little update :) > > The '1.9-dev0-48d92ee 2018/07/30'+your initial 2 patches, is running > correctly using as little cpu as can be expected for its little workload for > 4+ days now. I think i can call it 'fix confirmed', as you already knew ;) , > previously the issue would have likely returned in this time period.. > > Ill keep it running for a few more days, and switch back to nbthread 3 > then.. Till next time ;) > > Thanks again! > > Best regards, > > PiBa-NL (Pieter) > Thanks for the feedback :) Let us know how it goes with 3 threads ! Regards, Olivier
Re: 100% cpu usage 1.9-dev0-48d92ee 2018/07/30, task.?. but keeps working.. (nbthread 1)
Hi Olivier, Op 17-8-2018 om 14:51 schreef Willy Tarreau: On Fri, Aug 17, 2018 at 01:41:54PM +0200, Olivier Houchard wrote: That is true, this one is not a bug, but a pessimization, by using the global update_list which is more costly than the local one. Patches attached to do as suggest. Applied, thank you! willy Just a little update :) The '1.9-dev0-48d92ee 2018/07/30'+your initial 2 patches, is running correctly using as little cpu as can be expected for its little workload for 4+ days now. I think i can call it 'fix confirmed', as you already knew ;) , previously the issue would have likely returned in this time period.. Ill keep it running for a few more days, and switch back to nbthread 3 then.. Till next time ;) Thanks again! Best regards, PiBa-NL (Pieter)
Re: 100% cpu usage 1.9-dev0-48d92ee 2018/07/30, task.?. but keeps working.. (nbthread 1)
On Fri, Aug 17, 2018 at 01:41:54PM +0200, Olivier Houchard wrote: > That is true, this one is not a bug, but a pessimization, by using the global > update_list which is more costly than the local one. > > Patches attached to do as suggest. Applied, thank you! willy
Re: 100% cpu usage 1.9-dev0-48d92ee 2018/07/30, task.?. but keeps working.. (nbthread 1)
On Thu, Aug 16, 2018 at 07:31:17PM +0200, Willy Tarreau wrote: > Both patches applied, thanks guys! > > Olivier, I have a suggestion for this one : > On Thu, Aug 16, 2018 at 07:17:07PM +0200, Olivier Houchard wrote: > > From 90fc92f72c6b47d88769bb73680702d7b8e6 Mon Sep 17 00:00:00 2001 > > From: Olivier Houchard > > Date: Thu, 16 Aug 2018 19:03:02 +0200 > > Subject: [PATCH 1/2] BUG/MEDIUM: tasks: Don't insert in the global rqueue if > > nbthread == 1 > > > > Make sure we don't insert a task in the global run queue if nbthread == 1, > > as, as an optimisation, we avoid reading from it if nbthread == 1. > > --- > > src/task.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/src/task.c b/src/task.c > > index de097baf7..e357bc169 100644 > > --- a/src/task.c > > +++ b/src/task.c > > @@ -395,7 +395,8 @@ void process_runnable_tasks() > > state = HA_ATOMIC_AND(&t->state, ~TASK_RUNNING); > > if (state) > > #ifdef USE_THREAD > > - __task_wakeup(t, (t->thread_mask == tid_bit) ? > > + __task_wakeup(t, (t->thread_mask == tid_bit || > > + global.nbthread == 1) ? > > &rqueue_local[tid] : &rqueue); > > #else > > I'm pretty sure we'll get caught by this sort of stuff in the future > again. I think it would be safer to proceed like this : > > __task_wakeup(t, ((t->thread_mask & all_threads_mask) == tid_bit) > > It should always be valid regardless of the number of threads. The same > could be done in task_wakeup(). > Sure, why not. > I suspect we may have a similar issue with the fd cache applied to listeners > here : > >static inline void updt_fd_polling(const int fd) >{ > if (fdtab[fd].thread_mask == tid_bit) { > unsigned int oldupdt; > > /* note: we don't have a test-and-set yet in hathreads */ > > > In this case the thread_mask might be larger than all_threads_mask and > we might fail this test. Or we may see that when threads exit and the > threads mask changes. > > Just my two cents, both patches were applied anyway. > That is true, this one is not a bug, but a pessimization, by using the global update_list which is more costly than the local one. Patches attached to do as suggest. Regards, Olivier >From c61f3f76f7e546aac14d7db33552dba11ce71e12 Mon Sep 17 00:00:00 2001 From: Olivier Houchard Date: Fri, 17 Aug 2018 13:36:08 +0200 Subject: [PATCH 1/2] MINOR: tasks: Don't special-case when nbthreads == 1 Instead of checking if nbthreads == 1, just and thread_mask with all_threads_mask to know if we're supposed to add the task to the local or the global runqueue. --- src/task.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/src/task.c b/src/task.c index e357bc169..ce5b4f907 100644 --- a/src/task.c +++ b/src/task.c @@ -395,8 +395,7 @@ void process_runnable_tasks() state = HA_ATOMIC_AND(&t->state, ~TASK_RUNNING); if (state) #ifdef USE_THREAD - __task_wakeup(t, (t->thread_mask == tid_bit || - global.nbthread == 1) ? + __task_wakeup(t, ((t->thread_mask & all_threads_mask) == tid_bit) ? &rqueue_local[tid] : &rqueue); #else __task_wakeup(t, &rqueue_local[tid]); -- 2.14.3 >From ea3b07fd2ed96b46ea107c9d862b5a1c27e728c2 Mon Sep 17 00:00:00 2001 From: Olivier Houchard Date: Fri, 17 Aug 2018 13:37:59 +0200 Subject: [PATCH 2/2] MINOR: fd cache: And the thread_mask with all_threads_mask. When we choose to insert a fd in either the global or the local fd update list, and the thread_mask against all_threads_mask before checking if it's tid_bit, that way, if we run with nbthreads==1, we will always use the local list, which is cheaper than the global one. --- include/proto/fd.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/proto/fd.h b/include/proto/fd.h index a4cee3220..a3ec5e854 100644 --- a/include/proto/fd.h +++ b/include/proto/fd.h @@ -109,7 +109,7 @@ void fd_rm_from_fd_list(volatile struct fdlist *list, int fd, int off); */ static inline void updt_fd_polling(const int fd) { - if (fdtab[fd].thread_mask == tid_bit) { + if ((fdtab[fd].thread_mask & all_threads_mask) == tid_bit) { unsigned int oldupdt; /* note: we don't have a test-and-set yet in hathreads */ -- 2.14.3
Re: 100% cpu usage 1.9-dev0-48d92ee 2018/07/30, task.?. but keeps working.. (nbthread 1)
Both patches applied, thanks guys! Olivier, I have a suggestion for this one : On Thu, Aug 16, 2018 at 07:17:07PM +0200, Olivier Houchard wrote: > From 90fc92f72c6b47d88769bb73680702d7b8e6 Mon Sep 17 00:00:00 2001 > From: Olivier Houchard > Date: Thu, 16 Aug 2018 19:03:02 +0200 > Subject: [PATCH 1/2] BUG/MEDIUM: tasks: Don't insert in the global rqueue if > nbthread == 1 > > Make sure we don't insert a task in the global run queue if nbthread == 1, > as, as an optimisation, we avoid reading from it if nbthread == 1. > --- > src/task.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/src/task.c b/src/task.c > index de097baf7..e357bc169 100644 > --- a/src/task.c > +++ b/src/task.c > @@ -395,7 +395,8 @@ void process_runnable_tasks() > state = HA_ATOMIC_AND(&t->state, ~TASK_RUNNING); > if (state) > #ifdef USE_THREAD > - __task_wakeup(t, (t->thread_mask == tid_bit) ? > + __task_wakeup(t, (t->thread_mask == tid_bit || > + global.nbthread == 1) ? > &rqueue_local[tid] : &rqueue); > #else I'm pretty sure we'll get caught by this sort of stuff in the future again. I think it would be safer to proceed like this : __task_wakeup(t, ((t->thread_mask & all_threads_mask) == tid_bit) It should always be valid regardless of the number of threads. The same could be done in task_wakeup(). I suspect we may have a similar issue with the fd cache applied to listeners here : static inline void updt_fd_polling(const int fd) { if (fdtab[fd].thread_mask == tid_bit) { unsigned int oldupdt; /* note: we don't have a test-and-set yet in hathreads */ In this case the thread_mask might be larger than all_threads_mask and we might fail this test. Or we may see that when threads exit and the threads mask changes. Just my two cents, both patches were applied anyway. Willy
Re: 100% cpu usage 1.9-dev0-48d92ee 2018/07/30, task.?. but keeps working.. (nbthread 1)
Hi again, On Thu, Aug 16, 2018 at 05:50:27PM +0200, Olivier Houchard wrote: > Hi Pieter, > > On Thu, Aug 16, 2018 at 12:24:04AM +0200, PiBa-NL wrote: > > Hi List, > > > > Anyone got a idea how to debug this further? > > Currently its running at 100% again, any pointers to debug the process as > > its running would be appreciated. > > > > Or should i compile again from current master and 'hope' it doesn't return? > > > > b.t.w. truss output is as follows: > > kevent(3,0x0,0,{ },200,{ 0.0 }) = 0 (0x0) > > kevent(3,0x0,0,{ },200,{ 0.0 }) = 0 (0x0) > > kevent(3,0x0,0,{ },200,{ 0.0 }) = 0 (0x0) > > kevent(3,0x0,0,{ },200,{ 0.0 }) = 0 (0x0) > > kevent(3,0x0,0,{ },200,{ 0.0 }) = 0 (0x0) > > kevent(3,0x0,0,{ },200,{ 0.0 }) = 0 (0x0) > > kevent(3,0x0,0,{ },200,{ 0.0 }) = 0 (0x0) > > > > Regards, > > PiBa-NL (Pieter) > > I'm interested in figuring that one out. > From the look at it, it seems the scheduler thinks there's a task to be run, > and so won't let the poller sleep in kevent(). > Can you > - update to the latest master, even though I don't think any relevant fix > was applied since Jul 30. > - compile it with -O0, so that we can get meaningful informations from gdb. > - When/if that happens again, getting a core, and send it to me with the > haproxy binary, assuming there's no confidential information in that core, > of course. > > Thanks ! > So after discussing on IRC, I'm pretty sure I figured it out. The two attached patches should fix it. Thanks a lot ! Olivier >From 90fc92f72c6b47d88769bb73680702d7b8e6 Mon Sep 17 00:00:00 2001 From: Olivier Houchard Date: Thu, 16 Aug 2018 19:03:02 +0200 Subject: [PATCH 1/2] BUG/MEDIUM: tasks: Don't insert in the global rqueue if nbthread == 1 Make sure we don't insert a task in the global run queue if nbthread == 1, as, as an optimisation, we avoid reading from it if nbthread == 1. --- src/task.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/src/task.c b/src/task.c index de097baf7..e357bc169 100644 --- a/src/task.c +++ b/src/task.c @@ -395,7 +395,8 @@ void process_runnable_tasks() state = HA_ATOMIC_AND(&t->state, ~TASK_RUNNING); if (state) #ifdef USE_THREAD - __task_wakeup(t, (t->thread_mask == tid_bit) ? + __task_wakeup(t, (t->thread_mask == tid_bit || + global.nbthread == 1) ? &rqueue_local[tid] : &rqueue); #else __task_wakeup(t, &rqueue_local[tid]); -- 2.14.3 >From 7640aa3de3c9ad00fe82cda4a50351e46fc0bf48 Mon Sep 17 00:00:00 2001 From: Olivier Houchard Date: Thu, 16 Aug 2018 19:03:50 +0200 Subject: [PATCH 2/2] BUG/MEDIUM: sessions: Don't use t->state. In session_expire_embryonic(), don't use t->state, use the "state" argument instead, as t->state has been cleaned before we're being called. --- src/session.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/session.c b/src/session.c index 1d66b739f..d7d8544c7 100644 --- a/src/session.c +++ b/src/session.c @@ -388,7 +388,7 @@ static struct task *session_expire_embryonic(struct task *t, void *context, unsi { struct session *sess = context; - if (!(t->state & TASK_WOKEN_TIMER)) + if (!(state & TASK_WOKEN_TIMER)) return t; session_kill_embryonic(sess); -- 2.14.3
Re: 100% cpu usage 1.9-dev0-48d92ee 2018/07/30, task.?. but keeps working.. (nbthread 1)
Hi Pieter, On Thu, Aug 16, 2018 at 12:24:04AM +0200, PiBa-NL wrote: > Hi List, > > Anyone got a idea how to debug this further? > Currently its running at 100% again, any pointers to debug the process as > its running would be appreciated. > > Or should i compile again from current master and 'hope' it doesn't return? > > b.t.w. truss output is as follows: > kevent(3,0x0,0,{ },200,{ 0.0 }) = 0 (0x0) > kevent(3,0x0,0,{ },200,{ 0.0 }) = 0 (0x0) > kevent(3,0x0,0,{ },200,{ 0.0 }) = 0 (0x0) > kevent(3,0x0,0,{ },200,{ 0.0 }) = 0 (0x0) > kevent(3,0x0,0,{ },200,{ 0.0 }) = 0 (0x0) > kevent(3,0x0,0,{ },200,{ 0.0 }) = 0 (0x0) > kevent(3,0x0,0,{ },200,{ 0.0 }) = 0 (0x0) > > Regards, > PiBa-NL (Pieter) I'm interested in figuring that one out. >From the look at it, it seems the scheduler thinks there's a task to be run, and so won't let the poller sleep in kevent(). Can you - update to the latest master, even though I don't think any relevant fix was applied since Jul 30. - compile it with -O0, so that we can get meaningful informations from gdb. - When/if that happens again, getting a core, and send it to me with the haproxy binary, assuming there's no confidential information in that core, of course. Thanks ! Olivier
Re: 100% cpu usage 1.9-dev0-48d92ee 2018/07/30, task.?. but keeps working.. (nbthread 1)
Hi List, Anyone got a idea how to debug this further? Currently its running at 100% again, any pointers to debug the process as its running would be appreciated. Or should i compile again from current master and 'hope' it doesn't return? b.t.w. truss output is as follows: kevent(3,0x0,0,{ },200,{ 0.0 }) = 0 (0x0) kevent(3,0x0,0,{ },200,{ 0.0 }) = 0 (0x0) kevent(3,0x0,0,{ },200,{ 0.0 }) = 0 (0x0) kevent(3,0x0,0,{ },200,{ 0.0 }) = 0 (0x0) kevent(3,0x0,0,{ },200,{ 0.0 }) = 0 (0x0) kevent(3,0x0,0,{ },200,{ 0.0 }) = 0 (0x0) kevent(3,0x0,0,{ },200,{ 0.0 }) = 0 (0x0) Regards, PiBa-NL (Pieter) Op 8-8-2018 om 22:49 schreef PiBa-NL: Hi List, Ive got a weird issue.. and im not sure where/how to continue digging at the moment... Using nbthread=1 nbproc=1, a few lua scripts, ssl offloading / http traffic.. Only a few connections < 100... Sometimes haproxy starts using 100% cpu usage.. After a few days.. (Makes it hard to debug..) Currently running with version 'HA-Proxy version 1.9-dev0-48d92ee 2018/07/30' Ive ran some commands against the haproxy socket like 'show activity', as can be seen there are lots of loops and tasks in just a second of time. [2.4.3-RELEASE][root@pfsense_3]/root: /usr/local/pkg/haproxy/haproxy_socket.sh show activity show activity thread_id: 0 date_now: 1533754729.799405 loops: 828928664 wake_cache: 845396 wake_tasks: 827400248 wake_signal: 0 poll_exp: 828245644 poll_drop: 17451 poll_dead: 0 poll_skip: 0 fd_skip: 0 fd_lock: 0 fd_del: 0 conn_dead: 0 stream: 101147 empty_rq: 1242050 long_rq: 0 [2.4.3-RELEASE][root@pfsense_3]/root: /usr/local/pkg/haproxy/haproxy_socket.sh show activity show activity thread_id: 0 date_now: 1533754731.084664 loops: 829000230 wake_cache: 845398 wake_tasks: 827471812 wake_signal: 0 poll_exp: 828317210 poll_drop: 17452 poll_dead: 0 poll_skip: 0 fd_skip: 0 fd_lock: 0 fd_del: 0 conn_dead: 0 stream: 101149 empty_rq: 1242050 long_rq: 0 Other than that ive tried to attach gdb and step through / log some functions.. it passes through. With a gdb 'command' file like bellow i created a little log of function breakpoints hit: set pagination off set height 0 set logging on delete rbreak haproxy.c:. rbreak session.c:. rbreak hlua.c:. rbreak task.c:. commands 1-9 cont end cont Which got me a log with the following content.. as you can see it 'seems' to be looping over the same task multiple times.., which might not even be a problem.??. the t=0x802545a60 expires and wakes up , expires and wakes up.?.: Breakpoint 249, __task_queue (task=0x802545a60) at src/task.c:185 185 src/task.c: No such file or directory. Breakpoint 253, wake_expired_tasks () at src/task.c:209 209 in src/task.c Breakpoint 250, __task_wakeup (t=0x802545a60, root=0x8ced50 ) at src/task.c:72 72 in src/task.c Breakpoint 41, sync_poll_loop () at src/haproxy.c:2378 2378 src/haproxy.c: No such file or directory. Breakpoint 252, process_runnable_tasks () at src/task.c:275 275 src/task.c: No such file or directory. Breakpoint 51, session_expire_embryonic (t=0x802545a60, context=0x8024483a0, state=513) at src/session.c:389 389 src/session.c: No such file or directory. Breakpoint 249, __task_queue (task=0x802545a60) at src/task.c:185 185 src/task.c: No such file or directory. Breakpoint 253, wake_expired_tasks () at src/task.c:209 209 in src/task.c Breakpoint 250, __task_wakeup (t=0x802545a60, root=0x8ced50 ) at src/task.c:72 72 in src/task.c Breakpoint 41, sync_poll_loop () at src/haproxy.c:2378 2378 src/haproxy.c: No such file or directory. Breakpoint 252, process_runnable_tasks () at src/task.c:275 275 src/task.c: No such file or directory. Breakpoint 51, session_expire_embryonic (t=0x802545a60, context=0x8024483a0, state=513) at src/session.c:389 389 src/session.c: No such file or directory. Breakpoint 249, __task_queue (task=0x802545a60) at src/task.c:185 185 src/task.c: No such file or directory. Breakpoint 253, wake_expired_tasks () at src/task.c:209 209 in src/task.c Breakpoint 250, __task_wakeup (t=0x802545a60, root=0x8ced50 ) at src/task.c:72 72 in src/task.c Breakpoint 41, sync_poll_loop () at src/haproxy.c:2378 2378 src/haproxy.c: No such file or directory. Breakpoint 252, process_runnable_tasks () at src/task.c:275 275 src/task.c: No such file or directory. Breakpoint 51, session_expire_embryonic (t=0x802545a60, context=0x8024483a0, state=513) at src/session.c:389 389 src/session.c: No such file or directory. Breakpoint 249, __task_queue (task=0x802545a60) at src/task.c:185 185 src/task.c: No such file or directory. Breakpoint 253, wake_expired_tasks () at src/task.c:209 209 in src/task.c Breakpoint 250, __task_wakeup (t=0x802545a60, root=0x8ced50 ) at src/task.c:72 72 in src/task.c Breakpoint 41, sync_poll_loop () at src/haproxy.c:2378 2378 src/haprox