On 15/02/17 18:04 +0100, Jan Pokorný wrote: > On 15/02/17 15:13 +0000, Christine Caulfield wrote: >> On 15/02/17 14:50, Jan Friesse wrote: >>>> Hi all, >>>> >>>> Corosync Cluster Engine, version '2.3.4' >>>> Copyright (c) 2006-2009 Red Hat, Inc. >>>> >>>> Today I found corosync consuming 100% cpu. Strace showed following: >>>> >>>> write(7, "\v\0\0\0", 4) = -1 EAGAIN (Resource >>>> temporarily unavailable) >>>> write(7, "\v\0\0\0", 4) = -1 EAGAIN (Resource >>>> temporarily unavailable) >>>> >>>> Then I used gcore to get the coredump. >>>> >>>> (gdb) bt >>>> #0 0x00007f038b74b1cd in write () from /lib64/libpthread.so.0 >>>> #1 0x00007f038b9656ed in _handle_real_signal_ (signal_num=<optimized >>>> out>, si=<optimized out>, context=<optimized out>) at loop_poll.c:474 >>>> #2 <signal handler called> >>>> #3 0x0000000000000000 in ?? () >>>> #4 0x00007f038c220a3d in schedwrk_processor (context=<optimized out>) >>>> at sync.c:551 >>>> #5 0x00007f038c23042b in schedwrk_do (type=<optimized out>, >>>> context=0x6a12d56300000001) at schedwrk.c:77 >>>> #6 0x00007f038bdd49f7 in token_callbacks_execute >>>> (type=TOTEM_CALLBACK_TOKEN_SENT, instance=<optimized out>) at >>>> totemsrp.c:3493 >>>> #7 message_handler_orf_token (instance=<optimized out>, >>>> msg=<optimized out>, endian_conversion_needed=<optimized out>, >>>> msg_len=<optimized out>) at totemsrp.c:3894 >>>> #8 0x00007f038bdd65a5 in message_handler_orf_token >>>> (instance=<optimized out>, msg=<optimized out>, msg_len=<optimized >>>> out>, endian_conversion_needed=<optimized out>) at totemsrp.c:3609 >>>> #9 0x00007f038bdcdfb9 in rrp_deliver_fn (context=0x7f038d541840, >>>> msg=0x7f038d541af8, msg_len=70) at totemrrp.c:1941 >>>> #10 0x00007f038bdca01e in net_deliver_fn (fd=<optimized out>, >>>> revents=<optimized out>, data=0x7f038d541a90) at totemudpu.c:499 >>>> #11 0x00007f038b96576f in _poll_dispatch_and_take_back_ >>>> (item=0x7f038d4fe168, p=<optimized out>) at loop_poll.c:108 >>>> #12 0x00007f038b965300 in qb_loop_run_level (level=0x7f038d4fde08) at >>>> loop.c:43 >>>> #13 qb_loop_run (lp=<optimized out>) at loop.c:210 >>>> #14 0x00007f038c21b6d0 in main (argc=<optimized out>, argv=<optimized >>>> out>, envp=<optimized out>) at main.c:1383 >>>> >>>> (gdb) f 1 >>>> #1 0x00007f038b9656ed in _handle_real_signal_ (signal_num=<optimized >>>> out>, si=<optimized out>, context=<optimized out>) at loop_poll.c:474 >>>> 474 res = write(pipe_fds[1], &sig, sizeof(int32_t)); >>>> (gdb) info locals >>>> sig = 11 >>>> res = <optimized out> >>>> __func__ = "_handle_real_signal_" >>>> (gdb) f 4 >>>> #4 0x00007f038c220a3d in schedwrk_processor (context=<optimized out>) >>>> at sync.c:551 >>>> 551 >>>> my_service_list[my_processing_idx].sync_init (my_trans_list, >>>> (gdb) p my_processing_idx >>>> $31 = 3 >>>> (gdb) p my_service_list[3] >>>> $32 = {service_id = 0, sync_init = 0x0, sync_abort = 0x0, sync_process >>>> = 0x0, sync_activate = 0x0, state = PROCESS, name = '\000 <repeats 127 >>>> times>} >>>> >>>> So it seems corosync dead looping in segfault handler. >>>> I have not found any related changelog in the release notes after 2.3.4. >>>> >>>> Can anyone help please? >>> >>> Yep. It looks like (for some reason) signal pipe was not processed and >>> libqb _handle_real_signal_ is looping. Corosync really cannot do >>> anything about it. It looks like regular libqb bug, so even you can't do >>> anything with it. CCing Chrissie so she is aware. >>> >> >> Yes, it seems that some corosync SEGVs trigger this obscure bug in >> libqb. I've chased a few possible causes and none have been fruitful. >> >> If you get this then corosync has crashed, and this other bug is masking >> the actual diagnostics - I know, helpful :/ > > This particularly resembles recent discovery in corosync -- segfault > handler is not expecting a nested segfault leading to a tight loop > on signal processing and, due to its priority, eating the CPU off: > https://github.com/corosync/corosync/issues/159 > > Shifting towards the possible solution blueprint side in libqb: > https://github.com/ClusterLabs/libqb/pull/245 > > We could do better if we knew which signal in particular is the > culprit in this case -- was it indeed SIGSEGV (I don't actually > think so but it's hard to say)?
Ah, I missed "sig = 11" above, so indeed SIGSEGV. Anyway, there is a chance that libqb v1.0.1 (containing this PR: https://github.com/ClusterLabs/libqb/pull/230) alleviates the issue. I am still missing some parts of the picture. -- Jan (Poki)
pgpGkj34IBUhi.pgp
Description: PGP signature
_______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org