Hi David, Almost ten years ago you requested more info in a similar case, let's see if we can get further now!
We're running a 6-node Corosync cluster. DLM is started by systemd: ● dlm.service - dlm control daemon Loaded: loaded (/lib/systemd/system/dlm.service; enabled) Active: active (running) since Thu 2018-08-09 17:13:18 CEST; 33min ago Docs: man:dlm_controld man:dlm.conf man:dlm_stonith Process: 3690 ExecStartPre=/sbin/modprobe dlm (code=exited, status=0/SUCCESS) Main PID: 3692 (dlm_controld) CGroup: /system.slice/dlm.service └─3692 /usr/sbin/dlm_controld --foreground -D --enable_fencing=0 All other nodes have cLVM volumes activated, but activation is stuck on this node (in the last step of a rolling cluster reboot): [ 136.729172] dlm: Using TCP for communications [ 136.743935] dlm: clvmd: joining the lockspace group... [ 136.749419] dlm: clvmd: dlm_recover 1 [ 136.749485] dlm: clvmd: add member 167773710 [ 136.749493] dlm: clvmd: add member 167773709 [ 136.749497] dlm: clvmd: add member 167773708 [ 136.749499] dlm: clvmd: add member 167773707 [ 136.749504] dlm: clvmd: add member 167773706 [ 136.749506] dlm: clvmd: add member 167773705 [ 136.749519] dlm: connecting to 167773709 [ 136.752848] dlm: connecting to 167773708 [ 136.752889] dlm: connecting to 167773707 [ 136.752918] dlm: connecting to 167773706 [ 136.752943] dlm: connecting to 167773705 [ 136.768589] dlm: clvmd: dlm_recover_members 6 nodes [ 136.941888] dlm: clvmd: group event done 0 0 [ 136.960496] dlm: clvmd: join complete [ 137.019929] device enp5s0f1 entered promiscuous mode [ 137.036637] device enp5s0f0 entered promiscuous mode [ 137.054869] device vhbond entered promiscuous mode [ 137.207059] be2net 0000:05:00.0 enp5s0f0: Link is Up [ 137.252901] be2net 0000:05:00.1 enp5s0f1: Link is Up [ 138.009742] device vlan39 entered promiscuous mode [ 138.151755] device vlan894 entered promiscuous mode [ 153.861395] scsi host1: BM_2032 : Event CXN_KILLED_RST_RCVD[10] received on CID : 9 [ 153.886619] connection2:0: detected conn error (1011) [ 364.687306] INFO: task clvmd:5242 blocked for more than 120 seconds. [ 364.708222] Not tainted 4.9.0-0.bpo.6-amd64 #1 Debian 4.9.88-1+deb9u1~bpo8+1 [ 364.733131] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 364.758896] clvmd D 0 5242 1 0x00000000 [ 364.776934] ffff98eff5ecfc00 0000000000000000 ffff98fff8543040 ffff98efe9744140 [ 364.801322] ffff98ffff518ec0 ffffa8a64ddc7cb0 ffffffffaf20e973 ffffffffaedde780 [ 364.825720] bd9dfe3d7034cec9 0000000041fef240 fffffa4841fef240 ffff98efe9744140 [ 364.850103] Call Trace: [ 364.858140] [<ffffffffaf20e973>] ? __schedule+0x243/0x6f0 [ 364.876181] [<ffffffffaedde780>] ? alloc_pages_vma+0xb0/0x240 [ 364.895364] [<ffffffffaf20ee52>] ? schedule+0x32/0x80 [ 364.912261] [<ffffffffaf2118ca>] ? rwsem_down_read_failed+0x10a/0x160 [ 364.933732] [<ffffffffaef3f004>] ? call_rwsem_down_read_failed+0x14/0x30 [ 364.956059] [<ffffffffaf2110fc>] ? down_read+0x1c/0x30 [ 364.973259] [<ffffffffc099e5e7>] ? dlm_user_request+0x47/0x200 [dlm] [ 364.994443] [<ffffffffaede50df>] ? cache_alloc_refill+0x20f/0x2b0 [ 365.014773] [<ffffffffaede5ee2>] ? kmem_cache_alloc_trace+0xc2/0x200 [ 365.035962] [<ffffffffc09aa0d6>] ? device_write+0x5b6/0x7a0 [dlm] [ 365.056290] [<ffffffffaee079c3>] ? vfs_write+0xb3/0x1a0 [ 365.073754] [<ffffffffaee08e22>] ? SyS_write+0x52/0xc0 [ 365.090937] [<ffffffffaec03bd1>] ? do_syscall_64+0x91/0x1a0 [ 365.109559] [<ffffffffaf2137ce>] ? entry_SYSCALL_64_after_swapgs+0x58/0xc6 Here's the output of "dlm_tool dump" on the stuck node: 131 dlm_controld 4.0.5 started 131 our_nodeid 167773710 131 found /dev/misc/dlm-control minor 58 131 found /dev/misc/dlm-monitor minor 57 131 found /dev/misc/dlm_plock minor 56 131 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2 131 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2 131 set recover_callbacks 1 131 cmap totem.cluster_name = 'vhbl' 131 set cluster_name vhbl 131 /dev/misc/dlm-monitor fd 10 131 cluster quorum 0 seq 3680 nodes 1 131 cluster node 167773710 added seq 3680 131 set_configfs_node 167773710 10.0.6.14 local 1 131 cpg_join dlm:controld ... 131 setup_cpg_daemon 12 135 dlm:controld ring 167773705:3724 6 memb 167773705 167773706 167773707 167773708 167773709 167773710 135 dlm:controld ring 167773705:3724 6 memb 167773705 167773706 167773707 167773708 167773709 167773710 135 dlm:controld conf 6 1 0 memb 167773705 167773706 167773707 167773708 167773709 167773710 join 167773710 left 135 daemon joined 167773705 135 daemon joined 167773706 135 daemon joined 167773707 135 daemon joined 167773708 135 daemon joined 167773709 135 daemon joined 167773710 135 receive_protocol 167773707 max 3.1.1.0 run 3.1.1.1 135 daemon node 167773707 prot max 0.0.0.0 run 0.0.0.0 135 daemon node 167773707 save max 3.1.1.0 run 3.1.1.1 135 run protocol from nodeid 167773707 135 daemon run 3.1.1 max 3.1.1 kernel run 1.1.1 max 1.1.1 135 plocks 13 135 cluster quorum 1 seq 3724 nodes 6 135 cluster node 167773705 added seq 3724 135 set_configfs_node 167773705 10.0.6.9 local 0 135 cluster node 167773706 added seq 3724 135 set_configfs_node 167773706 10.0.6.10 local 0 135 cluster node 167773707 added seq 3724 135 set_configfs_node 167773707 10.0.6.11 local 0 135 cluster node 167773708 added seq 3724 135 set_configfs_node 167773708 10.0.6.12 local 0 135 cluster node 167773709 added seq 3724 135 set_configfs_node 167773709 10.0.6.13 local 0 135 receive_protocol 167773708 max 3.1.1.0 run 3.1.1.1 135 daemon node 167773708 prot max 0.0.0.0 run 0.0.0.0 135 daemon node 167773708 save max 3.1.1.0 run 3.1.1.1 135 fence_in_progress_unknown 0 startup 135 receive_protocol 167773709 max 3.1.1.0 run 3.1.1.1 135 daemon node 167773709 prot max 0.0.0.0 run 0.0.0.0 135 daemon node 167773709 save max 3.1.1.0 run 3.1.1.1 135 receive_protocol 167773710 max 3.1.1.0 run 0.0.0.0 135 daemon node 167773710 prot max 0.0.0.0 run 0.0.0.0 135 daemon node 167773710 save max 3.1.1.0 run 0.0.0.0 135 receive_protocol 167773710 max 3.1.1.0 run 3.1.1.0 135 daemon node 167773710 prot max 3.1.1.0 run 0.0.0.0 135 daemon node 167773710 save max 3.1.1.0 run 3.1.1.0 135 receive_protocol 167773705 max 3.1.1.0 run 3.1.1.1 135 daemon node 167773705 prot max 0.0.0.0 run 0.0.0.0 135 daemon node 167773705 save max 3.1.1.0 run 3.1.1.1 135 receive_protocol 167773706 max 3.1.1.0 run 3.1.1.1 135 daemon node 167773706 prot max 0.0.0.0 run 0.0.0.0 135 daemon node 167773706 save max 3.1.1.0 run 3.1.1.1 135 uevent: add@/kernel/dlm/clvmd 135 kernel: add@ clvmd 135 uevent: online@/kernel/dlm/clvmd 135 kernel: online@ clvmd 135 clvmd cpg_join dlm:ls:clvmd ... 135 dlm:ls:clvmd conf 6 1 0 memb 167773705 167773706 167773707 167773708 167773709 167773710 join 167773710 left 135 clvmd add_change cg 1 joined nodeid 167773710 135 clvmd add_change cg 1 we joined 135 clvmd add_change cg 1 counts member 6 joined 1 remove 0 failed 0 135 clvmd check_ringid cluster 3724 cpg 0:0 135 dlm:ls:clvmd ring 167773705:3724 6 memb 167773705 167773706 167773707 167773708 167773709 167773710 135 clvmd check_ringid done cluster 3724 cpg 167773705:3724 135 clvmd check_fencing disabled 135 clvmd send_start 167773710:1 counts 0 6 1 0 0 135 clvmd wait_messages cg 1 need 6 of 6 135 clvmd receive_start 167773709:8 len 96 135 clvmd match_change 167773709:8 matches cg 1 135 clvmd wait_messages cg 1 need 5 of 6 135 clvmd receive_start 167773710:1 len 96 135 clvmd match_change 167773710:1 matches cg 1 135 clvmd wait_messages cg 1 need 4 of 6 135 clvmd receive_start 167773705:12 len 96 135 clvmd match_change 167773705:12 matches cg 1 135 clvmd wait_messages cg 1 need 3 of 6 135 clvmd receive_start 167773706:14 len 96 135 clvmd match_change 167773706:14 matches cg 1 135 clvmd wait_messages cg 1 need 2 of 6 135 clvmd receive_start 167773707:5 len 96 135 clvmd match_change 167773707:5 matches cg 1 135 clvmd wait_messages cg 1 need 1 of 6 135 clvmd receive_start 167773708:10 len 96 135 clvmd match_change 167773708:10 matches cg 1 135 clvmd wait_messages cg 1 got all 6 135 clvmd start_kernel cg 1 member_count 6 135 write "1090842362" to "/sys/kernel/dlm/clvmd/id" 135 set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/167773705" 135 set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/167773706" 135 set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/167773707" 135 set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/167773708" 135 set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/167773709" 135 set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/167773710" 135 write "1" to "/sys/kernel/dlm/clvmd/control" 135 write "0" to "/sys/kernel/dlm/clvmd/event_done" 135 clvmd prepare_plocks 135 clvmd set_plock_data_node from 0 to 167773705 135 clvmd save_plocks start 135 uevent: add@/devices/virtual/misc/dlm_clvmd I've got all the output with -D from all nodes, and I can probably afford to keep the cluster in the current degraded state for a couple of days if deemed useful. If you can't look into this in the short term, please tell me what other diagnostics to collect before I reset the stuck node. For a start I attached the dump output from another node. -- Thanks, Feri _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org