Re: [ClusterLabs] Restoring network connection breaks cluster services
On Wed, Aug 7, 2019 at 1:00 PM Klaus Wenninger wrote: > On 8/7/19 12:26 PM, Momcilo Medic wrote: > > We have three node cluster that is setup to stop resources on lost quorum. > Failure (network going down) handling is done properly, but recovery > doesn't seem to work. > > What do you mean by 'network going down'? > Loss of link? Does the IP persist on the interface > in that case? > Yes, we simulate faulty cable by turning switch ports down and up. In such a case, the IP does not persist on the interface. > That there are issue reconnecting the CPG-API > sounds strange to me. Already the fact that > something has to be reconnected. I got it > that your nodes were persistently up during the > network-disconnection. Although I would have > expected fencing to kick in at least on those > which are part of the non-quorate cluster-partition. > Maybe a few words more on your scenario > (fening-setup e.g.) would help to understand what > is going on. > We don't use any fencing mechanisms, we rely on quorum to run the services. In more detail, we run three node Linbit LINSTOR storage that is hyperconverged. Meaning, we run clustered storage on the virtualization hypervisors. We use pcs in order to have linstor-controller service in high availabilty mode. Policy for no quorum is to stop the resources. In such hyperconverged setup, we can't fence a node without impact. It may happen that network instability causes primary node to no longer be primary. In that case, we don't want running VMs to go down with the ship, as there was no impact for them. However, we would like to have high-availability of that service upon network restoration, without manual actions. > > Klaus > > > What happens is, services crash when we re-enable network connection. > > From journal: > > ``` > ... > Jul 12 00:27:32 itaftestkvmls02.dc.itaf.eu corosync[9069]: corosync: > totemsrp.c:1328: memb_consensus_agreed: Assertion `token_memb_entries >= 1' > failed. > Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu attrd[9104]:error: > Connection to the CPG API failed: Library error (2) > Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu stonith-ng[9100]:error: > Connection to the CPG API failed: Library error (2) > Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: corosync.service: > Main process exited, code=dumped, status=6/ABRT > Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu cib[9098]:error: > Connection to the CPG API failed: Library error (2) > Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: corosync.service: > Failed with result 'core-dump'. > Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu pacemakerd[9087]:error: > Connection to the CPG API failed: Library error (2) > Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: pacemaker.service: > Main process exited, code=exited, status=107/n/a > Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: pacemaker.service: > Failed with result 'exit-code'. > Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: Stopped Pacemaker > High Availability Cluster Manager. > Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu lrmd[9102]: warning: > new_event_notification (9102-9107-7): Bad file descriptor (9) > ... > ``` > Pacemaker's log shows no relevant info. > > This is from corosync's log: > > ``` > Jul 12 00:27:33 [9107] itaftestkvmls02.dc.itaf.eu crmd: info: > qb_ipcs_us_withdraw:withdrawing server sockets > Jul 12 00:27:33 [9104] itaftestkvmls02.dc.itaf.eu attrd:error: > pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) > Jul 12 00:27:33 [9100] itaftestkvmls02.dc.itaf.eu stonith-ng:error: > pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) > Jul 12 00:27:33 [9098] itaftestkvmls02.dc.itaf.eucib:error: > pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) > Jul 12 00:27:33 [9087] itaftestkvmls02.dc.itaf.eu pacemakerd:error: > pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) > Jul 12 00:27:33 [9104] itaftestkvmls02.dc.itaf.eu attrd: info: > qb_ipcs_us_withdraw:withdrawing server sockets > Jul 12 00:27:33 [9087] itaftestkvmls02.dc.itaf.eu pacemakerd: info: > crm_xml_cleanup:Cleaning up memory from libxml2 > Jul 12 00:27:33 [9107] itaftestkvmls02.dc.itaf.eu crmd: info: > crm_xml_cleanup:Cleaning up memory from libxml2 > Jul 12 00:27:33 [9100] itaftestkvmls02.dc.itaf.eu stonith-ng: info: > qb_ipcs_us_withdraw:withdrawing server sockets > Jul 12 00:27:33 [9104] itaftestkvmls02.dc.itaf.eu attrd: info: > crm_xml_cleanup:Cleaning up memory from libxml2 > Jul 12 00:27:33 [9098] itaftestkvmls02.dc.itaf.eucib: info: > qb_
[ClusterLabs] Restoring network connection breaks cluster services
We have three node cluster that is setup to stop resources on lost quorum. Failure (network going down) handling is done properly, but recovery doesn't seem to work. What happens is, services crash when we re-enable network connection. >From journal: ``` ... Jul 12 00:27:32 itaftestkvmls02.dc.itaf.eu corosync[9069]: corosync: totemsrp.c:1328: memb_consensus_agreed: Assertion `token_memb_entries >= 1' failed. Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu attrd[9104]:error: Connection to the CPG API failed: Library error (2) Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu stonith-ng[9100]:error: Connection to the CPG API failed: Library error (2) Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: corosync.service: Main process exited, code=dumped, status=6/ABRT Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu cib[9098]:error: Connection to the CPG API failed: Library error (2) Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: corosync.service: Failed with result 'core-dump'. Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu pacemakerd[9087]:error: Connection to the CPG API failed: Library error (2) Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: pacemaker.service: Main process exited, code=exited, status=107/n/a Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: pacemaker.service: Failed with result 'exit-code'. Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: Stopped Pacemaker High Availability Cluster Manager. Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu lrmd[9102]: warning: new_event_notification (9102-9107-7): Bad file descriptor (9) ... ``` Pacemaker's log shows no relevant info. This is from corosync's log: ``` Jul 12 00:27:33 [9107] itaftestkvmls02.dc.itaf.eu crmd: info: qb_ipcs_us_withdraw:withdrawing server sockets Jul 12 00:27:33 [9104] itaftestkvmls02.dc.itaf.eu attrd:error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Jul 12 00:27:33 [9100] itaftestkvmls02.dc.itaf.eu stonith-ng:error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Jul 12 00:27:33 [9098] itaftestkvmls02.dc.itaf.eucib:error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Jul 12 00:27:33 [9087] itaftestkvmls02.dc.itaf.eu pacemakerd:error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Jul 12 00:27:33 [9104] itaftestkvmls02.dc.itaf.eu attrd: info: qb_ipcs_us_withdraw:withdrawing server sockets Jul 12 00:27:33 [9087] itaftestkvmls02.dc.itaf.eu pacemakerd: info: crm_xml_cleanup:Cleaning up memory from libxml2 Jul 12 00:27:33 [9107] itaftestkvmls02.dc.itaf.eu crmd: info: crm_xml_cleanup:Cleaning up memory from libxml2 Jul 12 00:27:33 [9100] itaftestkvmls02.dc.itaf.eu stonith-ng: info: qb_ipcs_us_withdraw:withdrawing server sockets Jul 12 00:27:33 [9104] itaftestkvmls02.dc.itaf.eu attrd: info: crm_xml_cleanup:Cleaning up memory from libxml2 Jul 12 00:27:33 [9098] itaftestkvmls02.dc.itaf.eucib: info: qb_ipcs_us_withdraw:withdrawing server sockets Jul 12 00:27:33 [9100] itaftestkvmls02.dc.itaf.eu stonith-ng: info: crm_xml_cleanup:Cleaning up memory from libxml2 Jul 12 00:27:33 [9098] itaftestkvmls02.dc.itaf.eucib: info: qb_ipcs_us_withdraw:withdrawing server sockets Jul 12 00:27:33 [9098] itaftestkvmls02.dc.itaf.eucib: info: qb_ipcs_us_withdraw:withdrawing server sockets Jul 12 00:27:33 [9098] itaftestkvmls02.dc.itaf.eucib: info: crm_xml_cleanup:Cleaning up memory from libxml2 Jul 12 00:27:33 [9102] itaftestkvmls02.dc.itaf.eu lrmd: warning: qb_ipcs_event_sendv:new_event_notification (9102-9107-7): Bad file descriptor (9) ``` Please let me know if you need any further info, I'll be more than happy to provide it. This is always reproducible in our environment: Ubuntu 18.04.2 corosync 2.4.3-0ubuntu1.1 pcs 0.9.164-1 pacemaker 1.1.18-0ubuntu1.1 Kind regards, Momo. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] ClusterLabsdlm reason for leaving the cluster changes when stopping gfs2-utils service
>> At least you've got interactive debugging ability then. So try to find >> out why the Corosync membership broke down. The output of >> corosync-quorumtool and corosync-cpgtool might help. Also try pinging >> the Corosync ring0 addresses between the nodes. Dear Feri and all, Just to come with feedback since I've managed to track down this issue. It was due to Open vSwitch that we have in our environment. It was not using init v system but Upstart. I identified that it brought our network interface down as soon as it hit runlevel 6. Pinging the ring0 address advice was really spot on. Kind regards, Momcilo. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] ClusterLabsdlm reason for leaving the cluster changes when stopping gfs2-utils service
On Wed, Mar 23, 2016 at 6:33 PM, Ferenc Wágner <wf...@niif.hu> wrote: > (Please post only to the list, or at least keep it amongst the Cc-s.) > > Momcilo Medic <fedorau...@fedoraproject.org> writes: > >> On Wed, Mar 23, 2016 at 1:56 PM, Ferenc Wágner <wf...@niif.hu> wrote: >>> Momcilo Medic <fedorau...@fedoraproject.org> writes: >>> >>>> I have three hosts setup in my test environment. >>>> They each have two connections to the SAN which has GFS2 on it. >>>> >>>> Everything works like a charm, except when I reboot a host. >>>> Once it tries to stop gfs2-utils service it will just hang. >>> >>> Are you sure the OS reboot sequence does not stop the network or >>> corosync before GFS and DLM? >> >> I specifically configured services to start in this order: >> Corosync - DLM - GFS2-utils >> and to shutdown in this order: >> GFS2-utils - DLM - Corosync. >> >> I've acomplish this with: >> update-rc.d -f corosync remove >> update-rc.d -f corosync-notifyd remove >> update-rc.d -f dlm remove >> update-rc.d -f gfs2-utils remove >> update-rc.d -f xendomains remove >> update-rc.d corosync start 25 2 3 4 5 . stop 35 0 1 6 . >> update-rc.d corosync-notifyd start 25 2 3 4 5 . stop 35 0 1 6 . >> update-rc.d dlm start 30 2 3 4 5 . stop 30 0 1 6 . >> update-rc.d gfs2-utils start 35 2 3 4 5 . stop 25 0 1 6 . >> update-rc.d xendomains start 40 2 3 4 5 . stop 20 0 1 6 . > > I don't know your OS, the above may or may not work. > >> Also, the moment I was capturing logs, corosync and dlm were not >> running as services, but in foreground debugging mode. >> SSH connection did not break until I powered down the host so network >> is not stopped either. > > At least you've got interactive debugging ability then. So try to find > out why the Corosync membership broke down. The output of > corosync-quorumtool and corosync-cpgtool might help. Also try pinging > the Corosync ring0 addresses between the nodes. Dear Feri, Sorry, for leaving out lists from reply, it was hasty mistake :) Just so I put all the information out there: I am using Ubuntu 14.04 across all hosts. I've attached debugging logs in my first post. I cannot figure out what is the key info there. Today, I'll try to use tools you mentioned to see their output before and during the issue. Kind regards, Momcilo "Momo" Medic. (fedorauser) ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] GFS2 with Pacemaker, Corosync on Ubuntu 14.04
>> 769897 cannot find device /dev/misc/dlm-control with minor 52 > > > Check that dlm's udev rules are installed in the location (lib/udev vs > /usr/lib/udev) appropriate for your system. That changed recently in the > upstream dlm. > Thank you for your reply and sorry for late response. Rebooting the server and mounting configfs made everything work. Now dlm loads and devices and symlinks to them are created correctly. Kind regards, Momcilo 'Momo' Medic. (fedorauser) ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] GFS2 with Pacemaker, Corosync on Ubuntu 14.04
Dear all, I am trying to setup GFS2 on two Ubuntu 14.04 servers. Every guide I can find online is for 12.04 by using cman package which was abandoned in 13.10 So, I tried using Pacemaker with Corosync as instructed on your guide [1]. In this guide pcs is used which is not available in Ubuntu so I am translating commands to crmsh. I installed everything and noticed bad packaging on DLM which have wrong init scripts. I reported [2] this bug to Ubuntu. This also might cause DLM not finding the /dev/misc/dlm-control device (which is actually located at /dev/dlm-control). This is the output that I am getting: # dlm_controld -D 769887 dlm_controld 4.0.1 started 769887 our_nodeid 739311650 769897 cannot find device /dev/misc/dlm-control with minor 52 769897 shutdown 769897 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2 769897 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2 # lsmod | grep dlm dlm 156389 1 gfs2 sctp 247248 3 dlm configfs 35358 2 dlm # ls -hal /dev/dlm* crw--- 1 root root 10, 52 Jan 14 17:55 /dev/dlm-control crw--- 1 root root 10, 51 Jan 14 17:55 /dev/dlm-monitor crw--- 1 root root 10, 50 Jan 14 17:55 /dev/dlm_plock I checked man dlm, man dlm.conf, man dlm_controld but didn't find anywhere an option to specify device location. Any help would be highly appreciated. [1] http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html-single/Clusters_from_Scratch/index.html#_install_the_cluster_software [2] https://bugs.launchpad.net/ubuntu/+source/dlm/+bug/1535242 Kind regards, Momcilo 'Momo' Medic. (fedorauser) ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] GFS2 with Pacemaker, Corosync on Ubuntu 14.04
Dear all, I am trying to setup GFS2 on two Ubuntu 14.04 servers. Every guide I can find online is for 12.04 by using cman package which was abandoned in 13.10 So, I tried using Pacemaker with Corosync as instructed on your guide [1]. In this guide pcs is used which is not available in Ubuntu so I am translating commands to crmsh. I installed everything and noticed bad packaging on DLM which have wrong init scripts. I reported [2] this bug to Ubuntu. This also might cause DLM not finding the /dev/misc/dlm-control device (which is actually located at /dev/dlm-control). This is the output that I am getting: # dlm_controld -D 769887 dlm_controld 4.0.1 started 769887 our_nodeid 739311650 769897 cannot find device /dev/misc/dlm-control with minor 52 769897 shutdown 769897 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2 769897 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2 # lsmod | grep dlm dlm 156389 1 gfs2 sctp 247248 3 dlm configfs 35358 2 dlm # ls -hal /dev/dlm* crw--- 1 root root 10, 52 Jan 14 17:55 /dev/dlm-control crw--- 1 root root 10, 51 Jan 14 17:55 /dev/dlm-monitor crw--- 1 root root 10, 50 Jan 14 17:55 /dev/dlm_plock I checked man dlm, man dlm.conf, man dlm_controld but didn't find anywhere an option to specify device location. Any help would be highly appreciated. [1] http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html-single/Clusters_from_Scratch/index.html#_install_the_cluster_software [2] https://bugs.launchpad.net/ubuntu/+source/dlm/+bug/1535242 Kind regards, Momcilo 'Momo' Medic. (fedorauser) ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org