Re: [ClusterLabs] Restoring network connection breaks cluster services

2019-08-07 Thread Momcilo Medic
On Wed, Aug 7, 2019 at 1:00 PM Klaus Wenninger  wrote:

> On 8/7/19 12:26 PM, Momcilo Medic wrote:
>
> We have three node cluster that is setup to stop resources on lost quorum.
> Failure (network going down) handling is done properly, but recovery
> doesn't seem to work.
>
> What do you mean by 'network going down'?
> Loss of link? Does the IP persist on the interface
> in that case?
>

Yes, we simulate faulty cable by turning switch ports down and up.
In such a case, the IP does not persist on the interface.


> That there are issue reconnecting the CPG-API
> sounds strange to me. Already the fact that
> something has to be reconnected. I got it
> that your nodes were persistently up during the
> network-disconnection. Although I would have
> expected fencing to kick in at least on those
> which are part of the non-quorate cluster-partition.
> Maybe a few words more on your scenario
> (fening-setup e.g.) would help to understand what
> is going on.
>

We don't use any fencing mechanisms, we rely on quorum to run the services.
In more detail, we run three node Linbit LINSTOR storage that is
hyperconverged.
Meaning, we run clustered storage on the virtualization hypervisors.

We use pcs in order to have linstor-controller service in high availabilty
mode.
Policy for no quorum is to stop the resources.

In such hyperconverged setup, we can't fence a node without impact.
It may happen that network instability causes primary node to no longer be
primary.
In that case, we don't want running VMs to go down with the ship, as there
was no impact for them.

However, we would like to have high-availability of that service upon
network restoration, without manual actions.


>
> Klaus
>
>
> What happens is, services crash when we re-enable network connection.
>
> From journal:
>
> ```
> ...
> Jul 12 00:27:32 itaftestkvmls02.dc.itaf.eu corosync[9069]: corosync:
> totemsrp.c:1328: memb_consensus_agreed: Assertion `token_memb_entries >= 1'
> failed.
> Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu attrd[9104]:error:
> Connection to the CPG API failed: Library error (2)
> Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu stonith-ng[9100]:error:
> Connection to the CPG API failed: Library error (2)
> Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: corosync.service:
> Main process exited, code=dumped, status=6/ABRT
> Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu cib[9098]:error:
> Connection to the CPG API failed: Library error (2)
> Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: corosync.service:
> Failed with result 'core-dump'.
> Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu pacemakerd[9087]:error:
> Connection to the CPG API failed: Library error (2)
> Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: pacemaker.service:
> Main process exited, code=exited, status=107/n/a
> Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: pacemaker.service:
> Failed with result 'exit-code'.
> Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: Stopped Pacemaker
> High Availability Cluster Manager.
> Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu lrmd[9102]:  warning:
> new_event_notification (9102-9107-7): Bad file descriptor (9)
> ...
> ```
> Pacemaker's log shows no relevant info.
>
> This is from corosync's log:
>
> ```
> Jul 12 00:27:33 [9107] itaftestkvmls02.dc.itaf.eu   crmd: info:
> qb_ipcs_us_withdraw:withdrawing server sockets
> Jul 12 00:27:33 [9104] itaftestkvmls02.dc.itaf.eu  attrd:error:
> pcmk_cpg_dispatch:  Connection to the CPG API failed: Library error (2)
> Jul 12 00:27:33 [9100] itaftestkvmls02.dc.itaf.eu stonith-ng:error:
> pcmk_cpg_dispatch:  Connection to the CPG API failed: Library error (2)
> Jul 12 00:27:33 [9098] itaftestkvmls02.dc.itaf.eucib:error:
> pcmk_cpg_dispatch:  Connection to the CPG API failed: Library error (2)
> Jul 12 00:27:33 [9087] itaftestkvmls02.dc.itaf.eu pacemakerd:error:
> pcmk_cpg_dispatch:  Connection to the CPG API failed: Library error (2)
> Jul 12 00:27:33 [9104] itaftestkvmls02.dc.itaf.eu  attrd: info:
> qb_ipcs_us_withdraw:withdrawing server sockets
> Jul 12 00:27:33 [9087] itaftestkvmls02.dc.itaf.eu pacemakerd: info:
> crm_xml_cleanup:Cleaning up memory from libxml2
> Jul 12 00:27:33 [9107] itaftestkvmls02.dc.itaf.eu   crmd: info:
> crm_xml_cleanup:Cleaning up memory from libxml2
> Jul 12 00:27:33 [9100] itaftestkvmls02.dc.itaf.eu stonith-ng: info:
> qb_ipcs_us_withdraw:withdrawing server sockets
> Jul 12 00:27:33 [9104] itaftestkvmls02.dc.itaf.eu  attrd: info:
> crm_xml_cleanup:Cleaning up memory from libxml2
> Jul 12 00:27:33 [9098] itaftestkvmls02.dc.itaf.eucib: info:
> qb_

[ClusterLabs] Restoring network connection breaks cluster services

2019-08-07 Thread Momcilo Medic
 We have three node cluster that is setup to stop resources on lost quorum.
Failure (network going down) handling is done properly, but recovery
doesn't seem to work.

What happens is, services crash when we re-enable network connection.

>From journal:

```
...
Jul 12 00:27:32 itaftestkvmls02.dc.itaf.eu corosync[9069]: corosync:
totemsrp.c:1328: memb_consensus_agreed: Assertion `token_memb_entries >= 1'
failed.
Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu attrd[9104]:error:
Connection to the CPG API failed: Library error (2)
Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu stonith-ng[9100]:error:
Connection to the CPG API failed: Library error (2)
Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: corosync.service:
Main process exited, code=dumped, status=6/ABRT
Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu cib[9098]:error: Connection
to the CPG API failed: Library error (2)
Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: corosync.service:
Failed with result 'core-dump'.
Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu pacemakerd[9087]:error:
Connection to the CPG API failed: Library error (2)
Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: pacemaker.service:
Main process exited, code=exited, status=107/n/a
Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: pacemaker.service:
Failed with result 'exit-code'.
Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: Stopped Pacemaker
High Availability Cluster Manager.
Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu lrmd[9102]:  warning:
new_event_notification (9102-9107-7): Bad file descriptor (9)
...
```
Pacemaker's log shows no relevant info.

This is from corosync's log:

```
Jul 12 00:27:33 [9107] itaftestkvmls02.dc.itaf.eu   crmd: info:
qb_ipcs_us_withdraw:withdrawing server sockets
Jul 12 00:27:33 [9104] itaftestkvmls02.dc.itaf.eu  attrd:error:
pcmk_cpg_dispatch:  Connection to the CPG API failed: Library error (2)
Jul 12 00:27:33 [9100] itaftestkvmls02.dc.itaf.eu stonith-ng:error:
pcmk_cpg_dispatch:  Connection to the CPG API failed: Library error (2)
Jul 12 00:27:33 [9098] itaftestkvmls02.dc.itaf.eucib:error:
pcmk_cpg_dispatch:  Connection to the CPG API failed: Library error (2)
Jul 12 00:27:33 [9087] itaftestkvmls02.dc.itaf.eu pacemakerd:error:
pcmk_cpg_dispatch:  Connection to the CPG API failed: Library error (2)
Jul 12 00:27:33 [9104] itaftestkvmls02.dc.itaf.eu  attrd: info:
qb_ipcs_us_withdraw:withdrawing server sockets
Jul 12 00:27:33 [9087] itaftestkvmls02.dc.itaf.eu pacemakerd: info:
crm_xml_cleanup:Cleaning up memory from libxml2
Jul 12 00:27:33 [9107] itaftestkvmls02.dc.itaf.eu   crmd: info:
crm_xml_cleanup:Cleaning up memory from libxml2
Jul 12 00:27:33 [9100] itaftestkvmls02.dc.itaf.eu stonith-ng: info:
qb_ipcs_us_withdraw:withdrawing server sockets
Jul 12 00:27:33 [9104] itaftestkvmls02.dc.itaf.eu  attrd: info:
crm_xml_cleanup:Cleaning up memory from libxml2
Jul 12 00:27:33 [9098] itaftestkvmls02.dc.itaf.eucib: info:
qb_ipcs_us_withdraw:withdrawing server sockets
Jul 12 00:27:33 [9100] itaftestkvmls02.dc.itaf.eu stonith-ng: info:
crm_xml_cleanup:Cleaning up memory from libxml2
Jul 12 00:27:33 [9098] itaftestkvmls02.dc.itaf.eucib: info:
qb_ipcs_us_withdraw:withdrawing server sockets
Jul 12 00:27:33 [9098] itaftestkvmls02.dc.itaf.eucib: info:
qb_ipcs_us_withdraw:withdrawing server sockets
Jul 12 00:27:33 [9098] itaftestkvmls02.dc.itaf.eucib: info:
crm_xml_cleanup:Cleaning up memory from libxml2
Jul 12 00:27:33 [9102] itaftestkvmls02.dc.itaf.eu   lrmd:  warning:
qb_ipcs_event_sendv:new_event_notification (9102-9107-7): Bad file
descriptor (9)
```

Please let me know if you need any further info, I'll be more than happy to
provide it.

This is always reproducible in our environment:
Ubuntu 18.04.2
corosync 2.4.3-0ubuntu1.1
pcs 0.9.164-1
pacemaker 1.1.18-0ubuntu1.1

Kind regards,
Momo.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] ClusterLabsdlm reason for leaving the cluster changes when stopping gfs2-utils service

2016-04-07 Thread Momcilo Medic
>> At least you've got interactive debugging ability then.  So try to find
>> out why the Corosync membership broke down.  The output of
>> corosync-quorumtool and corosync-cpgtool might help.  Also try pinging
>> the Corosync ring0 addresses between the nodes.

Dear Feri and all,

Just to come with feedback since I've managed to track down this issue.
It was due to Open vSwitch that we have in our environment.
It was not using init v system but Upstart.

I identified that it brought our network interface down as soon as it
hit runlevel 6.

Pinging the ring0 address advice was really spot on.

Kind regards,
Momcilo.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] ClusterLabsdlm reason for leaving the cluster changes when stopping gfs2-utils service

2016-03-24 Thread Momcilo Medic
On Wed, Mar 23, 2016 at 6:33 PM, Ferenc Wágner <wf...@niif.hu> wrote:
> (Please post only to the list, or at least keep it amongst the Cc-s.)
>
> Momcilo Medic <fedorau...@fedoraproject.org> writes:
>
>> On Wed, Mar 23, 2016 at 1:56 PM, Ferenc Wágner <wf...@niif.hu> wrote:
>>> Momcilo Medic <fedorau...@fedoraproject.org> writes:
>>>
>>>> I have three hosts setup in my test environment.
>>>> They each have two connections to the SAN which has GFS2 on it.
>>>>
>>>> Everything works like a charm, except when I reboot a host.
>>>> Once it tries to stop gfs2-utils service it will just hang.
>>>
>>> Are you sure the OS reboot sequence does not stop the network or
>>> corosync before GFS and DLM?
>>
>> I specifically configured services to start in this order:
>> Corosync - DLM - GFS2-utils
>> and to shutdown in this order:
>> GFS2-utils - DLM - Corosync.
>>
>> I've acomplish this with:
>>  update-rc.d -f corosync remove
>>  update-rc.d -f corosync-notifyd remove
>>  update-rc.d -f dlm remove
>>  update-rc.d -f gfs2-utils remove
>>  update-rc.d -f xendomains remove
>>  update-rc.d corosync start 25 2 3 4 5 . stop 35 0 1 6 .
>>  update-rc.d corosync-notifyd start 25 2 3 4 5 . stop 35 0 1 6 .
>>  update-rc.d dlm start 30 2 3 4 5 . stop 30 0 1 6 .
>>  update-rc.d gfs2-utils start 35 2 3 4 5 . stop 25 0 1 6 .
>>  update-rc.d xendomains start 40 2 3 4 5 . stop 20 0 1 6 .
>
> I don't know your OS, the above may or may not work.
>
>> Also, the moment I was capturing logs, corosync and dlm were not
>> running as services, but in foreground debugging mode.
>> SSH connection did not break until I powered down the host so network
>> is not stopped either.
>
> At least you've got interactive debugging ability then.  So try to find
> out why the Corosync membership broke down.  The output of
> corosync-quorumtool and corosync-cpgtool might help.  Also try pinging
> the Corosync ring0 addresses between the nodes.

Dear Feri,

Sorry, for leaving out lists from reply, it was hasty mistake :)
Just so I put all the information out there: I am using Ubuntu 14.04
across all hosts.

I've attached debugging logs in my first post. I cannot figure out
what is the key info there.
Today, I'll try to use tools you mentioned to see their output before
and during the issue.

Kind regards,
Momcilo "Momo" Medic.
(fedorauser)

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] GFS2 with Pacemaker, Corosync on Ubuntu 14.04

2016-01-25 Thread Momcilo Medic


>> 769897 cannot find device /dev/misc/dlm-control with minor 52
>
>
> Check that dlm's udev rules are installed in the location (lib/udev vs
> /usr/lib/udev) appropriate for your system. That changed recently in the
> upstream dlm.
>



Thank you for your reply and sorry for late response.
Rebooting the server and mounting configfs made everything work.

Now dlm loads and devices and symlinks to them are created correctly.

Kind regards,
Momcilo 'Momo' Medic.
(fedorauser)

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] GFS2 with Pacemaker, Corosync on Ubuntu 14.04

2016-01-20 Thread Momcilo Medic
Dear all,

I am trying to setup GFS2 on two Ubuntu 14.04 servers.
Every guide I can find online is for 12.04 by using cman package which
was abandoned in 13.10

So, I tried using Pacemaker with Corosync as instructed on your guide [1].
In this guide pcs is used which is not available in Ubuntu so I am
translating commands to crmsh.

I installed everything and noticed bad packaging on DLM which have
wrong init scripts. I reported [2] this bug to Ubuntu.
This also might cause DLM not finding the /dev/misc/dlm-control device
(which is actually located at /dev/dlm-control).

This is the output that I am getting:
# dlm_controld -D
769887 dlm_controld 4.0.1 started
769887 our_nodeid 739311650
769897 cannot find device /dev/misc/dlm-control with minor 52
769897 shutdown
769897 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2
769897 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2

# lsmod | grep dlm
dlm   156389  1 gfs2
sctp  247248  3 dlm
configfs   35358  2 dlm

# ls -hal /dev/dlm*
crw--- 1 root root 10, 52 Jan 14 17:55 /dev/dlm-control
crw--- 1 root root 10, 51 Jan 14 17:55 /dev/dlm-monitor
crw--- 1 root root 10, 50 Jan 14 17:55 /dev/dlm_plock

I checked man dlm, man dlm.conf, man dlm_controld but didn't find
anywhere an option to specify device location.

Any help would be highly appreciated.

[1] 
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html-single/Clusters_from_Scratch/index.html#_install_the_cluster_software
[2] https://bugs.launchpad.net/ubuntu/+source/dlm/+bug/1535242

Kind regards,
Momcilo 'Momo' Medic.
(fedorauser)

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] GFS2 with Pacemaker, Corosync on Ubuntu 14.04

2016-01-19 Thread Momcilo Medic
Dear all,

I am trying to setup GFS2 on two Ubuntu 14.04 servers.
Every guide I can find online is for 12.04 by using cman package which
was abandoned in 13.10

So, I tried using Pacemaker with Corosync as instructed on your guide [1].
In this guide pcs is used which is not available in Ubuntu so I am
translating commands to crmsh.

I installed everything and noticed bad packaging on DLM which have
wrong init scripts. I reported [2] this bug to Ubuntu.
This also might cause DLM not finding the /dev/misc/dlm-control device
(which is actually located at /dev/dlm-control).

This is the output that I am getting:
# dlm_controld -D
769887 dlm_controld 4.0.1 started
769887 our_nodeid 739311650
769897 cannot find device /dev/misc/dlm-control with minor 52
769897 shutdown
769897 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2
769897 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2

# lsmod | grep dlm
dlm   156389  1 gfs2
sctp  247248  3 dlm
configfs   35358  2 dlm

# ls -hal /dev/dlm*
crw--- 1 root root 10, 52 Jan 14 17:55 /dev/dlm-control
crw--- 1 root root 10, 51 Jan 14 17:55 /dev/dlm-monitor
crw--- 1 root root 10, 50 Jan 14 17:55 /dev/dlm_plock

I checked man dlm, man dlm.conf, man dlm_controld but didn't find
anywhere an option to specify device location.

Any help would be highly appreciated.

[1] 
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html-single/Clusters_from_Scratch/index.html#_install_the_cluster_software
[2] https://bugs.launchpad.net/ubuntu/+source/dlm/+bug/1535242

Kind regards,
Momcilo 'Momo' Medic.
(fedorauser)

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org