Re: [ClusterLabs] Restoring network connection breaks cluster services

Jan Friesse Mon, 12 Aug 2019 23:00:26 -0700

Momcilo

On Wed, Aug 7, 2019 at 1:00 PM Klaus Wenninger <kwenn...@redhat.com> wrote:

On 8/7/19 12:26 PM, Momcilo Medic wrote:

We have three node cluster that is setup to stop resources on lost quorum.
Failure (network going down) handling is done properly, but recovery
doesn't seem to work.

What do you mean by 'network going down'?
Loss of link? Does the IP persist on the interface
in that case?


Yes, we simulate faulty cable by turning switch ports down and up.
In such a case, the IP does not persist on the interface.

What corosync version you have? Corosync was really bad in handlingifdown (removal of ip) properly till version 3 with knet which solvedproblem completely and 2.4.5, where it is so-so for udpu (udp is stillaffected).


Solution is ether upgrade corosync or configure system to keep ip intact.


Honza

That there are issue reconnecting the CPG-API
sounds strange to me. Already the fact that
something has to be reconnected. I got it
that your nodes were persistently up during the
network-disconnection. Although I would have
expected fencing to kick in at least on those
which are part of the non-quorate cluster-partition.
Maybe a few words more on your scenario
(fening-setup e.g.) would help to understand what
is going on.


We don't use any fencing mechanisms, we rely on quorum to run the services.
In more detail, we run three node Linbit LINSTOR storage that is
hyperconverged.
Meaning, we run clustered storage on the virtualization hypervisors.

We use pcs in order to have linstor-controller service in high availabilty
mode.
Policy for no quorum is to stop the resources.

In such hyperconverged setup, we can't fence a node without impact.
It may happen that network instability causes primary node to no longer be
primary.
In that case, we don't want running VMs to go down with the ship, as there
was no impact for them.

However, we would like to have high-availability of that service upon
network restoration, without manual actions.


Klaus


What happens is, services crash when we re-enable network connection.

 From journal:

```
...
Jul 12 00:27:32 itaftestkvmls02.dc.itaf.eu corosync[9069]: corosync:
totemsrp.c:1328: memb_consensus_agreed: Assertion `token_memb_entries >= 1'
failed.
Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu attrd[9104]:    error:
Connection to the CPG API failed: Library error (2)
Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu stonith-ng[9100]:    error:
Connection to the CPG API failed: Library error (2)
Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: corosync.service:
Main process exited, code=dumped, status=6/ABRT
Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu cib[9098]:    error:
Connection to the CPG API failed: Library error (2)
Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: corosync.service:
Failed with result 'core-dump'.
Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu pacemakerd[9087]:    error:
Connection to the CPG API failed: Library error (2)
Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: pacemaker.service:
Main process exited, code=exited, status=107/n/a
Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: pacemaker.service:
Failed with result 'exit-code'.
Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: Stopped Pacemaker
High Availability Cluster Manager.
Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu lrmd[9102]:  warning:
new_event_notification (9102-9107-7): Bad file descriptor (9)
...
```
Pacemaker's log shows no relevant info.

This is from corosync's log:

```
Jul 12 00:27:33 [9107] itaftestkvmls02.dc.itaf.eu       crmd:     info:
qb_ipcs_us_withdraw:    withdrawing server sockets
Jul 12 00:27:33 [9104] itaftestkvmls02.dc.itaf.eu      attrd:    error:
pcmk_cpg_dispatch:      Connection to the CPG API failed: Library error (2)
Jul 12 00:27:33 [9100] itaftestkvmls02.dc.itaf.eu stonith-ng:    error:
pcmk_cpg_dispatch:      Connection to the CPG API failed: Library error (2)
Jul 12 00:27:33 [9098] itaftestkvmls02.dc.itaf.eu        cib:    error:
pcmk_cpg_dispatch:      Connection to the CPG API failed: Library error (2)
Jul 12 00:27:33 [9087] itaftestkvmls02.dc.itaf.eu pacemakerd:    error:
pcmk_cpg_dispatch:      Connection to the CPG API failed: Library error (2)
Jul 12 00:27:33 [9104] itaftestkvmls02.dc.itaf.eu      attrd:     info:
qb_ipcs_us_withdraw:    withdrawing server sockets
Jul 12 00:27:33 [9087] itaftestkvmls02.dc.itaf.eu pacemakerd:     info:
crm_xml_cleanup:        Cleaning up memory from libxml2
Jul 12 00:27:33 [9107] itaftestkvmls02.dc.itaf.eu       crmd:     info:
crm_xml_cleanup:        Cleaning up memory from libxml2
Jul 12 00:27:33 [9100] itaftestkvmls02.dc.itaf.eu stonith-ng:     info:
qb_ipcs_us_withdraw:    withdrawing server sockets
Jul 12 00:27:33 [9104] itaftestkvmls02.dc.itaf.eu      attrd:     info:
crm_xml_cleanup:        Cleaning up memory from libxml2
Jul 12 00:27:33 [9098] itaftestkvmls02.dc.itaf.eu        cib:     info:
qb_ipcs_us_withdraw:    withdrawing server sockets
Jul 12 00:27:33 [9100] itaftestkvmls02.dc.itaf.eu stonith-ng:     info:
crm_xml_cleanup:        Cleaning up memory from libxml2
Jul 12 00:27:33 [9098] itaftestkvmls02.dc.itaf.eu        cib:     info:
qb_ipcs_us_withdraw:    withdrawing server sockets
Jul 12 00:27:33 [9098] itaftestkvmls02.dc.itaf.eu        cib:     info:
qb_ipcs_us_withdraw:    withdrawing server sockets
Jul 12 00:27:33 [9098] itaftestkvmls02.dc.itaf.eu        cib:     info:
crm_xml_cleanup:        Cleaning up memory from libxml2
Jul 12 00:27:33 [9102] itaftestkvmls02.dc.itaf.eu       lrmd:  warning:
qb_ipcs_event_sendv:    new_event_notification (9102-9107-7): Bad file
descriptor (9)
```

Please let me know if you need any further info, I'll be more than happy
to provide it.

This is always reproducible in our environment:
Ubuntu 18.04.2
corosync 2.4.3-0ubuntu1.1
pcs 0.9.164-1
pacemaker 1.1.18-0ubuntu1.1

Kind regards,
Momo.

_______________________________________________
Manage your subscription:https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/




_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Restoring network connection breaks cluster services

Reply via email to