On 06/01/20 11:53 -0600, Ken Gaillot wrote: > On Fri, 2020-01-03 at 13:23 +0000, S Sathish S wrote: >> Pacemaker-controld process is getting restarted frequently reason for >> failure disconnect from CIB/Internal Error (or) high cpu on the >> system, same has been recorded in our system logs, Please find the >> pacemaker and corosync version installed on the system. >> >> Kindly let us know why we are getting below error on the system. >> >> corosync-2.4.4 à https://github.com/corosync/corosync/tree/v2.4.4 >> pacemaker-2.0.2 à >> https://github.com/ClusterLabs/pacemaker/tree/Pacemaker-2.0.2
libqb version is missing (to be explained later on) >> [root@vmc0621 ~]# ps -eo pid,lstart,cmd | grep -iE >> 'corosync|pacemaker' | grep -v grep >> 2039 Wed Dec 25 15:56:15 2019 corosync >> 3048 Wed Dec 25 15:56:15 2019 /usr/sbin/pacemakerd -f >> 3101 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-based >> 3102 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-fenced >> 3103 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-execd >> 3104 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-attrd >> 3105 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker- >> schedulerd >> 25371 Tue Dec 31 17:38:53 2019 /usr/libexec/pacemaker/pacemaker- >> controld >> >> >> In system message logs : >> >> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Node update 4419 >> failed: Timer expired (-62) >> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Node update 4420 >> failed: Timer expired (-62) > > This means that the controller is not getting a response back from the > CIB manager (pacemaker-based) within a reasonable time. If the DC can't > record the status of nodes, it can't make correct decisions, so it has > no choice but to exit (which should lead another node to fence it). I am not sure if it would be feasible in this other, mutual daemon relationship, but my first idea was that it might have something to do with deadlock-prone arrangement of prioririties, akin to what was resolved between pacemaker-fenced and pacemaker-based (perhaps -based would be bombarding -controld with updates rather than responding to some of its prior queries?) not too long ago: https://github.com/ClusterLabs/pacemaker/commit/3401f25994e8cc059898550082f9b75f2d07f103 Satish, you haven't included any metrics of you cluster (nodes #, resources #, load of the affected machine/all machines around the problem occurrence), nor you provided wider excerpts of the log. All in all, I'd start with updating libqb to 1.9.0 that supposedly contains https://github.com/ClusterLabs/libqb/pull/352, fix for the event priority concerning glitch, just in case. > The default timeout is the number of active nodes in the cluster times > 10 seconds, with a minimum of 30 seconds. That's a lot of time, so I > would be concerned if the CIB isn't responsive for that long. > > The logs from pacemaker-based before this point might be helpful, > although if it's not getting scheduled any CPU time there wouldn't be > any indication of that. > > It is possible to set the timeout explicitly using the PCMK_cib_timeout > environment variable, but the underlying problem would be likely to > cause other issues. -- Poki
pgp4EM7BNVwVC.pgp
Description: PGP signature
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/