Re: [ClusterLabs] pacemaker-controld getting respawned
On 06/01/20 11:53 -0600, Ken Gaillot wrote: > On Fri, 2020-01-03 at 13:23 +, S Sathish S wrote: >> Pacemaker-controld process is getting restarted frequently reason for >> failure disconnect from CIB/Internal Error (or) high cpu on the >> system, same has been recorded in our system logs, Please find the >> pacemaker and corosync version installed on the system. >> >> Kindly let us know why we are getting below error on the system. >> >> corosync-2.4.4 à https://github.com/corosync/corosync/tree/v2.4.4 >> pacemaker-2.0.2 à >> https://github.com/ClusterLabs/pacemaker/tree/Pacemaker-2.0.2 libqb version is missing (to be explained later on) >> [root@vmc0621 ~]# ps -eo pid,lstart,cmd | grep -iE >> 'corosync|pacemaker' | grep -v grep >> 2039 Wed Dec 25 15:56:15 2019 corosync >> 3048 Wed Dec 25 15:56:15 2019 /usr/sbin/pacemakerd -f >> 3101 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-based >> 3102 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-fenced >> 3103 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-execd >> 3104 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-attrd >> 3105 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker- >> schedulerd >> 25371 Tue Dec 31 17:38:53 2019 /usr/libexec/pacemaker/pacemaker- >> controld >> >> >> In system message logs : >> >> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Node update 4419 >> failed: Timer expired (-62) >> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Node update 4420 >> failed: Timer expired (-62) > > This means that the controller is not getting a response back from the > CIB manager (pacemaker-based) within a reasonable time. If the DC can't > record the status of nodes, it can't make correct decisions, so it has > no choice but to exit (which should lead another node to fence it). I am not sure if it would be feasible in this other, mutual daemon relationship, but my first idea was that it might have something to do with deadlock-prone arrangement of prioririties, akin to what was resolved between pacemaker-fenced and pacemaker-based (perhaps -based would be bombarding -controld with updates rather than responding to some of its prior queries?) not too long ago: https://github.com/ClusterLabs/pacemaker/commit/3401f25994e8cc059898550082f9b75f2d07f103 Satish, you haven't included any metrics of you cluster (nodes #, resources #, load of the affected machine/all machines around the problem occurrence), nor you provided wider excerpts of the log. All in all, I'd start with updating libqb to 1.9.0 that supposedly contains https://github.com/ClusterLabs/libqb/pull/352, fix for the event priority concerning glitch, just in case. > The default timeout is the number of active nodes in the cluster times > 10 seconds, with a minimum of 30 seconds. That's a lot of time, so I > would be concerned if the CIB isn't responsive for that long. > > The logs from pacemaker-based before this point might be helpful, > although if it's not getting scheduled any CPU time there wouldn't be > any indication of that. > > It is possible to set the timeout explicitly using the PCMK_cib_timeout > environment variable, but the underlying problem would be likely to > cause other issues. -- Poki pgp4EM7BNVwVC.pgp Description: PGP signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] pacemaker-controld getting respawned
On Fri, 2020-01-03 at 13:23 +, S Sathish S wrote: > Hi Team, > > Pacemaker-controld process is getting restarted frequently reason for > failure disconnect from CIB/Internal Error (or) high cpu on the > system, same has been recorded in our system logs, Please find the > pacemaker and corosync version installed on the system. > > Kindly let us know why we are getting below error on the system. > > corosync-2.4.4 à https://github.com/corosync/corosync/tree/v2.4.4 > pacemaker-2.0.2 à > https://github.com/ClusterLabs/pacemaker/tree/Pacemaker-2.0.2 > > [root@vmc0621 ~]# ps -eo pid,lstart,cmd | grep -iE > 'corosync|pacemaker' | grep -v grep > 2039 Wed Dec 25 15:56:15 2019 corosync > 3048 Wed Dec 25 15:56:15 2019 /usr/sbin/pacemakerd -f > 3101 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-based > 3102 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-fenced > 3103 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-execd > 3104 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-attrd > 3105 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker- > schedulerd > 25371 Tue Dec 31 17:38:53 2019 /usr/libexec/pacemaker/pacemaker- > controld > > > In system message logs : > > Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Node update > 4419 failed: Timer expired (-62) > Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Node update > 4420 failed: Timer expired (-62) This means that the controller is not getting a response back from the CIB manager (pacemaker-based) within a reasonable time. If the DC can't record the status of nodes, it can't make correct decisions, so it has no choice but to exit (which should lead another node to fence it). The default timeout is the number of active nodes in the cluster times 10 seconds, with a minimum of 30 seconds. That's a lot of time, so I would be concerned if the CIB isn't responsive for that long. The logs from pacemaker-based before this point might be helpful, although if it's not getting scheduled any CPU time there wouldn't be any indication of that. It is possible to set the timeout explicitly using the PCMK_cib_timeout environment variable, but the underlying problem would be likely to cause other issues. > Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Input > I_ERROR received in state S_IDLE from crmd_node_update_complete > Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: State > transition S_IDLE -> S_RECOVERY > Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: warning: Fast- > tracking shutdown in response to errors > Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: warning: Not voting > in election, we're in state S_RECOVERY > Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Input > I_ERROR received in state S_RECOVERY from node_list_update_callback > Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Input > I_TERMINATE received in state S_RECOVERY from do_recover > Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Stopped 0 > recurring operations at shutdown (12 remaining) > Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring > action XXX_vmc0621:241 (XXX_vmc0621_monitor_1) incomplete at > shutdown > Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring > action XXX_vmc0621:261 (XXX_vmc0621_monitor_1) incomplete at > shutdown > Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring > action XXX_vmc0621:249 (XXX_vmc0621_monitor_1) incomplete at > shutdown > Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring > action XXX_vmc0621:258 (XXX_vmc0621_monitor_1) incomplete at > shutdown > Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring > action XXX_vmc0621:253 (XXX_vmc0621_monitor_1) incomplete at > shutdown > Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring > action XXX_vmc0621:250 (XXX_vmc0621_monitor_1) incomplete at > shutdown > Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring > action XXX_vmc0621:244 (XXX_vmc0621_monitor_1) incomplete at > shutdown > Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring > action XXX_OCC:237 (XXX_monitor_1) incomplete at shutdown > Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring > action XXX_vmc0621:264 (XXX_vmc0621_monitor_1) incomplete at > shutdown > Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring > action XXX_vmc0621:270 (XXX_vmc0621_monitor_1) incomplete at > shutdown > Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring > action XXX_vmc0621:238 (XXX_vmc0621_monitor_1) incomplete at > shutdown > Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring > action XXX_vmc0621:267 (XXX_vmc0621_monitor_1) incomplete at > shutdown > Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: 12 resources > were active at shutdown > Dec 30 10:02:37 vmc0621
[ClusterLabs] pacemaker-controld getting respawned
Hi Team, Pacemaker-controld process is getting restarted frequently reason for failure disconnect from CIB/Internal Error (or) high cpu on the system, same has been recorded in our system logs, Please find the pacemaker and corosync version installed on the system. Kindly let us know why we are getting below error on the system. corosync-2.4.4 --> https://github.com/corosync/corosync/tree/v2.4.4 pacemaker-2.0.2 --> https://github.com/ClusterLabs/pacemaker/tree/Pacemaker-2.0.2 [root@vmc0621 ~]# ps -eo pid,lstart,cmd | grep -iE 'corosync|pacemaker' | grep -v grep 2039 Wed Dec 25 15:56:15 2019 corosync 3048 Wed Dec 25 15:56:15 2019 /usr/sbin/pacemakerd -f 3101 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-based 3102 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-fenced 3103 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-execd 3104 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-attrd 3105 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-schedulerd 25371 Tue Dec 31 17:38:53 2019 /usr/libexec/pacemaker/pacemaker-controld In system message logs : Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Node update 4419 failed: Timer expired (-62) Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Node update 4420 failed: Timer expired (-62) Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Input I_ERROR received in state S_IDLE from crmd_node_update_complete Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: State transition S_IDLE -> S_RECOVERY Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: warning: Fast-tracking shutdown in response to errors Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: warning: Not voting in election, we're in state S_RECOVERY Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Input I_ERROR received in state S_RECOVERY from node_list_update_callback Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Input I_TERMINATE received in state S_RECOVERY from do_recover Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Stopped 0 recurring operations at shutdown (12 remaining) Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action XXX_vmc0621:241 (XXX_vmc0621_monitor_1) incomplete at shutdown Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action XXX_vmc0621:261 (XXX_vmc0621_monitor_1) incomplete at shutdown Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action XXX_vmc0621:249 (XXX_vmc0621_monitor_1) incomplete at shutdown Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action XXX_vmc0621:258 (XXX_vmc0621_monitor_1) incomplete at shutdown Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action XXX_vmc0621:253 (XXX_vmc0621_monitor_1) incomplete at shutdown Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action XXX_vmc0621:250 (XXX_vmc0621_monitor_1) incomplete at shutdown Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action XXX_vmc0621:244 (XXX_vmc0621_monitor_1) incomplete at shutdown Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action XXX_OCC:237 (XXX_monitor_1) incomplete at shutdown Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action XXX_vmc0621:264 (XXX_vmc0621_monitor_1) incomplete at shutdown Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action XXX_vmc0621:270 (XXX_vmc0621_monitor_1) incomplete at shutdown Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action XXX_vmc0621:238 (XXX_vmc0621_monitor_1) incomplete at shutdown Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action XXX_vmc0621:267 (XXX_vmc0621_monitor_1) incomplete at shutdown Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: 12 resources were active at shutdown Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Disconnected from the executor Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Disconnected from Corosync Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Disconnected from the CIB manager Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Could not recover from internal error Dec 30 10:02:37 vmc0621 pacemakerd[3048]: error: pacemaker-controld[7517] exited with status 1 (Error occurred) Dec 30 10:02:37 vmc0621 pacemakerd[3048]: notice: Respawning failed child process: pacemaker-controld Please let us know if any further logs required from our end. Thanks and Regards, S Sathish S ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/