Re: [ClusterLabs] pacemaker-controld getting respawned

Jan Pokorný Tue, 07 Jan 2020 06:41:17 -0800

On 06/01/20 11:53 -0600, Ken Gaillot wrote:
> On Fri, 2020-01-03 at 13:23 +0000, S Sathish S wrote:
>> Pacemaker-controld process is getting restarted frequently reason for
>> failure disconnect from CIB/Internal Error (or) high cpu on the
>> system, same has been recorded in our system logs, Please find the
>> pacemaker and corosync version installed on the system.  
>>  
>> Kindly let us know why we are getting below error on the system.
>>  
>> corosync-2.4.4 à  https://github.com/corosync/corosync/tree/v2.4.4
>> pacemaker-2.0.2 à 
>> https://github.com/ClusterLabs/pacemaker/tree/Pacemaker-2.0.2


libqb version is missing (to be explained later on)

>> [root@vmc0621 ~]# ps -eo pid,lstart,cmd  | grep -iE
>> 'corosync|pacemaker' | grep -v grep
>> 2039 Wed Dec 25 15:56:15 2019 corosync
>> 3048 Wed Dec 25 15:56:15 2019 /usr/sbin/pacemakerd -f
>> 3101 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-based
>> 3102 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-fenced
>> 3103 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-execd
>> 3104 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-attrd
>> 3105 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-
>> schedulerd
>> 25371 Tue Dec 31 17:38:53 2019 /usr/libexec/pacemaker/pacemaker-
>> controld
>>  
>>  
>> In system message logs :
>>  
>> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Node update 4419 
>> failed: Timer expired (-62)
>> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Node update 4420 
>> failed: Timer expired (-62)
> 
> This means that the controller is not getting a response back from the
> CIB manager (pacemaker-based) within a reasonable time. If the DC can't
> record the status of nodes, it can't make correct decisions, so it has
> no choice but to exit (which should lead another node to fence it).

I am not sure if it would be feasible in this other, mutual daemon
relationship, but my first idea was that it might have something to
do with deadlock-prone arrangement of prioririties, akin to what
was resolved between pacemaker-fenced and pacemaker-based
(perhaps -based would be bombarding -controld with updates rather
than responding to some of its prior queries?) not too long ago:
https://github.com/ClusterLabs/pacemaker/commit/3401f25994e8cc059898550082f9b75f2d07f103

Satish, you haven't included any metrics of you cluster (nodes #,
resources #, load of the affected machine/all machines around the
problem occurrence), nor you provided wider excerpts of the log.

All in all, I'd start with updating libqb to 1.9.0 that supposedly
contains https://github.com/ClusterLabs/libqb/pull/352, fix for
the event priority concerning glitch, just in case.

> The default timeout is the number of active nodes in the cluster times
> 10 seconds, with a minimum of 30 seconds. That's a lot of time, so I
> would be concerned if the CIB isn't responsive for that long.
> 
> The logs from pacemaker-based before this point might be helpful,
> although if it's not getting scheduled any CPU time there wouldn't be
> any indication of that.
> 
> It is possible to set the timeout explicitly using the PCMK_cib_timeout
> environment variable, but the underlying problem would be likely to
> cause other issues.

-- 
Poki

pgp4EM7BNVwVC.pgp
Description: PGP signature

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] pacemaker-controld getting respawned

Reply via email to