Re: [ClusterLabs] pacemaker-controld getting respawned

2020-01-07 Thread Jan Pokorný
On 06/01/20 11:53 -0600, Ken Gaillot wrote:
> On Fri, 2020-01-03 at 13:23 +, S Sathish S wrote:
>> Pacemaker-controld process is getting restarted frequently reason for
>> failure disconnect from CIB/Internal Error (or) high cpu on the
>> system, same has been recorded in our system logs, Please find the
>> pacemaker and corosync version installed on the system.  
>>  
>> Kindly let us know why we are getting below error on the system.
>>  
>> corosync-2.4.4 à  https://github.com/corosync/corosync/tree/v2.4.4
>> pacemaker-2.0.2 à 
>> https://github.com/ClusterLabs/pacemaker/tree/Pacemaker-2.0.2

libqb version is missing (to be explained later on)

>> [root@vmc0621 ~]# ps -eo pid,lstart,cmd  | grep -iE
>> 'corosync|pacemaker' | grep -v grep
>> 2039 Wed Dec 25 15:56:15 2019 corosync
>> 3048 Wed Dec 25 15:56:15 2019 /usr/sbin/pacemakerd -f
>> 3101 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-based
>> 3102 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-fenced
>> 3103 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-execd
>> 3104 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-attrd
>> 3105 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-
>> schedulerd
>> 25371 Tue Dec 31 17:38:53 2019 /usr/libexec/pacemaker/pacemaker-
>> controld
>>  
>>  
>> In system message logs :
>>  
>> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Node update 4419 
>> failed: Timer expired (-62)
>> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Node update 4420 
>> failed: Timer expired (-62)
> 
> This means that the controller is not getting a response back from the
> CIB manager (pacemaker-based) within a reasonable time. If the DC can't
> record the status of nodes, it can't make correct decisions, so it has
> no choice but to exit (which should lead another node to fence it).

I am not sure if it would be feasible in this other, mutual daemon
relationship, but my first idea was that it might have something to
do with deadlock-prone arrangement of prioririties, akin to what
was resolved between pacemaker-fenced and pacemaker-based
(perhaps -based would be bombarding -controld with updates rather
than responding to some of its prior queries?) not too long ago:
https://github.com/ClusterLabs/pacemaker/commit/3401f25994e8cc059898550082f9b75f2d07f103

Satish, you haven't included any metrics of you cluster (nodes #,
resources #, load of the affected machine/all machines around the
problem occurrence), nor you provided wider excerpts of the log.

All in all, I'd start with updating libqb to 1.9.0 that supposedly
contains https://github.com/ClusterLabs/libqb/pull/352, fix for
the event priority concerning glitch, just in case.

> The default timeout is the number of active nodes in the cluster times
> 10 seconds, with a minimum of 30 seconds. That's a lot of time, so I
> would be concerned if the CIB isn't responsive for that long.
> 
> The logs from pacemaker-based before this point might be helpful,
> although if it's not getting scheduled any CPU time there wouldn't be
> any indication of that.
> 
> It is possible to set the timeout explicitly using the PCMK_cib_timeout
> environment variable, but the underlying problem would be likely to
> cause other issues.

-- 
Poki


pgp4EM7BNVwVC.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] pacemaker-controld getting respawned

2020-01-06 Thread Ken Gaillot
On Fri, 2020-01-03 at 13:23 +, S Sathish S wrote:
> Hi Team,
>  
> Pacemaker-controld process is getting restarted frequently reason for
> failure disconnect from CIB/Internal Error (or) high cpu on the
> system, same has been recorded in our system logs, Please find the
> pacemaker and corosync version installed on the system.  
>  
> Kindly let us know why we are getting below error on the system.
>  
> corosync-2.4.4 à  https://github.com/corosync/corosync/tree/v2.4.4
> pacemaker-2.0.2 à 
> https://github.com/ClusterLabs/pacemaker/tree/Pacemaker-2.0.2
>  
> [root@vmc0621 ~]# ps -eo pid,lstart,cmd  | grep -iE
> 'corosync|pacemaker' | grep -v grep
> 2039 Wed Dec 25 15:56:15 2019 corosync
> 3048 Wed Dec 25 15:56:15 2019 /usr/sbin/pacemakerd -f
> 3101 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-based
> 3102 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-fenced
> 3103 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-execd
> 3104 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-attrd
> 3105 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-
> schedulerd
> 25371 Tue Dec 31 17:38:53 2019 /usr/libexec/pacemaker/pacemaker-
> controld
>  
>  
> In system message logs :
>  
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Node update
> 4419 failed: Timer expired (-62)
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Node update
> 4420 failed: Timer expired (-62)

This means that the controller is not getting a response back from the
CIB manager (pacemaker-based) within a reasonable time. If the DC can't
record the status of nodes, it can't make correct decisions, so it has
no choice but to exit (which should lead another node to fence it).

The default timeout is the number of active nodes in the cluster times
10 seconds, with a minimum of 30 seconds. That's a lot of time, so I
would be concerned if the CIB isn't responsive for that long.

The logs from pacemaker-based before this point might be helpful,
although if it's not getting scheduled any CPU time there wouldn't be
any indication of that.

It is possible to set the timeout explicitly using the PCMK_cib_timeout
environment variable, but the underlying problem would be likely to
cause other issues.

> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Input
> I_ERROR received in state S_IDLE from crmd_node_update_complete
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: State
> transition S_IDLE -> S_RECOVERY
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: warning: Fast-
> tracking shutdown in response to errors
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: warning: Not voting
> in election, we're in state S_RECOVERY
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Input
> I_ERROR received in state S_RECOVERY from node_list_update_callback
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Input
> I_TERMINATE received in state S_RECOVERY from do_recover
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Stopped 0
> recurring operations at shutdown (12 remaining)
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring
> action XXX_vmc0621:241 (XXX_vmc0621_monitor_1) incomplete at
> shutdown
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring
> action XXX_vmc0621:261 (XXX_vmc0621_monitor_1) incomplete at
> shutdown
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring
> action XXX_vmc0621:249 (XXX_vmc0621_monitor_1) incomplete at
> shutdown
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring
> action XXX_vmc0621:258 (XXX_vmc0621_monitor_1) incomplete at
> shutdown
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring
> action XXX_vmc0621:253 (XXX_vmc0621_monitor_1) incomplete at
> shutdown
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring
> action XXX_vmc0621:250 (XXX_vmc0621_monitor_1) incomplete at
> shutdown
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring
> action XXX_vmc0621:244 (XXX_vmc0621_monitor_1) incomplete at
> shutdown
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring
> action XXX_OCC:237 (XXX_monitor_1) incomplete at shutdown
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring
> action XXX_vmc0621:264 (XXX_vmc0621_monitor_1) incomplete at
> shutdown
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring
> action XXX_vmc0621:270 (XXX_vmc0621_monitor_1) incomplete at
> shutdown
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring
> action XXX_vmc0621:238 (XXX_vmc0621_monitor_1) incomplete at
> shutdown
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring
> action XXX_vmc0621:267 (XXX_vmc0621_monitor_1) incomplete at
> shutdown
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: 12 resources
> were active at shutdown
> Dec 30 10:02:37 vmc0621 

[ClusterLabs] pacemaker-controld getting respawned

2020-01-04 Thread S Sathish S
Hi Team,

Pacemaker-controld process is getting restarted frequently reason for failure 
disconnect from CIB/Internal Error (or) high cpu on the system, same has been 
recorded in our system logs, Please find the pacemaker and corosync version 
installed on the system.

Kindly let us know why we are getting below error on the system.

corosync-2.4.4 -->  https://github.com/corosync/corosync/tree/v2.4.4
pacemaker-2.0.2 --> 
https://github.com/ClusterLabs/pacemaker/tree/Pacemaker-2.0.2

[root@vmc0621 ~]# ps -eo pid,lstart,cmd  | grep -iE 'corosync|pacemaker' | grep 
-v grep
2039 Wed Dec 25 15:56:15 2019 corosync
3048 Wed Dec 25 15:56:15 2019 /usr/sbin/pacemakerd -f
3101 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-based
3102 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-fenced
3103 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-execd
3104 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-attrd
3105 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-schedulerd
25371 Tue Dec 31 17:38:53 2019 /usr/libexec/pacemaker/pacemaker-controld


In system message logs :

Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Node update 4419 
failed: Timer expired (-62)
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Node update 4420 
failed: Timer expired (-62)
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Input I_ERROR received 
in state S_IDLE from crmd_node_update_complete
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: State transition 
S_IDLE -> S_RECOVERY
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: warning: Fast-tracking 
shutdown in response to errors
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: warning: Not voting in 
election, we're in state S_RECOVERY
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Input I_ERROR received 
in state S_RECOVERY from node_list_update_callback
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Input I_TERMINATE 
received in state S_RECOVERY from do_recover
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Stopped 0 recurring 
operations at shutdown (12 remaining)
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action 
XXX_vmc0621:241 (XXX_vmc0621_monitor_1) incomplete at shutdown
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action 
XXX_vmc0621:261 (XXX_vmc0621_monitor_1) incomplete at shutdown
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action 
XXX_vmc0621:249 (XXX_vmc0621_monitor_1) incomplete at shutdown
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action 
XXX_vmc0621:258 (XXX_vmc0621_monitor_1) incomplete at shutdown
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action 
XXX_vmc0621:253 (XXX_vmc0621_monitor_1) incomplete at shutdown
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action 
XXX_vmc0621:250 (XXX_vmc0621_monitor_1) incomplete at shutdown
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action 
XXX_vmc0621:244 (XXX_vmc0621_monitor_1) incomplete at shutdown
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action 
XXX_OCC:237 (XXX_monitor_1) incomplete at shutdown
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action 
XXX_vmc0621:264 (XXX_vmc0621_monitor_1) incomplete at shutdown
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action 
XXX_vmc0621:270 (XXX_vmc0621_monitor_1) incomplete at shutdown
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action 
XXX_vmc0621:238 (XXX_vmc0621_monitor_1) incomplete at shutdown
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action 
XXX_vmc0621:267 (XXX_vmc0621_monitor_1) incomplete at shutdown
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: 12 resources were 
active at shutdown
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Disconnected from the 
executor
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Disconnected from 
Corosync
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Disconnected from the 
CIB manager
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Could not recover from 
internal error
Dec 30 10:02:37 vmc0621 pacemakerd[3048]: error: pacemaker-controld[7517] 
exited with status 1 (Error occurred)
Dec 30 10:02:37 vmc0621 pacemakerd[3048]: notice: Respawning failed child 
process: pacemaker-controld

Please let us know if any further logs required from our end.

Thanks and Regards,
S Sathish S
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/