Re: [ClusterLabs] corosync service stopping

2024-04-30 Thread Alexander Eastwood via Users
Hi Honza

I would say there is still a certain ambiguity in "shutdown by cfg request”, 
but I would argue that by not using the term “sysadmin” it at least doesn’t 
suggest that the shutdown was triggered by a human. So yes, I think that this 
phrasing is less misleading.

Cheers,

Alex

> On 29.04.2024, at 09:56, Jan Friesse  wrote:
> 
> Hi,
> I will reply just to "sysadmin" question:
> 
> On 26/04/2024 14:43, Alexander Eastwood via Users wrote:
>> Dear Reid,
> ...
> 
>> Why does the corosync log say ’shutdown by sysadmin’ when the shutdown was 
>> triggered by pacemaker? Isn’t this misleading?
> 
> This basically means shutdown was triggered by calling corosync cfg api. I 
> can agree "sysadmin" is misleading. Problem is, same cfg api call is used by 
> corosync-cfgtool and corosync-cfgtool is used in systemd service file and 
> here it is really probably sysadmin who initiated the shutdown.
> 
> Currently the function where this log message is printed has no information 
> about which process initiated shutdown. It knows only nodeid.
> 
> It would be possible to log some more info (probably also with proc_name) in 
> the cfg API function call, but then it is probably good candidate for DEBUG 
> log level.
> 
> So do you think "shutdown by cfg request" would be less misleading?
> 
> Regards
>  Honza
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] corosync service stopping

2024-04-29 Thread Jan Friesse

Hi,
I will reply just to "sysadmin" question:

On 26/04/2024 14:43, Alexander Eastwood via Users wrote:

Dear Reid,


...



Why does the corosync log say ’shutdown by sysadmin’ when the shutdown was 
triggered by pacemaker? Isn’t this misleading?


This basically means shutdown was triggered by calling corosync cfg api. 
I can agree "sysadmin" is misleading. Problem is, same cfg api call is 
used by corosync-cfgtool and corosync-cfgtool is used in systemd service 
file and here it is really probably sysadmin who initiated the shutdown.


Currently the function where this log message is printed has no 
information about which process initiated shutdown. It knows only nodeid.


It would be possible to log some more info (probably also with 
proc_name) in the cfg API function call, but then it is probably good 
candidate for DEBUG log level.


So do you think "shutdown by cfg request" would be less misleading?

Regards
  Honza

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] corosync service stopping

2024-04-26 Thread Alexander Eastwood via Users
Dear Reid,Thanks for the reply. Yes, lots of pacemaker logs - I have included just over a minute of them below and 5m of them as an attached .log file. The same behaviour occurs for a period of roughly 6 minutes before the corosync shutdown happens and can be summarised like so:Both cluster nodes (test cluster-c1 and testcluster-c2) are onlineHigh CPU load is detected on the active cluster node (testcluster-c1)High CPU load (presumably) leads to monitor operations on cluster resources timing outPacemaker attempts to recover/restart managed resources several times - unsuccessfullyCluster processes start shutting downPacemaker is responsible for corosync shutting down (pcmkd_shutdown_corosync)      info: Asking Corosync to shut down)This leads to several more questions…Why does the corosync log say ’shutdown by sysadmin’ when the shutdown was triggered by pacemaker? Isn’t this misleading?Why was there no transition to the other node?What is the purpose of shutting down corosync? Isn’t this how the 2 cluster nodes communicate with each other? Since testcluster-c1 requests a shutdown - and this is acknowledged by testcluster-c2 - isn’t corosync then required for the transition to occur?Any help is much appreciated!CheersAlex

pacemaker.log.gz
Description: GNU Zip compressed data
Apr 23 11:04:59.462 testcluster-c1 pacemaker-execd     [1295872] (async_action_complete)        warning: virtual_ip_monitor_3[724966] timed out after 2msApr 23 11:05:20.190 testcluster-c1 pacemaker-execd     [1295872] (async_action_complete)        warning: PingChecks_monitor_1[724970] timed out after 3msApr 23 11:05:24.754 testcluster-c1 pacemaker-controld  [1295875] (throttle_check_thresholds)    notice: High CPU load detected: 108.660004Apr 23 11:05:29.558 testcluster-c1 pacemaker-controld  [1295875] (qb_ipcs_us_withdraw)  info: withdrawing server socketsApr 23 11:05:30.270 testcluster-c1 pacemaker-controld  [1295875] (pe_ipc_destroy)       info: Connection to the scheduler releasedApr 23 11:05:34.982 testcluster-c1 pacemaker-controld  [1295875] (tengine_stonith_connection_destroy)   info: Fencing daemon disconnectedApr 23 11:05:35.350 testcluster-c1 pacemaker-controld  [1295875] (crmd_exit)    notice: Forcing immediate exit with status 100 (Fatal error occurred, will not respawn)Apr 23 11:05:35.538 testcluster-c1 pacemaker-controld  [1295875] (crm_xml_cleanup)      info: Cleaning up memory from libxml2Apr 23 11:05:35.850 testcluster-c1 pacemaker-controld  [1295875] (crm_exit)     info: Exiting pacemaker-controld | with status 100Apr 23 11:05:38.630 testcluster-c1 pacemaker-execd     [1295872] (cancel_recurring_action)      info: Cancelling ocf operation virtual_ip_monitor_3Apr 23 11:05:38.630 testcluster-c1 pacemaker-execd     [1295872] (services_action_cancel)       info: Terminating in-flight op virtual_ip_monitor_3[724993] early because it was cancelledApr 23 11:05:38.610 testcluster-c1 pacemakerd          [1295869] (pcmk_child_exit)      warning: Shutting cluster down because pacemaker-controld[1295875] had fatal failureApr 23 11:05:38.630 testcluster-c1 pacemakerd          [1295869] (pcmk_shutdown_worker)         notice: Shutting down PacemakerApr 23 11:05:38.630 testcluster-c1 pacemakerd          [1295869] (stop_child)   notice: Stopping pacemaker-schedulerd | sent signal 15 to process 1295874Apr 23 11:05:38.634 testcluster-c1 pacemaker-execd     [1295872] (async_action_complete)        info: virtual_ip_monitor_3[724993] terminated with signal 9 (Killed)Apr 23 11:05:38.634 testcluster-c1 pacemaker-execd     [1295872] (cancel_recurring_action)      info: Cancelling ocf operation virtual_ip_monitor_3Apr 23 11:05:38.650 testcluster-c1 pacemaker-execd     [1295872] (send_client_notify)   warning: Could not notify client crmd: Bad file descriptor | rc=9Apr 23 11:05:38.650 testcluster-c1 pacemaker-execd     [1295872] (cancel_recurring_action)      info: Cancelling systemd operation docker-services_status_6Apr 23 11:05:38.650 testcluster-c1 pacemaker-execd     [1295872] (cancel_recurring_action)      info: Cancelling ocf operation PingChecks_monitor_1Apr 23 11:05:38.650 testcluster-c1 pacemaker-execd     [1295872] (services_action_cancel)       info: Terminating in-flight op PingChecks_monitor_1[724994] early because it was cancelledApr 23 11:05:38.650 testcluster-c1 pacemaker-execd     [1295872] (async_action_complete)        info: PingChecks_monitor_1[724994] terminated with signal 9 (Killed)Apr 23 11:05:38.650 testcluster-c1 pacemaker-execd     [1295872] (cancel_recurring_action)      info: Cancelling ocf operation PingChecks_monitor_1Apr 23 11:05:38.654 testcluster-c1 pacemaker-execd     [1295872] (cancel_recurring_action)      info: Cancelling ocf operation DrbdFS_monitor_2Apr 23 11:05:38.654 testcluster-c1 pacemaker-execd     [1295872] (services_action_cancel)       info: Terminating in-flight op DrbdFS_monitor_2[724990] early because it was cancelledApr 23 

Re: [ClusterLabs] corosync service stopping

2024-04-25 Thread Reid Wahl
Any logs from Pacemaker?

On Thu, Apr 25, 2024 at 3:46 AM Alexander Eastwood via Users
 wrote:
>
> Hi all,
>
> I’m trying to get a better understanding of why our cluster - or specifically 
> corosync.service - entered a failed state. Here are all of the relevant 
> corosync logs from this event, with the last line showing when I manually 
> started the service again:
>
> Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [CFG   ] Node 1 was 
> shut down by sysadmin
> Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Unloading 
> all Corosync service engines.
> Apr 23 11:06:10 [1295854] testcluster-c1 corosync info[QB] 
> withdrawing server sockets
> Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
> engine unloaded: corosync vote quorum service v1.0
> Apr 23 11:06:10 [1295854] testcluster-c1 corosync info[QB] 
> withdrawing server sockets
> Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
> engine unloaded: corosync configuration map access
> Apr 23 11:06:10 [1295854] testcluster-c1 corosync info[QB] 
> withdrawing server sockets
> Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
> engine unloaded: corosync configuration service
> Apr 23 11:06:10 [1295854] testcluster-c1 corosync info[QB] 
> withdrawing server sockets
> Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
> engine unloaded: corosync cluster closed process group service v1.01
> Apr 23 11:06:10 [1295854] testcluster-c1 corosync info[QB] 
> withdrawing server sockets
> Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
> engine unloaded: corosync cluster quorum service v0.1
> Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
> engine unloaded: corosync profile loading service
> Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
> engine unloaded: corosync resource monitoring service
> Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
> engine unloaded: corosync watchdog service
> Apr 23 11:06:11 [1295854] testcluster-c1 corosync info[KNET  ] host: 
> host: 1 (passive) best link: 0 (pri: 0)
> Apr 23 11:06:11 [1295854] testcluster-c1 corosync warning [KNET  ] host: 
> host: 1 has no active links
> Apr 23 11:06:11 [1295854] testcluster-c1 corosync notice  [MAIN  ] Corosync 
> Cluster Engine exiting normally
> Apr 23 13:18:36 [796246] testcluster-c1 corosync notice  [MAIN  ] Corosync 
> Cluster Engine 3.1.6 starting up
>
> The first line suggests that a manual shutdown of one of the cluster nodes, 
> however neither me nor any of my colleagues did this. The ‘sysadmin’ surely 
> must mean a person logging on to the server and running some command, as 
> opposed to a system process?
>
> Then in the 3rd row from the bottom there is the warning “host: host: 1 has 
> no active links” which is followed by “Corosync Cluster Engine exiting 
> normally”. Does this mean that the reason for the Cluster Engine exiting is 
> the fact that there are no active links?
>
> Finally, I am considering adding a systemd override file for the corosync 
> service with the following content:
>
> [Service]
> Restart=on-failure
>
> Is there any reason not to do this? And, given that the process exited 
> normally, would I need to use Restart=always instead?
>
> Many thanks
>
> Alex
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 
Regards,

Reid Wahl (He/Him)
Senior Software Engineer, Red Hat
RHEL High Availability - Pacemaker

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/