>>> Ken Gaillot <[email protected]> schrieb am 26.04.2022 um 21:24 in Nachricht <[email protected]>: > On Tue, 2022‑04‑26 at 15:20 ‑0300, Salatiel Filho wrote: >> I have a question about OCF_TIMEOUT. Some times my cluster shows me >> this on pcs status: >> Failed Resource Actions: >> * fence‑server02_monitor_60000 on server01 'OCF_TIMEOUT' (198): >> call=419, status='Timed Out', exitreason='', >> last‑rc‑change='2022‑04‑26 14:47:32 ‑03:00', queued=0ms, exec=20004ms >> >> I can see in the same pcs status output that the fence device is >> started, so does that mean it failed some moment in the past and now >> it is OK? Or do I have to do something to recover it? > > Correct, the status shows failures that have happened in the past. The
However the "past" was rather recently ;-) > cluster tries to recover failed resources automatically according to > whatever policy has been configured (the default being to stop and > start the resource). AFAIR the cluster stops monitoring after that, and you have to cleanup the error first. Am I wrong? > > Since the resource is shown as active, there's nothing you have to do. > You can investigate the timeout (for example look at the system logs > around that timestamp to see if anything else unusual was reported), > and you can clear the failure from the status display with > "crm_resource ‑‑cleanup" (or "pcs resource cleanup"). 20 seconds can be rather short for some monitors on a busy system. Maybe you suffer from "read stalls" (when a lot of dirty buffers are written)? You could use the classic "sa/sar" tools to monitor your system, or if you have some specific suspect you might use monit to check. For example I'm monitoring the /var filesystem here in a VM: # monit status fs_var Monit 5.29.0 uptime: 6h 20m Filesystem 'fs_var' status OK monitoring status Monitored monitoring mode active on reboot start filesystem type ext3 filesystem flags rw,relatime,data=ordered permission 755 uid 0 gid 0 block size 4 kB space total 5.5 GB (of which 10.9% is reserved for root user) space free for non superuser 2.8 GB [51.4%] space free total 3.4 GB [62.3%] inodes total 786432 inodes free 781794 [99.4%] read bytes 34.1 B/s [113.3 MB total] disk read operations 0.0 reads/s [4269 reads total] write bytes 4.2 kB/s [75.5 MB total] disk write operations 1.0 writes/s [15037 writes total] service time 0.007ms/operation (of which read 0.000ms, write 0.007ms) data collected Wed, 27 Apr 2022 08:46:17 (You can trigger alerts if any of those values exceeds some threshold) Regards, Ulrich > >> >> # pcs status >> Cluster name: cluster1 >> Cluster Summary: >> * Stack: corosync >> * Current DC: server02 (version 2.1.0‑8.el8‑7c3f660707) ‑ partition >> with quorum >> * Last updated: Tue Apr 26 14:52:56 2022 >> * Last change: Tue Apr 26 14:37:22 2022 by hacluster via crmd on >> server01 >> * 2 nodes configured >> * 11 resource instances configured >> >> Node List: >> * Online: [ server01 server02 ] >> >> Full List of Resources: >> * fence‑server01 (stonith:fence_vmware_rest): Started >> server02 >> * fence‑server02 (stonith:fence_vmware_rest): Started >> server01 >> ... >> >> Is "pcs resource cleanup" the right way to remove those messages ? >> >> >> >> >> Atenciosamente/Kind regards, >> Salatiel > ‑‑ > Ken Gaillot <[email protected]> > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
