(subject changed for existing thread) Hi!
After I had thought the problem with the sticky failed monitor was solved eventually, I realized that I'm getting a message that I don't really understand after each cluster recheck interval: pengine[7280]: notice: Re-initiated expired calculated failure prm_nfs_server_monitor_60000 (rc=7, magic=0:7;4:6568:0:d941efc1-de73-4ee4-b593-f65be9e90726) on h11 The message repeats absolutely identical. So what does it mean? The monitor did not fail between cluster rechecks, and crm_mon is not displaying any failed operations. Regards, Ulrich >>> Ulrich Windl schrieb am 13.08.2019 um 11:06 in Nachricht <5D527D91.124 : 161 : 60728>: > Hi, > > an update: > After setting a failure-timeout for the resource that stale monitor failure > was removed automatically at next cluster recheck (it seems). > Still I wonder why a resource cleanup didn't do that (bug?). > > Regards, > Ulrich > > > >>> "Ulrich Windl" <ulrich.wi...@rz.uni-regensburg.de> schrieb am 13.08.2019 um > 10:07 in Nachricht <5d526fb0020000a100032...@gwsmtp.uni-regensburg.de>: > >>>> Ken Gaillot <kgail...@redhat.com> schrieb am 13.08.2019 um 01:03 in > > Nachricht > > <ec355e98371d22d6c83e4927f65fdfa711266c21.ca...@redhat.com>: > >> On Mon, 2019‑08‑12 at 17:46 +0200, Ulrich Windl wrote: > >>> Hi! > >>> > >>> I just noticed that a "crm resource cleanup <rsc>" caused some > >>> unexpected behavior and the syslog message: > >>> crmd[7281]: warning: new_event_notification (7281‑97955‑15): Broken > >>> pipe (32) > >>> > >>> It's SLES14 SP4 last updated Sept. 2018 (up since then, pacemaker‑ > >>> 1.1.19+20180928.0d2680780‑1.8.x86_64). > >>> > >>> The cleanup was due to a failed monitor. As an unexpected consequence > >>> of this cleanup, CRM seemed to restart the complete resource (and > >>> dependencies), even though it was running. > >> > >> I assume the monitor failure was old, and recovery had already > >> completed? If not, recovery might have been initiated before the clean‑ > >> up was recorded. > >> > >>> I noticed that a manual "crm_resource ‑C ‑r <rsc> ‑N <node>" command > >>> has the same effect (multiple resources are "Cleaned up", resources > >>> are restarted seemingly before the "probe" is done.). > >> > >> Can you verify whether the probes were done? The DC should log a > >> message when each <rsc>_monitor_0 result comes in. > > > > So here's a rough sketch of events: > > 17:10:23 crmd[7281]: notice: State transition S_IDLE -> S_POLICY_ENGINE > > ...no probes yet... > > 17:10:24 pengine[7280]: warning: Processing failed monitor of > > prm_nfs_server > > on rksaph11: not running > > ...lots of starts/restarts... > > 17:10:24 pengine[7280]: notice: * Restart prm_nfs_server > > ... > > 17:10:24 crmd[7281]: notice: Processing graph 6628 > > (ref=pe_calc-dc-1565622624-7313) derived from > > /var/lib/pacemaker/pengine/pe-input-1810.bz2 > > ...monitors are being called... > > 17:10:24 crmd[7281]: notice: Result of probe operation for prm_nfs_vg on > > h11: 0 (ok) > > ...the above was the first probe result... > > 17:10:24 crmd[7281]: warning: Action 33 (prm_nfs_vg_monitor_0) on h11 > > failed > > (target: 7 vs. rc: 0): Error > > ...not surprising to me: The resource was running; I don't know why the > > cluster want to start it... > > 17:10:24 crmd[7281]: notice: Transition 6629 (Complete=9, Pending=0, > > Fired=0, Skipped=0, Incomplete=0, > > Source=/var/lib/pacemaker/pengine/pe-input-1811.bz2): Complete > > 17:10:24 crmd[7281]: notice: State transition S_TRANSITION_ENGINE -> > S_IDLE > > > > The really bad thing after this is that the "cleaned up" resource still has > > a > > failed status (dated in the past (last-rc-change='Mon Aug 12 04:52:23 > > 2019')), > > even though "running". > > > > I tend to believe that the cluster is in a bad state, or the software has a > > problem cleaning the status of the monitor. > > > > The CIB status for the resource looks like this: > > <lrm_resource id="prm_nfs_server" type="nfsserver" class="ocf" > > provider="heartbeat"> > > <lrm_rsc_op id="prm_nfs_server_last_0" > > operation_key="prm_nfs_server_start_0" operation="start" > > crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" > > transition-key="67:6583:0:d941efc1-de73-4ee4-b593-f65be9e90726" > > transition-magic="0:0;67:6583:0:d941efc1-de73-4ee4-b593-f65be9e90726" > > exit-reason="" on_node="h11" call-id="799" rc-code="0" op-status="0" > > interval="0" last-run="1565582351" last-rc-change="1565582351" > > exec-time="708" > > queue-time="0" op-digest="73311a0ef4ba8e9f1f97e05e989f6348"/> > > <lrm_rsc_op id="prm_nfs_server_monitor_60000" > > operation_key="prm_nfs_server_monitor_60000" operation="monitor" > > crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" > > transition-key="68:6583:0:d941efc1-de73-4ee4-b593-f65be9e90726" > > transition-magic="0:0;68:6583:0:d941efc1-de73-4ee4-b593-f65be9e90726" > > exit-reason="" on_node="h11" call-id="800" rc-code="0" op-status="0" > > interval="60000" last-rc-change="1565582351" exec-time="499" queue-time="0" > > op-digest="9d8aa17b2a741c8328d7896459733e56"/> > > <lrm_rsc_op id="prm_nfs_server_last_failure_0" > > operation_key="prm_nfs_server_monitor_60000" operation="monitor" > > crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" > > transition-key="4:6568:0:d941efc1-de73-4ee4-b593-f65be9e90726" > > transition-magic="0:7;4:6568:0:d941efc1-de73-4ee4-b593-f65be9e90726" > > exit-reason="" on_node="h11" call-id="738" rc-code="7" op-status="0" > > interval="60000" last-rc-change="1565578343" exec-time="0" queue-time="0" > > op-digest="9d8aa17b2a741c8328d7896459733e56"/> > > </lrm_resource> > > > > > > Regards, > > Ulrich > > > >> > >>> Actually the manual says when cleaning up a single primitive, the > >>> whole group is cleaned up, unless using ‑‑force. Well ,I don't like > >>> this default, as I expect any status change from probe would > >>> propagate to the group anyway... > >> > >> In 1.1, clean‑up always wipes the history of the affected resources, > >> regardless of whether the history is for success or failure. That means > >> all the cleaned resources will be reprobed. In 2.0, clean‑up by default > >> wipes the history only if there's a failed action (‑‑refresh/‑R is > >> required to get the 1.1 behavior). That lessens the impact of the > >> "default to whole group" behavior. > >> > >> I think the original idea was that a group indicates that the resources > >> are closely related, so changing the status of one member might affect > >> what status the others report. > >> > >>> Regards, > >>> Ulrich > >> ‑‑ > >> Ken Gaillot <kgail...@redhat.com> > >> > >> _______________________________________________ > >> Manage your subscription: > >> https://lists.clusterlabs.org/mailman/listinfo/users > >> > >> ClusterLabs home: https://www.clusterlabs.org/ > > > > > > > > _______________________________________________ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > > _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/