On Tue, Mar 8, 2016 at 7:03 AM, Ferenc Wágner <wf...@niif.hu> wrote:
> Ken Gaillot <kgail...@redhat.com> writes: > > > On 03/07/2016 07:31 AM, Ferenc Wágner wrote: > > > >> 12:55:13 vhbl07 crmd[8484]: notice: Transition aborted by > vm-eiffel_monitor_60000 'create' on vhbl05: Foreign event > (magic=0:0;521:0:0:634eef05-39c1-4093-94d4-8d624b423bb7, cib=0.613.98, > source=process_graph_event:600, 0) > > > > That means the action was initiated by a different node (the previous DC > > presumably), I suspect s/previous/other/ With a stuck machine its entirely possible that the other nodes elected a new leader. Would I be right in guessing that fencing is disabled? > so the new DC wants to recalculate everything. > > Time travel was sort of possible in that situation, and recurring > monitor operations are not logged, so this is indeed possible. The main > thing is that it wasn't mishandled. > > >> recovery actions turned into start actions for the resources stopped > >> during the previous transition. However, almost all other recovery > >> actions just disappeared without any comment. This was actually > >> correct, but I really wonder why the cluster decided to paper over > >> the previous monitor operation timeouts. Maybe the operations > >> finished meanwhile and got accounted somehow, just not logged? > > > > I'm not sure why the PE decided recovery was not necessary. Operation > > results wouldn't be accepted without being logged. > > At which logging level? I can't see recurring monitor operation logs in > syslog (at default logging level: notice) nor in /var/log/pacemaker.log > (which contains info level messages as well). > The DC will log that the recurring monitor was successfully started, but due to noise it doesn't log subsequent successes. > > However, the info level logs contain more "Transition aborted" lines, as > if only the first of them got logged with notice level. This would make > sense, since the later ones don't make any difference on an already > aborted transition, so they aren't that important. And in fact such > lines were suppressed from the syslog I checked first, for example: > > 12:55:39 [8479] vhbl07 cib: info: cib_perform_op: Diff: --- > 0.613.120 2 > 12:55:39 [8479] vhbl07 cib: info: cib_perform_op: Diff: +++ > 0.613.121 (null) > 12:55:39 [8479] vhbl07 cib: info: cib_perform_op: + /cib: > @num_updates=121 > 12:55:39 [8479] vhbl07 cib: info: cib_perform_op: ++ > /cib/status/node_state[@id='167773707']/lrm[@id='167773707']/lrm_resources/lrm_resource[@id='vm-elm']: > <lrm_rsc_op id="vm-elm_monitor_60000" operation_key="vm-elm_monitor_60000" > operation="monitor" crm-debug-origin="do_update_resource" > crm_feature_set="3.0.10" > transition-key="473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" > transition-magic="0:0;473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" > on_node="vhbl05" call-id="645" rc-code="0" op-st > 12:55:39 [8479] vhbl07 cib: info: cib_process_request: > Completed cib_modify operation for section status: OK (rc=0, > origin=vhbl05/crmd/362, version=0.613.121) > 12:55:39 [8484] vhbl07 crmd: info: abort_transition_graph: > Transition aborted by vm-elm_monitor_60000 'create' on vhbl05: Foreign > event (magic=0:0;473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7, > cib=0.613.121, source=process_graph_event:600, 0) > 12:55:39 [8484] vhbl07 crmd: info: process_graph_event: > Detected action (0.473) vm-elm_monitor_60000.645=ok: initiated by a > different node > > I can very much imagine this cancelling the FAILED state induced by a > monitor timeout like: > > 12:54:52 [8479] vhbl07 cib: info: cib_perform_op: ++ > <lrm_resource id="vm-elm" > type="TransientDomain" class="ocf" provider="niif"> > 12:54:52 [8479] vhbl07 cib: info: cib_perform_op: ++ > <lrm_rsc_op > id="vm-elm_last_failure_0" operation_key="vm-elm_monitor_60000" > operation="monitor" crm-debug-origin="build_active_RAs" > crm_feature_set="3.0.10" > transition-key="473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" > transition-magic="2:1;473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" > on_node="vhbl05" call-id="645" rc-code="1" op-status="2" interval="60000" > last-rc-change="1456833279" exe > 12:54:52 [8479] vhbl07 cib: info: cib_perform_op: ++ > <lrm_rsc_op id="vm-elm_last_0" > operation_key="vm-elm_start_0" operation="start" > crm-debug-origin="build_active_RAs" crm_feature_set="3.0.10" > transition-key="472:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" > transition-magic="0:0;472:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" > on_node="vhbl05" call-id="602" rc-code="0" op-status="0" interval="0" > last-run="1456091121" last-rc-change="1456091121" e > 12:54:52 [8479] vhbl07 cib: info: cib_perform_op: ++ > </lrm_resource> > > The transition-keys match, does this mean that the above is a late > result from the monitor operation which was considered timed-out > previously? How did it reach vhbl07, if the DC at that time was vhbl03? > Everything goes into the cib (replicated datastore) and the DC(s) get notified. > > > The pe-input files from the transitions around here should help. > > They are available. What shall I look for? > > >> Basically, the cluster responded beyond my expectations, sparing lots of > >> unnecessary recoveries or fencing. I'm happy, thanks for this wonderful > >> software! But I'm left with these "Processing failed op monitor" > >> warnings emitted every 15 minutes (timer pops). Is it safe and clever > >> to cleanup the affected resources? Would that get rid of them without > >> invoking other effects, like recoveries for example? > > > > That's normal; it's how the cluster maintains the effect of a failure > > that has not been cleared. The logs can be confusing, because it's not > > apparent from that message alone whether the failure is new or old. > > Ah, do you mean that these are the same thing that appears after "Failed > Actions:" at the end of the crm_mon output? They certainly match, and > the logs are confusing indeed. > > > Cleaning up the resource will end the failure condition, so what happens > > next depends on the configuration and state of the cluster. If the > > failure was preventing a preferred node from running the resource, the > > resource could move, depending on other factors such as stickiness. > > These resources are (still) running fine, suffered only monitor failures > and are node-neutral, so it should be safe to cleanup them, I suppose. > -- > Thanks for your quick and enlightening answer! I feared the mere length > of my message would scare everybody away... > Regards, > Feri > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org >
_______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org