On 01/05/2013, at 2:51 AM, Brian J. Murrell <br...@interlinx.bc.ca> wrote:
> Using 1.1.8 on EL6.4, I am seeing this sort of thing: > > pengine[1590]: warning: unpack_rsc_op: Processing failed op monitor for > my_resource on node1: unknown error (1) > > The full log from the point of adding the resource until the errors: > > Apr 30 11:46:30 node1 cibadmin[3380]: notice: crm_log_args: Invoked: > cibadmin -o resources -C -x /tmp/tmpHrgNZv > Apr 30 11:46:30 node1 crmd[1591]: notice: do_state_transition: State > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL > origin=abort_transition_graph ] > Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: Diff: --- 0.24.5 > Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: Diff: +++ 0.25.1 > 8a4aac3dcddc2689e4b336e1bf2078ff > Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: -- <cib admin_epoch="0" > epoch="24" num_updates="5" /> > Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ <primitive > class="ocf" provider="my_provider" type="my_RA" id="my_resource" > > Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ > <meta_attributes id="my_resource-meta_attributes" > > Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ <nvpair > name="my_RA-role" id="my_resource-meta_attributes-my_RA-role" value="Stopped" > /> > Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ > </meta_attributes> > Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ <operations > id="my_resource-operations" > > Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ <op > id="my_resource-monitor-5" interval="5" name="monitor" timeout="60" /> > Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ <op > id="my_resource-start-0" interval="0" name="start" timeout="300" /> > Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ <op > id="my_resource-stop-0" interval="0" name="stop" timeout="300" /> > Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ </operations> > Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ > <instance_attributes id="my_resource-instance_attributes" > > Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ <nvpair > id="my_resource-instance_attributes-my_RA" name="my_RA" > value="33bb17d2-350b-495f-bd8d-8427baabeed9" /> > Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ > </instance_attributes> > Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ </primitive> > Apr 30 11:46:30 node1 pengine[1590]: notice: unpack_config: On loss of CCM > Quorum: Ignore > Apr 30 11:46:30 node1 pengine[1590]: crit: get_timet_now: Defaulting to > 'now' > Apr 30 11:46:30 node1 pengine[1590]: crit: get_timet_now: Defaulting to > 'now' > Apr 30 11:46:30 node1 pengine[1590]: crit: get_timet_now: Defaulting to > 'now' > Apr 30 11:46:30 node1 pengine[1590]: crit: get_timet_now: Defaulting to > 'now' > Apr 30 11:46:30 node1 pengine[1590]: crit: get_timet_now: Defaulting to > 'now' > Apr 30 11:46:30 node1 pengine[1590]: notice: process_pe_message: Calculated > Transition 5: /var/lib/pacemaker/pengine/pe-input-10.bz2 > Apr 30 11:46:30 node1 cibadmin[3386]: notice: crm_log_args: Invoked: > cibadmin -o constraints -C -X <rsc_location id="my_resource-primary" > node="node1" rsc="my_resource" score="20"/> > Apr 30 11:46:30 node1 cib[1586]: notice: log_cib_diff: cib:diff: Local-only > Change: 0.26.1 > Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: -- <cib admin_epoch="0" > epoch="25" num_updates="1" /> > Apr 30 11:46:30 node1 cib[1586]: notice: cib:diff: ++ <rsc_location > id="my_resource-primary" node="node1" rsc="my_resource" score="20" /> > Apr 30 11:46:33 node1 cib[1586]: notice: cib:diff: Diff: --- 0.26.3 > Apr 30 11:46:33 node1 cib[1586]: notice: cib:diff: Diff: +++ 0.27.1 > 8305c8fe19d06a6204bd04f437eb923a > Apr 30 11:46:33 node1 cib[1586]: notice: cib:diff: -- <nvpair > value="1367322378" id="cib-bootstrap-options-last-lrm-refresh" /> > Apr 30 11:46:33 node1 cib[1586]: notice: cib:diff: ++ <nvpair > id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" > value="1367322393" /> > Apr 30 11:46:33 node1 cib[1586]: notice: cib:diff: Diff: --- 0.27.2 > Apr 30 11:46:33 node1 cib[1586]: notice: cib:diff: Diff: +++ 0.28.1 > 0dbddb3084f7cd76bffe21916538be94 > Apr 30 11:46:33 node1 cib[1586]: notice: cib:diff: -- <nvpair > value="Stopped" id="my_resource-meta_attributes-my_RA-role" /> > Apr 30 11:46:33 node1 cib[1586]: notice: cib:diff: ++ <nvpair > name="my_RA-role" id="my_resource-meta_attributes-my_RA-role" value="Started" > /> > Apr 30 11:46:33 node1 crmd[1591]: warning: do_update_resource: Resource > my_resource no longer exists in the lrmd > Apr 30 11:46:33 node1 crmd[1591]: notice: process_lrm_event: LRM operation > my_resource_monitor_0 (call=31, rc=7, cib-update=0, confirmed=true) not > running > Apr 30 11:46:33 node1 crmd[1591]: warning: decode_transition_key: Bad UUID > (crm_resource.c) in sscanf result (4) for 3397:0:0:crm_resource.c > Apr 30 11:46:33 node1 crmd[1591]: error: send_msg_via_ipc: Unknown > Sub-system (3397_crm_resource)... discarding message. > Apr 30 11:47:50 node1 crmd[1591]: warning: action_timer_callback: Timer > popped (timeout=20000, abort_level=1000000, complete=false) > Apr 30 11:47:50 node1 crmd[1591]: error: print_synapse: [Action 5]: > In-flight rsc op my_resource_monitor_0 on node1 (priority: 0, waiting: none) > Apr 30 11:47:50 node1 crmd[1591]: warning: cib_action_update: rsc_op 5: > my_resource_monitor_0 on node1 timed out > Apr 30 11:47:50 node1 crmd[1591]: notice: run_graph: Transition 5 > (Complete=4, Pending=0, Fired=0, Skipped=1, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-input-10.bz2): Stopped > Apr 30 11:47:50 node1 pengine[1590]: notice: unpack_config: On loss of CCM > Quorum: Ignore > Apr 30 11:47:50 node1 pengine[1590]: crit: get_timet_now: Defaulting to > 'now' > Apr 30 11:47:50 node1 pengine[1590]: crit: get_timet_now: Defaulting to > 'now' > Apr 30 11:47:50 node1 pengine[1590]: crit: get_timet_now: Defaulting to > 'now' > Apr 30 11:47:50 node1 pengine[1590]: warning: unpack_rsc_op: Processing > failed op monitor for my_resource on node1: unknown error (1) > Apr 30 11:47:50 node1 pengine[1590]: crit: get_timet_now: Defaulting to > 'now' > Apr 30 11:47:50 node1 pengine[1590]: crit: get_timet_now: Defaulting to > 'now' > Apr 30 11:47:50 node1 pengine[1590]: notice: LogActions: Recover > my_resource#011(Started node1) > Apr 30 11:47:50 node1 pengine[1590]: notice: process_pe_message: Calculated > Transition 6: /var/lib/pacemaker/pengine/pe-input-11.bz2 > Apr 30 11:47:52 node1 cib[1586]: notice: cib:diff: Diff: --- 0.28.5 > Apr 30 11:47:52 node1 cib[1586]: notice: cib:diff: Diff: +++ 0.29.1 > c79afeec7d8f0d54fb425821e4e910f9 > Apr 30 11:47:52 node1 cib[1586]: notice: cib:diff: -- <nvpair > value="Started" id="my_resource-meta_attributes-my_RA-role" /> > Apr 30 11:47:52 node1 cib[1586]: notice: cib:diff: ++ <nvpair > name="my_RA-role" id="my_resource-meta_attributes-my_RA-role" value="Stopped" > /> > Apr 30 11:47:53 node1 crmd[1591]: notice: run_graph: Transition 6 > (Complete=2, Pending=0, Fired=0, Skipped=5, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-input-11.bz2): Stopped > Apr 30 11:47:53 node1 pengine[1590]: notice: unpack_config: On loss of CCM > Quorum: Ignore > Apr 30 11:47:53 node1 pengine[1590]: crit: get_timet_now: Defaulting to > 'now' > Apr 30 11:47:53 node1 pengine[1590]: crit: get_timet_now: Defaulting to > 'now' > Apr 30 11:47:53 node1 pengine[1590]: crit: get_timet_now: Defaulting to > 'now' > Apr 30 11:47:53 node1 pengine[1590]: warning: unpack_rsc_op: Processing > failed op monitor for my_resource on node1: unknown error (1) > Apr 30 11:47:53 node1 pengine[1590]: crit: get_timet_now: Defaulting to > 'now' > Apr 30 11:47:53 node1 pengine[1590]: crit: get_timet_now: Defaulting to > 'now' > Apr 30 11:47:53 node1 pengine[1590]: crit: get_timet_now: Defaulting to > 'now' > Apr 30 11:47:53 node1 pengine[1590]: notice: LogActions: Stop > my_resource#011(node1) > Apr 30 11:47:53 node1 pengine[1590]: notice: process_pe_message: Calculated > Transition 7: /var/lib/pacemaker/pengine/pe-input-12.bz2 > Apr 30 11:47:57 node1 crmd[1591]: notice: process_lrm_event: LRM operation > my_resource_stop_0 (call=38, rc=0, cib-update=89, confirmed=true) ok > Apr 30 11:47:57 node1 crmd[1591]: notice: run_graph: Transition 7 > (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-input-12.bz2): Complete > Apr 30 11:47:57 node1 crmd[1591]: notice: do_state_transition: State > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS > cause=C_FSA_INTERNAL origin=notify_crmd ] > Apr 30 11:50:10 node1 crmd[1591]: notice: do_state_transition: State > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL > origin=abort_transition_graph ] > Apr 30 11:50:10 node1 cib[1586]: notice: cib:diff: Diff: --- 0.29.3 > Apr 30 11:50:10 node1 cib[1586]: notice: cib:diff: Diff: +++ 0.30.1 > 628db64f8576694a86a58110e174731d > Apr 30 11:50:10 node1 cib[1586]: notice: cib:diff: -- <nvpair > value="Started" id="MGS_18e873-meta_attributes-my_RA-role" /> > Apr 30 11:50:10 node1 cib[1586]: notice: cib:diff: ++ <nvpair > name="my_RA-role" id="MGS_18e873-meta_attributes-my_RA-role" value="Stopped" > /> > Apr 30 11:50:10 node1 pengine[1590]: notice: unpack_config: On loss of CCM > Quorum: Ignore > Apr 30 11:50:10 node1 pengine[1590]: crit: get_timet_now: Defaulting to > 'now' > Apr 30 11:50:10 node1 pengine[1590]: crit: get_timet_now: Defaulting to > 'now' > Apr 30 11:50:10 node1 pengine[1590]: crit: get_timet_now: Defaulting to > 'now' > Apr 30 11:50:10 node1 pengine[1590]: warning: unpack_rsc_op: Processing > failed op monitor for my_resource on node1: unknown error (1) > Apr 30 11:50:10 node1 pengine[1590]: crit: get_timet_now: Defaulting to > 'now' > Apr 30 11:50:10 node1 pengine[1590]: crit: get_timet_now: Defaulting to > 'now' > > I'm wondering what: > > warning: unpack_rsc_op: Processing failed op monitor for my_resource on > node1: unknown error (1) > > really means. Is that that the RA ran (i.e. did it's job) but didn't > return something pacemaker was expecting Normally it means the RA returned 1 (aka. OCF_ERR_GENERIC): http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacemaker_Explained/s-ocf-return-codes.html > or was there a problem even > trying to run the RA. It might be possible, but I'd expect some more errors in that case. > In the case of this RA the monitor operation only takes a few seconds at > most and as you can see, I give it 60 seconds so it's difficult to > understand how this: > > Apr 30 11:47:50 node1 crmd[1591]: warning: cib_action_update: rsc_op 5: > my_resource_monitor_0 on node1 timed out > > can happen. Could it be blocked on I/O or similar? I've seen even IPaddr take ridiculously long on a sufficiently overloaded system. > > Additionally, I have my RA rigged to log Can we see it? > and provide debugging for > every invocation and I see no evidence of the RA being run at that > time. > > Any thoughts? > > b. > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org