Hi All, Thanks for reply.
Recently, i run the following command : (clustera) # crm_simulate --xml-file pe-warn.last it returns the following results : error: crm_abort: xpath_search: Triggered assert at xpath.c:153 : xml_top != NULL error: crm_element_value: Couldn't find validate-with in NULL error: crm_abort: crm_element_value: Triggered assert at xml.c:5135 : data != NULL Configuration validation is currently disabled. It is highly encouraged and prevents many common cluster issues. error: crm_element_value: Couldn't find validate-with in NULL error: crm_abort: crm_element_value: Triggered assert at xml.c:5135 : data != NULL error: crm_element_value: Couldn't find ignore-dtd in NULL error: crm_abort: crm_element_value: Triggered assert at xml.c:5135 : data != NULL error: crm_abort: validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort: validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort: validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort: validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort: validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort: validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort: validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort: validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort: validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort: validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort: validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort: validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort: validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort: validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort: validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort: validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort: validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort: crm_xml_add: Triggered assert at xml.c:2494 : node != NULL error: write_xml_stream: Cannot write NULL to /var/lib/pacemaker/cib/shadow.20008 Could not create '/var/lib/pacemaker/cib/shadow.20008': Success Could anyone help me how to read those messages and what's going on my server? Thanks a lot.. On Fri, Jun 8, 2018 at 4:49 AM, Ken Gaillot <kgail...@redhat.com> wrote: > On Thu, 2018-06-07 at 08:37 +0800, Albert Weng wrote: > > Hi Andrei, > > > > Thanks for your quickly reply. Still need help as below : > > > > On Wed, Jun 6, 2018 at 11:58 AM, Andrei Borzenkov <arvidj...@gmail.co > > m> wrote: > > > 06.06.2018 04:27, Albert Weng пишет: > > > > Hi All, > > > > > > > > I have created active/passive pacemaker cluster on RHEL 7. > > > > > > > > Here are my environment: > > > > clustera : 192.168.11.1 (passive) > > > > clusterb : 192.168.11.2 (master) > > > > clustera-ilo4 : 192.168.11.10 > > > > clusterb-ilo4 : 192.168.11.11 > > > > > > > > cluster resource status : > > > > cluster_fs started on clusterb > > > > cluster_vip started on clusterb > > > > cluster_sid started on clusterb > > > > cluster_listnr started on clusterb > > > > > > > > Both cluster node are online status. > > > > > > > > i found my corosync.log contain many records like below: > > > > > > > > clustera pengine: info: > > > determine_online_status_fencing: > > > > Node clusterb is active > > > > clustera pengine: info: determine_online_status: > > > Node > > > > clusterb is online > > > > clustera pengine: info: > > > determine_online_status_fencing: > > > > Node clustera is active > > > > clustera pengine: info: determine_online_status: > > > Node > > > > clustera is online > > > > > > > > *clustera pengine: warning: unpack_rsc_op_failure: > > > Processing > > > > failed op start for cluster_sid on clustera: unknown error (1)* > > > > *=> Question : Why pengine always trying to start cluster_sid on > > > the > > > > passive node? how to fix it? * > > > > > > > > > > pacemaker does not have concept of "passive" or "master" node - it > > > is up > > > to you to decide when you configure resource placement. By default > > > pacemaker will attempt to spread resources across all eligible > > > nodes. > > > You can influence node selection by using constraints. See > > > https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pace > > > maker_Explained/_deciding_which_nodes_a_resource_can_run_on.html > > > for details. > > > > > > But in any case - all your resources MUST be capable of running of > > > both > > > nodes, otherwise cluster makes no sense. If one resource A depends > > > on > > > something that another resource B provides and can be started only > > > together with resource B (and after it is ready) - you must tell it > > > to > > > pacemaker by using resource colocations and ordering. See same > > > document > > > for details. > > > > > > > clustera pengine: info: native_print: ipmi-fence- > > > clustera > > > > (stonith:fence_ipmilan): Started clustera > > > > clustera pengine: info: native_print: ipmi-fence- > > > clusterb > > > > (stonith:fence_ipmilan): Started clustera > > > > clustera pengine: info: group_print: Resource > > > Group: cluster > > > > clustera pengine: info: native_print: > > > cluster_fs > > > > (ocf::heartbeat:Filesystem): Started clusterb > > > > clustera pengine: info: native_print: > > > cluster_vip > > > > (ocf::heartbeat:IPaddr2): Started clusterb > > > > clustera pengine: info: native_print: > > > cluster_sid > > > > (ocf::heartbeat:oracle): Started clusterb > > > > clustera pengine: info: native_print: > > > > cluster_listnr (ocf::heartbeat:oralsnr): Started > > > clusterb > > > > clustera pengine: info: get_failcount_full: > > > cluster_sid has > > > > failed INFINITY times on clustera > > > > > > > > > > > > *clustera pengine: warning: common_apply_stickiness: > > > Forcing > > > > cluster_sid away from clustera after 1000000 failures > > > (max=1000000)* > > > > *=> Question: too much trying result in forbid the resource start > > > on > > > > clustera ?* > > > > > > > > > > Yes. > > > > How to find out the root cause of 1000000 failures? which log will > > contain the error message? > > As an aside, 1,000,000 is "infinity" to pacemaker. It could mean > 1,000,000 actual failures, or a "fatal" failure that causes pacemaker > to set the fail count to infinity. > > The most recent failure of each resource will be shown in the status > display (crm_mon, pcs status, etc.). They will have a basic exit code > (which you can use to distinguish a timeout from an error received from > the agent), and if the agent provided one, an "exit-reason". That's the > first place to look. > > Failures will remain in the status display, and affect the placement of > resources, until one of two things happen: you manually clean up the > failure (crm_resource --cleanup, pcs resource cleanup, etc.), or, if > you configured a failure-timeout for the resource, that much time has > passed with no more failures. > > For deeper investigation, check the system log (wherever it's kept on > your distro). You can use the timestamp from the failure in the status > to know where to look. > > For even more detail, you can look at pacemaker's detail log (the one > you posted excerpts from). This will have additional messages beyond > the system log, but they are harder to follow and more intended for > developers and advanced troubleshooting. > > > > > > > Couple days ago, the clusterb has been stonith by unknown reason, > > > but only > > > > "cluster_fs", "cluster_vip" moved to clustera successfully, but > > > > "cluster_sid" and "cluster_listnr" go to "STOP" status. > > > > like below messages, is it related with "op start for cluster_sid > > > on > > > > clustera..." ? > > > > > > > > > > Yes. Node clustera is now marked as being incapable of running > > > resource > > > so if node cluaterb fails, resource cannot be started anywhere. > > > > > > > > > > How could i fix it? i need some hint for troubleshooting. > > > > > > clustera pengine: warning: unpack_rsc_op_failure: Processing > > > failed op > > > > start for cluster_sid on clustera: unknown error (1) > > > > clustera pengine: info: native_print: ipmi-fence- > > > clustera > > > > (stonith:fence_ipmilan): Started clustera > > > > clustera pengine: info: native_print: ipmi-fence- > > > clusterb > > > > (stonith:fence_ipmilan): Started clustera > > > > clustera pengine: info: group_print: Resource Group: > > > cluster > > > > clustera pengine: info: native_print: cluster_fs > > > > (ocf::heartbeat:Filesystem): Started clusterb (UNCLEAN) > > > > clustera pengine: info: native_print: cluster_vip > > > > (ocf::heartbeat:IPaddr2): Started clusterb (UNCLEAN) > > > > clustera pengine: info: native_print: cluster_sid > > > > (ocf::heartbeat:oracle): Started clusterb (UNCLEAN) > > > > clustera pengine: info: native_print: > > > cluster_listnr > > > > (ocf::heartbeat:oralsnr): Started clusterb (UNCLEAN) > > > > clustera pengine: info: get_failcount_full: > > > cluster_sid has > > > > failed INFINITY times on clustera > > > > clustera pengine: warning: common_apply_stickiness: > > > Forcing > > > > cluster_sid away from clustera after 1000000 failures > > > (max=1000000) > > > > clustera pengine: info: rsc_merge_weights: > > > cluster_fs: Rolling > > > > back scores from cluster_sid > > > > clustera pengine: info: rsc_merge_weights: > > > cluster_vip: Rolling > > > > back scores from cluster_sid > > > > clustera pengine: info: rsc_merge_weights: > > > cluster_sid: Rolling > > > > back scores from cluster_listnr > > > > clustera pengine: info: native_color: Resource > > > cluster_sid cannot > > > > run anywhere > > > > clustera pengine: info: native_color: Resource > > > cluster_listnr > > > > cannot run anywhere > > > > clustera pengine: warning: custom_action: Action > > > cluster_fs_stop_0 on > > > > clusterb is unrunnable (offline) > > > > clustera pengine: info: RecurringOp: Start recurring > > > monitor > > > > (20s) for cluster_fs on clustera > > > > clustera pengine: warning: custom_action: Action > > > cluster_vip_stop_0 on > > > > clusterb is unrunnable (offline) > > > > clustera pengine: info: RecurringOp: Start recurring > > > monitor > > > > (10s) for cluster_vip on clustera > > > > clustera pengine: warning: custom_action: Action > > > cluster_sid_stop_0 on > > > > clusterb is unrunnable (offline) > > > > clustera pengine: warning: custom_action: Action > > > cluster_sid_stop_0 on > > > > clusterb is unrunnable (offline) > > > > clustera pengine: warning: custom_action: Action > > > cluster_listnr_stop_0 > > > > on clusterb is unrunnable (offline) > > > > clustera pengine: warning: custom_action: Action > > > cluster_listnr_stop_0 > > > > on clusterb is unrunnable (offline) > > > > clustera pengine: warning: stage6: Scheduling Node clusterb > > > for STONITH > > > > clustera pengine: info: native_stop_constraints: > > > > cluster_fs_stop_0 is implicit after clusterb is fenced > > > > clustera pengine: info: native_stop_constraints: > > > > cluster_vip_stop_0 is implicit after clusterb is fenced > > > > clustera pengine: info: native_stop_constraints: > > > > cluster_sid_stop_0 is implicit after clusterb is fenced > > > > clustera pengine: info: native_stop_constraints: > > > > cluster_listnr_stop_0 is implicit after clusterb is fenced > > > > clustera pengine: info: LogActions: Leave ipmi- > > > fence-db01 > > > > (Started clustera) > > > > clustera pengine: info: LogActions: Leave ipmi- > > > fence-db02 > > > > (Started clustera) > > > > clustera pengine: notice: LogActions: Move cluster_fs > > > > (Started clusterb -> clustera) > > > > clustera pengine: notice: LogActions: Move > > > cluster_vip > > > > (Started clusterb -> clustera) > > > > clustera pengine: notice: LogActions: Stop > > > cluster_sid > > > > (clusterb) > > > > clustera pengine: notice: LogActions: Stop > > > cluster_listnr > > > > (clusterb) > > > > clustera pengine: warning: process_pe_message: Calculated > > > > Transition 26821: /var/lib/pacemaker/pengine/pe-warn-7.bz2 > > > > clustera crmd: info: do_state_transition: State > > > transition > > > > S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS > > > > cause=C_IPC_MESSAGE origin=handle_response ] > > > > clustera crmd: info: do_te_invoke: Processing graph > > > 26821 > > > > (ref=pe_calc-dc-1526868653-26882) derived from > > > > /var/lib/pacemaker/pengine/pe-warn-7.bz2 > > > > clustera crmd: notice: te_fence_node: Executing reboot > > > fencing > > > > operation (23) on clusterb (timeout=60000) > > > > > > > > > > > > Thanks ~~~~ > > Ken Gaillot <kgail...@redhat.com> > _______________________________________________ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > -- Kind regards, Albert Weng
_______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org