----- Am 29. Jul 2020 um 18:53 schrieb kgaillot kgail...@redhat.com:
> On Wed, 2020-07-29 at 17:26 +0200, Lentes, Bernd wrote: >> Hi, >> >> a few days ago one of my nodes was fenced and i don't know why, which >> is something i really don't like. >> What i did: >> I put one node (ha-idg-1) in standby. The resources on it (most of >> all virtual domains) were migrated to ha-idg-2, >> except one domain (vm_nextcloud). On ha-idg-2 a mountpoint was >> missing the xml of the domain points to. >> Then the cluster tries to start vm_nextcloud on ha-idg-2 which of >> course also failed. >> Then ha-idg-1 was fenced. >> I did a "crm history" over the respective time period, you find it >> here: >> https://hmgubox2.helmholtz-muenchen.de/index.php/s/529dfcXf5a72ifF >> >> Here, from my point of view, the most interesting from the logs: >> ha-idg-1: >> Jul 20 16:59:33 [23763] ha-idg-1 cib: info: >> cib_perform_op: Diff: --- 2.16196.19 2 >> Jul 20 16:59:33 [23763] ha-idg-1 cib: info: >> cib_perform_op: Diff: +++ 2.16197.0 bc9a558dfbe6d7196653ce56ad1ee758 >> Jul 20 16:59:33 [23763] ha-idg-1 cib: info: >> cib_perform_op: + /cib: @epoch=16197, @num_updates=0 >> Jul 20 16:59:33 [23763] ha-idg-1 cib: info: >> cib_perform_op: + /cib/configuration/nodes/node[@id='1084777482']/i >> nstance_attributes[@id='nodes-108 >> 4777482']/nvpair[@id='nodes-1084777482-standby']: @value=on >> ha-idg-1 set to standby >> >> Jul 20 16:59:34 [23768] ha-idg-1 crmd: notice: >> process_lrm_event: ha-idg-1-vm_nextcloud_migrate_to_0:3169 [ >> error: Cannot access storage file >> '/mnt/mcd/AG_BioInformatik/Technik/software_und_treiber/linux/ubuntu/ >> ubuntu-18.04.4-live-server-amd64.iso': No such file or >> directory\nocf-exit-reason:vm_nextcloud: live migration to ha-idg-2 >> failed: 1\n ] >> migration failed >> >> Jul 20 17:04:01 [23767] ha-idg-1 pengine: error: >> native_create_actions: Resource vm_nextcloud is active on 2 nodes >> (attempting recovery) >> ??? > > This is standard for a failed live migration -- the cluster doesn't > know how far the migration actually got before failing, so it has to > assume the VM could be active on either node. (The log message would > make more sense saying "might be active" rather than "is active".) > >> Jul 20 17:04:01 [23767] ha-idg-1 pengine: notice: >> LogAction: * >> Recover vm_nextcloud ( ha-idg-2 ) > > The recovery from that situation is a full stop on both nodes, and > start on one of them. > >> Jul 20 17:04:01 [23768] ha-idg-1 crmd: notice: >> te_rsc_command: Initiating stop operation vm_nextcloud_stop_0 on ha- >> idg-2 | action 106 >> Jul 20 17:04:01 [23768] ha-idg-1 crmd: notice: >> te_rsc_command: Initiating stop operation vm_nextcloud_stop_0 >> locally on ha-idg-1 | action 2 >> >> Jul 20 17:04:01 [23768] ha-idg-1 crmd: info: >> match_graph_event: Action vm_nextcloud_stop_0 (106) confirmed >> on ha-idg-2 (rc=0) >> >> Jul 20 17:04:06 [23768] ha-idg-1 crmd: notice: >> process_lrm_event: Result of stop operation for vm_nextcloud on >> ha-idg-1: 0 (ok) | call=3197 key=vm_nextcloud_stop_0 confirmed=true >> cib-update=5960 > > It looks like both stops succeeded. > >> Jul 20 17:05:29 [23761] ha-idg-1 pacemakerd: notice: >> crm_signal_dispatch: Caught 'Terminated' signal | 15 (invoking >> handler) >> systemctl stop pacemaker.service >> >> >> ha-idg-2: >> Jul 20 17:04:03 [10691] ha-idg-2 crmd: notice: >> process_lrm_event: Result of stop operation for vm_nextcloud on >> ha-idg-2: 0 (ok) | call=157 key=vm_nextcloud_stop_0 confirmed=true >> cib-update=57 >> the log from ha-idg-2 is two seconds ahead of ha-idg-1 >> >> Jul 20 17:04:08 [10688] ha-idg-2 lrmd: notice: >> log_execute: executing - rsc:vm_nextcloud action:start >> call_id:192 >> Jul 20 17:04:09 [10688] ha-idg-2 lrmd: notice: >> operation_finished: vm_nextcloud_start_0:29107:stderr [ error: >> Failed to create domain from /mnt/share/vm_nextcloud.xml ] >> Jul 20 17:04:09 [10688] ha-idg-2 lrmd: notice: >> operation_finished: vm_nextcloud_start_0:29107:stderr [ error: >> Cannot access storage file >> '/mnt/mcd/AG_BioInformatik/Technik/software_und_treiber/linux/ubuntu/ >> ubuntu-18.04.4-live-server-amd64.iso': No such file or directory ] >> Jul 20 17:04:09 [10688] ha-idg-2 lrmd: notice: >> operation_finished: vm_nextcloud_start_0:29107:stderr [ ocf- >> exit-reason:Failed to start virtual domain vm_nextcloud. ] >> Jul 20 17:04:09 [10688] ha-idg-2 lrmd: notice: >> log_finished: finished - rsc:vm_nextcloud action:start call_id:192 >> pid:29107 exit-code:1 exec-time:581ms queue-time:0ms >> start on ha-idg-2 failed > > The start failed ... > >> Jul 20 17:05:32 [10691] ha-idg-2 crmd: info: >> do_dc_takeover: Taking over DC status for this partition >> ha-idg-1 stopped pacemaker > > Since the ha-idg-2 is now shutting down, ha-idg-1 becomes DC. > >> Jul 20 17:05:33 [10690] ha-idg-2 pengine: warning: >> unpack_rsc_op_failure: Processing failed migrate_to of vm_nextcloud >> on ha-idg-1: unknown error | rc=1 >> Jul 20 17:05:33 [10690] ha-idg-2 pengine: warning: >> unpack_rsc_op_failure: Processing failed start of vm_nextcloud on >> ha-idg-2: unknown error | rc >> >> Jul 20 17:05:33 [10690] ha-idg-2 pengine: info: >> native_color: Resource vm_nextcloud cannot run anywhere >> logical >> >> Jul 20 17:05:33 [10690] ha-idg-2 pengine: warning: >> custom_action: Action vm_nextcloud_stop_0 on ha-idg-1 is unrunnable >> (pending) >> ??? > > So this appears to be the problem. From these logs I would guess the > successful stop on ha-idg-1 did not get written to the CIB for some > reason. I'd look at the pe input from this transition on ha-idg-2 to > confirm that. > > Without the DC knowing about the stop, it tries to schedule a new one, > but the node is shutting down so it can't do it, which means it has to > be fenced. > >> Jul 20 17:05:35 [10690] ha-idg-2 pengine: warning: >> custom_action: Action vm_nextcloud_stop_0 on ha-idg-1 is unrunnable >> (offline) >> Jul 20 17:05:35 [10690] ha-idg-2 pengine: warning: >> pe_fence_node: Cluster node ha-idg-1 will be fenced: resource >> actions are unrunnable >> Jul 20 17:05:35 [10690] ha-idg-2 pengine: warning: >> stage6: Scheduling Node ha-idg-1 for STONITH >> Jul 20 17:05:35 [10690] ha-idg-2 pengine: info: >> native_stop_constraints: vm_nextcloud_stop_0 is implicit after ha- >> idg-1 is fenced >> Jul 20 17:05:35 [10690] ha-idg-2 pengine: notice: >> LogNodeActions: * Fence (Off) ha-idg-1 'resource actions are >> unrunnable' >> >> >> Why does it say "Jul 20 17:05:35 [10690] ha-idg- >> 2 pengine: warning: custom_action: Action vm_nextcloud_stop_0 >> on ha-idg-1 is unrunnable (offline)" although >> "Jul 20 17:04:06 [23768] ha-idg-1 crmd: notice: >> process_lrm_event: Result of stop operation for vm_nextcloud on >> ha-idg-1: 0 (ok) | call=3197 key=vm_nextcloud_stop_0 confirmed=true >> cib-update=5960" >> says that stop was ok ? I'm stll digging in the logs trying to understand what happened. What i'm wondering abaout: Jul 20 17:04:26 [23768] ha-idg-1 crmd: notice: run_graph: Transition 4515 (Complete=10, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3117.bz2): Complete Jul 20 17:05:29 [23768] ha-idg-1 crmd: info: abort_transition_graph: Transition 4515 aborted by status-1084777482-shutdown doing create shutdown=1595257529: Transient attribute change | cib=2.16197.72 source=abort_unless_down:317 path=/cib/status/node_state[@id='1084777482']/transient_attributes[@id='1084777482']/instance_attributes[@id='status-1084777482'] complete=true A transition is completed and three secons later aborted. How can that be? Bernd Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/