On Sun, 2020-08-09 at 22:17 +0200, Lentes, Bernd wrote: > > ----- Am 29. Jul 2020 um 18:53 schrieb kgaillot [email protected]: > > > On Wed, 2020-07-29 at 17:26 +0200, Lentes, Bernd wrote: > > > Hi, > > > > > > a few days ago one of my nodes was fenced and i don't know why, > > > which > > > is something i really don't like. > > > What i did: > > > I put one node (ha-idg-1) in standby. The resources on it (most > > > of > > > all virtual domains) were migrated to ha-idg-2, > > > except one domain (vm_nextcloud). On ha-idg-2 a mountpoint was > > > missing the xml of the domain points to. > > > Then the cluster tries to start vm_nextcloud on ha-idg-2 which of > > > course also failed. > > > Then ha-idg-1 was fenced. > > > I did a "crm history" over the respective time period, you find > > > it > > > here: > > > https://hmgubox2.helmholtz-muenchen.de/index.php/s/529dfcXf5a72ifF > > > > > > Here, from my point of view, the most interesting from the logs: > > > ha-idg-1: > > > Jul 20 16:59:33 [23763] ha-idg-1 cib: info: > > > cib_perform_op: Diff: --- 2.16196.19 2 > > > Jul 20 16:59:33 [23763] ha-idg-1 cib: info: > > > cib_perform_op: Diff: +++ 2.16197.0 > > > bc9a558dfbe6d7196653ce56ad1ee758 > > > Jul 20 16:59:33 [23763] ha-idg-1 cib: info: > > > cib_perform_op: + /cib: @epoch=16197, @num_updates=0 > > > Jul 20 16:59:33 [23763] ha-idg-1 cib: info: > > > cib_perform_op: + /cib/configuration/nodes/node[@id='1084777482 > > > ']/i > > > nstance_attributes[@id='nodes-108 > > > 4777482']/nvpair[@id='nodes-1084777482-standby']: @value=on > > > ha-idg-1 set to standby > > > > > > Jul 20 16:59:34 [23768] ha-idg-1 crmd: notice: > > > process_lrm_event: ha-idg-1-vm_nextcloud_migrate_to_0:3169 > > > [ > > > error: Cannot access storage file > > > '/mnt/mcd/AG_BioInformatik/Technik/software_und_treiber/linux/ubu > > > ntu/ > > > ubuntu-18.04.4-live-server-amd64.iso': No such file or > > > directory\nocf-exit-reason:vm_nextcloud: live migration to ha- > > > idg-2 > > > failed: 1\n ] > > > migration failed > > > > > > Jul 20 17:04:01 [23767] ha-idg-1 pengine: error: > > > native_create_actions: Resource vm_nextcloud is active on 2 > > > nodes > > > (attempting recovery) > > > ??? > > > > This is standard for a failed live migration -- the cluster doesn't > > know how far the migration actually got before failing, so it has > > to > > assume the VM could be active on either node. (The log message > > would > > make more sense saying "might be active" rather than "is active".) > > > > > Jul 20 17:04:01 [23767] ha-idg-1 pengine: notice: > > > LogAction: * > > > Recover vm_nextcloud ( ha-idg-2 ) > > > > The recovery from that situation is a full stop on both nodes, and > > start on one of them. > > > > > Jul 20 17:04:01 [23768] ha-idg-1 crmd: notice: > > > te_rsc_command: Initiating stop operation vm_nextcloud_stop_0 on > > > ha- > > > idg-2 | action 106 > > > Jul 20 17:04:01 [23768] ha-idg-1 crmd: notice: > > > te_rsc_command: Initiating stop operation vm_nextcloud_stop_0 > > > locally on ha-idg-1 | action 2 > > > > > > Jul 20 17:04:01 [23768] ha-idg-1 crmd: info: > > > match_graph_event: Action vm_nextcloud_stop_0 (106) > > > confirmed > > > on ha-idg-2 (rc=0) > > > > > > Jul 20 17:04:06 [23768] ha-idg-1 crmd: notice: > > > process_lrm_event: Result of stop operation for > > > vm_nextcloud on > > > ha-idg-1: 0 (ok) | call=3197 key=vm_nextcloud_stop_0 > > > confirmed=true > > > cib-update=5960 > > > > It looks like both stops succeeded. > > > > > Jul 20 17:05:29 [23761] ha-idg-1 pacemakerd: notice: > > > crm_signal_dispatch: Caught 'Terminated' signal | 15 > > > (invoking > > > handler) > > > systemctl stop pacemaker.service > > > > > > > > > ha-idg-2: > > > Jul 20 17:04:03 [10691] ha-idg-2 crmd: notice: > > > process_lrm_event: Result of stop operation for > > > vm_nextcloud on > > > ha-idg-2: 0 (ok) | call=157 key=vm_nextcloud_stop_0 > > > confirmed=true > > > cib-update=57 > > > the log from ha-idg-2 is two seconds ahead of ha-idg-1 > > > > > > Jul 20 17:04:08 [10688] ha-idg-2 lrmd: notice: > > > log_execute: executing - rsc:vm_nextcloud action:start > > > call_id:192 > > > Jul 20 17:04:09 [10688] ha-idg-2 lrmd: notice: > > > operation_finished: vm_nextcloud_start_0:29107:stderr [ > > > error: > > > Failed to create domain from /mnt/share/vm_nextcloud.xml ] > > > Jul 20 17:04:09 [10688] ha-idg-2 lrmd: notice: > > > operation_finished: vm_nextcloud_start_0:29107:stderr [ > > > error: > > > Cannot access storage file > > > '/mnt/mcd/AG_BioInformatik/Technik/software_und_treiber/linux/ubu > > > ntu/ > > > ubuntu-18.04.4-live-server-amd64.iso': No such file or directory > > > ] > > > Jul 20 17:04:09 [10688] ha-idg-2 lrmd: notice: > > > operation_finished: vm_nextcloud_start_0:29107:stderr [ ocf- > > > exit-reason:Failed to start virtual domain vm_nextcloud. ] > > > Jul 20 17:04:09 [10688] ha-idg-2 lrmd: notice: > > > log_finished: finished - rsc:vm_nextcloud action:start > > > call_id:192 > > > pid:29107 exit-code:1 exec-time:581ms queue-time:0ms > > > start on ha-idg-2 failed > > > > The start failed ... > > > > > Jul 20 17:05:32 [10691] ha-idg-2 crmd: info: > > > do_dc_takeover: Taking over DC status for this partition > > > ha-idg-1 stopped pacemaker > > > > Since the ha-idg-2 is now shutting down, ha-idg-1 becomes DC. > > > > > Jul 20 17:05:33 [10690] ha-idg-2 pengine: warning: > > > unpack_rsc_op_failure: Processing failed migrate_to of > > > vm_nextcloud > > > on ha-idg-1: unknown error | rc=1 > > > Jul 20 17:05:33 [10690] ha-idg-2 pengine: warning: > > > unpack_rsc_op_failure: Processing failed start of vm_nextcloud > > > on > > > ha-idg-2: unknown error | rc > > > > > > Jul 20 17:05:33 [10690] ha-idg-2 pengine: info: > > > native_color: Resource vm_nextcloud cannot run anywhere > > > logical > > > > > > Jul 20 17:05:33 [10690] ha-idg-2 pengine: warning: > > > custom_action: Action vm_nextcloud_stop_0 on ha-idg-1 is > > > unrunnable > > > (pending) > > > ??? > > > > So this appears to be the problem. From these logs I would guess > > the > > successful stop on ha-idg-1 did not get written to the CIB for some > > reason. I'd look at the pe input from this transition on ha-idg-2 > > to > > confirm that. > > > > Without the DC knowing about the stop, it tries to schedule a new > > one, > > but the node is shutting down so it can't do it, which means it has > > to > > be fenced. > > > > > Jul 20 17:05:35 [10690] ha-idg-2 pengine: warning: > > > custom_action: Action vm_nextcloud_stop_0 on ha-idg-1 is > > > unrunnable > > > (offline) > > > Jul 20 17:05:35 [10690] ha-idg-2 pengine: warning: > > > pe_fence_node: Cluster node ha-idg-1 will be fenced: resource > > > actions are unrunnable > > > Jul 20 17:05:35 [10690] ha-idg-2 pengine: warning: > > > stage6: Scheduling Node ha-idg-1 for STONITH > > > Jul 20 17:05:35 [10690] ha-idg-2 pengine: info: > > > native_stop_constraints: vm_nextcloud_stop_0 is implicit after > > > ha- > > > idg-1 is fenced > > > Jul 20 17:05:35 [10690] ha-idg-2 pengine: notice: > > > LogNodeActions: * Fence (Off) ha-idg-1 'resource actions are > > > unrunnable' > > > > > > > > > Why does it say "Jul 20 17:05:35 [10690] ha-idg- > > > 2 pengine: warning: custom_action: Action > > > vm_nextcloud_stop_0 > > > on ha-idg-1 is unrunnable (offline)" although > > > "Jul 20 17:04:06 [23768] ha-idg-1 crmd: notice: > > > process_lrm_event: Result of stop operation for > > > vm_nextcloud on > > > ha-idg-1: 0 (ok) | call=3197 key=vm_nextcloud_stop_0 > > > confirmed=true > > > cib-update=5960" > > > says that stop was ok ? > > I'm stll digging in the logs trying to understand what happened. > What i'm wondering abaout: > > Jul 20 17:04:26 [23768] ha-idg-1 crmd: notice: > run_graph: Transition 4515 (Complete=10, Pending=0, Fired=0, > Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input- > 3117.bz2): Complete > > Jul 20 17:05:29 [23768] ha-idg-1 crmd: info: > abort_transition_graph: Transition 4515 aborted by status- > 1084777482-shutdown doing create shutdown=1595257529: Transient > attribute change | cib=2.16197.72 > source=abort_unless_down:317 path=/cib/status/node_state[@id='10847 > 77482']/transient_attributes[@id='1084777482']/instance_attributes[@i > d='status-1084777482'] complete=true > > > A transition is completed and three secons later aborted. How can > that be?
The most recent transition is aborted, but since all its actions are complete, the only effect is to trigger a new transition. We should probably rephrase the log message. In fact, the whole "transition" terminology is kind of obscure. It's hard to come up with something better though. > > Bernd > Helmholtz Zentrum München > > Helmholtz Zentrum Muenchen > Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) > Ingolstaedter Landstr. 1 > 85764 Neuherberg > www.helmholtz-muenchen.de > Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling > Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin > Guenther > Registergericht: Amtsgericht Muenchen HRB 6466 > USt-IdNr: DE 129521671 -- Ken Gaillot <[email protected]> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
