On Wed, 2023-07-26 at 13:29 -0700, Reid Wahl wrote:
> On Fri, Jul 21, 2023 at 9:51 AM Novik Arthur
> wrote:
> > Hello Andrew, Ken and the entire community!
> >
> > I faced a problem and I would like to ask for help.
> >
> > Preamble:
> > I have dual controller storage (C0, C1) with 2 VM per controller
> > (vm0[1,2] on C0, vm[3,4] on C1).
> > I did online controller upgrade (update the firmware on physical
> > controller) and for that purpose we have a special procedure:
> >
> > Put all vms on the controller which will be updated into the
> > standby mode (vm0[3,4] in logs).
> > Once all resources are moved to spare controller VMs, turn on
> > maintenance-mode (DC machine is vm01).
> > Shutdown vm0[3,4] and perform firmware update on C1 (OS + KVM +
> > HCA/HBA + BMC drivers will be updated).
> > Reboot C1
> > Start vm0[3,4]
> > On this step I hit the problem.
> > Do the same steps for C0 (turn off maint, put nodes 3,4 to online,
> > put 1-2 to standby, maint and etc).
> >
> > Here is what I observed during step 5.
> > Machine vm03 started without problems, but vm04 caught critical
> > error and HA stack died. If manually start the pacemaker one more
> > time then it starts without problems and vm04 joins the cluster.
> >
> > Some logs from vm04:
> >
> > Jul 21 04:05:39 vm04 corosync[3061]: [QUORUM] This node is within
> > the primary component and will provide service.
> > Jul 21 04:05:39 vm04 corosync[3061]: [QUORUM] Members[4]: 1 2 3 4
> > Jul 21 04:05:39 vm04 corosync[3061]: [MAIN ] Completed service
> > synchronization, ready to provide service.
> > Jul 21 04:05:39 vm04 corosync[3061]: [KNET ] rx: host: 3 link: 1
> > is up
> > Jul 21 04:05:39 vm04 corosync[3061]: [KNET ] link: Resetting MTU
> > for link 1 because host 3 joined
> > Jul 21 04:05:39 vm04 corosync[3061]: [KNET ] host: host: 3
> > (passive) best link: 0 (pri: 1)
> > Jul 21 04:05:39 vm04 pacemaker-attrd[4240]: notice: Setting
> > ifspeed-lnet-o2ib-o2ib[vm02]: (unset) -> 600
> > Jul 21 04:05:40 vm04 corosync[3061]: [KNET ] pmtud: PMTUD link
> > change for host: 3 link: 1 from 453 to 65413
> > Jul 21 04:05:40 vm04 corosync[3061]: [KNET ] pmtud: Global data
> > MTU changed to: 1397
> > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting ping-
> > lnet-o2ib-o2ib[vm02]: (unset) -> 4000
> > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting
> > ifspeed-lnet-o2ib-o2ib[vm01]: (unset) -> 600
> > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting ping-
> > lnet-o2ib-o2ib[vm01]: (unset) -> 4000
> > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: notice: State
> > transition S_NOT_DC -> S_STOPPING
> > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot
> > execute monitor of sfa-home-vd: No executor connection
> > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Cannot
> > calculate digests for operation sfa-home-vd_monitor_0 because we
> > have no connection to executor for vm04
> > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result of
> > probe operation for sfa-home-vd on vm04: Error (No executor
> > connection)
> > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot
> > execute monitor of ifspeed-lnet-o2ib-o2ib: No executor connection
> > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Cannot
> > calculate digests for operation ifspeed-lnet-o2ib-o2ib_monitor_0
> > because we have no connection to executor for vm04
> > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result of
> > probe operation for ifspeed-lnet-o2ib-o2ib on vm04: Error (No
> > executor connection)
> > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot
> > execute monitor of ping-lnet-o2ib-o2ib: No executor connection
> > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Cannot
> > calculate digests for operation ping-lnet-o2ib-o2ib_monitor_0
> > because we have no connection to executor for vm04
> > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result of
> > probe operation for ping-lnet-o2ib-o2ib on vm04: Error (No executor
> > connection)
> > Jul 21 04:05:49 vm04 pacemakerd[4127]: notice: pacemaker-
> > controld[4257] is unresponsive to ipc after 1 tries
> > Jul 21 04:05:52 vm04 pacemakerd[4127]: warning: Shutting cluster
> > down because pacemaker-controld[4257] had fatal failure
> > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutting down
> > Pacemaker
> > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-
> > schedulerd
> > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-
> > attrd
> > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-
> > execd
> > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-
> > fenced
> > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-
> > based
> > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutdown complete
> > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutting down and
> > staying down after fatal error
> >
> > Jul 21 04:05:44 vm04 root