I wonder how the subject is related to the problem described below...

-----Original Message-----
From: Users <users-boun...@clusterlabs.org> On Behalf Of Novik Arthur
Sent: Friday, July 21, 2023 2:42 PM
To: users@clusterlabs.org
Subject: [EXT] [ClusterLabs] Need a help with "(crm_glib_handler) crit: GLib: 
g_hash_table_lookup: assertion 'hash_table != NULL' failed"

Hello Andrew, Ken and the entire community!

I faced a problem and I would like to ask for help.

Preamble:
I have dual controller storage (C0, C1) with 2 VM per controller (vm0[1,2] on 
C0, vm[3,4] on C1).
I did online controller upgrade (update the firmware on physical controller) 
and for that purpose we have a special procedure:

1.      Put all vms on the controller which will be updated into the standby 
mode (vm0[3,4] in logs).
        
2.      Once all resources are moved to spare controller VMs, turn on 
maintenance-mode (DC machine is vm01).
3.      Shutdown vm0[3,4] and perform firmware update on C1 (OS + KVM + HCA/HBA 
+ BMC drivers will be updated).
4.      Reboot C1
5.      Start vm0[3,4]
        On this step I hit the problem.
6.      Do the same steps for C0 (turn off maint, put nodes 3,4 to online, put 
1-2 to standby, maint and etc).

Here is what I observed during step 5.

Machine vm03 started without problems, but vm04 caught critical error and HA 
stack died. If manually start the pacemaker one more time then it starts 
without problems and vm04 joins the cluster.

Some logs from vm04:

Jul 21 04:05:39 vm04 corosync[3061]:  [QUORUM] This node is within the primary 
component and will provide service.
Jul 21 04:05:39 vm04 corosync[3061]:  [QUORUM] Members[4]: 1 2 3 4
Jul 21 04:05:39 vm04 corosync[3061]:  [MAIN  ] Completed service 
synchronization, ready to provide service.
Jul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] rx: host: 3 link: 1 is up
Jul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] link: Resetting MTU for link 1 
because host 3 joined
Jul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] host: host: 3 (passive) best 
link: 0 (pri: 1)
Jul 21 04:05:39 vm04 pacemaker-attrd[4240]: notice: Setting 
ifspeed-lnet-o2ib-o2ib[vm02]: (unset) -> 600
Jul 21 04:05:40 vm04 corosync[3061]:  [KNET  ] pmtud: PMTUD link change for 
host: 3 link: 1 from 453 to 65413
Jul 21 04:05:40 vm04 corosync[3061]:  [KNET  ] pmtud: Global data MTU changed 
to: 1397
Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting 
ping-lnet-o2ib-o2ib[vm02]: (unset) -> 4000
Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting 
ifspeed-lnet-o2ib-o2ib[vm01]: (unset) -> 600
Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting 
ping-lnet-o2ib-o2ib[vm01]: (unset) -> 4000
Jul 21 04:05:47 vm04 pacemaker-controld[4257]: notice: State transition 
S_NOT_DC -> S_STOPPING
Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot execute monitor of 
sfa-home-vd: No executor connection
Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Cannot calculate 
digests for operation sfa-home-vd_monitor_0 because we have no connection to 
executor for vm04
Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result of probe operation 
for sfa-home-vd on vm04: Error (No executor connection)
Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot execute monitor of 
ifspeed-lnet-o2ib-o2ib: No executor connection
Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Cannot calculate 
digests for operation ifspeed-lnet-o2ib-o2ib_monitor_0 because we have no 
connection to executor for vm04
Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result of probe operation 
for ifspeed-lnet-o2ib-o2ib on vm04: Error (No executor connection)
Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot execute monitor of 
ping-lnet-o2ib-o2ib: No executor connection
Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Cannot calculate 
digests for operation ping-lnet-o2ib-o2ib_monitor_0 because we have no 
connection to executor for vm04
Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result of probe operation 
for ping-lnet-o2ib-o2ib on vm04: Error (No executor connection)
Jul 21 04:05:49 vm04 pacemakerd[4127]: notice: pacemaker-controld[4257] is 
unresponsive to ipc after 1 tries
Jul 21 04:05:52 vm04 pacemakerd[4127]: warning: Shutting cluster down because 
pacemaker-controld[4257] had fatal failure
Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutting down Pacemaker
Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-schedulerd
Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-attrd
Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-execd
Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-fenced
Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-based
Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutdown complete
Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutting down and staying down 
after fatal error

Jul 21 04:05:44 vm04 root[10111]: openibd: Set node_desc for mlx5_0: vm04 HCA-1
Jul 21 04:05:44 vm04 root[10113]: openibd: Set node_desc for mlx5_1: vm04 HCA-2
Jul 21 04:05:47 vm04 pacemaker-controld[4257]:  error: Shutting down controller 
after unexpected shutdown request from vm01
Jul 21 04:05:47 vm04 pacemaker-controld[4257]: Problem detected at 
handle_shutdown_ack:954 (controld_messages.c), please see 
/var/lib/pacemaker/blackbox/pacemaker-controld-4257.1 for additional details
Jul 21 04:05:47 vm04 pacemaker-controld[4257]:  notice: State transition 
S_NOT_DC -> S_STOPPING
Jul 21 04:05:47 vm04 pacemaker-controld[4257]:  notice: Disconnected from the 
executor
Jul 21 04:05:47 vm04 pacemaker-controld[4257]:  notice: Disconnected from 
Corosync
Jul 21 04:05:47 vm04 pacemaker-controld[4257]:  notice: Disconnected from the 
CIB manager
Jul 21 04:05:48 vm04 pacemaker-controld[4257]:  notice: Disconnected from the 
CIB manager
Jul 21 04:05:48 vm04 pacemaker-controld[4257]:  crit: GLib: 
g_hash_table_lookup: assertion 'hash_table != NULL' failed
Jul 21 04:05:48 vm04 pacemaker-controld[4257]:  error: Cannot execute monitor 
of sfa-home-vd: No executor connection
Jul 21 04:05:48 vm04 pacemaker-controld[4257]:  crit: GLib: 
g_hash_table_lookup: assertion 'hash_table != NULL' failed
Jul 21 04:05:48 vm04 pacemaker-controld[4257]:  warning: Cannot calculate 
digests for operation sfa-home-vd_monitor_0 because we have no connection to 
executor for vm04
Jul 21 04:05:48 vm04 pacemaker-controld[4257]:  warning: Resource update -107 
failed: (rc=-107) Transport endpoint is not connected


the log from DC vm01:

Jul 21 04:05:39 vm01 pacemaker-controld[4048]: notice: Transition 16 aborted: 
Peer Halt
Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Detected another attribute 
writer (vm04), starting new election
Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Setting 
#attrd-protocol[vm04]: (unset) -> 5
Jul 21 04:05:40 vm01 pacemaker-controld[4048]: notice: Finalizing join-2 for 1 
node (sync'ing CIB from vm02)
Jul 21 04:05:40 vm01 pacemaker-controld[4048]: notice: Requested CIB version   
<generation_tuple crm_feature_set="3.16.2" validate-with="pacemaker-3.9" 
epoch="567" num_updates="111" admin_epoch="0" cib-last-writt
en="Fri Jul 21 03:48:43 2023" update-origin="vm01" update-client="cibadmin" 
update-user="root" have-quorum="0" dc-uuid="1"/>
Jul 21 04:05:40 vm01 pacemaker-attrd[4017]: notice: Recorded local node as 
attribute writer (was unset)
Jul 21 04:05:40 vm01 pacemaker-attrd[4017]: notice: Setting #feature-set[vm04]: 
(unset) -> 3.16.2
Jul 21 04:05:41 vm01 pacemaker-controld[4048]: notice: Transition 16 aborted by 
deletion of lrm[@id='4']: Resource state removal

Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice: No fencing will be 
done until there are resources to manage
Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice:  * Shutdown vm04
Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice: Calculated transition 
17, saving inputs in /var/lib/pacemaker/pengine/pe-input-940.bz2



As far as I understand, vm04 was killed by DC during the election of a new attr 
writer?



The issue is reproducible from time to time and the version of pacemaker is " 
2.1.5-8.1.el8_8-a3f44794f94" from Rocky linux 8.8.



I attached crm_report with blackbox. I have debug logs, but they are pretty 
heavy (~40MB bzip --best). Please tell me if you need them.


Thanks,
Arthur



_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to