Re: [ClusterLabs] pcs node removal still crm_node it is removed node is listing as lost node

2023-07-27 Thread Ken Gaillot
On Thu, 2023-07-13 at 11:03 -0500, Ken Gaillot wrote:
> On Thu, 2023-07-13 at 09:58 +, S Sathish S via Users wrote:
> > Hi Team,
> >  
> > Problem Statement : we are trying to remove node on pcs cluster,
> > post
> > execution also still crm_node 
> > it is removed node is listing as lost node.
> >  
> > we have checked corosync.conf file it is removed but still it is
> > displaying on
> > crm_node -l.
> >  
> > [root@node1 ~]# pcs cluster node remove node2 --force
> > Destroying cluster on hosts: 'node2'...
> > node2: Successfully destroyed cluster
> > Sending updated corosync.conf to nodes...
> > node1: Succeeded
> > node1: Corosync configuration reloaded
> >  
> > [root@node1 ~]# crm_node -l
> > 1 node1 member
> > 2 node2 lost
> 
> This looks like a possible regression. The "node remove" command
> should
> erase all knowledge of the node, but I can reproduce this, and I
> don't
> see log messages I would expect. I'll have to investigate further.

This turned out to be a regression introduced in Pacemaker 2.0.5. It is
now fixed in the main branch by commit 3e31da00, expected to land in
2.1.7 toward the end of this year.

It only affected "crm_node -l".

> 
> >  
> > In RHEL 7.x we are using below rpm version not seeing this issue
> > while removing the node.
> > pacemaker-2.0.2-2.el7
> > corosync-2.4.4-2.el7
> > pcs-0.9.170-1.el7
> >  
> > In RHEL 8.x we are using below rpm version but seeing above issue
> > over here.
> > pacemaker-2.1.6-1.el8
> > corosync-3.1.7-1.el8
> > pcs-0.10.16-1.el8
> >  
> > Thanks and Regards,
> > S Sathish S
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Need a help with "(crm_glib_handler) crit: GLib: g_hash_table_lookup: assertion 'hash_table != NULL' failed"

2023-07-27 Thread Ken Gaillot
On Wed, 2023-07-26 at 13:29 -0700, Reid Wahl wrote:
> On Fri, Jul 21, 2023 at 9:51 AM Novik Arthur 
> wrote:
> > Hello Andrew, Ken and the entire community!
> > 
> > I faced a problem and I would like to ask for help.
> > 
> > Preamble:
> > I have dual controller storage (C0, C1) with 2 VM per controller
> > (vm0[1,2] on C0, vm[3,4] on C1).
> > I did online controller upgrade (update the firmware on physical
> > controller) and for that purpose we have a special procedure:
> > 
> > Put all vms on the controller which will be updated into the
> > standby mode (vm0[3,4] in logs).
> > Once all resources are moved to spare controller VMs, turn on
> > maintenance-mode (DC machine is vm01).
> > Shutdown vm0[3,4] and perform firmware update on C1 (OS + KVM +
> > HCA/HBA + BMC drivers will be updated).
> > Reboot C1
> > Start vm0[3,4]
> > On this step I hit the problem.
> > Do the same steps for C0 (turn off maint, put nodes 3,4 to online,
> > put 1-2 to standby, maint and etc).
> > 
> > Here is what I observed during step 5.
> > Machine vm03 started without problems, but vm04 caught critical
> > error and HA stack died. If manually start the pacemaker one more
> > time then it starts without problems and vm04 joins the cluster.
> > 
> > Some logs from vm04:
> > 
> > Jul 21 04:05:39 vm04 corosync[3061]:  [QUORUM] This node is within
> > the primary component and will provide service.
> > Jul 21 04:05:39 vm04 corosync[3061]:  [QUORUM] Members[4]: 1 2 3 4
> > Jul 21 04:05:39 vm04 corosync[3061]:  [MAIN  ] Completed service
> > synchronization, ready to provide service.
> > Jul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] rx: host: 3 link: 1
> > is up
> > Jul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] link: Resetting MTU
> > for link 1 because host 3 joined
> > Jul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] host: host: 3
> > (passive) best link: 0 (pri: 1)
> > Jul 21 04:05:39 vm04 pacemaker-attrd[4240]: notice: Setting
> > ifspeed-lnet-o2ib-o2ib[vm02]: (unset) -> 600
> > Jul 21 04:05:40 vm04 corosync[3061]:  [KNET  ] pmtud: PMTUD link
> > change for host: 3 link: 1 from 453 to 65413
> > Jul 21 04:05:40 vm04 corosync[3061]:  [KNET  ] pmtud: Global data
> > MTU changed to: 1397
> > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting ping-
> > lnet-o2ib-o2ib[vm02]: (unset) -> 4000
> > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting
> > ifspeed-lnet-o2ib-o2ib[vm01]: (unset) -> 600
> > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting ping-
> > lnet-o2ib-o2ib[vm01]: (unset) -> 4000
> > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: notice: State
> > transition S_NOT_DC -> S_STOPPING
> > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot
> > execute monitor of sfa-home-vd: No executor connection
> > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Cannot
> > calculate digests for operation sfa-home-vd_monitor_0 because we
> > have no connection to executor for vm04
> > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result of
> > probe operation for sfa-home-vd on vm04: Error (No executor
> > connection)
> > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot
> > execute monitor of ifspeed-lnet-o2ib-o2ib: No executor connection
> > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Cannot
> > calculate digests for operation ifspeed-lnet-o2ib-o2ib_monitor_0
> > because we have no connection to executor for vm04
> > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result of
> > probe operation for ifspeed-lnet-o2ib-o2ib on vm04: Error (No
> > executor connection)
> > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot
> > execute monitor of ping-lnet-o2ib-o2ib: No executor connection
> > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Cannot
> > calculate digests for operation ping-lnet-o2ib-o2ib_monitor_0
> > because we have no connection to executor for vm04
> > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result of
> > probe operation for ping-lnet-o2ib-o2ib on vm04: Error (No executor
> > connection)
> > Jul 21 04:05:49 vm04 pacemakerd[4127]: notice: pacemaker-
> > controld[4257] is unresponsive to ipc after 1 tries
> > Jul 21 04:05:52 vm04 pacemakerd[4127]: warning: Shutting cluster
> > down because pacemaker-controld[4257] had fatal failure
> > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutting down
> > Pacemaker
> > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-
> > schedulerd
> > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-
> > attrd
> > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-
> > execd
> > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-
> > fenced
> > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-
> > based
> > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutdown complete
> > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutting down and
> > staying down after fatal error
> > 
> > Jul 21 04:05:44 vm04 root

Re: [ClusterLabs] updu transport support with corosync

2023-07-27 Thread Jan Friesse

Hi,

On 24/07/2023 18:13, Abhijeet Singh wrote:

Hello,

We have a 2-node corosync/pacemaker cluster setup. We recently updated
corosync from v2.3.4 to v.3.0.3. I have couple of questions related to
corosync transport mechanism -

1. Found below article which indicates updu support might be deprecated in
the future. Is there a timeline for when updu might be deprecated? updu


Probably in corosync 4.x, but there is currently no plan for corosync 
4.x yet :) So if you are happy with udpu (eq. no need for encryption and 
multi-link) just keep using it.



seems to be performing better than knet in our setup.


Do you have a numbers? We did quite extensive testing and knet was 
always both faster and better (lower) latency.



https://www.mail-archive.com/users@clusterlabs.org/msg12806.html

2. We are using Linux 5.15.x. Noticed that with Knet transport corosync
takes up almost double memory as compared to updu. Is this expected? Are


Knet pre-allocates buffers on startup and corosync send buffers are also 
larger so yes, it is expected. It shouldn't be extra bad especially over 
time.



there any config changes which can help reduce the memory footprint?


Not really. You can change some compile #defines in source code but it's 
really asking for huge trouble.


Honza


Thanks
Abhijeet


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/