Re: [ClusterLabs] corosync - CS_ERR_BAD_HANDLE when multiple nodes are starting up

Thomas Lamprecht Wed, 07 Oct 2015 01:03:07 -0700

Hi,
again thanks for the response!

Thomas,


Thomas Lamprecht napsal(a):

Hi,

thanks for the response!
I added some information and clarification below.

On 10/01/2015 09:23 AM, Jan Friesse wrote:

Hi,

Thomas Lamprecht napsal(a):

Hello,
we are using corosync version needle (2.3.5) for our clusterfilesystem
(pmxcfs).
The situation is the following. First we start up the pmxcfs, which is
an fuse fs. And if there is an cluster configuration, we start also
corosync.
This allows the filesystem to exist on one node 'cluster's orforcing it
in an local mode. We use CPG to send our messages to all members,
the filesystem is in the RAM and all fs operations are sent 'over the
wire'.

The problem is now the following:
When we're restarting all (in my test case 3) nodes at the sametime, I
get in 1 from 10 cases only CS_ERR_BAD_HANDLE back when calling


I'm really unsure how to understand what are you doing. You are
restarting all nodes and get CS_ERR_BAD_HANDLE? I mean, if you are
restarting all nodes, which node returns CS_ERR_BAD_HANDLE? Or are you
restarting just pmxcfs? Or just coorsync?

Clarification, sorry was a bit unspecific. I can see the error behaviour
in two cases:
1) I restart three physical hosts (= nodes) at the same time, one of
them - normally the last one coming up again - joins successfully the
corosync cluster the filesystem (pmxcfs) notices that, but then
cpg_mcast_joined receives only CS_ERR_BAD_HANDLE errors.

Ok, that is weird. Are you able to reproduce same behavior restartingpmxcfs? Or really membership change (= restart of node) is needed?Also are you sure network interface is up when corosync starts?

No, tried quite a few times to restart pmxcfs but that didn't triggerthe problem, yet. But I could trigger it once with only restarting onenode, so restarting all only makes the problem worse but isn't needed inthe first place.


corosync.log of failing node may be interesting.

My nodes hostnames are [one, two, three], this time the came up in theorder they're named.This time it was on two nodes the first and second node coming up again.corosync log seems normal, although I haven't had debug mode enabled,don't know what difference that makes when no errors shows up in thenormal log.

Oct 07 09:06:36 [1335] two corosync notice [MAIN ] Corosync ClusterEngine ('2.3.5'): started and ready to provide service.Oct 07 09:06:36 [1335] two corosync info [MAIN ] Corosync built-infeatures: augeas systemd pie relro bindnowOct 07 09:06:36 [1335] two corosync notice [TOTEM ] Initializingtransport (UDP/IP Multicast).Oct 07 09:06:36 [1335] two corosync notice [TOTEM ] Initializingtransmit/receive security (NSS) crypto: aes256 hash: sha1Oct 07 09:06:36 [1335] two corosync notice [TOTEM ] The networkinterface [10.10.1.152] is now up.Oct 07 09:06:36 [1335] two corosync notice [SERV ] Service engineloaded: corosync configuration map access [0]
Oct 07 09:06:36 [1335] two corosync info    [QB    ] server name: cmap
Oct 07 09:06:36 [1335] two corosync notice [SERV ] Service engineloaded: corosync configuration service [1]
Oct 07 09:06:36 [1335] two corosync info    [QB    ] server name: cfg
Oct 07 09:06:36 [1335] two corosync notice [SERV ] Service engineloaded: corosync cluster closed process group service v1.01 [2]
Oct 07 09:06:36 [1335] two corosync info    [QB    ] server name: cpg
Oct 07 09:06:36 [1335] two corosync notice [SERV ] Service engineloaded: corosync profile loading service [4]Oct 07 09:06:36 [1335] two corosync notice [QUORUM] Using quorumprovider corosync_votequorumOct 07 09:06:36 [1335] two corosync notice [SERV ] Service engineloaded: corosync vote quorum service v1.0 [5]Oct 07 09:06:36 [1335] two corosync info [QB ] server name:votequorumOct 07 09:06:36 [1335] two corosync notice [SERV ] Service engineloaded: corosync cluster quorum service v0.1 [3]
Oct 07 09:06:36 [1335] two corosync info    [QB    ] server name: quorum
Oct 07 09:06:36 [1335] two corosync notice [TOTEM ] A new membership(10.10.1.152:92) was formed. Members joined: 2
Oct 07 09:06:36 [1335] two corosync notice  [QUORUM] Members[1]: 2
Oct 07 09:06:36 [1335] two corosync notice [MAIN ] Completed servicesynchronization, ready to provide service.

Then pmxcfs results in:

Oct 07 09:06:38 two pmxcfs[952]: [status] crit: cpg_send_message failed: 9
Oct 07 09:06:38 two pmxcfs[952]: [status] notice: Bad handle 0
Oct 07 09:06:38 two pmxcfs[952]: [status] crit: cpg_send_message failed: 9
Oct 07 09:06:38 two pmxcfs[952]: [status] crit: cpg_send_message failed: 9
Oct 07 09:06:38 two pmxcfs[952]: [status] notice: Bad handle 0
Oct 07 09:06:38 two pmxcfs[952]: [status] crit: cpg_send_message failed: 9


After that the third node joins, but the CS_ERR_BAD_HANDLE stays.

Oct 07 09:06:41 [1335] two corosync notice [TOTEM ] A new membership(10.10.1.151:100) was formed. Members joined: 1Oct 07 09:06:41 [1335] two corosync notice [QUORUM] This node iswithin the primary component and will provide service.
Oct 07 09:06:41 [1335] two corosync notice  [QUORUM] Members[2]: 1 2
Oct 07 09:06:41 [1335] two corosync notice [MAIN ] Completed servicesynchronization, ready to provide service.Oct 07 09:06:52 [1335] two corosync notice [TOTEM ] A new membership(10.10.1.151:104) was formed. Members joined: 3
Oct 07 09:06:52 [1335] two corosync notice  [QUORUM] Members[3]: 1 2 3
Oct 07 09:06:52 [1335] two corosync notice [MAIN ] Completed servicesynchronization, ready to provide service.

Funnily the third node also has some initial problems, but is able tocorrect them:

Oct 07 09:06:53 three pmxcfs[943]: [status] crit: cpg_send_messagefailed: 9
Oct 07 09:06:53 three pmxcfs[943]: [status] notice: Bad handle 0
Oct 07 09:06:53 three pmxcfs[943]: [status] crit: cpg_send_messagefailed: 9Oct 07 09:06:53 three pmxcfs[943]: [status] crit: cpg_send_messagefailed: 9
Oct 07 09:06:53 three pmxcfs[943]: [status] notice: Bad handle 0
Oct 07 09:06:53 three pmxcfs[943]: [status] crit: cpg_send_messagefailed: 9
[...]
Oct 07 09:06:56 three pmxcfs[943]: [status] notice: update clusterinfo (cluster name teachings, version = 15)
Oct 07 09:06:56 three pmxcfs[943]: [status] notice: node has quorum
Oct 07 09:06:56 three pmxcfs[943]: [dcdb] notice: members: 3/943
Oct 07 09:06:56 three pmxcfs[943]: [dcdb] notice: all data is up to date
Oct 07 09:06:56 three pmxcfs[943]: [status] notice: members: 3/943
Oct 07 09:06:56 three pmxcfs[943]: [status] notice: all data is up to date




Thats

2) I disconnect the network interface on which corosync runs, and
reconnect it a bit later. This triggers the same as above, but also not
every time.
Just to make sure. Don't do ifdown. Corosync reacts to ifdown prettybadly. Also NetworkManager does ifdown on cable unplug if notconfigured in server mode. If you want to test network split, etheruse iptables (make sure to block all traffic needed by corosync, so ifyou are using multicast make sure to block both unicast and multicastpackets on input and output -https://github.com/jfriesse/csts/blob/master/tests/inc/fw.sh), or useblocking on switch.

No, I'm aware of the ifdown problems and we do not use NetworkManager inProxmox VE. I often used an 'virtual' cluster, i.e. three virtualmachines (nested virtualization) in KVM/qemu, and then 'disconnected'the network interface through qemu, which should equal to an physicalunplug.

Corosync also recovers when 'connecting' the interface again through qemu.

Thanks for the script hint, will do some more physical tests involvingthe network.

Currently I'm trying to get an somewhat reproduce able test and try it
also on bigger setups and other possible causes, need to do a bit more
home work here and report back later.
Actually, smaller clusters are better for debugging, but yes, largersetup may show problem faster.

Ok, good to know.

cpg_mcast_joined to send out the data, but only one node.
corosyn-quorumtool shows that we have quorum, and the logs are also

showing a healthy connect to the corosync cluster. The failinghandle is

initialized once at the initialization of our filesystem. Should it be
reinitialized on every reconnect?


Again, I'm unsure what you mean by reconnect. On Corosync shudown you
have to reconnect (I believe this is not the case because you are
getting error only with 10% probability).

Yes, we reconnect to Corosync, and it's not a corosync shutdown, the
whole host reboots or the network interfaces goes down and then a bit
later up again. The probability is just an estimation but the main
problem is that I can not reproduce it all the time.

Restarting the filesystem solves this problem, the strange thing isthat
isn't clearly reproduce-able and often works just fine.

Are there some known problems or steps we should look for?


Hard to tell but generally:
- Make sure cpg_init really returns CS_OK. If not, returned handle is
invalid
- Make sure there is no memory corruption and handle is really valid
(valgrind may be helpful).

cpg_init checks are in place and should be OK.
Yes, will use Valgrind, but one questions ahead:

Can the handle get lost somehow? Is there a need to reinitialize the cpg
with cpg_initialize/cpg_model_initialize after we left and later
rejoined the cluster?

I'm still unsure what you mean after we left and later rejoined. Aslong as corosync is running client application "don't need to careabout" membership changes. It's corosync problem. So if network splithappens, you don't have to call cpg_initialize. Only place wherecpg_initalize is needed is initial connection and reconnection aftercorosync main process exit.

What I mean is leaving the cpg group, after a lost and regained quorum(unplug and plug NIC).


I'm a bit confused by two things:

-) Restarting pmxcfs, thus making a clean new initialization, solves theproblem but reinitializing isn't intended, as you also stated. If I'mnot mistaken corosync has even problems when calling cpg_initializetwice from the same PID, with an finalize in between.

-) the handle only can be changed by cpg_initialize, as it's the onlyplace where it is a reference, so where does it get to zero? I assume itwasn't zero from the beginning as then cpg_initialize should fail?


Is there a intended way how to handle an CS_ERR_BAD_HANDLE?

Thanks for your time.

Regards,
Thomas


_______________________________________________
Users mailing list: [email protected]
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync - CS_ERR_BAD_HANDLE when multiple nodes are starting up

Reply via email to