Re: [ClusterLabs] corosync - CS_ERR_BAD_HANDLE when multiple nodes are starting up

Thomas Lamprecht Sat, 03 Oct 2015 23:50:20 -0700

Hi,

thanks for the response!
I added some information and clarification below.


On 10/01/2015 09:23 AM, Jan Friesse wrote:

Hi,

Thomas Lamprecht napsal(a):

Hello,

we are using corosync version needle (2.3.5) for our cluster filesystem
(pmxcfs).
The situation is the following. First we start up the pmxcfs, which is
an fuse fs. And if there is an cluster configuration, we start also
corosync.
This allows the filesystem to exist on one node 'cluster's or forcing it
in an local mode. We use CPG to send our messages to all members,
the filesystem is in the RAM and all fs operations are sent 'over the
wire'.

The problem is now the following:
When we're restarting all (in my test case 3) nodes at the same time, I
get in 1 from 10 cases only CS_ERR_BAD_HANDLE back when calling

I'm really unsure how to understand what are you doing. You arerestarting all nodes and get CS_ERR_BAD_HANDLE? I mean, if you arerestarting all nodes, which node returns CS_ERR_BAD_HANDLE? Or are yourestarting just pmxcfs? Or just coorsync?

Clarification, sorry was a bit unspecific. I can see the error behaviourin two cases:1) I restart three physical hosts (= nodes) at the same time, one ofthem - normally the last one coming up again - joins successfully thecorosync cluster the filesystem (pmxcfs) notices that, but thencpg_mcast_joined receives only CS_ERR_BAD_HANDLE errors.

2) I disconnect the network interface on which corosync runs, andreconnect it a bit later. This triggers the same as above, but also notevery time.

Currently I'm trying to get an somewhat reproduce able test and try italso on bigger setups and other possible causes, need to do a bit morehome work here and report back later.

cpg_mcast_joined to send out the data, but only one node.
corosyn-quorumtool shows that we have quorum, and the logs are also
showing a healthy connect to the corosync cluster. The failing handle is
initialized once at the initialization of our filesystem. Should it be
reinitialized on every reconnect?
Again, I'm unsure what you mean by reconnect. On Corosync shudown youhave to reconnect (I believe this is not the case because you aregetting error only with 10% probability).

Yes, we reconnect to Corosync, and it's not a corosync shutdown, thewhole host reboots or the network interfaces goes down and then a bitlater up again. The probability is just an estimation but the mainproblem is that I can not reproduce it all the time.

Restarting the filesystem solves this problem, the strange thing is that
isn't clearly reproduce-able and often works just fine.

Are there some known problems or steps we should look for?
Hard to tell but generally:
- Make sure cpg_init really returns CS_OK. If not, returned handle isinvalid- Make sure there is no memory corruption and handle is really valid(valgrind may be helpful).

cpg_init checks are in place and should be OK.
Yes, will use Valgrind, but one questions ahead:

Can the handle get lost somehow? Is there a need to reinitialize the cpgwith cpg_initialize/cpg_model_initialize after we left and laterrejoined the cluster?


Regards,
  Honza



_______________________________________________
Users mailing list: [email protected]
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



_______________________________________________
Users mailing list: [email protected]
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



_______________________________________________
Users mailing list: [email protected]
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync - CS_ERR_BAD_HANDLE when multiple nodes are starting up

Reply via email to