On Wed, 27 Sep 2017, Jan Friesse wrote:

I don't think scheduling is the case. If scheduler would be the case
other message (Corosync main process was not scheduled for ...) would
kick in. This looks more like a something is blocked in totemsrp.

Ah, interesting!

Also, it looks like the side effect is that corosync drops important
messages (I think "join" messages?), and I fear that this can lead to

You mean membership join messages? Because there are a lot (327) of them
in log you've sent.

Yes. In my test setup I didn't see any issue where we lost membership join
messages, but the reason why I am looking into this is this:

We had one problem on a real deployment of DLM+corosync (5 voters and 20
non-voters, with dlm on those 20, for a specific application that uses

What you mean by voters and non-voters? There is 25 nodes in total and each of them is running corosync?

libdlm). On a reboot of one server running just corosync (which thus did
NOT run dlm), a large number of other servers got briefly evicted from the

This is kind of weird. AFAIK DLM is joining to CPG group and using CPG membership. So if DLM was not running on the node then other nodes joined to DLM CPG group should not even notice its leave.

corosync ring; and when rejoining, dlm complained about a "stateful merge"
which forces a reboot. Note, dlm fencing is disabled.

In that system, it was "legal" for corosync to kick out these servers
(they had zero vote), but it was highly unexpected (they were running
fine) and the impact is high (reboot).

What you mean by zero vote? You mean DLM vote or corosync number of votes (related to quorum)?


We did see "Process pause detected" in the logs on that system when the
incident happened, which is why I think could be a clue.

I've tried to reproduce the problem and I was not successful with 3 nodes cluster using more or less default config (not changing join/consensus/...). I'll try 5 nodes possibly with totem values and see if problem appears.

Regards,
  Honza


I'll definitively try to reproduce this bug and let you know. I don't
think any message get lost, but it's better to be on a safe side.

Thanks!


Cheers,
JM



_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to