Hi,

On 10/11/2016 02:18 PM, Ulrich Windl wrote:
{>>>> emmanuel segura <emi2f...@gmail.com> schrieb am 10.10.2016 um 16:49 in
Nachricht
<CAE7pJ3CBJR3pctT3N_jaMCXBuUGD3nta=ya8fznbnfaifk3...@mail.gmail.com>:


Node h01 (old DC) was fenced at Oct 10 10:06:33
Node h01 went down around Oct 10 10:06:37.
DLM noticed that on node h05:
Oct 10 10:06:44 h05 cluster-dlm[12063]: dlm_process_node: Removed inactive node 
739512321: born-on=3180, last-seen=3208, this-event=3212, last-event=3208
cLVM and OCFS noticed the event also:
Oct 10 10:06:44 h05 ocfs2_controld[12147]: Sending notification of node 739512321 for 
"490B9FCAFA3D4B2F9A586A5893E00730"
Oct 10 10:06:44 h05 ocfs2_controld[12147]: Notified for 
"490B9FCAFA3D4B2F9A586A5893E00730", node 739512321, status 0

Similar on node h10 (new DC):
Oct 10 10:06:44 h10 cluster-dlm[32150]: dlm_process_node: Removed inactive node 
739512321: born-on=3180, last-seen=3208, this-event=3212, last-event=3208
Oct 10 10:06:44 h10 ocfs2_controld[32271]:   notice: crm_update_peer_state: 
plugin_handle_membership: Node h01[739512321] - state is now lost (was member)
Oct 10 10:06:44 h10 ocfs2_controld[32271]: node daemon left 739512321
Oct 10 10:06:44 h10 ocfs2_controld[32271]: Sending notification of node 739512321 for 
"490B9FCAFA3D4B2F9A586A5893E00730"

My point is this: For a resource that can only exclusively run on one node, 
it's important that the other node is down before taking action. But for cLVM 
and OCFS2 the resources can run concurrently on each node, so I don't see why 
every node veirtually freezes until STONITH completed.

OCFS2 use DLM (fs/dlm in kernel). And DLM use the cpg service provided by corosync [1] to get nodes membership.
And the membership only gets stable until STONITH completed.

[1] https://en.wikipedia.org/wiki/Corosync_Cluster_Engine

If you have a large cluster (maybe 100 nodes), OCFS will be unavailable most of 
the time if any node fails.
The upper limit is 32 nodes AFAIK. But I think it's unusual to see more than 
3-nodes cluster?

Assuming your case exists, yes, it will take much more time to recover from 
node failure.

When assuming node h01 still lived when communication failed, wouldn't quorum 
prevent h01 from doing anything with DLM and OCFS2 anyway?
Not sure I understand you correctly. By default, loosing quorum will make DLM stop service. See `man dlm_controld`:
```
--enable_quorum_lockspace 0|1
               enable/disable quorum requirement for lockspace operations
```

Eric

Regards,
Ulrich




_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to