>>> Ferenc Wágner <wf...@niif.hu> schrieb am 28.08.2017 um 18:07 in Nachricht <87mv6jk75r....@lant.ki.iif.hu>:
[...] cLVM under I/O load can be really slow (I'm talking about delays in the range of a few seconds). Be sure to have any timeouts adjusted accordingly. I wrote a tool that allows to monitor the read latency (as seen by applications), so I know these numbers. And things get significantly worse if you do cLVM mirroring with a mirrorlog replicated to each device. Maybe the CLVM slows down at n^2, where n is the number of nodes; I don't know ;-) Regards, Ulrich > So Pacemaker does nothing, basically, and I can't see any adverse effect > to resource management, but DLM seems to have some problem, which may or > may not be related. When the TOTEM error appears, all nodes log this: > > vhbl03 dlm_controld[3914]: 2801675 dlm:controld ring 167773705:3056 6 memb > 167773705 167773706 167773707 167773708 167773709 167773710 > vhbl03 dlm_controld[3914]: 2801675 fence work wait for cluster ringid > vhbl03 dlm_controld[3914]: 2801675 dlm:ls:clvmd ring 167773705:3056 6 memb > 167773705 167773706 167773707 167773708 167773709 167773710 > vhbl03 dlm_controld[3914]: 2801675 clvmd wait_messages cg 9 need 1 of 6 > vhbl03 dlm_controld[3914]: 2801675 fence work wait for cluster ringid > vhbl03 dlm_controld[3914]: 2801675 cluster quorum 1 seq 3056 nodes 6 > > dlm_controld is running with --enable_fencing=0. Pacemaker does its own > fencing if resource management requires it, but DLM is used by cLVM > only, which does not warrant such harsh measures. Right now cLVM is > blocked; I don't know since when, because we seldom do cLVM operations > on this cluster. My immediate aim is to unblock cLVM somehow. > > While dlm_tool status reports (similar on all nodes): > > cluster nodeid 167773705 quorate 1 ring seq 3088 3088 > daemon now 2941405 fence_pid 0 > node 167773705 M add 196 rem 0 fail 0 fence 0 at 0 0 > node 167773706 M add 5960 rem 5730 fail 0 fence 0 at 0 0 > node 167773707 M add 2089 rem 1802 fail 0 fence 0 at 0 0 > node 167773708 M add 3646 rem 3413 fail 0 fence 0 at 0 0 > node 167773709 M add 2588921 rem 2588920 fail 0 fence 0 at 0 0 > node 167773710 M add 196 rem 0 fail 0 fence 0 at 0 0 > > dlm_tool ls shows "kern_stop": > > dlm lockspaces > name clvmd > id 0x4104eefa > flags 0x00000004 kern_stop > change member 5 joined 0 remove 1 failed 1 seq 8,8 > members 167773705 167773706 167773707 167773708 167773710 > new change member 6 joined 1 remove 0 failed 0 seq 9,9 > new status wait messages 1 > new members 167773705 167773706 167773707 167773708 167773709 167773710 > > on all nodes except for vhbl07 (167773709), where it gives > > dlm lockspaces > name clvmd > id 0x4104eefa > flags 0x00000000 > change member 6 joined 1 remove 0 failed 0 seq 11,11 > members 167773705 167773706 167773707 167773708 167773709 167773710 > > instead. > > Does anybody have an idea what the problem(s) might be? Why is Corosync > deteriorating on this cluster? (It's running with RR PRIO 99.) Could > that have hurt DLM? Is there a way to unblock DLM without rebooting all > nodes? (Actually, rebooting is problematic in itself with blocked cLVM, > but that's tractable.) > -- > Thanks, > Feri > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org