>>> Stuart Massey <[email protected]> schrieb am 22.01.2021 um 14:08 in Nachricht <cabq68ntgdmxvo_uvlxg0hytlgsmrgucvcssa3ergqfov+cj...@mail.gmail.com>: > Hi Ulrich, > Thank you for your response. > It makes sense that this would be happening on the failing, secondary/slave > node, in which case we might expect drbd to be restarted (the service > entirely, since it is already demoted) on the slave. I don't understand how > it would affect the master, unless the failing secondary is causing some > issue with drbd on the primary that causes the monitor on the master to > time out for some reason. This does not (so far) seem to be the case, as > the failing node has now been in maintenance mode for a couple of days with > drbd still running as secondary, so if drbd failures on the secondary were > causing the monitor on the Master/Primary to timeout, we should still be > seeing that; we are not. The master has yet to demote the drbd resource > since we put the failing node in maintenance.
When you are in maintenance mode, monitor operations won't run AFAIK. > We will watch for a bit longer. > Thanks again > > On Thu, Jan 21, 2021 at 2:23 AM Ulrich Windl < > [email protected]> wrote: > >> >>> Stuart Massey <[email protected]> schrieb am 20.01.2021 um >> 03:41 >> in >> Nachricht >> <cajfrb75upumzjpjxcoacrdgog-bqdcjhff5c_omvbfya53d...@mail.gmail.com>: >> > Strahil, >> > That is very kind of you, thanks. >> > I see that in your (feature set 3.4.1) cib, drbd is in a <clone> with >> some >> > meta_attributes and operations having to do with promotion, while in our >> > (feature set 3.0.14) cib, drbd is in a <master> which does not have those >> > (maybe since promotion is implicit). >> > Our cluster has been working quite well for some time, too. I wonder what >> > would happen if you could hang the os in one of your nodes? If a VM, >> maybe >> >> Unless some other fencing mechanism (like watchdog timeout) kicks in, thge >> monitor operation is the only thing that can detect a problem (from the >> cluster's view): The monitor operation would timeout. Then the cluster >> would >> try to restart the resource (stop, then start). If stop also times out the >> node >> will be fenced. >> >> > the constrained secondary could be starved by setting disk IOPs to >> > something really low. Of course, you are using different versions of just >> > about everything, as we're on centos7. >> > Regards, >> > Stuart >> > >> > On Tue, Jan 19, 2021 at 6:20 PM Strahil Nikolov <[email protected]> >> > wrote: >> > >> >> I have just built a test cluster (centOS 8.3) for testing DRBD and it >> >> works quite fine. >> >> Actually I followed my notes from >> >> https://forums.centos.org/viewtopic.php?t=65539 with the exception of >> >> point 8 due to the "promotable" stuff. >> >> >> >> I'm attaching the output of 'pcs cluster cib file' and I hope it helps >> you >> >> fix your issue. >> >> >> >> Best Regards, >> >> Strahil Nikolov >> >> >> >> >> >> В 09:32 -0500 на 19.01.2021 (вт), Stuart Massey написа: >> >> >> >> Ulrich, >> >> Thank you for that observation. We share that concern. >> >> We have 4 ea 1G nics active, bonded in pairs. One bonded pair serves the >> >> "public" (to the intranet) IPs, and the other bonded pair is private to >> the >> >> cluster, used for drbd replication. HA will, I hope, be using the >> "public" >> >> IP, since that is the route to the IP addresses resolved for the host >> >> names; that will certainly be the only route to the quorum device. I can >> >> say that this cluster has run reasonably well for quite some time with >> this >> >> configuration prior to the recently developed hardware issues on one of >> the >> >> nodes. >> >> Regards, >> >> Stuart >> >> >> >> On Tue, Jan 19, 2021 at 2:49 AM Ulrich Windl < >> >> [email protected]> wrote: >> >> >> >> >>> Stuart Massey <[email protected]> schrieb am 19.01.2021 um >> 04:46 >> >> in >> >> Nachricht >> >> <cabq68nqutyyxcygwcupg5txxajjwhsp+c6gcokfowgyrqsa...@mail.gmail.com>: >> >> > So, we have a 2-node cluster with a quorum device. One of the nodes >> >> (node1) >> >> > is having some trouble, so we have added constraints to prevent any >> >> > resources migrating to it, but have not put it in standby, so that >> drbd >> >> in >> >> > secondary on that node stays in sync. The problems it is having lead >> to >> >> OS >> >> > lockups that eventually resolve themselves - but that causes it to be >> >> > temporarily dropped from the cluster by the current master (node2). >> >> > Sometimes when node1 rejoins, then node2 will demote the drbd ms >> >> resource. >> >> > That causes all resources that depend on it to be stopped, leading to >> a >> >> > service outage. They are then restarted on node2, since they can't run >> on >> >> > node1 (due to constraints). >> >> > We are having a hard time understanding why this happens. It seems >> like >> >> > there may be some sort of DC contention happening. Does anyone have >> any >> >> > idea how we might prevent this from happening? >> >> >> >> I think if you are routing high-volume DRBD traffic throuch "the same >> >> pipe" as the cluster communication, cluster communication may fail if >> the >> >> pipe is satiated. >> >> I'm not happy with that, but it seems to be that way. >> >> >> >> Maybe running a combination of iftop and iotop could help you understand >> >> what's going on... >> >> >> >> Regards, >> >> Ulrich >> >> >> >> > Selected messages (de-identified) from pacemaker.log that illustrate >> >> > suspicion re DC confusion are below. The update_dc and >> >> > abort_transition_graph re deletion of lrm seem to always precede the >> >> > demotion, and a demotion seems to always follow (when not already >> >> demoted). >> >> > >> >> > Jan 18 16:52:17 [21938] node02.example.com crmd: info: >> >> > do_dc_takeover: Taking over DC status for this partition >> >> > Jan 18 16:52:17 [21938] node02.example.com crmd: info: >> >> update_dc: >> >> > Set DC to node02.example.com (3.0.14) >> >> > Jan 18 16:52:17 [21938] node02.example.com crmd: info: >> >> > abort_transition_graph: Transition aborted by deletion of >> >> > lrm[@id='1']: Resource state removal | cib=0.89.327 >> >> > source=abort_unless_down:357 >> >> > path=/cib/status/node_state[@id='1']/lrm[@id='1'] complete=true >> >> > Jan 18 16:52:19 [21937] node02.example.com pengine: info: >> >> > master_color: ms_drbd_ourApp: Promoted 0 instances of a possible 1 to >> >> > master >> >> > Jan 18 16:52:19 [21937] node02.example.com pengine: notice: >> >> LogAction: >> >> > * Demote drbd_ourApp:1 ( Master -> Slave >> >> > node02.example.com ) >> >> >> >> >> >> >> >> >> >> _______________________________________________ >> >> Manage your subscription: >> >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> >> >> ClusterLabs home: https://www.clusterlabs.org/ >> >> >> >> _______________________________________________ >> >> >> >> Manage your subscription: >> >> >> >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> >> >> >> >> >> >> ClusterLabs home: >> >> >> >> https://www.clusterlabs.org/ >> >> >> >> >> >> >> >> >> >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ >> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
