>>> Stuart Massey <stuart.e.mas...@gmail.com> schrieb am 20.01.2021 um 03:41 in Nachricht <cajfrb75upumzjpjxcoacrdgog-bqdcjhff5c_omvbfya53d...@mail.gmail.com>: > Strahil, > That is very kind of you, thanks. > I see that in your (feature set 3.4.1) cib, drbd is in a <clone> with some > meta_attributes and operations having to do with promotion, while in our > (feature set 3.0.14) cib, drbd is in a <master> which does not have those > (maybe since promotion is implicit). > Our cluster has been working quite well for some time, too. I wonder what > would happen if you could hang the os in one of your nodes? If a VM, maybe
Unless some other fencing mechanism (like watchdog timeout) kicks in, thge monitor operation is the only thing that can detect a problem (from the cluster's view): The monitor operation would timeout. Then the cluster would try to restart the resource (stop, then start). If stop also times out the node will be fenced. > the constrained secondary could be starved by setting disk IOPs to > something really low. Of course, you are using different versions of just > about everything, as we're on centos7. > Regards, > Stuart > > On Tue, Jan 19, 2021 at 6:20 PM Strahil Nikolov <hunter86...@yahoo.com> > wrote: > >> I have just built a test cluster (centOS 8.3) for testing DRBD and it >> works quite fine. >> Actually I followed my notes from >> https://forums.centos.org/viewtopic.php?t=65539 with the exception of >> point 8 due to the "promotable" stuff. >> >> I'm attaching the output of 'pcs cluster cib file' and I hope it helps you >> fix your issue. >> >> Best Regards, >> Strahil Nikolov >> >> >> В 09:32 -0500 на 19.01.2021 (вт), Stuart Massey написа: >> >> Ulrich, >> Thank you for that observation. We share that concern. >> We have 4 ea 1G nics active, bonded in pairs. One bonded pair serves the >> "public" (to the intranet) IPs, and the other bonded pair is private to the >> cluster, used for drbd replication. HA will, I hope, be using the "public" >> IP, since that is the route to the IP addresses resolved for the host >> names; that will certainly be the only route to the quorum device. I can >> say that this cluster has run reasonably well for quite some time with this >> configuration prior to the recently developed hardware issues on one of the >> nodes. >> Regards, >> Stuart >> >> On Tue, Jan 19, 2021 at 2:49 AM Ulrich Windl < >> ulrich.wi...@rz.uni-regensburg.de> wrote: >> >> >>> Stuart Massey <djangosc...@gmail.com> schrieb am 19.01.2021 um 04:46 >> in >> Nachricht >> <cabq68nqutyyxcygwcupg5txxajjwhsp+c6gcokfowgyrqsa...@mail.gmail.com>: >> > So, we have a 2-node cluster with a quorum device. One of the nodes >> (node1) >> > is having some trouble, so we have added constraints to prevent any >> > resources migrating to it, but have not put it in standby, so that drbd >> in >> > secondary on that node stays in sync. The problems it is having lead to >> OS >> > lockups that eventually resolve themselves - but that causes it to be >> > temporarily dropped from the cluster by the current master (node2). >> > Sometimes when node1 rejoins, then node2 will demote the drbd ms >> resource. >> > That causes all resources that depend on it to be stopped, leading to a >> > service outage. They are then restarted on node2, since they can't run on >> > node1 (due to constraints). >> > We are having a hard time understanding why this happens. It seems like >> > there may be some sort of DC contention happening. Does anyone have any >> > idea how we might prevent this from happening? >> >> I think if you are routing high-volume DRBD traffic throuch "the same >> pipe" as the cluster communication, cluster communication may fail if the >> pipe is satiated. >> I'm not happy with that, but it seems to be that way. >> >> Maybe running a combination of iftop and iotop could help you understand >> what's going on... >> >> Regards, >> Ulrich >> >> > Selected messages (de-identified) from pacemaker.log that illustrate >> > suspicion re DC confusion are below. The update_dc and >> > abort_transition_graph re deletion of lrm seem to always precede the >> > demotion, and a demotion seems to always follow (when not already >> demoted). >> > >> > Jan 18 16:52:17 [21938] node02.example.com crmd: info: >> > do_dc_takeover: Taking over DC status for this partition >> > Jan 18 16:52:17 [21938] node02.example.com crmd: info: >> update_dc: >> > Set DC to node02.example.com (3.0.14) >> > Jan 18 16:52:17 [21938] node02.example.com crmd: info: >> > abort_transition_graph: Transition aborted by deletion of >> > lrm[@id='1']: Resource state removal | cib=0.89.327 >> > source=abort_unless_down:357 >> > path=/cib/status/node_state[@id='1']/lrm[@id='1'] complete=true >> > Jan 18 16:52:19 [21937] node02.example.com pengine: info: >> > master_color: ms_drbd_ourApp: Promoted 0 instances of a possible 1 to >> > master >> > Jan 18 16:52:19 [21937] node02.example.com pengine: notice: >> LogAction: >> > * Demote drbd_ourApp:1 ( Master -> Slave >> > node02.example.com ) >> >> >> >> >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ >> >> _______________________________________________ >> >> Manage your subscription: >> >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> >> >> ClusterLabs home: >> >> https://www.clusterlabs.org/ >> >> >> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/