Srry for the top posting.My iSCSILogicalUnit is blocking failover on "standby" (I think it's a bug in the resource), yet without it -> drbd fails over properly: [root@drbd1 ~]# pcs resource show DRBD Resource: DRBD (class=ocf provider=linbit type=drbd) Attributes: drbd_resource=drbd0 Operations: demote interval=0s timeout=90 (DRBD-demote-interval- 0s) monitor interval=30 role=Slave (DRBD-monitor-interval- 30) monitor interval=15 role=Master (DRBD-monitor- interval-15) notify interval=0s timeout=90 (DRBD-notify- interval-0s) promote interval=0s timeout=90 (DRBD-promote- interval-0s) reload interval=0s timeout=30 (DRBD-reload- interval-0s) start interval=0s timeout=240 (DRBD-start- interval-0s) stop interval=0s timeout=100 (DRBD-stop- interval-0s)[root@drbd1 ~]# pcs resource show DRBD-CLONE Master: DRBD- CLONE Meta Attrs: clone-max=3 clone-node-max=1 master-max=1 master- node-max=1 notify=true Resource: DRBD (class=ocf provider=linbit type=drbd) Attributes: drbd_resource=drbd0 Operations: demote interval=0s timeout=90 (DRBD-demote-interval-0s) monitor interval=30 role=Slave (DRBD-monitor-interval-30) monitor interval=15 role=Master (DRBD-monitor-interval-15) notify interval=0s timeout=90 (DRBD-notify-interval-0s) promote interval=0s timeout=90 (DRBD-promote-interval-0s) reload interval=0s timeout=30 (DRBD-reload-interval-0s) start interval=0s timeout=240 (DRBD-start-interval-0s) stop interval=0s timeout=100 (DRBD-stop-interval-0s)
Best Regards,Strahil Nikolov В 23:30 -0500 на 21.01.2021 (чт), Stuart Massey написа: > Hi Ulrich, > Thank you for your response. > It makes sense that this would be happening on the failing, > secondary/slave node, in which case we might expect drbd to be > restarted (entirely, since it is already demoted) on the slave. I > don't see how it would affect the master, unless the failing > secondary is causing some issue with drbd on the primary that causes > the monitor on the master to time out for some reason. This does not > (so far) seem to be the case, as the failing node has now been in > maintenance mode for a couple of days with drbd still running as > secondary, so if drbd failures on the secondary were causing the > monitor on the Master/Primary to timeout, we should still be seeing > that; we are not. The master has yet to demote the drbd resource > since we put the failing node in maintenance. > We will watch for a bit longer. > Thanks again > > > > On Thu, Jan 21, 2021, 2:23 AM Ulrich Windl < > [email protected]> wrote: > > >>> Stuart Massey <[email protected]> schrieb am 20.01.2021 > > um 03:41 > > > > in > > > > Nachricht > > > > <cajfrb75upumzjpjxcoacrdgog-bqdcjhff5c_omvbfya53d...@mail.gmail.com > > >: > > > > > Strahil, > > > > > That is very kind of you, thanks. > > > > > I see that in your (feature set 3.4.1) cib, drbd is in a <clone> > > with some > > > > > meta_attributes and operations having to do with promotion, while > > in our > > > > > (feature set 3.0.14) cib, drbd is in a <master> which does not > > have those > > > > > (maybe since promotion is implicit). > > > > > Our cluster has been working quite well for some time, too. I > > wonder what > > > > > would happen if you could hang the os in one of your nodes? If a > > VM, maybe > > > > > > > > Unless some other fencing mechanism (like watchdog timeout) kicks > > in, thge > > > > monitor operation is the only thing that can detect a problem (from > > the > > > > cluster's view): The monitor operation would timeout. Then the > > cluster would > > > > try to restart the resource (stop, then start). If stop also times > > out the node > > > > will be fenced. > > > > > > > > > the constrained secondary could be starved by setting disk IOPs > > to > > > > > something really low. Of course, you are using different versions > > of just > > > > > about everything, as we're on centos7. > > > > > Regards, > > > > > Stuart > > > > > > > > > > On Tue, Jan 19, 2021 at 6:20 PM Strahil Nikolov < > > [email protected]> > > > > > wrote: > > > > > > > > > >> I have just built a test cluster (centOS 8.3) for testing DRBD > > and it > > > > >> works quite fine. > > > > >> Actually I followed my notes from > > > > >> https://forums.centos.org/viewtopic.php?t=65539 with the > > exception of > > > > >> point 8 due to the "promotable" stuff. > > > > >> > > > > >> I'm attaching the output of 'pcs cluster cib file' and I hope it > > helps you > > > > >> fix your issue. > > > > >> > > > > >> Best Regards, > > > > >> Strahil Nikolov > > > > >> > > > > >> > > > > >> В 09:32 -0500 на 19.01.2021 (вт), Stuart Massey написа: > > > > >> > > > > >> Ulrich, > > > > >> Thank you for that observation. We share that concern. > > > > >> We have 4 ea 1G nics active, bonded in pairs. One bonded pair > > serves the > > > > >> "public" (to the intranet) IPs, and the other bonded pair is > > private to > > > > the > > > > >> cluster, used for drbd replication. HA will, I hope, be using > > the "public" > > > > >> IP, since that is the route to the IP addresses resolved for the > > host > > > > >> names; that will certainly be the only route to the quorum > > device. I can > > > > >> say that this cluster has run reasonably well for quite some > > time with > > > > this > > > > >> configuration prior to the recently developed hardware issues on > > one of > > > > the > > > > >> nodes. > > > > >> Regards, > > > > >> Stuart > > > > >> > > > > >> On Tue, Jan 19, 2021 at 2:49 AM Ulrich Windl < > > > > >> [email protected]> wrote: > > > > >> > > > > >> >>> Stuart Massey <[email protected]> schrieb am 19.01.2021 > > um 04:46 > > > > >> in > > > > >> Nachricht > > > > >> < > > cabq68nqutyyxcygwcupg5txxajjwhsp+c6gcokfowgyrqsa...@mail.gmail.com> > > : > > > > >> > So, we have a 2-node cluster with a quorum device. One of the > > nodes > > > > >> (node1) > > > > >> > is having some trouble, so we have added constraints to > > prevent any > > > > >> > resources migrating to it, but have not put it in standby, so > > that drbd > > > > >> in > > > > >> > secondary on that node stays in sync. The problems it is > > having lead to > > > > >> OS > > > > >> > lockups that eventually resolve themselves - but that causes > > it to be > > > > >> > temporarily dropped from the cluster by the current master > > (node2). > > > > >> > Sometimes when node1 rejoins, then node2 will demote the drbd > > ms > > > > >> resource. > > > > >> > That causes all resources that depend on it to be stopped, > > leading to a > > > > >> > service outage. They are then restarted on node2, since they > > can't run > > > > on > > > > >> > node1 (due to constraints). > > > > >> > We are having a hard time understanding why this happens. It > > seems like > > > > >> > there may be some sort of DC contention happening. Does anyone > > have any > > > > >> > idea how we might prevent this from happening? > > > > >> > > > > >> I think if you are routing high-volume DRBD traffic throuch "the > > same > > > > >> pipe" as the cluster communication, cluster communication may > > fail if the > > > > >> pipe is satiated. > > > > >> I'm not happy with that, but it seems to be that way. > > > > >> > > > > >> Maybe running a combination of iftop and iotop could help you > > understand > > > > >> what's going on... > > > > >> > > > > >> Regards, > > > > >> Ulrich > > > > >> > > > > >> > Selected messages (de-identified) from pacemaker.log that > > illustrate > > > > >> > suspicion re DC confusion are below. The update_dc and > > > > >> > abort_transition_graph re deletion of lrm seem to always > > precede the > > > > >> > demotion, and a demotion seems to always follow (when not > > already > > > > >> demoted). > > > > >> > > > > > >> > Jan 18 16:52:17 [21938] node02.example.com crmd: > > info: > > > > >> > do_dc_takeover: Taking over DC status for this > > partition > > > > >> > Jan 18 16:52:17 [21938] node02.example.com crmd: > > info: > > > > >> update_dc: > > > > >> > Set DC to node02.example.com (3.0.14) > > > > >> > Jan 18 16:52:17 [21938] node02.example.com crmd: > > info: > > > > >> > abort_transition_graph: Transition aborted by deletion > > of > > > > >> > lrm[@id='1']: Resource state removal | cib=0.89.327 > > > > >> > source=abort_unless_down:357 > > > > >> > path=/cib/status/node_state[@id='1']/lrm[@id='1'] > > complete=true > > > > >> > Jan 18 16:52:19 [21937] node02.example.com pengine: > > info: > > > > >> > master_color: ms_drbd_ourApp: Promoted 0 instances of a > > possible 1 to > > > > >> > master > > > > >> > Jan 18 16:52:19 [21937] node02.example.com pengine: > > notice: > > > > >> LogAction: > > > > >> > * Demote drbd_ourApp:1 ( Master -> > > Slave > > > > >> > node02.example.com ) > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> _______________________________________________ > > > > >> Manage your subscription: > > > > >> https://lists.clusterlabs.org/mailman/listinfo/users > > > > >> > > > > >> ClusterLabs home: https://www.clusterlabs.org/ > > > > >> > > > > >> _______________________________________________ > > > > >> > > > > >> Manage your subscription: > > > > >> > > > > >> https://lists.clusterlabs.org/mailman/listinfo/users > > > > >> > > > > >> > > > > >> > > > > >> ClusterLabs home: > > > > >> > > > > >> https://www.clusterlabs.org/ > > > > >> > > > > >> > > > > >> > > > > > > > > > > > > > > > > _______________________________________________ > > > > Manage your subscription: > > > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > > > _______________________________________________Manage your > subscription:https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
