*** HELP *** Our healthy Primary/Master demoted itself again. This time it did not re-promote anything until we "refresh"-ed the ms drbd resource. Note that the failing Slave/Secondary node is in maintenance mode, as it has been for several days now. I have posted the pacemaker.log here: http://project.ibss.net/samples/deidPacemakeLog.2021-01-25.txt Any insight anyone could offer would be very much appreciated!
On Mon, Jan 25, 2021 at 8:04 AM Stuart Massey <[email protected]> wrote: > Ok, that is exactly what one might expect -- and: Note that only the > failing node is in maintenance mode. The current master/primary is not in > maintenance mode, and on that node we continue to see messages in > pacemaker.log that seem to indicate that it is doing monitor operations. > Logically, if one has a multi-node cluster and puts only one of the nodes > in maintenance mode while there are no managed resources running on it, > wouldn't the other nodes continue to manage the resources among themselves? > > On Mon, Jan 25, 2021 at 2:07 AM Ulrich Windl < > [email protected]> wrote: > >> >>> Stuart Massey <[email protected]> schrieb am 22.01.2021 um 14:08 >> in >> Nachricht >> <cabq68ntgdmxvo_uvlxg0hytlgsmrgucvcssa3ergqfov+cj...@mail.gmail.com>: >> > Hi Ulrich, >> > Thank you for your response. >> > It makes sense that this would be happening on the failing, >> secondary/slave >> > node, in which case we might expect drbd to be restarted (the service >> > entirely, since it is already demoted) on the slave. I don't understand >> how >> > it would affect the master, unless the failing secondary is causing some >> > issue with drbd on the primary that causes the monitor on the master to >> > time out for some reason. This does not (so far) seem to be the case, as >> > the failing node has now been in maintenance mode for a couple of days >> with >> > drbd still running as secondary, so if drbd failures on the secondary >> were >> > causing the monitor on the Master/Primary to timeout, we should still be >> > seeing that; we are not. The master has yet to demote the drbd resource >> > since we put the failing node in maintenance. >> >> When you are in maintenance mode, monitor operations won't run AFAIK. >> >> > We will watch for a bit longer. >> > Thanks again >> > >> > On Thu, Jan 21, 2021 at 2:23 AM Ulrich Windl < >> > [email protected]> wrote: >> > >> >> >>> Stuart Massey <[email protected]> schrieb am 20.01.2021 um >> >> 03:41 >> >> in >> >> Nachricht >> >> <cajfrb75upumzjpjxcoacrdgog-bqdcjhff5c_omvbfya53d...@mail.gmail.com>: >> >> > Strahil, >> >> > That is very kind of you, thanks. >> >> > I see that in your (feature set 3.4.1) cib, drbd is in a <clone> >> with >> >> some >> >> > meta_attributes and operations having to do with promotion, while in >> our >> >> > (feature set 3.0.14) cib, drbd is in a <master> which does not have >> those >> >> > (maybe since promotion is implicit). >> >> > Our cluster has been working quite well for some time, too. I wonder >> what >> >> > would happen if you could hang the os in one of your nodes? If a VM, >> >> maybe >> >> >> >> Unless some other fencing mechanism (like watchdog timeout) kicks in, >> thge >> >> monitor operation is the only thing that can detect a problem (from the >> >> cluster's view): The monitor operation would timeout. Then the cluster >> >> would >> >> try to restart the resource (stop, then start). If stop also times out >> the >> >> node >> >> will be fenced. >> >> >> >> > the constrained secondary could be starved by setting disk IOPs to >> >> > something really low. Of course, you are using different versions of >> just >> >> > about everything, as we're on centos7. >> >> > Regards, >> >> > Stuart >> >> > >> >> > On Tue, Jan 19, 2021 at 6:20 PM Strahil Nikolov < >> [email protected]> >> >> > wrote: >> >> > >> >> >> I have just built a test cluster (centOS 8.3) for testing DRBD and >> it >> >> >> works quite fine. >> >> >> Actually I followed my notes from >> >> >> https://forums.centos.org/viewtopic.php?t=65539 with the exception >> of >> >> >> point 8 due to the "promotable" stuff. >> >> >> >> >> >> I'm attaching the output of 'pcs cluster cib file' and I hope it >> helps >> >> you >> >> >> fix your issue. >> >> >> >> >> >> Best Regards, >> >> >> Strahil Nikolov >> >> >> >> >> >> >> >> >> В 09:32 -0500 на 19.01.2021 (вт), Stuart Massey написа: >> >> >> >> >> >> Ulrich, >> >> >> Thank you for that observation. We share that concern. >> >> >> We have 4 ea 1G nics active, bonded in pairs. One bonded pair serves >> the >> >> >> "public" (to the intranet) IPs, and the other bonded pair is >> private to >> >> the >> >> >> cluster, used for drbd replication. HA will, I hope, be using the >> >> "public" >> >> >> IP, since that is the route to the IP addresses resolved for the >> host >> >> >> names; that will certainly be the only route to the quorum device. I >> can >> >> >> say that this cluster has run reasonably well for quite some time >> with >> >> this >> >> >> configuration prior to the recently developed hardware issues on >> one of >> >> the >> >> >> nodes. >> >> >> Regards, >> >> >> Stuart >> >> >> >> >> >> On Tue, Jan 19, 2021 at 2:49 AM Ulrich Windl < >> >> >> [email protected]> wrote: >> >> >> >> >> >> >>> Stuart Massey <[email protected]> schrieb am 19.01.2021 um >> >> 04:46 >> >> >> in >> >> >> Nachricht >> >> >> <cabq68nqutyyxcygwcupg5txxajjwhsp+c6gcokfowgyrqsa...@mail.gmail.com >> >: >> >> >> > So, we have a 2-node cluster with a quorum device. One of the >> nodes >> >> >> (node1) >> >> >> > is having some trouble, so we have added constraints to prevent >> any >> >> >> > resources migrating to it, but have not put it in standby, so that >> >> drbd >> >> >> in >> >> >> > secondary on that node stays in sync. The problems it is having >> lead >> >> to >> >> >> OS >> >> >> > lockups that eventually resolve themselves - but that causes it >> to be >> >> >> > temporarily dropped from the cluster by the current master >> (node2). >> >> >> > Sometimes when node1 rejoins, then node2 will demote the drbd ms >> >> >> resource. >> >> >> > That causes all resources that depend on it to be stopped, >> leading to >> >> a >> >> >> > service outage. They are then restarted on node2, since they can't >> run >> >> on >> >> >> > node1 (due to constraints). >> >> >> > We are having a hard time understanding why this happens. It seems >> >> like >> >> >> > there may be some sort of DC contention happening. Does anyone >> have >> >> any >> >> >> > idea how we might prevent this from happening? >> >> >> >> >> >> I think if you are routing high-volume DRBD traffic throuch "the >> same >> >> >> pipe" as the cluster communication, cluster communication may fail >> if >> >> the >> >> >> pipe is satiated. >> >> >> I'm not happy with that, but it seems to be that way. >> >> >> >> >> >> Maybe running a combination of iftop and iotop could help you >> understand >> >> >> what's going on... >> >> >> >> >> >> Regards, >> >> >> Ulrich >> >> >> >> >> >> > Selected messages (de-identified) from pacemaker.log that >> illustrate >> >> >> > suspicion re DC confusion are below. The update_dc and >> >> >> > abort_transition_graph re deletion of lrm seem to always precede >> the >> >> >> > demotion, and a demotion seems to always follow (when not already >> >> >> demoted). >> >> >> > >> >> >> > Jan 18 16:52:17 [21938] node02.example.com crmd: info: >> >> >> > do_dc_takeover: Taking over DC status for this partition >> >> >> > Jan 18 16:52:17 [21938] node02.example.com crmd: info: >> >> >> update_dc: >> >> >> > Set DC to node02.example.com (3.0.14) >> >> >> > Jan 18 16:52:17 [21938] node02.example.com crmd: info: >> >> >> > abort_transition_graph: Transition aborted by deletion of >> >> >> > lrm[@id='1']: Resource state removal | cib=0.89.327 >> >> >> > source=abort_unless_down:357 >> >> >> > path=/cib/status/node_state[@id='1']/lrm[@id='1'] complete=true >> >> >> > Jan 18 16:52:19 [21937] node02.example.com pengine: info: >> >> >> > master_color: ms_drbd_ourApp: Promoted 0 instances of a possible >> 1 >> to >> >> >> > master >> >> >> > Jan 18 16:52:19 [21937] node02.example.com pengine: notice: >> >> >> LogAction: >> >> >> > * Demote drbd_ourApp:1 ( Master -> Slave >> >> >> > node02.example.com ) >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> _______________________________________________ >> >> >> Manage your subscription: >> >> >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> >> >> >> >> ClusterLabs home: https://www.clusterlabs.org/ >> >> >> >> >> >> _______________________________________________ >> >> >> >> >> >> Manage your subscription: >> >> >> >> >> >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> >> >> >> >> >> >> >> >> >> >> ClusterLabs home: >> >> >> >> >> >> https://www.clusterlabs.org/ >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> _______________________________________________ >> >> Manage your subscription: >> >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> >> >> ClusterLabs home: https://www.clusterlabs.org/ >> >> >> >> >> >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ >> >
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
