On 9/11/2018 1:59 AM, Andrei Borzenkov wrote:
07.09.2018 23:07, Dan Ragle пишет:On an active-active two node cluster with DRBD, dlm, filesystem mounts, a Web Server, and some crons I can't figure out how to have the crons jump from node to node in the correct order. Specifically, I have two crontabs (managed via symlink creation/deletion) which normally will run one on node1 and the other on node2. When a node goes down, I want both to run on the remaining node until the original node comes back up, at which time they should split the nodes again. However, when returning to the original node the crontab that is being moved must wait until the underlying FS mount is done on the original node before jumping. DRBD, dlm, the filesystem mounts and the Web Server are all working as expected; when I mark the second node as standby Apache stops, the FS unmounts, dlm stops, and DRBD stops on the node; and when I mark that same node unstandby the reverse happens as expected. All three of those are cloned resources. The crontab resources are not cloned and create symlinks, one resource preferring the first node and the other preferring the second. Each is colocated and order dependent on the filesystem mounts (which in turn are colocated and dependent on dlm, which in turn is colocated and dependent on DRBD promotion). I thought this would be sufficient, but when the original node is marked unstandby the crontab that prefers to be on that node attempts to jump over immediately before the FS is mounted on that node. Of course the crontab link fails because the underlying filesystem hasn't been mounted yet. pcs version is 0.9.162. Here's the obfuscated detailed list of commands for the config. I'm still trying to set it up so it's not production-ready yet, but want to get this much sorted before I add too much more. # pcs config export pcs-commands #!/usr/bin/sh # sequence generated on 2018-09-07 15:21:15 with: clufter 0.77.0 # invoked as: ['/usr/sbin/pcs', 'config', 'export', 'pcs-commands'] # targeting system: ('linux', 'centos', '7.5.1804', 'Core') # using interpreter: CPython 2.7.5 pcs cluster auth node1.mydomain.com node2.mydomain.com <> /dev/tty pcs cluster setup --name MyCluster \ node1.mydomain.com node2.mydomain.com --transport udpu pcs cluster start --all --wait=60 pcs cluster cib tmp-cib.xml cp tmp-cib.xml tmp-cib.xml.deltasrc pcs -f tmp-cib.xml property set stonith-enabled=false pcs -f tmp-cib.xml property set no-quorum-policy=freeze pcs -f tmp-cib.xml resource defaults resource-stickiness=100 pcs -f tmp-cib.xml resource create DRBD ocf:linbit:drbd drbd_resource=r0 \ op demote interval=0s timeout=90 monitor interval=60s notify interval=0s \ timeout=90 promote interval=0s timeout=90 reload interval=0s timeout=30 \ start interval=0s timeout=240 stop interval=0s timeout=100 pcs -f tmp-cib.xml resource create dlm ocf:pacemaker:controld \ allow_stonith_disabled=1 \ op monitor interval=60s start interval=0s timeout=90 stop interval=0s \ timeout=100 pcs -f tmp-cib.xml resource create WWWMount ocf:heartbeat:Filesystem \ device=/dev/drbd1 directory=/var/www fstype=gfs2 \ options=_netdev,nodiratime,noatime \ op monitor interval=20 timeout=40 notify interval=0s timeout=60 start \ interval=0s timeout=120s stop interval=0s timeout=120s pcs -f tmp-cib.xml resource create WebServer ocf:heartbeat:apache \ configfile=/etc/httpd/conf/httpd.conf statusurl=http://localhost/server-status \ op monitor interval=1min start interval=0s timeout=40s stop interval=0s \ timeout=60s pcs -f tmp-cib.xml resource create SharedRootCrons ocf:heartbeat:symlink \ link=/etc/cron.d/root-shared target=/var/www/crons/root-shared \ op monitor interval=60 timeout=15 start interval=0s timeout=15 stop \ interval=0s timeout=15 pcs -f tmp-cib.xml resource create SharedUserCrons ocf:heartbeat:symlink \ link=/etc/cron.d/User-shared target=/var/www/crons/User-shared \ op monitor interval=60 timeout=15 start interval=0s timeout=15 stop \ interval=0s timeout=15 pcs -f tmp-cib.xml resource create PrimaryUserCrons ocf:heartbeat:symlink \ link=/etc/cron.d/User-server1 target=/var/www/crons/User-server1 \ op monitor interval=60 timeout=15 start interval=0s timeout=15 stop \ interval=0s timeout=15 meta resource-stickiness=0 pcs -f tmp-cib.xml \ resource create SecondaryUserCrons ocf:heartbeat:symlink \ link=/etc/cron.d/User-server2 target=/var/www/crons/User-server2 \ op monitor interval=60 timeout=15 start interval=0s timeout=15 stop \ interval=0s timeout=15 meta resource-stickiness=0 pcs -f tmp-cib.xml \ resource clone dlm clone-max=2 clone-node-max=1 interleave=true pcs -f tmp-cib.xml resource clone WWWMount interleave=true pcs -f tmp-cib.xml resource clone WebServer interleave=true pcs -f tmp-cib.xml resource clone SharedRootCrons interleave=true pcs -f tmp-cib.xml resource clone SharedUserCrons interleave=true pcs -f tmp-cib.xml \ resource master DRBDClone DRBD master-node-max=1 clone-max=2 master-max=2 \ interleave=true notify=true clone-node-max=1 pcs -f tmp-cib.xml \ constraint colocation add dlm-clone with DRBDClone \ id=colocation-dlm-clone-DRBDClone-INFINITY pcs -f tmp-cib.xml constraint order promote DRBDClone \ then dlm-clone id=order-DRBDClone-dlm-clone-mandatory pcs -f tmp-cib.xml \ constraint colocation add WWWMount-clone with dlm-clone \ id=colocation-WWWMount-clone-dlm-clone-INFINITY pcs -f tmp-cib.xml constraint order dlm-clone \ then WWWMount-clone id=order-dlm-clone-WWWMount-clone-mandatory pcs -f tmp-cib.xml \ constraint colocation add WebServer-clone with WWWMount-clone \ id=colocation-WebServer-clone-WWWMount-clone-INFINITY pcs -f tmp-cib.xml constraint order WWWMount-clone \ then WebServer-clone id=order-WWWMount-clone-WebServer-clone-mandatory pcs -f tmp-cib.xml \ constraint colocation add SharedRootCrons-clone with WWWMount-clone \ id=colocation-SharedRootCrons-clone-WWWMount-clone-INFINITY pcs -f tmp-cib.xml \ constraint colocation add SharedUserCrons-clone with WWWMount-clone \ id=colocation-SharedUserCrons-clone-WWWMount-clone-INFINITY pcs -f tmp-cib.xml constraint order WWWMount-clone \ then SharedRootCrons-clone \ id=order-WWWMount-clone-SharedRootCrons-clone-mandatory pcs -f tmp-cib.xml constraint order WWWMount-clone \ then SharedUserCrons-clone \ id=order-WWWMount-clone-SharedUserCrons-clone-mandatory pcs -f tmp-cib.xml \ constraint location PrimaryUserCrons prefers node1.mydomain.com=500 pcs -f tmp-cib.xml \ constraint colocation add PrimaryUserCrons with WWWMount-clone \ id=colocation-PrimaryUserCrons-WWWMount-clone-INFINITY pcs -f tmp-cib.xml constraint order WWWMount-clone \ then PrimaryUserCrons \ id=order-WWWMount-clone-PrimaryUserCrons-mandatory pcs -f tmp-cib.xml \ constraint location SecondaryUserCrons prefers node2.mydomain.com=500I can't answer your question, but just observation - it appears only resources with explicit location preferences misbehave. Is it possible as workaround to not use them?
I suppose it's not *critical* that PrimaryCrons be on node1 and SecondaryCrons on node2; so long as during normal operation they remain split. I could try something like negative colocation (?) to keep them separate, if nothing else to see if that allows them to bounce back and forth cleanly with regards to their other constraints. I'll give that a shot this morning.
pcs -f tmp-cib.xml \ constraint colocation add SecondaryUserCrons with WWWMount-clone \ id=colocation-SecondaryUserCrons-WWWMount-clone-INFINITY pcs -f tmp-cib.xml constraint order WWWMount-clone \ then SecondaryUserCrons \ id=order-WWWMount-clone-SecondaryUserCrons-mandatory pcs cluster cib-push tmp-cib.xml diff-against=tmp-cib.xml.deltasrc When I standby node2, the SecondaryUserCrons bounces over to node1 as expected. When I unstandby node2, it bounces back to node2 immediately, before WWWMount is performed, and thus it fails. What am I missing? Here are the log messages from the unstandby operation: Sep 7 15:02:28 node2 crmd[58188]: notice: State transition S_IDLE -> S_POLICY_ENGINE Sep 7 15:02:28 node2 pengine[58187]: notice: * Start DRBD:1 ( node2.mydomain.com ) Sep 7 15:02:28 node2 pengine[58187]: notice: * Start dlm:1 ( node2.mydomain.com ) due to unrunnable DRBD:1 promote (blocked) Sep 7 15:02:28 node2 pengine[58187]: notice: * Start WWWMount:1 ( node2.mydomain.com ) due to unrunnable dlm:1 start (blocked) Sep 7 15:02:28 node2 pengine[58187]: notice: * Start WebServer:1 ( node2.mydomain.com ) due to unrunnable WWWMount:1 start (blocked) Sep 7 15:02:28 node2 pengine[58187]: notice: * Start SharedRootCrons:1 ( node2.mydomain.com ) due to unrunnable WWWMount:1 start (blocked) Sep 7 15:02:28 node2 pengine[58187]: notice: * Start SharedUserCrons:1 ( node2.mydomain.com ) due to unrunnable WWWMount:1 start (blocked) Sep 7 15:02:28 node2 pengine[58187]: notice: * Move SecondaryUserCrons ( node1.mydomain.com -> node2.mydomain.com ) Sep 7 15:02:28 node2 pengine[58187]: notice: Calculated transition 129, saving inputs in /var/lib/pacemaker/pengine/pe-input-2795.bz2This file would be useful to have.
Reran the test this morning, the file you noted is enclosed. I removed the WebServer and the SharedCrons from the test setup in an attempt to simplify, but other than that should be the same. Still getting the same issue.
Remember this file is the transition generated when I connect to node2 and execute pcs node unstandby. I.E.: Sep 11 08:42:52 node1 pengine[103342]: notice: * Start DRBD:1 ( node2.mydomain.com )Sep 11 08:42:52 node1 pengine[103342]: notice: * Start dlm:1 ( node2.mydomain.com ) due to unrunnable DRBD:1 promote (blocked) Sep 11 08:42:52 node1 pengine[103342]: notice: * Start WWWMount:1 ( node2.mydomain.com ) due to unrunnable dlm:1 start (blocked)
Sep 11 08:42:52 node1 pengine[103342]: notice: * Move SecondaryUserCrons ( node1.mydomain.com -> node2.mydomain.com ) Sep 11 08:42:52 node1 pengine[103342]: notice: Calculated transition 72, saving inputs in /var/lib/pacemaker/pengine/pe-input-1412.bz2
_______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
pe-input-1412.bz2
Description: Binary data
_______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org