Re: [ClusterLabs] Non-cloned resource moves before cloned resource startup on unstandby

Daniel Ragle Wed, 12 Sep 2018 10:17:12 -0700

Thanks for the comments. Replies within.

On 9/11/2018 1:52 PM, Ken Gaillot wrote:

On Fri, 2018-09-07 at 16:07 -0400, Dan Ragle wrote:

On an active-active two node cluster with DRBD, dlm, filesystem
mounts, a Web Server, and some crons I can't figure out how to have
the crons jump from node to node in the correct order. Specifically,
I have two crontabs (managed via symlink creation/deletion)
which normally will run one on node1 and the other on node2. When a
node goes down, I want both to run on the remaining node until
the original node comes back up, at which time they should split the
nodes again. However, when returning to the original node the
crontab that is being moved must wait until the underlying FS mount
is done on the original node before jumping.


DRBD, dlm, the filesystem mounts and the Web Server are all working
as expected; when I mark the second node as standby Apache
stops, the FS unmounts, dlm stops, and DRBD stops on the node; and
when I mark that same node unstandby the reverse happens as
expected. All three of those are cloned resources.

The crontab resources are not cloned and create symlinks, one
resource preferring the first node and the other preferring the
second. Each is colocated and order dependent on the filesystem
mounts (which in turn are colocated and dependent on dlm, which in
turn is colocated and dependent on DRBD promotion). I thought this
would be sufficient, but when the original node is marked
unstandby the crontab that prefers to be on that node attempts to
jump over immediately before the FS is mounted on that node. Of
course the crontab link fails because the underlying filesystem
hasn't been mounted yet.

pcs version is 0.9.162.

Here's the obfuscated detailed list of commands for the config. I'm
still trying to set it up so it's not production-ready yet, but
want to get this much sorted before I add too much more.

# pcs config export pcs-commands
#!/usr/bin/sh
# sequence generated on 2018-09-07 15:21:15 with: clufter 0.77.0
# invoked as: ['/usr/sbin/pcs', 'config', 'export', 'pcs-commands']
# targeting system: ('linux', 'centos', '7.5.1804', 'Core')
# using interpreter: CPython 2.7.5
pcs cluster auth node1.mydomain.com node2.mydomain.com <> /dev/tty
pcs cluster setup --name MyCluster \
    node1.mydomain.com node2.mydomain.com --transport udpu
pcs cluster start --all --wait=60
pcs cluster cib tmp-cib.xml
cp tmp-cib.xml tmp-cib.xml.deltasrc
pcs -f tmp-cib.xml property set stonith-enabled=false
pcs -f tmp-cib.xml property set no-quorum-policy=freeze
pcs -f tmp-cib.xml resource defaults resource-stickiness=100


Just a note, scores are all added together, and highest wins. For
example, if resource-stickiness + location preference for current node
> colocation with resource on different node, then the colocation will
be ignored.

I don't think that's what's happening here; as the resource is moving*where* I want/expect it to, just not in the right order.

pcs -f tmp-cib.xml resource create DRBD ocf:linbit:drbd
drbd_resource=r0 \
    op demote interval=0s timeout=90 monitor interval=60s notify
interval=0s \
    timeout=90 promote interval=0s timeout=90 reload interval=0s
timeout=30 \
    start interval=0s timeout=240 stop interval=0s timeout=100
pcs -f tmp-cib.xml resource create dlm ocf:pacemaker:controld \
    allow_stonith_disabled=1 \
    op monitor interval=60s start interval=0s timeout=90 stop
interval=0s \
    timeout=100
pcs -f tmp-cib.xml resource create WWWMount ocf:heartbeat:Filesystem
\
    device=/dev/drbd1 directory=/var/www fstype=gfs2 \
    options=_netdev,nodiratime,noatime \
    op monitor interval=20 timeout=40 notify interval=0s timeout=60
start \
    interval=0s timeout=120s stop interval=0s timeout=120s
pcs -f tmp-cib.xml resource create WebServer ocf:heartbeat:apache \
    configfile=/etc/httpd/conf/httpd.conf statusurl=http://localhost/s
erver-status \
    op monitor interval=1min start interval=0s timeout=40s stop
interval=0s \
    timeout=60s
pcs -f tmp-cib.xml resource create SharedRootCrons
ocf:heartbeat:symlink \
    link=/etc/cron.d/root-shared target=/var/www/crons/root-shared \
    op monitor interval=60 timeout=15 start interval=0s timeout=15
stop \
    interval=0s timeout=15


Another note, I seem to remember some implementations of the cron
daemon refuse to work from symlinks, and some require a restart when a
cron is changed outside of the crontab command. That may or may not
apply in your situation; the system or cron daemon logs should show
whether the change took effect when the resource is started/stopped.

Yup. We're actually already doing this much in production, just not withany type of cluster based management. It's working well. Creating anddeleting the symlinks works fine (crond picks up the change without aproblem). When updating the underlying cron definitions you do howeverneed to touch -h the symlink file itself; just updating the underlyingfile isn't enough to get crond to notice. We just issue that touchcommand as part of our rollin tools whenever we update the underlyingcron definitions.

An alternative design for working around those issues is to have all
the crons always active (on host storage) on both nodes, but the cron
jobs check somehow whether they're on the active node or not and exit
when not where they need to be.

pcs -f tmp-cib.xml resource create SharedUserCrons
ocf:heartbeat:symlink \
    link=/etc/cron.d/User-shared target=/var/www/crons/User-shared \
    op monitor interval=60 timeout=15 start interval=0s timeout=15
stop \
    interval=0s timeout=15
pcs -f tmp-cib.xml resource create PrimaryUserCrons
ocf:heartbeat:symlink \
    link=/etc/cron.d/User-server1 target=/var/www/crons/User-server1 \
    op monitor interval=60 timeout=15 start interval=0s timeout=15
stop \
    interval=0s timeout=15 meta resource-stickiness=0
pcs -f tmp-cib.xml \
    resource create SecondaryUserCrons ocf:heartbeat:symlink \
    link=/etc/cron.d/User-server2 target=/var/www/crons/User-server2 \
    op monitor interval=60 timeout=15 start interval=0s timeout=15
stop \
    interval=0s timeout=15 meta resource-stickiness=0
pcs -f tmp-cib.xml \
    resource clone dlm clone-max=2 clone-node-max=1 interleave=true
pcs -f tmp-cib.xml resource clone WWWMount interleave=true
pcs -f tmp-cib.xml resource clone WebServer interleave=true
pcs -f tmp-cib.xml resource clone SharedRootCrons interleave=true
pcs -f tmp-cib.xml resource clone SharedUserCrons interleave=true
pcs -f tmp-cib.xml \
    resource master DRBDClone DRBD master-node-max=1 clone-max=2
master-max=2 \
    interleave=true notify=true clone-node-max=1
pcs -f tmp-cib.xml \
    constraint colocation add dlm-clone with DRBDClone \
    id=colocation-dlm-clone-DRBDClone-INFINITY


Even though your DRBD is multi-master, that doesn't mean it will
*always* be in primary mode (e.g. it will start in secondary and then
be promoted to primary, or the promotion may fail). I think you want to
colocate DLM with the DRBD master role, so DLM (and further
dependencies) don't run if DRBD is in secondary mode.


I'll be buggered. That fixed it. Though I don't understand why.

    pcs constraint remove colocation-dlm-clone-DRBDClone-INFINITY
    pcs constraint colocation add dlm-clone with DRBDClone \
        with-rsc-role=Master INFINITY

And all is well, the symlink resource is now behaving as expected. Iwould never have thought of that as a part of the problem because DRBD,dlm, and the FS mount were working as expected all along; i.e., dlmwaited for the DRBD promotion (see the ordering constraint on the nextcommand) before starting, and WWWMount waited for dlm to start.

With the above in place, now I see a new transition sequence, where DRBDis started (by itself, the only operation in the first transition), andthen the next transition promotes DRBD, starts dlm, starts WWWMount,*then* moves the symlink.

pcs -f tmp-cib.xml constraint order promote DRBDClone \
    then dlm-clone id=order-DRBDClone-dlm-clone-mandatory
pcs -f tmp-cib.xml \
    constraint colocation add WWWMount-clone with dlm-clone \
    id=colocation-WWWMount-clone-dlm-clone-INFINITY
pcs -f tmp-cib.xml constraint order dlm-clone \
    then WWWMount-clone id=order-dlm-clone-WWWMount-clone-mandatory
pcs -f tmp-cib.xml \
    constraint colocation add WebServer-clone with WWWMount-clone \
    id=colocation-WebServer-clone-WWWMount-clone-INFINITY
pcs -f tmp-cib.xml constraint order WWWMount-clone \
    then WebServer-clone id=order-WWWMount-clone-WebServer-clone-
mandatory


Yet another side note: you can clone a group, so it might simplify
slightly to clone a group of DLM + WWWMount + WebServer, then
colocate/order the cloned group relative to DRBD master.


Cool. Yah, hadn't even gotten as far as groups yet.

pcs -f tmp-cib.xml \
    constraint colocation add SharedRootCrons-clone with WWWMount-
clone \
    id=colocation-SharedRootCrons-clone-WWWMount-clone-INFINITY
pcs -f tmp-cib.xml \
    constraint colocation add SharedUserCrons-clone with WWWMount-
clone \
    id=colocation-SharedUserCrons-clone-WWWMount-clone-INFINITY
pcs -f tmp-cib.xml constraint order WWWMount-clone \
    then SharedRootCrons-clone \
    id=order-WWWMount-clone-SharedRootCrons-clone-mandatory
pcs -f tmp-cib.xml constraint order WWWMount-clone \
    then SharedUserCrons-clone \
    id=order-WWWMount-clone-SharedUserCrons-clone-mandatory
pcs -f tmp-cib.xml \
    constraint location PrimaryUserCrons prefers
node1.mydomain.com=500


This score is higher than stickiness, but I think that was intentional,
so it would move back if a node is lost and recovered. Another way to
do that would be to set resource-stickiness=0 on these resources to
override the default, ideally with a small anti-colocation as you tried
later.

Yup, just trying to make sure it really likes node2 as long as node2 isavailable and all the other dependencies are met.

pcs -f tmp-cib.xml \
    constraint colocation add PrimaryUserCrons with WWWMount-clone \
    id=colocation-PrimaryUserCrons-WWWMount-clone-INFINITY
pcs -f tmp-cib.xml constraint order WWWMount-clone \
    then PrimaryUserCrons \
    id=order-WWWMount-clone-PrimaryUserCrons-mandatory
pcs -f tmp-cib.xml \
    constraint location SecondaryUserCrons prefers
node2.mydomain.com=500
pcs -f tmp-cib.xml \
    constraint colocation add SecondaryUserCrons with WWWMount-clone \
    id=colocation-SecondaryUserCrons-WWWMount-clone-INFINITY
pcs -f tmp-cib.xml constraint order WWWMount-clone \
    then SecondaryUserCrons \
    id=order-WWWMount-clone-SecondaryUserCrons-mandatory
pcs cluster cib-push tmp-cib.xml diff-against=tmp-cib.xml.deltasrc

When I standby node2, the SecondaryUserCrons bounces over to node1 as
expected. When I unstandby node2, it bounces back to node2
immediately, before WWWMount is performed, and thus it fails. What am
I missing? Here are the log messages from the unstandby operation:

Sep  7 15:02:28 node2 crmd[58188]:   notice: State transition S_IDLE
-> S_POLICY_ENGINE
Sep  7 15:02:28 node2 pengine[58187]:   notice:  *
Start      DRBD:1                 (                        node2.mydo
main.com )
Sep  7 15:02:28 node2 pengine[58187]:   notice:  *
Start      dlm:1                  (                        node2.mydo
main.com )
due to unrunnable DRBD:1 promote (blocked)
Sep  7 15:02:28 node2 pengine[58187]:   notice:  *
Start      WWWMount:1             (                        node2.mydo
main.com )
due to unrunnable dlm:1 start (blocked)
Sep  7 15:02:28 node2 pengine[58187]:   notice:  *
Start      WebServer:1            (                        node2.mydo
main.com )
due to unrunnable WWWMount:1 start (blocked)
Sep  7 15:02:28 node2 pengine[58187]:   notice:  *
Start      SharedRootCrons:1      (                        node2.mydo
main.com )
due to unrunnable WWWMount:1 start (blocked)
Sep  7 15:02:28 node2 pengine[58187]:   notice:  *
Start      SharedUserCrons:1      (                        node2.mydo
main.com )
due to unrunnable WWWMount:1 start (blocked)
Sep  7 15:02:28 node2 pengine[58187]:   notice:  *
Move       SecondaryUserCrons     ( node1.mydomain.com ->
node2.mydomain.com )
Sep  7 15:02:28 node2 pengine[58187]:   notice: Calculated transition
129, saving inputs in /var/lib/pacemaker/pengine/pe-input-2795.bz2


Please open a bug at bugs.clusterlabs.org; this is definitely broken.

Filed as https://bugs.clusterlabs.org/show_bug.cgi?id=5368 (had done sobefore realizing the fix from above). Might still be worth looking into,however, will leave it there and let you guys decide what you want to dowith it. I'll add a note as to the dlm -> DRBD Master Role thing above.

Dan

Sep  7 15:02:28 node2 crmd[58188]:   notice: Initiating stop
operation SecondaryUserCrons_stop_0 on node1.mydomain.com
Sep  7 15:02:28 node2 crmd[58188]:   notice: Initiating notify
operation DRBD_pre_notify_start_0 on node1.mydomain.com
Sep  7 15:02:28 node2 crmd[58188]:   notice: Initiating start
operation SecondaryUserCrons_start_0 locally on node2.mydomain.com
Sep  7 15:02:28 node2 symlink(SecondaryUserCrons)[52196]: WARNING:
/var/www/crons/User-server2 does not exist!
Sep  7 15:02:28 node2 crmd[58188]:   notice: Initiating start
operation DRBD_start_0 locally on node2.mydomain.com
Sep  7 15:02:28 node2 symlink(SecondaryUserCrons)[52196]: INFO:
'/etc/cron.d/User-server2' -> '/var/www/crons/User-server2'
Sep  7 15:02:28 node2 symlink(SecondaryUserCrons)[52196]: ERROR:
/etc/cron.d/User-server2 does not point to /var/www/crons/User-
server2!
Sep  7 15:02:28 node2 lrmd[58185]:   notice:
SecondaryUserCrons_start_0:52196:stderr [ ocf-exit-
reason:/etc/cron.d/User-server2 does
not point to /var/www/crons/User-server2! ]
Sep  7 15:02:28 node2 crmd[58188]:   notice: Result of start
operation for SecondaryUserCrons on node2.mydomain.com: 5 (not
installed)
Sep  7 15:02:28 node2 crmd[58188]:   notice: node2.mydomain.com-
SecondaryUserCrons_start_0:390 [
ocf-exit-reason:/etc/cron.d/User-server2 does not point to
/var/www/crons/User-server2!\n ]
Sep  7 15:02:28 node2 crmd[58188]:  warning: Action 109
(SecondaryUserCrons_start_0) on node2.mydomain.com failed (target: 0
vs. rc:
5): Error
Sep  7 15:02:28 node2 crmd[58188]:   notice: Transition aborted by
operation SecondaryUserCrons_start_0 'modify' on
node2.mydomain.com: Event failed
Sep  7 15:02:28 node2 crmd[58188]:  warning: Action 109
(SecondaryUserCrons_start_0) on node2.mydomain.com failed (target: 0
vs. rc:
5): Error
Sep  7 15:02:28 node2 crmd[58188]:   notice: Transition aborted by
status-2-fail-count-SecondaryUserCrons.start_0 doing create
fail-count-SecondaryUserCrons#start_0=INFINITY: Transient attribute
change
Sep  7 15:02:28 node2 kernel: drbd r0: Starting worker thread (from
drbdsetup [52264])
Sep  7 15:02:28 node2 kernel: drbd r0/0 drbd1: disk( Diskless ->
Attaching )
Sep  7 15:02:28 node2 kernel: drbd r0/0 drbd1: Maximum number of peer
devices = 1
Sep  7 15:02:28 node2 kernel: drbd r0: Method to ensure write
ordering: drain
Sep  7 15:02:28 node2 kernel: drbd r0/0 drbd1: drbd_bm_resize called
with capacity == 1048543928
Sep  7 15:02:28 node2 kernel: drbd r0/0 drbd1: resync bitmap:
bits=131067991 words=2047938 pages=4000
Sep  7 15:02:28 node2 kernel: drbd r0/0 drbd1: size = 500 GB
(524271964 KB)
Sep  7 15:02:28 node2 kernel: drbd r0/0 drbd1: size = 500 GB
(524271964 KB)
Sep  7 15:02:28 node2 kernel: drbd r0/0 drbd1: recounting of set bits
took additional 13ms
Sep  7 15:02:28 node2 kernel: drbd r0/0 drbd1: disk( Attaching ->
Outdated )
Sep  7 15:02:28 node2 kernel: drbd r0/0 drbd1: attached to current
UUID: A2457506F4D44F1C
Sep  7 15:02:28 node2 kernel: drbd r0/1 drbd2: disk( Diskless ->
Attaching )
Sep  7 15:02:28 node2 kernel: drbd r0/1 drbd2: Maximum number of peer
devices = 1
Sep  7 15:02:28 node2 kernel: drbd r0/1 drbd2: drbd_bm_resize called
with capacity == 2097016
Sep  7 15:02:28 node2 kernel: drbd r0/1 drbd2: resync bitmap:
bits=262127 words=4096 pages=8
Sep  7 15:02:28 node2 kernel: drbd r0/1 drbd2: size = 1024 MB
(1048508 KB)
Sep  7 15:02:28 node2 kernel: drbd r0/1 drbd2: size = 1024 MB
(1048508 KB)
Sep  7 15:02:28 node2 kernel: drbd r0/1 drbd2: recounting of set bits
took additional 0ms
Sep  7 15:02:28 node2 kernel: drbd r0/1 drbd2: disk( Attaching ->
Outdated )
Sep  7 15:02:28 node2 kernel: drbd r0/1 drbd2: attached to current
UUID: 0EC5D56AEE53C6B6
Sep  7 15:02:28 node2 kernel: drbd r0 node1.mydomain.com: Starting
sender thread (from drbdsetup [52291])
Sep  7 15:02:28 node2 kernel: drbd r0 node1.mydomain.com: conn(
StandAlone -> Unconnected )
Sep  7 15:02:28 node2 kernel: drbd r0 node1.mydomain.com: Starting
receiver thread (from drbd_w_r0 [52265])
Sep  7 15:02:28 node2 kernel: drbd r0 node1.mydomain.com: conn(
Unconnected -> Connecting )
Sep  7 15:02:28 node2 crmd[58188]:   notice: Result of start
operation for DRBD on node2.mydomain.com: 0 (ok)
Sep  7 15:02:28 node2 crmd[58188]:   notice: Initiating notify
operation DRBD_post_notify_start_0 on node1.mydomain.com
Sep  7 15:02:28 node2 crmd[58188]:   notice: Initiating notify
operation DRBD_post_notify_start_0 locally on node2.mydomain.com
Sep  7 15:02:28 node2 crmd[58188]:   notice: Result of notify
operation for DRBD on node2.mydomain.com: 0 (ok)
Sep  7 15:02:28 node2 crmd[58188]:   notice: Transition 129
(Complete=29, Pending=0, Fired=0, Skipped=1, Incomplete=7,
Source=/var/lib/pacemaker/pengine/pe-input-2795.bz2): Stopped
Sep  7 15:02:28 node2 pengine[58187]:  warning: Processing failed op
start for SecondaryUserCrons on node2.mydomain.com: not
installed (5)
Sep  7 15:02:28 node2 pengine[58187]:   notice: Preventing
SecondaryUserCrons from re-starting on node2.mydomain.com: operation
start failed 'not installed' (5)
Sep  7 15:02:28 node2 pengine[58187]:  warning: Processing failed op
start for SecondaryUserCrons on node2.mydomain.com: not
installed (5)
Sep  7 15:02:28 node2 pengine[58187]:   notice: Preventing
SecondaryUserCrons from re-starting on node2.mydomain.com: operation
start failed 'not installed' (5)
Sep  7 15:02:28 node2 pengine[58187]:  warning: Forcing
SecondaryUserCrons away from node2.mydomain.com after 1000000
failures
(max=1000000)
Sep  7 15:02:28 node2 pengine[58187]:   notice:  *
Start      dlm:1                  (                        node2.mydo
main.com )
due to unrunnable DRBD:1 promote (blocked)
Sep  7 15:02:28 node2 pengine[58187]:   notice:  *
Start      WWWMount:1             (                        node2.mydo
main.com )
due to unrunnable dlm:1 start (blocked)
Sep  7 15:02:28 node2 pengine[58187]:   notice:  *
Start      WebServer:1            (                        node2.mydo
main.com )
due to unrunnable WWWMount:1 start (blocked)
Sep  7 15:02:28 node2 pengine[58187]:   notice:  *
Start      SharedRootCrons:1      (                        node2.mydo
main.com )
due to unrunnable WWWMount:1 start (blocked)
Sep  7 15:02:28 node2 pengine[58187]:   notice:  *
Start      SharedUserCrons:1      (                        node2.mydo
main.com )
due to unrunnable WWWMount:1 start (blocked)
Sep  7 15:02:28 node2 pengine[58187]:   notice:  *
Recover    SecondaryUserCrons     ( node2.mydomain.com ->
node1.mydomain.com )
Sep  7 15:02:28 node2 pengine[58187]:   notice: Calculated transition
130, saving inputs in /var/lib/pacemaker/pengine/pe-input-2796.bz2
Sep  7 15:02:28 node2 crmd[58188]:   notice: Initiating monitor
operation DRBD_monitor_60000 locally on node2.mydomain.com
Sep  7 15:02:28 node2 crmd[58188]:   notice: Initiating stop
operation SecondaryUserCrons_stop_0 locally on node2.mydomain.com
Sep  7 15:02:28 node2 symlink(SecondaryUserCrons)[52329]: WARNING:
/var/www/crons/User-server2 does not exist!
Sep  7 15:02:28 node2 symlink(SecondaryUserCrons)[52329]: ERROR:
/etc/cron.d/User-server2 does not point to /var/www/crons/User-
server2!
Sep  7 15:02:28 node2 lrmd[58185]:   notice:
SecondaryUserCrons_stop_0:52329:stderr [ ocf-exit-
reason:/etc/cron.d/User-server2 does
not point to /var/www/crons/User-server2! ]
Sep  7 15:02:28 node2 crmd[58188]:   notice: Result of stop operation
for SecondaryUserCrons on node2.mydomain.com: 5 (not installed)
Sep  7 15:02:28 node2 crmd[58188]:   notice: node2.mydomain.com-
SecondaryUserCrons_stop_0:394 [
ocf-exit-reason:/etc/cron.d/User-server2 does not point to
/var/www/crons/User-server2!\n ]
Sep  7 15:02:28 node2 crmd[58188]:  warning: Action 10
(SecondaryUserCrons_stop_0) on node2.mydomain.com failed (target: 0
vs. rc:
5): Error
Sep  7 15:02:28 node2 crmd[58188]:   notice: Transition aborted by
operation SecondaryUserCrons_stop_0 'modify' on
node2.mydomain.com: Event failed
Sep  7 15:02:28 node2 crmd[58188]:  warning: Action 10
(SecondaryUserCrons_stop_0) on node2.mydomain.com failed (target: 0
vs. rc:
5): Error
Sep  7 15:02:28 node2 crmd[58188]:   notice: Transition aborted by
status-2-fail-count-SecondaryUserCrons.stop_0 doing create
fail-count-SecondaryUserCrons#stop_0=INFINITY: Transient attribute
change
Sep  7 15:02:28 node2 crmd[58188]:   notice: Transition 130
(Complete=18, Pending=0, Fired=0, Skipped=0, Incomplete=8,
Source=/var/lib/pacemaker/pengine/pe-input-2796.bz2): Complete
Sep  7 15:02:29 node2 pengine[58187]:    error: No further recovery
can be attempted for SecondaryUserCrons: stop action failed with
'not installed' (5)
Sep  7 15:02:29 node2 pengine[58187]:  warning: Processing failed op
stop for SecondaryUserCrons on node2.mydomain.com: not
installed (5)
Sep  7 15:02:29 node2 pengine[58187]:   notice: Preventing
SecondaryUserCrons from re-starting on node2.mydomain.com: operation
stop
failed 'not installed' (5)
Sep  7 15:02:29 node2 pengine[58187]:    error: No further recovery
can be attempted for SecondaryUserCrons: stop action failed with
'not installed' (5)
Sep  7 15:02:29 node2 pengine[58187]:  warning: Processing failed op
stop for SecondaryUserCrons on node2.mydomain.com: not
installed (5)
Sep  7 15:02:29 node2 pengine[58187]:   notice: Preventing
SecondaryUserCrons from re-starting on node2.mydomain.com: operation
stop
failed 'not installed' (5)
Sep  7 15:02:29 node2 pengine[58187]:  warning: Forcing
SecondaryUserCrons away from node2.mydomain.com after 1000000
failures
(max=1000000)
Sep  7 15:02:29 node2 pengine[58187]:   notice:  *
Start      dlm:1                  (                        node2.mydo
main.com )
due to unrunnable DRBD:1 promote (blocked)
Sep  7 15:02:29 node2 pengine[58187]:   notice:  *
Start      WWWMount:1             (                        node2.mydo
main.com )
due to unrunnable dlm:1 start (blocked)
Sep  7 15:02:29 node2 pengine[58187]:   notice:  *
Start      WebServer:1            (                        node2.mydo
main.com )
due to unrunnable WWWMount:1 start (blocked)
Sep  7 15:02:29 node2 pengine[58187]:   notice:  *
Start      SharedRootCrons:1      (                        node2.mydo
main.com )
due to unrunnable WWWMount:1 start (blocked)
Sep  7 15:02:29 node2 pengine[58187]:   notice:  *
Start      SharedUserCrons:1      (                        node2.mydo
main.com )
due to unrunnable WWWMount:1 start (blocked)
Sep  7 15:02:29 node2 pengine[58187]:    error: Calculated transition
131 (with errors), saving inputs in
/var/lib/pacemaker/pengine/pe-error-26.bz2
Sep  7 15:02:29 node2 crmd[58188]:  warning: Transition 131
(Complete=16, Pending=0, Fired=0, Skipped=0, Incomplete=5,
Source=/var/lib/pacemaker/pengine/pe-error-26.bz2): Terminated
Sep  7 15:02:29 node2 crmd[58188]:  warning: Transition failed:
terminated
Sep  7 15:02:29 node2 crmd[58188]:   notice: Graph 131 with 21
actions: batch-limit=0 jobs, network-delay=60000ms
Sep  7 15:02:29 node2 crmd[58188]:   notice: [Action   47]: Completed
pseudo op dlm-clone_running_0            on N/A (priority:
1000000, waiting: none)
Sep  7 15:02:29 node2 crmd[58188]:   notice: [Action   46]: Completed
pseudo op dlm-clone_start_0              on N/A (priority: 0,
waiting: none)
Sep  7 15:02:29 node2 crmd[58188]:   notice: [Action   55]: Completed
pseudo op WWWMount-clone_running_0       on N/A (priority:
1000000, waiting: none)
Sep  7 15:02:29 node2 crmd[58188]:   notice: [Action   54]: Completed
pseudo op WWWMount-clone_start_0         on N/A (priority: 0,
waiting: none)
Sep  7 15:02:29 node2 crmd[58188]:   notice: [Action   69]: Pending
rsc op WebServer_monitor_60000             on node2.mydomain.com
(priority: 0, waiting: none)
Sep  7 15:02:29 node2 crmd[58188]:   notice:  * [Input 68]:
Unresolved dependency rsc op WebServer_start_0 on node2.mydomain.com
Sep  7 15:02:29 node2 crmd[58188]:   notice: [Action   71]: Completed
pseudo op WebServer-clone_running_0        on N/A (priority:
1000000, waiting: none)
Sep  7 15:02:29 node2 crmd[58188]:   notice: [Action   70]: Completed
pseudo op WebServer-clone_start_0          on N/A (priority:
0, waiting: none)
Sep  7 15:02:29 node2 crmd[58188]:   notice: [Action   93]: Pending
rsc op SharedRootCrons_monitor_60000       on node2.mydomain.com
(priority: 0, waiting: none)
Sep  7 15:02:29 node2 crmd[58188]:   notice:  * [Input 92]:
Unresolved dependency rsc op SharedRootCrons_start_0 on
node2.mydomain.com
Sep  7 15:02:29 node2 crmd[58188]:   notice: [Action   95]: Completed
pseudo op SharedRootCrons-clone_running_0 on N/A (priority:
1000000, waiting: none)
Sep  7 15:02:29 node2 crmd[58188]:   notice: [Action   94]: Completed
pseudo op SharedRootCrons-clone_start_0  on N/A (priority: 0,
waiting: none)
Sep  7 15:02:29 node2 crmd[58188]:   notice: [Action  101]: Pending
rsc op SharedUserCrons_monitor_60000   on node2.mydomain.com
(priority: 0, waiting: none)
Sep  7 15:02:29 node2 crmd[58188]:   notice:  * [Input 100]:
Unresolved dependency rsc op SharedUserCrons_start_0 on
node2.mydomain.com
Sep  7 15:02:29 node2 crmd[58188]:   notice: [Action  103]: Completed
pseudo op SharedUserCrons-clone_running_0 on N/A (priority:
1000000, waiting: none)
Sep  7 15:02:29 node2 crmd[58188]:   notice: [Action  102]: Completed
pseudo op SharedUserCrons-clone_start_0 on N/A (priority: 0,
waiting: none)
Sep  7 15:02:29 node2 crmd[58188]:   notice: State transition
S_TRANSITION_ENGINE -> S_IDLE
Sep  7 15:02:29 node2 kernel: drbd r0 node1.mydomain.com: Handshake
to peer 0 successful: Agreed network protocol version 113
Sep  7 15:02:29 node2 kernel: drbd r0 node1.mydomain.com: Feature
flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME
WRITE_ZEROES.
Sep  7 15:02:29 node2 kernel: drbd r0 node1.mydomain.com: Starting
ack_recv thread (from drbd_r_r0 [52295])
Sep  7 15:02:29 node2 kernel: drbd r0 node1.mydomain.com: Preparing
remote state change 2019156377
Sep  7 15:02:29 node2 kernel: drbd r0 node1.mydomain.com: Committing
remote state change 2019156377 (primary_nodes=1)
Sep  7 15:02:29 node2 kernel: drbd r0 node1.mydomain.com: conn(
Connecting -> Connected ) peer( Unknown -> Primary )
Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
drbd_sync_handshake:
Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
self
A2457506F4D44F1C:0000000000000000:B13E5D392CF268C4:FE2F70857D64FB02
bits:0 flags:20
Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
peer
D355B0F942665879:A2457506F4D44F1D:B13E5D392CF268C4:E56E164C51EEFAB0
bits:6 flags:120
Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
uuid_compare()=-2 by rule 50
Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
pdsk( DUnknown -> UpToDate ) repl( Off -> WFBitMapT )
Sep  7 15:02:29 node2 kernel: drbd r0/1 drbd2 node1.mydomain.com:
drbd_sync_handshake:
Sep  7 15:02:29 node2 kernel: drbd r0/1 drbd2 node1.mydomain.com:
self
0EC5D56AEE53C6B6:0000000000000000:0000000000000000:0000000000000000
bits:0 flags:20
Sep  7 15:02:29 node2 kernel: drbd r0/1 drbd2 node1.mydomain.com:
peer
0EC5D56AEE53C6B6:0000000000000000:B62926494645765C:0000000000000000
bits:0 flags:120
Sep  7 15:02:29 node2 kernel: drbd r0/1 drbd2 node1.mydomain.com:
uuid_compare()=0 by rule 38
Sep  7 15:02:29 node2 kernel: drbd r0/1 drbd2: disk( Outdated ->
UpToDate )
Sep  7 15:02:29 node2 kernel: drbd r0/1 drbd2 node1.mydomain.com:
pdsk( DUnknown -> UpToDate ) repl( Off -> Established )
Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 27(1),
total 27; compression: 100.0%
Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
send bitmap stats [Bytes(packets)]: plain 0(0), RLE 27(1), total
27; compression: 100.0%
Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
helper command: /sbin/drbdadm before-resync-target
Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
helper command: /sbin/drbdadm before-resync-target exit code 0 (0x0)
Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1: disk( Outdated ->
Inconsistent )
Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
repl( WFBitMapT -> SyncTarget )
Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
Began resync as SyncTarget (will sync 24 KB [6 bits set]).
Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
Resync done (total 1 sec; paused 0 sec; 24 K/sec)
Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
updated UUIDs
D355B0F942665878:0000000000000000:A2457506F4D44F1C:E2BDB50A1BFBAE5E
Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1: disk( Inconsistent ->
UpToDate )
Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
repl( SyncTarget -> Established )
Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
helper command: /sbin/drbdadm after-resync-target
Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
helper command: /sbin/drbdadm after-resync-target exit code 0 (0x0)
Sep  7 15:03:29 node2 crmd[58188]:   notice: State transition S_IDLE
-> S_POLICY_ENGINE
Sep  7 15:03:29 node2 pengine[58187]:    error: No further recovery
can be attempted for SecondaryUserCrons: stop action failed with
'not installed' (5)
Sep  7 15:03:29 node2 pengine[58187]:  warning: Processing failed op
stop for SecondaryUserCrons on node2.mydomain.com: not
installed (5)
Sep  7 15:03:29 node2 pengine[58187]:   notice: Preventing
SecondaryUserCrons from re-starting on node2.mydomain.com: operation
stop
failed 'not installed' (5)
Sep  7 15:03:29 node2 pengine[58187]:    error: No further recovery
can be attempted for SecondaryUserCrons: stop action failed with
'not installed' (5)
Sep  7 15:03:29 node2 pengine[58187]:  warning: Processing failed op
stop for SecondaryUserCrons on node2.mydomain.com: not
installed (5)
Sep  7 15:03:29 node2 pengine[58187]:   notice: Preventing
SecondaryUserCrons from re-starting on node2.mydomain.com: operation
stop
failed 'not installed' (5)
Sep  7 15:03:29 node2 pengine[58187]:  warning: Forcing
SecondaryUserCrons away from node2.mydomain.com after 1000000
failures
(max=1000000)
Sep  7 15:03:29 node2 pengine[58187]:   notice:  *
Promote    DRBD:1                 (        Slave -> Master
node2.mydomain.com )
Sep  7 15:03:29 node2 pengine[58187]:   notice:  *
Start      dlm:1                  (                        node2.mydo
main.com )
Sep  7 15:03:29 node2 pengine[58187]:   notice:  *
Start      WWWMount:1             (                        node2.mydo
main.com )
Sep  7 15:03:29 node2 pengine[58187]:   notice:  *
Start      WebServer:1            (                        node2.mydo
main.com )
Sep  7 15:03:29 node2 pengine[58187]:   notice:  *
Start      SharedRootCrons:1      (                        node2.mydo
main.com )
Sep  7 15:03:29 node2 pengine[58187]:   notice:  *
Start      SharedUserCrons:1      (                        node2.mydo
main.com )
Sep  7 15:03:29 node2 pengine[58187]:    error: Calculated transition
132 (with errors), saving inputs in
/var/lib/pacemaker/pengine/pe-error-27.bz2
Sep  7 15:03:29 node2 crmd[58188]:   notice: Initiating cancel
operation DRBD_monitor_60000 locally on node2.mydomain.com
Sep  7 15:03:29 node2 crmd[58188]:   notice: Initiating notify
operation DRBD_pre_notify_promote_0 on node1.mydomain.com
Sep  7 15:03:29 node2 crmd[58188]:   notice: Initiating notify
operation DRBD_pre_notify_promote_0 locally on node2.mydomain.com
Sep  7 15:03:29 node2 crmd[58188]:   notice: Result of notify
operation for DRBD on node2.mydomain.com: 0 (ok)
Sep  7 15:03:29 node2 crmd[58188]:   notice: Initiating promote
operation DRBD_promote_0 locally on node2.mydomain.com
Sep  7 15:03:29 node2 kernel: drbd r0: Preparing cluster-wide state
change 360863446 (1->-1 3/1)
Sep  7 15:03:29 node2 kernel: drbd r0: State change 360863446:
primary_nodes=3, weak_nodes=FFFFFFFFFFFFFFFC
Sep  7 15:03:29 node2 kernel: drbd r0: Committing cluster-wide state
change 360863446 (0ms)
Sep  7 15:03:29 node2 kernel: drbd r0: role( Secondary -> Primary )
Sep  7 15:03:29 node2 crmd[58188]:   notice: Result of promote
operation for DRBD on node2.mydomain.com: 0 (ok)
Sep  7 15:03:29 node2 crmd[58188]:   notice: Initiating notify
operation DRBD_post_notify_promote_0 on node1.mydomain.com
Sep  7 15:03:29 node2 crmd[58188]:   notice: Initiating notify
operation DRBD_post_notify_promote_0 locally on node2.mydomain.com
Sep  7 15:03:29 node2 crmd[58188]:   notice: Result of notify
operation for DRBD on node2.mydomain.com: 0 (ok)
Sep  7 15:03:29 node2 crmd[58188]:   notice: Initiating start
operation dlm_start_0 locally on node2.mydomain.com
Sep  7 15:03:29 node2 dlm_controld[53127]: 693403 dlm_controld 4.0.7
started
Sep  7 15:03:30 node2 crmd[58188]:   notice: Result of start
operation for dlm on node2.mydomain.com: 0 (ok)
Sep  7 15:03:30 node2 crmd[58188]:   notice: Initiating monitor
operation dlm_monitor_60000 locally on node2.mydomain.com
Sep  7 15:03:30 node2 crmd[58188]:   notice: Initiating start
operation WWWMount_start_0 locally on node2.mydomain.com
Sep  7 15:03:30 node2 Filesystem(WWWMount)[53154]: INFO: Running
start for /dev/drbd1 on /var/www
Sep  7 15:03:30 node2 kernel: dlm: Using TCP for communications
Sep  7 15:03:30 node2 kernel: GFS2: fsid=MyCluster:www: Trying to
join cluster "lock_dlm", "MyCluster:www"
Sep  7 15:03:30 node2 kernel: dlm: connecting to 1
Sep  7 15:03:30 node2 kernel: dlm: got connection from 1
Sep  7 15:03:31 node2 kernel: GFS2: fsid=MyCluster:www: Joined
cluster. Now mounting FS...
Sep  7 15:03:31 node2 kernel: GFS2: fsid=MyCluster:www.1: jid=1,
already locked for use
Sep  7 15:03:31 node2 kernel: GFS2: fsid=MyCluster:www.1: jid=1:
Looking at journal...
Sep  7 15:03:31 node2 kernel: GFS2: fsid=MyCluster:www.1: jid=1: Done
Sep  7 15:03:31 node2 crmd[58188]:   notice: Result of start
operation for WWWMount on node2.mydomain.com: 0 (ok)
Sep  7 15:03:31 node2 crmd[58188]:   notice: Initiating monitor
operation WWWMount_monitor_20000 locally on node2.mydomain.com
Sep  7 15:03:31 node2 crmd[58188]:   notice: Initiating start
operation WebServer_start_0 locally on node2.mydomain.com
Sep  7 15:03:31 node2 crmd[58188]:   notice: Initiating start
operation SharedRootCrons_start_0 locally on node2.mydomain.com
Sep  7 15:03:31 node2 crmd[58188]:   notice: Initiating start
operation SharedUserCrons_start_0 locally on node2.mydomain.com
Sep  7 15:03:31 node2 symlink(SharedRootCrons)[53328]: INFO:
'/etc/cron.d/root-shared' -> '/var/www/crons/root-shared'
Sep  7 15:03:31 node2 symlink(SharedUserCrons)[53329]: INFO:
'/etc/cron.d/User-shared' -> '/var/www/crons/User-shared'
Sep  7 15:03:31 node2 crmd[58188]:   notice: Result of start
operation for SharedRootCrons on node2.mydomain.com: 0 (ok)
Sep  7 15:03:31 node2 crmd[58188]:   notice: Result of start
operation for SharedUserCrons on node2.mydomain.com: 0 (ok)
Sep  7 15:03:31 node2 crmd[58188]:   notice: Initiating monitor
operation SharedRootCrons_monitor_60000 locally on node2.mydomain.com
Sep  7 15:03:31 node2 crmd[58188]:   notice: Initiating monitor
operation SharedUserCrons_monitor_60000 locally on node2.mydomain.com
Sep  7 15:03:31 node2 apache(WebServer)[53325]: INFO: apache not
running
Sep  7 15:03:31 node2 apache(WebServer)[53325]: INFO: waiting for
apache /etc/httpd/conf/httpd.conf to come up
Sep  7 15:03:32 node2 crmd[58188]:   notice: Result of start
operation for WebServer on node2.mydomain.com: 0 (ok)
Sep  7 15:03:32 node2 crmd[58188]:   notice: Initiating monitor
operation WebServer_monitor_60000 locally on node2.mydomain.com
Sep  7 15:03:33 node2 crmd[58188]:   notice: Transition 132
(Complete=44, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-error-27.bz2): Complete
Sep  7 15:03:33 node2 crmd[58188]:   notice: State transition
S_TRANSITION_ENGINE -> S_IDLE



_______________________________________________
Users mailing list: [email protected]
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
pdf
Bugs: http://bugs.clusterlabs.org


_______________________________________________
Users mailing list: [email protected]
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Non-cloned resource moves before cloned resource startup on unstandby

Reply via email to