Here are some more log files (notice the error on 'helper command: /sbin/drbdadm disconnected') on the Primary Node logs

Master (Primary) Mode going Standby
-----------------------------------
Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Result of stop operation for nfs5-stonith on nfs6: ok Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Result of notify operation for drbd0 on nfs6: ok Jan 17 11:48:14 nfs6 Filesystem(fs_drbd)[797290]: INFO: Running stop for /dev/drbd0 on /data Jan 17 11:48:14 nfs6 Filesystem(fs_drbd)[797290]: INFO: Trying to unmount /data
Jan 17 11:48:14 nfs6 systemd[1923]: data.mount: Succeeded.
Jan 17 11:48:14 nfs6 systemd[1]: data.mount: Succeeded.
Jan 17 11:48:14 nfs6 kernel: XFS (drbd0): Unmounting Filesystem
Jan 17 11:48:14 nfs6 Filesystem(fs_drbd)[797290]: INFO: unmounted /data successfully Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Result of stop operation for fs_drbd on nfs6: ok
Jan 17 11:48:14 nfs6 kernel: drbd r0: role( Primary -> Secondary )
Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Result of demote operation for drbd0 on nfs6: ok Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Result of notify operation for drbd0 on nfs6: ok Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Result of notify operation for drbd0 on nfs6: ok Jan 17 11:48:14 nfs6 kernel: drbd r0: Preparing cluster-wide state change 59605293 (1->0 496/16) Jan 17 11:48:14 nfs6 kernel: drbd r0: State change 59605293: primary_nodes=0, weak_nodes=0
Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: Cluster is now split
Jan 17 11:48:14 nfs6 kernel: drbd r0: Committing cluster-wide state change 59605293 (0ms) Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: conn( Connected -> Disconnecting ) peer( Secondary -> Unknown ) Jan 17 11:48:14 nfs6 kernel: drbd r0/0 drbd0 nfs5: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: ack_receiver terminated
Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: Terminating ack_recv thread
Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: Restarting sender thread
Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: Connection closed
Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: helper command: /sbin/drbdadm disconnected Jan 17 11:48:14 nfs6 drbdadm[797503]: drbdadm: Unknown command 'disconnected' Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: helper command: /sbin/drbdadm disconnected exit code 1 (0x100) Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: conn( Disconnecting -> StandAlone )
Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: Terminating receiver thread
Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: Terminating sender thread
Jan 17 11:48:14 nfs6 kernel: drbd r0/0 drbd0: disk( UpToDate -> Detaching )
Jan 17 11:48:14 nfs6 kernel: drbd r0/0 drbd0: disk( Detaching -> Diskless )
Jan 17 11:48:14 nfs6 kernel: drbd r0/0 drbd0: drbd_bm_resize called with capacity == 0
Jan 17 11:48:14 nfs6 kernel: drbd r0: Terminating worker thread
Jan 17 11:48:14 nfs6 pacemaker-attrd[1691]: notice: Setting master-drbd0[nfs6]: 10000 -> (unset) Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Result of stop operation for drbd0 on nfs6: ok Jan 17 11:48:14 nfs6 pacemaker-attrd[1691]: notice: Setting master-drbd0[nfs5]: 10000 -> 1000 Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Current ping state: S_NOT_DC


Secondary Node going primary (fails)
------------------------------------
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: State transition S_IDLE -> S_POLICY_ENGINE Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice: On loss of quorum: Ignore Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice:  * Move       fs_drbd          (         nfs6 -> nfs5 ) Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice:  * Stop       nfs5-stonith     (                 nfs6 )   due to node availability Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice:  * Stop       drbd0:0          (          Master nfs6 )   due to node availability Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice:  * Promote    drbd0:1          ( Slave -> Master nfs5 ) Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice: Calculated transition 490, saving inputs in /var/lib/pacemaker/pengine/pe-input-123.bz2 Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating stop operation fs_drbd_stop_0 on nfs6 Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating stop operation nfs5-stonith_stop_0 on nfs6 Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating cancel operation drbd0_monitor_20000 locally on nfs5 Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating notify operation drbd0_pre_notify_demote_0 on nfs6 Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating notify operation drbd0_pre_notify_demote_0 locally on nfs5 Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Result of notify operation for drbd0 on nfs5: ok Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating demote operation drbd0_demote_0 on nfs6
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: peer( Primary -> Secondary )
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating notify operation drbd0_post_notify_demote_0 on nfs6 Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating notify operation drbd0_post_notify_demote_0 locally on nfs5 Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Result of notify operation for drbd0 on nfs5: ok Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating notify operation drbd0_pre_notify_stop_0 on nfs6 Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating notify operation drbd0_pre_notify_stop_0 locally on nfs5 Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Result of notify operation for drbd0 on nfs5: ok Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating stop operation drbd0_stop_0 on nfs6 Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: Preparing remote state change 59605293 Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: Committing remote state change 59605293 (primary_nodes=0) Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: conn( Connected -> TearDown ) peer( Secondary -> Unknown ) Jan 17 11:48:14 nfs5 kernel: drbd r0/0 drbd0 nfs6: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: ack_receiver terminated
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: Terminating ack_recv thread
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: Restarting sender thread
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: Connection closed
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: helper command: /sbin/drbdadm disconnected Jan 17 11:48:14 nfs5 drbdadm[570326]: drbdadm: Unknown command 'disconnected' Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: helper command: /sbin/drbdadm disconnected exit code 1 (0x100)
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: conn( TearDown -> Unconnected )
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: Restarting receiver thread
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: conn( Unconnected -> Connecting )
Jan 17 11:48:14 nfs5 pacemaker-attrd[207003]: notice: Setting master-drbd0[nfs6]: 10000 -> (unset) Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Transition 490 aborted by deletion of nvpair[@id='status-2-master-drbd0']: Transient attribute change Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating notify operation drbd0_post_notify_stop_0 locally on nfs5 Jan 17 11:48:14 nfs5 pacemaker-attrd[207003]: notice: Setting master-drbd0[nfs5]: 10000 -> 1000 Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Result of notify operation for drbd0 on nfs5: ok Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Transition 490 (Complete=27, Pending=0, Fired=0, Skipped=1, Incomplete=13, Source=/var/lib/pacemaker/pengine/pe-input-123.bz2): Stopped Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice: On loss of quorum: Ignore Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice:  * Start      fs_drbd          (                 nfs5 ) Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice:  * Promote    drbd0:0          ( Slave -> Master nfs5 ) Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice: Calculated transition 491, saving inputs in /var/lib/pacemaker/pengine/pe-input-124.bz2 Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating notify operation drbd0_pre_notify_promote_0 locally on nfs5 Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Result of notify operation for drbd0 on nfs5: ok Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating promote operation drbd0_promote_0 locally on nfs5 Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: helper command: /sbin/drbdadm fence-peer Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: DRBD_BACKING_DEV_0=/dev/sdb1 DRBD_CONF=/etc/drbd.conf DRBD_CSTATE=Connecting DRBD_LL_DISK=/dev/sdb1 DRBD_MINOR=0 DRBD_MINOR_0=0 DRBD_MY_ADDRESS=10.1.3.35 DRBD_MY_AF=ipv4 DRBD_MY_NODE_ID=0 DRBD_NODE_ID_0=nfs5 DRBD_NODE_ID_1=nfs6 DRBD_PEER_ADDRESS=10.1.3.36 DRBD_PEER_AF=ipv4 DRBD_PEER_NODE_ID=1 DRBD_RESOURCE=r0 DRBD_VOLUME=0 UP_TO_DATE_NODES=0x00000001 /usr/lib/drbd/crm-fence-peer.9.sh Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (qb_rb_open_2) #011debug: shm size:131085; real_size:135168; rb->word_size:33792 Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (qb_rb_open_2) #011debug: shm size:131085; real_size:135168; rb->word_size:33792 Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (qb_rb_open_2) #011debug: shm size:131085; real_size:135168; rb->word_size:33792 Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (connect_with_main_loop) #011debug: Connected to controller IPC (attached to main loop) Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (post_connect) #011debug: Sent IPC hello to controller Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (qb_ipcc_disconnect) #011debug: qb_ipcc_disconnect() Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (qb_rb_close_helper) #011debug: Closing ringbuffer: /dev/shm/qb-207005-570438-16-rx95PO/qb-request-crmd-header Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (qb_rb_close_helper) #011debug: Closing ringbuffer: /dev/shm/qb-207005-570438-16-rx95PO/qb-response-crmd-header Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (qb_rb_close_helper) #011debug: Closing ringbuffer: /dev/shm/qb-207005-570438-16-rx95PO/qb-event-crmd-header Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (ipc_post_disconnect) #011info: Disconnected from controller IPC API Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (pcmk_free_ipc_api) #011debug: Releasing controller IPC API Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (crm_xml_cleanup) #011info: Cleaning up memory from libxml2 Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (crm_exit) #011info: Exiting crm_node | with status 0
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: /
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: Could not connect to the CIB: No such device or address Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: Init failed, could not perform requested operations Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: WARNING DATA INTEGRITY at RISK: could not place the fencing constraint! Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: helper command: /sbin/drbdadm fence-peer exit code 1 (0x100) Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: fence-peer helper broken, returned 1 Jan 17 11:48:14 nfs5 kernel: drbd r0: State change failed: Refusing to be Primary while peer is not outdated
Jan 17 11:48:14 nfs5 kernel: drbd r0: Failed: role( Secondary -> Primary )
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: helper command: /sbin/drbdadm fence-peer Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: DRBD_BACKING_DEV_0=/dev/sdb1 DRBD_CONF=/etc/drbd.conf DRBD_CSTATE=Connecting DRBD_LL_DISK=/dev/sdb1 DRBD_MINOR=0 DRBD_MINOR_0=0 DRBD_MY_ADDRESS=10.1.3.35 DRBD_MY_AF=ipv4 DRBD_MY_NODE_ID=0 DRBD_NODE_ID_0=nfs5 DRBD_NODE_ID_1=nfs6 DRBD_PEER_ADDRESS=10.1.3.36 DRBD_PEER_AF=ipv4 DRBD_PEER_NODE_ID=1 DRBD_RESOURCE=r0 DRBD_VOLUME=0 UP_TO_DATE_NODES=0x00000001 /usr/lib/drbd/crm-fence-peer.9.sh Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (qb_rb_open_2) #011debug: shm size:131085; real_size:135168; rb->word_size:33792 Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (qb_rb_open_2) #011debug: shm size:131085; real_size:135168; rb->word_size:33792 Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (qb_rb_open_2) #011debug: shm size:131085; real_size:135168; rb->word_size:33792 Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (connect_with_main_loop) #011debug: Connected to controller IPC (attached to main loop) Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (post_connect) #011debug: Sent IPC hello to controller Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (qb_ipcc_disconnect) #011debug: qb_ipcc_disconnect() Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (qb_rb_close_helper) #011debug: Closing ringbuffer: /dev/shm/qb-207005-570490-16-D2a84t/qb-request-crmd-header Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (qb_rb_close_helper) #011debug: Closing ringbuffer: /dev/shm/qb-207005-570490-16-D2a84t/qb-response-crmd-header Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (qb_rb_close_helper) #011debug: Closing ringbuffer: /dev/shm/qb-207005-570490-16-D2a84t/qb-event-crmd-header Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (ipc_post_disconnect) #011info: Disconnected from controller IPC API Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (pcmk_free_ipc_api) #011debug: Releasing controller IPC API Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (crm_xml_cleanup) #011info: Cleaning up memory from libxml2 Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (crm_exit) #011info: Exiting crm_node | with status 0
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: /
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: Could not connect to the CIB: No such device or address Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: Init failed, could not perform requested operations Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: WARNING DATA INTEGRITY at RISK: could not place the fencing constraint! Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: helper command: /sbin/drbdadm fence-peer exit code 1 (0x100) Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: fence-peer helper broken, returned 1 Jan 17 11:48:14 nfs5 drbd(drbd0)[570375]: ERROR: r0: Called drbdadm -c /etc/drbd.conf primary r0
Jan 17 11:48:14 nfs5 drbd(drbd0)[570375]: ERROR: r0: Exit code 11
Jan 17 11:48:14 nfs5 drbd(drbd0)[570375]: ERROR: r0: Command output:
Jan 17 11:48:15 nfs5 drbd(drbd0)[570375]: ERROR: r0: Command stderr: r0: State change failed: (-7) Refusing to be Primary while peer is not outdated#012Command 'drbdsetup primary r0' terminated with exit code 11 Jan 17 11:48:15 nfs5 kernel: drbd r0 nfs6: helper command: /sbin/drbdadm fence-peer
...



On 1/16/2021 11:07 AM, Strahil Nikolov wrote:
В 14:10 -0700 на 15.01.2021 (пт), Brent Jensen написа:

Problem: When performing "pcs node standby" on the current master, this node demotes fine but the slave doesn't promote to master. It keeps  looping the same error including "Refusing to be Primary while peer is  not outdated" and "Could not connect to the CIB." At this point the old  master has already unloaded drbd. The only way to fix it is to start  drbd on the standby node (e.g. drbdadm r0 up). Logs contained herein are  from the node trying to be master.

In order to debug, stop the cluster and verify that drbd is running properly. Promote one of the nodes, then demote and promote another one...
I have done this on DRBD9/Centos7/Pacemaker1 w/o any problems. So I don't know were the issue is (crm-fence-peer.9.sh <http://crm-fence-peer.9.sh>

Another odd data point: On the slave if I do a "pcs node standby" & then unstandby, DRBD is loaded again; HOWEVER, when I do this on the master (which should then be slave), DRBD doesn't get loaded.

Stonith/Fencing doesn't seem to make a difference. Not sure if auto-promote is required.

Quote from official documentation (https://www.linbit.com/drbd-user-guide/drbd-guide-9_0-en/#s-pacemaker-crm-drbd-backed-service <https://www.linbit.com/drbd-user-guide/drbd-guide-9_0-en/#s-pacemaker-crm-drbd-backed-service>): If you are employing the DRBD OCF resource agent, it is recommended that you defer DRBD startup, shutdown, promotion, and demotion /exclusively/ to the OCF resource agent. That means that you should disable the DRBD init script:
So remove the autopromote and disable the drbd service at all.

Best Regards, Strahil Nikolov

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to