On 03/30/2016 11:20 AM, Sam Gardner wrote: > I have configured some network resources to automatically standby their node > if the system detects a failure on them. However, the DRBD slave that I have > configured does not automatically restart after the node is "unstandby-ed" > after the failure-timeout expires. > Is there any way to make the "stopped" DRBDSlave resource automatically start > again once the node is recovered? > > See the progression of events below: > > Running cluster: > Wed Mar 30 16:04:20 UTC 2016 > Cluster name: > Last updated: Wed Mar 30 16:04:20 2016 > Last change: Wed Mar 30 16:03:24 2016 > Stack: classic openais (with plugin) > Current DC: ha-d1.tw.com - partition with quorum > Version: 1.1.12-561c4cf > 2 Nodes configured, 2 expected votes > 7 Resources configured > > > Online: [ ha-d1.tw.com ha-d2.tw.com ] > > Full list of resources: > > Resource Group: network > inif (ocf::custom:ip.sh): Started ha-d1.tw.com > outif (ocf::custom:ip.sh): Started ha-d1.tw.com > dmz1 (ocf::custom:ip.sh): Started ha-d1.tw.com > Master/Slave Set: DRBDMaster [DRBDSlave] > Masters: [ ha-d1.tw.com ] > Slaves: [ ha-d2.tw.com ] > Resource Group: filesystem > DRBDFS (ocf::heartbeat:Filesystem): Started ha-d1.tw.com > Resource Group: application > service_failover (ocf::custom:service_failover): Started > ha-d1.tw.com > > > version: 8.4.5 (api:1/proto:86-101) > srcversion: 315FB2BBD4B521D13C20BF4 > > 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----- > ns:4 nr:0 dw:4 dr:757 al:1 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 > [153766.565352] block drbd1: send bitmap stats [Bytes(packets)]: plain 0(0), > RLE 21(1), total 21; compression: 100.0% > [153766.568303] block drbd1: receive bitmap stats [Bytes(packets)]: plain > 0(0), RLE 21(1), total 21; compression: 100.0% > [153766.568316] block drbd1: helper command: /sbin/drbdadm > before-resync-source minor-1 > [153766.568356] block drbd1: helper command: /sbin/drbdadm > before-resync-source minor-1 exit code 255 (0xfffffffe) > [153766.568363] block drbd1: conn( WFBitMapS -> SyncSource ) pdsk( Consistent > -> Inconsistent ) > [153766.568374] block drbd1: Began resync as SyncSource (will sync 4 KB [1 > bits set]). > [153766.568444] block drbd1: updated sync UUID > B0DA745C79C56591:36E0631B6F022952:36DF631B6F022952:133127197CF097C6 > [153766.577695] block drbd1: Resync done (total 1 sec; paused 0 sec; 4 K/sec) > [153766.577700] block drbd1: updated UUIDs > B0DA745C79C56591:0000000000000000:36E0631B6F022952:36DF631B6F022952 > [153766.577705] block drbd1: conn( SyncSource -> Connected ) pdsk( > Inconsistent -> UpToDate )¯ > > Failure detected: > Wed Mar 30 16:08:22 UTC 2016 > Cluster name: > Last updated: Wed Mar 30 16:08:22 2016 > Last change: Wed Mar 30 16:03:24 2016 > Stack: classic openais (with plugin) > Current DC: ha-d1.tw.com - partition with quorum > Version: 1.1.12-561c4cf > 2 Nodes configured, 2 expected votes > 7 Resources configured > > > Node ha-d1.tw.com: standby (on-fail) > Online: [ ha-d2.tw.com ] > > Full list of resources: > > Resource Group: network > inif (ocf::custom:ip.sh): Started ha-d1.tw.com > outif (ocf::custom:ip.sh): Started ha-d1.tw.com > dmz1 (ocf::custom:ip.sh): FAILED ha-d1.tw.com > Master/Slave Set: DRBDMaster [DRBDSlave] > Masters: [ ha-d1.tw.com ] > Slaves: [ ha-d2.tw.com ] > Resource Group: filesystem > DRBDFS (ocf::heartbeat:Filesystem): Started ha-d1.tw.com > Resource Group: application > service_failover (ocf::custom:service_failover): Started > ha-d1.tw.com > > Failed actions: > dmz1_monitor_7000 on ha-d1.tw.com 'not running' (7): call=156, > status=complete, last-rc-change='Wed Mar 30 16:08:19 2016', queued=0ms, > exec=0ms > > > > version: 8.4.5 (api:1/proto:86-101) > srcversion: 315FB2BBD4B521D13C20BF4 > > 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----- > ns:4 nr:0 dw:4 dr:765 al:1 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 > [153766.568356] block drbd1: helper command: /sbin/drbdadm > before-resync-source minor-1 exit code 255 (0xfffffffe) > [153766.568363] block drbd1: conn( WFBitMapS -> SyncSource ) pdsk( Consistent > -> Inconsistent ) > [153766.568374] block drbd1: Began resync as SyncSource (will sync 4 KB [1 > bits set]). > [153766.568444] block drbd1: updated sync UUID > B0DA745C79C56591:36E0631B6F022952:36DF631B6F022952:133127197CF097C6 > [153766.577695] block drbd1: Resync done (total 1 sec; paused 0 sec; 4 K/sec) > [153766.577700] block drbd1: updated UUIDs > B0DA745C79C56591:0000000000000000:36E0631B6F022952:36DF631B6F022952 > [153766.577705] block drbd1: conn( SyncSource -> Connected ) pdsk( > Inconsistent -> UpToDate ) > [154057.455270] e1000: eth2 NIC Link is Down > [154057.455451] e1000 0000:02:02.0 eth2: Reset adapter > > Failover complete: > Wed Mar 30 16:09:02 UTC 2016 > Cluster name: > Last updated: Wed Mar 30 16:09:02 2016 > Last change: Wed Mar 30 16:03:24 2016 > Stack: classic openais (with plugin) > Current DC: ha-d1.tw.com - partition with quorum > Version: 1.1.12-561c4cf > 2 Nodes configured, 2 expected votes > 7 Resources configured > > > Node ha-d1.tw.com: standby (on-fail) > Online: [ ha-d2.tw.com ] > > Full list of resources: > > Resource Group: network > inif (ocf::custom:ip.sh): Started ha-d2.tw.com > outif (ocf::custom:ip.sh): Started ha-d2.tw.com > dmz1 (ocf::custom:ip.sh): Started ha-d2.tw.com > Master/Slave Set: DRBDMaster [DRBDSlave] > Masters: [ ha-d2.tw.com ] > Stopped: [ ha-d1.tw.com ] > Resource Group: filesystem > DRBDFS (ocf::heartbeat:Filesystem): Started ha-d2.tw.com > Resource Group: application > service_failover (ocf::custom:service_failover): Started > ha-d2.tw.com > > Failed actions: > dmz1_monitor_7000 on ha-d1.tw.com 'not running' (7): call=156, > status=complete, last-rc-change='Wed Mar 30 16:08:19 2016', queued=0ms, > exec=0ms > > > > version: 8.4.5 (api:1/proto:86-101) > srcversion: 315FB2BBD4B521D13C20BF4 > [154094.894524] drbd wwwdata: conn( Disconnecting -> StandAlone ) > [154094.894525] drbd wwwdata: receiver terminated > [154094.894527] drbd wwwdata: Terminating drbd_r_wwwdata > [154094.894559] block drbd1: disk( UpToDate -> Failed ) > [154094.894569] block drbd1: bitmap WRITE of 0 pages took 0 jiffies > [154094.894571] block drbd1: 4 KB (1 bits) marked out-of-sync by on disk > bit-map. > [154094.894574] block drbd1: disk( Failed -> Diskless ) > [154094.894647] block drbd1: drbd_bm_resize called with capacity == 0 > [154094.894652] drbd wwwdata: Terminating drbd_w_wwwdata > > Standby node recovered, with DRBDSlave stopped (I want DRBDSlave started > here): > Wed Mar 30 16:13:01 UTC 2016 > Cluster name: > Last updated: Wed Mar 30 16:13:01 2016 > Last change: Wed Mar 30 16:03:24 2016 > Stack: classic openais (with plugin) > Current DC: ha-d1.tw.com - partition with quorum > Version: 1.1.12-561c4cf > 2 Nodes configured, 2 expected votes > 7 Resources configured > > > Online: [ ha-d1.tw.com ha-d2.tw.com ] > > Full list of resources: > > Resource Group: network > inif (ocf::custom:ip.sh): Started ha-d2.tw.com > outif (ocf::custom:ip.sh): Started ha-d2.tw.com > dmz1 (ocf::custom:ip.sh): Started ha-d2.tw.com > Master/Slave Set: DRBDMaster [DRBDSlave] > Masters: [ ha-d2.tw.com ] > Stopped: [ ha-d1.tw.com ] > Resource Group: filesystem > DRBDFS (ocf::heartbeat:Filesystem): Started ha-d2.tw.com > Resource Group: application > service_failover (ocf::custom:service_failover): Started > ha-d2.tw.com > > > version: 8.4.5 (api:1/proto:86-101) > srcversion: 315FB2BBD4B521D13C20BF4 > [154094.894574] block drbd1: disk( Failed -> Diskless ) > [154094.894647] block drbd1: drbd_bm_resize called with capacity == 0 > [154094.894652] drbd wwwdata: Terminating drbd_w_wwwdata > > -- > Sam Gardner > Trustwave | SMART SECURITY ON DEMAND
This might be a bug. A crm_report covering a few minutes around when the failure expires might help. Does the slave start after the next cluster-recheck-interval? _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org