>>> Ken Gaillot <kgail...@redhat.com> schrieb am 07.04.2016 um 00:04 in Nachricht <57058805.8050...@redhat.com>: > On 03/30/2016 12:18 PM, Sam Gardner wrote: >> I'll check about the cluster-recheck-interval. Attached is a crm_report. >> >> In the meantime, what is all performed on that interval? The Red Hat docs >> say the following, which doesn't make much sense to me:
In my understanding, the cluster re-probes resources, and if there is a mismatch between actual and believed state, actions are triggered and performed. This can be goog, or can be bad: If you deliberately do not monitor some resources, a cluster recheck will actually "monitor" it an perform actions... > > Normally, the cluster only recalculates what actions need to be taken > when an interesting event occurs -- node or resource failure, > configuration change, node attribute change, etc. > > The cluster-recheck-interval allows that recalculation to happen > regardless of (the lack of) events. For example, let's say you have > rules that specify that certain constraints only apply between 9am and > 5pm. If there are no events happening at 9am, the rules won't actually > be noticed or take effect. So the cluster-recheck-interval is the > granularity of such "time-based changes". A cluster-recheck-interval of > 5m ensures the rules kick in no later than 9:05am. > > Looking at the crm_report: > > I see "Configuration ERRORs found during PE processing. Please run > "crm_verify -L" to identify issues." The offending bit is described a > little earlier: "error: RecurringOp: Invalid recurring action > DRBDSlave-start-interval-30s wth name: 'start'". There was a discussion > on the mailing list recently about this -- a recurring start action is > meaningless. > > That constraint will be ignored. If you want to set on-fail=standby for > DRBD starts, use an interval of 0. > > I'd recommend running "crm_verify -L" to see if there are any other > issues, and take care of them. Once you have a clean crm_verify, run > "cibadmin --upgrade" to upgrade the XML of your configuration to the > latest schema. This is just good housekeeping when keeping an older > configuration after pacemaker upgrades. > > I see "e1000: eth2 NIC Link is Down" shortly before the issue. If you're > using ifdown/ifup to test failure, be aware that corosync can't recover > from that particular scenario (known issue, nontrivial to fix). It's > recommended to simulate a network failure by blocking corosync traffic > via the local firewall (both inbound and outbound). Or of course you can > unplug a network cable. > > Are you limited to the "classic openais (with plugin)" cluster stack? > Corosync 2 is preferred these days, and even corosync 1 + CMAN gets more > testing than the old plugin. > > If it still happens after looking into those items, I'd need logs from > both nodes from the failure time to a couple minutes after the > unstandby. The other node will be the DC at this point and will have the > more interesting bits. > >> Polling interval for time-based changes to options, resource > parameters >> and constraints. Allowed values: Zero disables polling, positive values >> are an interval in seconds (unless other SI units are specified, such as >> 5min). >> -- >> Sam Gardner >> Trustwave | SMART SECURITY ON DEMAND >> >> >> >> On 3/30/16, 11:46 AM, "Ken Gaillot" <kgail...@redhat.com> wrote: >> >>> On 03/30/2016 11:20 AM, Sam Gardner wrote: >>>> I have configured some network resources to automatically standby their >>>> node if the system detects a failure on them. However, the DRBD slave >>>> that I have configured does not automatically restart after the node is >>>> "unstandby-ed" after the failure-timeout expires. >>>> Is there any way to make the "stopped" DRBDSlave resource automatically >>>> start again once the node is recovered? >>>> >>>> See the progression of events below: >>>> >>>> Running cluster: >>>> Wed Mar 30 16:04:20 UTC 2016 >>>> Cluster name: >>>> Last updated: Wed Mar 30 16:04:20 2016 >>>> Last change: Wed Mar 30 16:03:24 2016 >>>> Stack: classic openais (with plugin) >>>> Current DC: >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom - partition with quorum >>>> Version: 1.1.12-561c4cf >>>> 2 Nodes configured, 2 expected votes >>>> 7 Resources configured >>>> >>>> >>>> Online: [ >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom ] >>>> >>>> Full list of resources: >>>> >>>> Resource Group: network >>>> inif (ocf::custom:ip.sh): Started >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom >>>> outif (ocf::custom:ip.sh): Started >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom >>>> dmz1 (ocf::custom:ip.sh): Started >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom >>>> Master/Slave Set: DRBDMaster [DRBDSlave] >>>> Masters: [ >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom ] >>>> Slaves: [ >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom ] >>>> Resource Group: filesystem >>>> DRBDFS (ocf::heartbeat:Filesystem): Started >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom >>>> Resource Group: application >>>> service_failover (ocf::custom:service_failover): Started >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom >>>> >>>> >>>> version: 8.4.5 (api:1/proto:86-101) >>>> srcversion: 315FB2BBD4B521D13C20BF4 >>>> >>>> 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----- >>>> ns:4 nr:0 dw:4 dr:757 al:1 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 >>>> [153766.565352] block drbd1: send bitmap stats [Bytes(packets)]: plain >>>> 0(0), RLE 21(1), total 21; compression: 100.0% >>>> [153766.568303] block drbd1: receive bitmap stats [Bytes(packets)]: >>>> plain 0(0), RLE 21(1), total 21; compression: 100.0% >>>> [153766.568316] block drbd1: helper command: /sbin/drbdadm >>>> before-resync-source minor-1 >>>> [153766.568356] block drbd1: helper command: /sbin/drbdadm >>>> before-resync-source minor-1 exit code 255 (0xfffffffe) >>>> [153766.568363] block drbd1: conn( WFBitMapS -> SyncSource ) pdsk( >>>> Consistent -> Inconsistent ) >>>> [153766.568374] block drbd1: Began resync as SyncSource (will sync 4 KB >>>> [1 bits set]). >>>> [153766.568444] block drbd1: updated sync UUID >>>> B0DA745C79C56591:36E0631B6F022952:36DF631B6F022952:133127197CF097C6 >>>> [153766.577695] block drbd1: Resync done (total 1 sec; paused 0 sec; 4 >>>> K/sec) >>>> [153766.577700] block drbd1: updated UUIDs >>>> B0DA745C79C56591:0000000000000000:36E0631B6F022952:36DF631B6F022952 >>>> [153766.577705] block drbd1: conn( SyncSource -> Connected ) pdsk( >>>> Inconsistent -> UpToDate )̄ >>>> >>>> Failure detected: >>>> Wed Mar 30 16:08:22 UTC 2016 >>>> Cluster name: >>>> Last updated: Wed Mar 30 16:08:22 2016 >>>> Last change: Wed Mar 30 16:03:24 2016 >>>> Stack: classic openais (with plugin) >>>> Current DC: >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom - partition with quorum >>>> Version: 1.1.12-561c4cf >>>> 2 Nodes configured, 2 expected votes >>>> 7 Resources configured >>>> >>>> >>>> Node ha-d1.tw.com: standby (on-fail) >>>> Online: [ >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom ] >>>> >>>> Full list of resources: >>>> >>>> Resource Group: network >>>> inif (ocf::custom:ip.sh): Started >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom >>>> outif (ocf::custom:ip.sh): Started >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom >>>> dmz1 (ocf::custom:ip.sh): FAILED >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom >>>> Master/Slave Set: DRBDMaster [DRBDSlave] >>>> Masters: [ >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom ] >>>> Slaves: [ >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom ] >>>> Resource Group: filesystem >>>> DRBDFS (ocf::heartbeat:Filesystem): Started >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom >>>> Resource Group: application >>>> service_failover (ocf::custom:service_failover): Started >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom >>>> >>>> Failed actions: >>>> dmz1_monitor_7000 on >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom 'not running' (7): >>>> call=156, status=complete, last-rc-change='Wed Mar 30 16:08:19 2016', >>>> queued=0ms, exec=0ms >>>> >>>> >>>> >>>> version: 8.4.5 (api:1/proto:86-101) >>>> srcversion: 315FB2BBD4B521D13C20BF4 >>>> >>>> 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----- >>>> ns:4 nr:0 dw:4 dr:765 al:1 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 >>>> [153766.568356] block drbd1: helper command: /sbin/drbdadm >>>> before-resync-source minor-1 exit code 255 (0xfffffffe) >>>> [153766.568363] block drbd1: conn( WFBitMapS -> SyncSource ) pdsk( >>>> Consistent -> Inconsistent ) >>>> [153766.568374] block drbd1: Began resync as SyncSource (will sync 4 KB >>>> [1 bits set]). >>>> [153766.568444] block drbd1: updated sync UUID >>>> B0DA745C79C56591:36E0631B6F022952:36DF631B6F022952:133127197CF097C6 >>>> [153766.577695] block drbd1: Resync done (total 1 sec; paused 0 sec; 4 >>>> K/sec) >>>> [153766.577700] block drbd1: updated UUIDs >>>> B0DA745C79C56591:0000000000000000:36E0631B6F022952:36DF631B6F022952 >>>> [153766.577705] block drbd1: conn( SyncSource -> Connected ) pdsk( >>>> Inconsistent -> UpToDate ) >>>> [154057.455270] e1000: eth2 NIC Link is Down >>>> [154057.455451] e1000 0000:02:02.0 eth2: Reset adapter >>>> >>>> Failover complete: >>>> Wed Mar 30 16:09:02 UTC 2016 >>>> Cluster name: >>>> Last updated: Wed Mar 30 16:09:02 2016 >>>> Last change: Wed Mar 30 16:03:24 2016 >>>> Stack: classic openais (with plugin) >>>> Current DC: >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom - partition with quorum >>>> Version: 1.1.12-561c4cf >>>> 2 Nodes configured, 2 expected votes >>>> 7 Resources configured >>>> >>>> >>>> Node ha-d1.tw.com: standby (on-fail) >>>> Online: [ >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom ] >>>> >>>> Full list of resources: >>>> >>>> Resource Group: network >>>> inif (ocf::custom:ip.sh): Started >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom >>>> outif (ocf::custom:ip.sh): Started >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom >>>> dmz1 (ocf::custom:ip.sh): Started >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom >>>> Master/Slave Set: DRBDMaster [DRBDSlave] >>>> Masters: [ >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom ] >>>> Stopped: [ >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom ] >>>> Resource Group: filesystem >>>> DRBDFS (ocf::heartbeat:Filesystem): Started >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom >>>> Resource Group: application >>>> service_failover (ocf::custom:service_failover): Started >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom >>>> >>>> Failed actions: >>>> dmz1_monitor_7000 on >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom 'not running' (7): >>>> call=156, status=complete, last-rc-change='Wed Mar 30 16:08:19 2016', >>>> queued=0ms, exec=0ms >>>> >>>> >>>> >>>> version: 8.4.5 (api:1/proto:86-101) >>>> srcversion: 315FB2BBD4B521D13C20BF4 >>>> [154094.894524] drbd wwwdata: conn( Disconnecting -> StandAlone ) >>>> [154094.894525] drbd wwwdata: receiver terminated >>>> [154094.894527] drbd wwwdata: Terminating drbd_r_wwwdata >>>> [154094.894559] block drbd1: disk( UpToDate -> Failed ) >>>> [154094.894569] block drbd1: bitmap WRITE of 0 pages took 0 jiffies >>>> [154094.894571] block drbd1: 4 KB (1 bits) marked out-of-sync by on >>>> disk bit-map. >>>> [154094.894574] block drbd1: disk( Failed -> Diskless ) >>>> [154094.894647] block drbd1: drbd_bm_resize called with capacity == 0 >>>> [154094.894652] drbd wwwdata: Terminating drbd_w_wwwdata >>>> >>>> Standby node recovered, with DRBDSlave stopped (I want DRBDSlave >>>> started here): >>>> Wed Mar 30 16:13:01 UTC 2016 >>>> Cluster name: >>>> Last updated: Wed Mar 30 16:13:01 2016 >>>> Last change: Wed Mar 30 16:03:24 2016 >>>> Stack: classic openais (with plugin) >>>> Current DC: >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom - partition with quorum >>>> Version: 1.1.12-561c4cf >>>> 2 Nodes configured, 2 expected votes >>>> 7 Resources configured >>>> >>>> >>>> Online: [ >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom ] >>>> >>>> Full list of resources: >>>> >>>> Resource Group: network >>>> inif (ocf::custom:ip.sh): Started >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom >>>> outif (ocf::custom:ip.sh): Started >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom >>>> dmz1 (ocf::custom:ip.sh): Started >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom >>>> Master/Slave Set: DRBDMaster [DRBDSlave] >>>> Masters: [ >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom ] >>>> Stopped: [ >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> AFrSWgtww&s=5&u=http%3a%2f%2fha-d1%2etw%2ecom ] >>>> Resource Group: filesystem >>>> DRBDFS (ocf::heartbeat:Filesystem): Started >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom >>>> Resource Group: application >>>> service_failover (ocf::custom:service_failover): Started >>>> http://scanmail.trustwave.com/?c=4062&d=8oP81inGHG69ATJU-vrUMVGr-hM5L5fIh >>>> FVqRWF9lw&s=5&u=http%3a%2f%2fha-d2%2etw%2ecom >>>> >>>> >>>> version: 8.4.5 (api:1/proto:86-101) >>>> srcversion: 315FB2BBD4B521D13C20BF4 >>>> [154094.894574] block drbd1: disk( Failed -> Diskless ) >>>> [154094.894647] block drbd1: drbd_bm_resize called with capacity == 0 >>>> [154094.894652] drbd wwwdata: Terminating drbd_w_wwwdata >>>> >>>> -- >>>> Sam Gardner >>>> Trustwave | SMART SECURITY ON DEMAND >>> >>> This might be a bug. A crm_report covering a few minutes around when the >>> failure expires might help. >>> >>> Does the slave start after the next cluster-recheck-interval? > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org