Hi! Not saying that the use of start-delay in the monitor-operations is a good thing. It should in most cases be definitely better to delay the return of start till a monitor would succeed. Have seen discussion about deprecating start-delay - don't know the current state though. But this case - if I got the use-case right - with a 10min delay might be a legitimate use of start-delay - if any does exists at all ;-)
Regards, Klaus On 11/04/2017 03:46 PM, lkxjtu wrote: > > > >Another possibility would be to have the start return immediately, and > >make the monitor artificially return success for the first 10 minutes > >after starting. It's hacky, and it depends on your situation whether > >the behavior is acceptable. > <http://fanyi.baidu.com/###> > I tried to put the sleep into the monitor function,( I add a “sleep > 60” at the monitor entry for debug), the start function returns > immediately.I found an interesting thing that is, at the first time of > monitor after start, it will block other resource too, but from the > second time, it won't block other resources! Is this normal? > > >My first thought on how to implement this > >would be to have the start action set a private node attribute > >(attrd_updater -p) with a timestamp. When the monitor runs, it could do > >its usual check, and if it succeeds, remove that node attribute, but if > >it fails, check the node attribute to see whether it's within the > >desired delay. > This means that if it is in the desired delay, monitor should return success > even if healthcheck failed? > I think this can solve my problem except "crm status" show > > > At 2017-11-01 21:20:50, "Ken Gaillot" <kgail...@redhat.com> wrote: > >On Sat, 2017-10-28 at 01:11 +0800, lkxjtu wrote: > >> > >> Thank you for your response! This means that there shoudn't be long > >> "sleep" in ocf script. > >> If my service takes 10 minite from service starting to healthcheck > >> normally, then what shoud I do? > > > >That is a tough situation with no great answer. > > > >You can leave it as it is, and live with the delay. Note that it only > >happens if a resource fails after the slow resource has already begun > >starting ... if they fail at the same time (as with a node failure), > >the cluster will schedule recovery for both at the same time. > > > >Another possibility would be to have the start return immediately, and > >make the monitor artificially return success for the first 10 minutes > >after starting. It's hacky, and it depends on your situation whether > >the behavior is acceptable. My first thought on how to implement this > >would be to have the start action set a private node attribute > >(attrd_updater -p) with a timestamp. When the monitor runs, it could do > >its usual check, and if it succeeds, remove that node attribute, but if > >it fails, check the node attribute to see whether it's within the > >desired delay. > > > >> Thank you very much! > >> > >> > Hi, > >> > If I remember correctly, any pending actions from a previous > >> transition > >> > must be completed before a new transition can be calculated. > >> Otherwise, > >> > there's the possibility that the pending action could change the > >> state > >> > in a way that makes the second transition's decisions harmful. > >> > Theoretically (and ideally), pacemaker could figure out whether > >> some of > >> > the actions in the second transition would be needed regardless of > >> > whether the pending actions succeeded or failed, but in practice, > >> that > >> > would be difficult to implement (and possibly take more time to > >> > calculate than is desirable in a recovery situation). > >> > >> > On Fri, 2017-10-27 at 23:54 +0800, lkxjtu wrote: > >> > >> > I have two clone resources in my corosync/pacemaker cluster. They > >> are > >> > fm_mgt and logserver. Both of their RA is ocf. fm_mgt takes 1 > >> minute > >> > to start the > >> > service(calling ocf start function for 1 minite). Configured as > >> > below: > >> > # crm configure show > >> > node 168002177: 192.168.2.177 > >> > node 168002178: 192.168.2.178 > >> > node 168002179: 192.168.2.179 > >> > primitive fm_mgt fm_mgt \ > >> > op monitor interval=20s timeout=120s \ > >> > op stop interval=0 timeout=120s on-fail=restart \ > >> > op start interval=0 timeout=120s on-fail=restart \ > >> > meta target-role=Started > >> > primitive logserver logserver \ > >> > op monitor interval=20s timeout=120s \ > >> > op stop interval=0 timeout=120s on-fail=restart \ > >> > op start interval=0 timeout=120s on-fail=restart \ > >> > meta target-role=Started > >> > clone fm_mgt_replica fm_mgt > >> > clone logserver_replica logserver > >> > property cib-bootstrap-options: \ > >> > have-watchdog=false \ > >> > dc-version=1.1.13-10.el7-44eb2dd \ > >> > cluster-infrastructure=corosync \ > >> > stonith-enabled=false \ > >> > start-failure-is-fatal=false > >> > When I kill fm_mgt service on one node,pacemaker will immediately > >> > recover it after monitor failed. This looks perfectly normal. But > >> in > >> > this 1 minite > >> > of fm_mgt starting, if I kill logserver service on any node, the > >> > monitor will catch the fail normally too,but pacemaker will not > >> > restart it > >> > immediately but waiting for fm_mgt starting finished. After fm_mgt > >> > starting finished, pacemaker begin restarting logserver. It seems > >> > that there are > >> > some dependency between pacemaker resource. > >> > # crm status > >> > Last updated: Thu Oct 26 06:40:24 2017 Last change: Thu > >> Oct > >> > 26 06:36:33 2017 by root via crm_resource on 192.168.2.177 > >> > Stack: corosync > >> > Current DC: 192.168.2.179 (version 1.1.13-10.el7-44eb2dd) - > >> partition > >> > with quorum > >> > 3 nodes and 6 resources configured > >> > Online: [ 192.168.2.177 192.168.2.178 192.168.2.179 ] > >> > Full list of resources: > >> > Clone Set: logserver_replica [logserver] > >> > logserver (ocf::heartbeat:logserver): FAILED > >> 192.168.2.177 > >> > Started: [ 192.168.2.178 192.168.2.179 ] > >> > Clone Set: fm_mgt_replica [fm_mgt] > >> > Started: [ 192.168.2.178 192.168.2.179 ] > >> > Stopped: [ 192.168.2.177 ] > >> > I am confusing very much. Is there something wrong configure?Thank > >> > you very much! > >> > James > >> > best regards > >> > >> > >> > >> 【网易自营】好吃到爆!鲜香弹滑加热即食,经典13香/麻辣小龙虾仅75元3斤>> > >> > >> > >> 【网易自营】好吃到爆!鲜香弹滑加热即食,经典13香/麻辣小龙虾仅75元3斤>> > >> > >> > >> 【网易自营|30天无忧退货】仅售同款价1/4!MUJI制造商“2017秋冬舒适家居拖鞋系列”限时仅34.9元>> > >-- > >Ken Gaillot <kgail...@redhat.com> > > > > > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org