>Also, I forgot about the undocumented/unsupported start-delay operation >attribute, that you can put on the status operation to delay the first >monitor. That may give you the behavior you want. I have try to add "start-delay=60s" to monitor operation. The first monitor was really delayed as 60s. But in the 60s, it will block other resources too! The result is the same to sleeping in monitor. So, I think the best method for me, is to judge whether need to return success in monitor function by timestamp. Thank you very much!
At 2017-11-06 21:53:53, "Ken Gaillot" <kgail...@redhat.com> wrote: >On Sat, 2017-11-04 at 22:46 +0800, lkxjtu wrote: >> >> >> >Another possibility would be to have the start return immediately, >> and >> >make the monitor artificially return success for the first 10 >> minutes >> >after starting. It's hacky, and it depends on your situation whether >> >the behavior is acceptable. >> I tried to put the sleep into the monitor function,( I add a “sleep >> 60” at the monitor entry for debug), the start function returns >> immediately. I found an interesting thing that is, at the first time >> of monitor after start, it will block other resource too, but from >> the second time, it won't block other resources! Is this normal? > >Yes, the first result is for an unknown status, but after that, the >cluster assumes the resource is OK unless/until the monitor says >otherwise. > >However, I wasn't suggesting putting a sleep inside the monitor -- I >was just thinking of having the monitor check the time, and if it's >within 10 minutes of start, return success. > >> >My first thought on how to implement this >> >would be to have the start action set a private node attribute >> >(attrd_updater -p) with a timestamp. When the monitor runs, it could >> do >> >its usual check, and if it succeeds, remove that node attribute, but >> if >> >it fails, check the node attribute to see whether it's within the >> >desired delay. >> This means that if it is in the desired delay, monitor should return >> success even if healthcheck failed? >> I think this can solve my problem except "crm status" show > >Yes, that's what I had in mind. The status would show "running", which >may or may not be what you want in this case. > >Also, I forgot about the undocumented/unsupported start-delay operation >attribute, that you can put on the status operation to delay the first >monitor. That may give you the behavior you want. > >> At 2017-11-01 21:20:50, "Ken Gaillot" <kgail...@redhat.com> wrote: >> >On Sat, 2017-10-28 at 01:11 +0800, lkxjtu wrote: >> >> >> >> Thank you for your response! This means that there shoudn't be >> long >> >> "sleep" in ocf script. >> >> If my service takes 10 minite from service starting to healthcheck >> >> normally, then what shoud I do? >> > >> >That is a tough situation with no great answer. >> > >> >You can leave it as it is, and live with the delay. Note that it >> only >> >happens if a resource fails after the slow resource has already >> begun >> >starting ... if they fail at the same time (as with a node failure), >> >the cluster will schedule recovery for both at the same time. >> > >> >Another possibility would be to have the start return immediately, >> and >> >make the monitor artificially return success for the first 10 >> minutes >> >after starting. It's hacky, and it depends on your situation whether >> >the behavior is acceptable. My first thought on how to implement >> this >> >would be to have the start action set a private node attribute >> >(attrd_updater -p) with a timestamp. When the monitor runs, it could >> do >> >its usual check, and if it succeeds, remove that node attribute, but >> if >> >it fails, check the node attribute to see whether it's within the >> >desired delay. >> > >> >> Thank you very much! >> >> >> >> > Hi, >> >> > If I remember correctly, any pending actions from a previous >> >> transition >> >> > must be completed before a new transition can be calculated. >> >> Otherwise, >> >> > there's the possibility that the pending action could change the >> >> state >> >> > in a way that makes the second transition's decisions harmful. >> >> > Theoretically (and ideally), pacemaker could figure out whether >> >> some of >> >> > the actions in the second transition would be needed regardless >> of >> >> > whether the pending actions succeeded or failed, but in >> practice, >> >> that >> >> > would be difficult to implement (and possibly take more time to >> >> > calculate than is desirable in a recovery situation). >> >> >> >> > On Fri, 2017-10-27 at 23:54 +0800, lkxjtu wrote: >> >> >> >> > I have two clone resources in my corosync/pacemaker cluster. >> They >> >> are >> >> > fm_mgt and logserver. Both of their RA is ocf. fm_mgt takes 1 >> >> minute >> >> > to start the >> >> > service(calling ocf start function for 1 minite). Configured as >> >> > below: >> >> > # crm configure show >> >> > node 168002177: 192.168.2.177 >> >> > node 168002178: 192.168.2.178 >> >> > node 168002179: 192.168.2.179 >> >> > primitive fm_mgt fm_mgt \ >> >> > op monitor interval=20s timeout=120s \ >> >> > op stop interval=0 timeout=120s on-fail=restart \ >> >> > op start interval=0 timeout=120s on-fail=restart \ >> >> > meta target-role=Started >> >> > primitive logserver logserver \ >> >> > op monitor interval=20s timeout=120s \ >> >> > op stop interval=0 timeout=120s on-fail=restart \ >> >> > op start interval=0 timeout=120s on-fail=restart \ >> >> > meta target-role=Started >> >> > clone fm_mgt_replica fm_mgt >> >> > clone logserver_replica logserver >> >> > property cib-bootstrap-options: \ >> >> > have-watchdog=false \ >> >> > dc-version=1.1.13-10.el7-44eb2dd \ >> >> > cluster-infrastructure=corosync \ >> >> > stonith-enabled=false \ >> >> > start-failure-is-fatal=false >> >> > When I kill fm_mgt service on one node,pacemaker will >> immediately >> >> > recover it after monitor failed. This looks perfectly normal. >> But >> >> in >> >> > this 1 minite >> >> > of fm_mgt starting, if I kill logserver service on any node, the >> >> > monitor will catch the fail normally too,but pacemaker will not >> >> > restart it >> >> > immediately but waiting for fm_mgt starting finished. After >> fm_mgt >> >> > starting finished, pacemaker begin restarting logserver. It >> seems >> >> > that there are >> >> > some dependency between pacemaker resource. >> >> > # crm status >> >> > Last updated: Thu Oct 26 06:40:24 2017 Last change: Thu >> >> Oct >> >> > 26 06:36:33 2017 by root via crm_resource on 192.168.2.177 >> >> > Stack: corosync >> >> > Current DC: 192.168.2.179 (version 1.1.13-10.el7-44eb2dd) - >> >> partition >> >> > with quorum >> >> > 3 nodes and 6 resources configured >> >> > Online: [ 192.168.2.177 192.168.2.178 192.168.2.179 ] >> >> > Full list of resources: >> >> > Clone Set: logserver_replica [logserver] >> >> > logserver (ocf::heartbeat:logserver): FAILED >> >> 192.168.2.177 >> >> > Started: [ 192.168.2.178 192.168.2.179 ] >> >> > Clone Set: fm_mgt_replica [fm_mgt] >> >> > Started: [ 192.168.2.178 192.168.2.179 ] >> >> > Stopped: [ 192.168.2.177 ] >> >> > I am confusing very much. Is there something wrong >> configure?Thank >> >> > you very much! >> >> > James >> >> > best regards >> >> >> >> >> >> >> >> 【网易自营】好吃到爆!鲜香弹滑加热即食,经典13香/麻辣小龙虾仅75元3斤>> >> >> >> >> >> >> 【网易自营】好吃到爆!鲜香弹滑加热即食,经典13香/麻辣小龙虾仅75元3斤>> >> >> >> >> >> >> 【网易自营|30天无忧退货】仅售同款价1/4!MUJI制造商“2017秋冬舒适家居拖鞋系列”限时仅34.9元>> >> >-- >> >Ken Gaillot <kgail...@redhat.com> >> >> >> >-- >Ken Gaillot <kgail...@redhat.com>
_______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org