On 04.01.2016 16:36, Ken Gaillot wrote: > On 01/04/2016 09:25 AM, Bogdan Dobrelya wrote: >> On 04.01.2016 15:50, Bogdan Dobrelya wrote: >>> So far so bad. >>> I made a dummy OCF script [0] to simulate an example >>> promote/demote/notify failure mode for a multistate clone resource which >>> is very similar to the one I reported originally. And the test to >>> reproduce my case with the dummy is: >>> - install dummy resource ocf ra and create the dummy resource as README >>> [0] says >>> - just watch the a) OCF logs from the dummy and b) outputs for the >>> reoccurring commands: >>> >>> # while true; do date; ls /var/lib/heartbeat/trace_ra/dummy/ | tail -1; >>> sleep 20; done& >>> # crm_resource --resource p_dummy --list-operations >>> >>> At some point I noticed: >>> - there are no more "OK" messages logged from the monitor actions, >>> although according to the trace_ra dumps' timestamps, all monitors are >>> still being invoked! >>> >>> - at some point I noticed very strange results reported by the: >>> # crm_resource --resource p_dummy --list-operations >>> p_dummy (ocf::dummy:dummy): FAILED : p_dummy_monitor_103000 >>> (node=node-1.test.domain.local, call=579, rc=1, last-rc-change=Mon Jan >>> 4 14:33:07 2016, exec=62107ms): Timed Out >>> or >>> p_dummy (ocf::dummy:dummy): Started : p_dummy_monitor_103000 >>> (node=node-3.test.domain.local, call=-1, rc=1, last-rc-change=Mon Jan 4 >>> 14:43:58 2016, exec=0ms): Timed Out >>> >>> - according to the trace_ra dumps reoccurring monitors are being invoked >>> by the intervals *much longer* than configured. For example, a 7 minutes >>> of "monitoring silence": >>> Mon Jan 4 14:47:46 UTC 2016 >>> p_dummy.monitor.2016-01-04.14:40:52 >>> Mon Jan 4 14:48:06 UTC 2016 >>> p_dummy.monitor.2016-01-04.14:47:58 >>> >>> Given that said, it is very likely there is some bug exist for >>> monitoring multi-state clones in pacemaker! >>> >>> [0] https://github.com/bogdando/dummy-ocf-ra >>> >> >> Also note, that lrmd spawns *many* monitors like: >> root 6495 0.0 0.0 70268 1456 ? Ss 2015 4:56 \_ >> /usr/lib/pacemaker/lrmd >> root 31815 0.0 0.0 4440 780 ? S 15:08 0:00 | \_ >> /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor >> root 31908 0.0 0.0 4440 388 ? S 15:08 0:00 | >> \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor >> root 31910 0.0 0.0 4440 384 ? S 15:08 0:00 | >> \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor >> root 31915 0.0 0.0 4440 392 ? S 15:08 0:00 | >> \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor >> ... > > At first glance, that looks like your monitor action is calling itself > recursively, but I don't see how in your code.
Yes, it should be a bug in the ocf-shellfuncs's ocf_log(). If I replace it in the dummy RA to the: #. ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs ocf_log() { logger $HA_LOGFACILITY -t $HA_LOGTAG "$@" } there is no such issue anymore. And I see log messages "It's OK" as expected. Note, I used the resource-agents 3.9.5+git+a626847-1 from [0]. [0] http://ftp.de.debian.org/debian/ experimental/main amd64 Packages > >> At some point, there was 1111 already. Then I unmanaged the p_dummy but >> it grew up to the 2403 after that. The number of running monitors may >> grow or decrease as well. >> Also, the /var/lib/heartbeat/trace_ra/dummy/ still have been populated >> by new p_dummy.monitor* files with recent timestamps. Why?.. >> >> If I pkill -9 all dummy monitors, lrmd spawns another ~2000 almost >> instantly :) Unless the node became unresponsive at some point. And >> after restarted by power off&on: >> # crm_resource --resource p_dummy --list-operations >> p_dummy (ocf::dummy:dummy): Started (unmanaged) : >> p_dummy_monitor_30000 (node=node-1.test.domain.local, call=679, rc=1, >> last-rc-change=Mon Jan 4 15:04:25 2016, exec=66747ms): Timed Out >> or >> p_dummy (ocf::dummy:dummy): Stopped (unmanaged) : >> p_dummy_monitor_103000 (node=node-3.test.domain.local, call=142, rc=1, >> last-rc-change=Mon Jan 4 15:14:59 2016, exec=65237ms): Timed Out >> >> And then lrmd repeats all of the fun again. >> >> > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > -- Best regards, Bogdan Dobrelya, Irc #bogdando _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org