** Description changed: [Impact] - * Cluster resource timeouts are not working and should be working. - Timeouts are important in order for the actions (done by the resource) - don't timeout before we're expecting (sometimes starting a resource can - take more time than the default time because of configuration files, or - cache to be loaded, etc). + * Cluster resource operation timeouts are not working correctly for + systemd resources and should be working. Timeouts are important in order + for the actions executed by pacemaker - for the systemd resource in + question - don't wait forever to start (or stop) a service, causing the + police engine to take the correct decisions (like trying to start the + resource somewhere else). [Test Case] - - * configure correctly a pacemaker cluster and add the following resources: + + * configure correctly a pacemaker cluster and add the following + resources: # fencing primitive fence-focal01 stonith:fence_virsh \ - params ipaddr=192.168.100.202 \ - secure=true plug=focal01 login=fenceuser \ - op monitor interval=30s + params ipaddr=192.168.100.202 \ + secure=true plug=focal01 login=fenceuser \ + op monitor interval=30s primitive fence-focal02 stonith:fence_virsh \ - params ipaddr=192.168.100.202 \ - secure=true plug=focal02 login=fenceuser \ - op monitor interval=30s + params ipaddr=192.168.100.202 \ + secure=true plug=focal02 login=fenceuser \ + op monitor interval=30s primitive fence-focal03 stonith:fence_virsh \ - params ipaddr=192.168.100.202 \ - secure=true plug=focal03 login=fenceuser \ - op monitor interval=30s + params ipaddr=192.168.100.202 \ + secure=true plug=focal03 login=fenceuser \ + op monitor interval=30s # resources primitive virtual_ip IPaddr2 \ - params ip=10.250.92.90 nic=public01 \ - op monitor interval=5s + params ip=10.250.92.90 nic=public01 \ + op monitor interval=5s primitive webserver systemd:lighttpd \ - op monitor interval=5s \ - op start interval=0s timeout=2s \ - op stop interval=0s timeout=2s \ - meta migration-threshold=2 + op monitor interval=5s \ + op start interval=0s timeout=2s \ + op stop interval=0s timeout=2s \ + meta migration-threshold=2 # resource group group webserver_vip webserver virtual_ip \ - meta target-role=Stopped + meta target-role=Stopped # locations location fence-focal01-location fence-focal01 -inf: focal01 location fence-focal02-location fence-focal02 -inf: focal02 location fence-focal03-location fence-focal03 -inf: focal03 # properties property cib-bootstrap-options: \ - have-watchdog=false \ - dc-version=2.0.3-4b1f869f0f \ - cluster-infrastructure=corosync \ - stonith-enabled=on \ - stonith-action=reboot \ - no-quorum-policy=stop \ - cluster-name=focal + have-watchdog=false \ + dc-version=2.0.3-4b1f869f0f \ + cluster-infrastructure=corosync \ + stonith-enabled=on \ + stonith-action=reboot \ + no-quorum-policy=stop \ + cluster-name=focal * Try to stop an already started resource group with "op stop timeout=2s" for the systemd resource will not be accounted as 2 seconds: Failed Resource Actions: - * webserver_stop_0 on focal03 'OCF_TIMEOUT' (198): call=29, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:01:57Z', queued=1828ms, exec=204557ms + * webserver_stop_0 on focal03 'OCF_TIMEOUT' (198): call=29, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:01:57Z', queued=1828ms, exec=204557ms * Watch the cluster collapse.. (fencing nodes, trying to start resources, fencing nodes again, and over) Increasing timeout to 20s does not help: Failed Resource Actions: - * webserver_stop_0 on focal01 'OCF_TIMEOUT' (198): call=47, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:10:35Z', queued=20ms, exec=236013ms - * webserver_start_0 on focal03 'OCF_TIMEOUT' (198): call=22, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:05:09Z', queued=33ms, exec=241831ms + * webserver_stop_0 on focal01 'OCF_TIMEOUT' (198): call=47, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:10:35Z', queued=20ms, exec=236013ms + * webserver_start_0 on focal03 'OCF_TIMEOUT' (198): call=22, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:05:09Z', queued=33ms, exec=241831ms and the systemd resources startup is much less than 20 seconds. [Regression Potential] * Debian was still using ftime() for pacemaker 2.0.3, and, because of deprecation warnings, wgrant has changed it in: pacemaker (2.0.3-3ubuntu2): This was "bad" because it made this issue to appear (as we started using clock_gettime(CLOCK_MONOTONIC) instead of ftime(). But.. it was good, because in order for pacemaker to support systemd resources a monotonic clock is required (and this change enabled it). * So, there is no easy path: Its either we disable clock_gettime() support, by defining PCMK_TIME_EMERGENCY_CGT (like 2.0.3 does by default) - and stick with broken systemd resources + FTBFS - or we fix the clock_gettime() support (with this patchset) enabled by wgrant in 2.0.3. Now... to the potential issues: * This patchset was not done in 2.0.3 because it was missed also (it is like "half fix" for clock_gettime() was done before the release). * The number of patches are not small but they're ALL related to the same thing: fixing timeout not working and re-organizing timing for resources. They're also mostly touching the same file: execd_commands.c (and configure.ac to control macros). * timeouts are confirmed broken for systemd resources (like the test case shows). We could, perhaps, brake for OCF resorces and/or fencing as well. * This change has been recommended by upstream maintainer (from 2 merge numbers he pointed out in the upstream bug = https://bugs.clusterlabs.org/show_bug.cgi?id=5429). [Other Info] * Original Description (from the reporter): While working on pacemaker, i discovered a issue with timeouts haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms this lead me down the path of finding that setting a timeout unit value was not doing anything primitive haproxy systemd:haproxy \ op monitor interval=2s \ op start interval=0s timeout=500s \ op stop interval=0s timeout=500s \ meta migration-threshold=2 primitive haproxy systemd:haproxy \ op monitor interval=2s \ op start interval=0s timeout=500 \ op stop interval=0s timeout=500 \ meta migration-threshold=2 the two above configs result in the same behavior, pacemaker/crm seems to be ignoring the "s" I file a bug with pacemaker itself https://bugs.clusterlabs.org/show_bug.cgi?id=5429 but this lead to the following responsed, copied from the ticket: <<Looking back on your irc chat, I see you have a version of Pacemaker with a known bug: <<haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-<<change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms <<The incorrect date is a result of bugs that occur in systemd resources when Pacemaker 2.0.3 is built <<with the -UPCMK_TIME_EMERGENCY_CGT C flag (which is not the default). I was only aware of that being the <<case in one Fedora release. If those are stock Ubuntu packages, please file an Ubuntu bug to make sure <<they are aware of it. <<The underlying bugs are fixed as of the Pacemaker 2.0.4 release. If anyone wants to backport specific <<commits instead, the github pull requests #1992 and #1997 should take care of it. It appears the the root cause of my issue with setting timeout values with units ("600s") is a bug in the build process of ubuntu pacemaker 1) lsb_release -d Description: Ubuntu 20.04 LTS 2) ii pacemaker 2.0.3-3ubuntu3 amd64 cluster resource manager 3) setting "100s" in the timeout of a resource should result in a 100 second timeout, not a 100 milisecond timeout 4) the settings unit value "s", is being ignored. force me to set the timeout to 10000 to get a 10 second timeout
-- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1881762 Title: resource timeout not respecting units To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1881762/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
