[Bug 1881762] Re: resource timeout not respecting units

Rafael David Tinoco Thu, 01 Oct 2020 13:21:29 -0700

** Description changed:

  [Impact]
  
-  * Cluster resource timeouts are not working and should be working.
- Timeouts are important in order for the actions (done by the resource)
- don't timeout before we're expecting (sometimes starting a resource can
- take more time than the default time because of configuration files, or
- cache to be loaded, etc).
+  * Cluster resource operation timeouts are not working correctly for
+ systemd resources and should be working. Timeouts are important in order
+ for the actions executed by pacemaker - for the systemd resource in
+ question - don't wait forever to start (or stop) a service, causing the
+ police engine to take the correct decisions (like trying to start the
+ resource somewhere else).
  
  [Test Case]
-  
-  * configure correctly a pacemaker cluster and add the following resources:
+ 
+  * configure correctly a pacemaker cluster and add the following
+ resources:
  
  # fencing
  
  primitive fence-focal01 stonith:fence_virsh \
-         params ipaddr=192.168.100.202 \
-         secure=true plug=focal01 login=fenceuser \
-         op monitor interval=30s
+         params ipaddr=192.168.100.202 \
+         secure=true plug=focal01 login=fenceuser \
+         op monitor interval=30s
  primitive fence-focal02 stonith:fence_virsh \
-         params ipaddr=192.168.100.202 \
-         secure=true plug=focal02 login=fenceuser \
-         op monitor interval=30s
+         params ipaddr=192.168.100.202 \
+         secure=true plug=focal02 login=fenceuser \
+         op monitor interval=30s
  primitive fence-focal03 stonith:fence_virsh \
-         params ipaddr=192.168.100.202 \
-         secure=true plug=focal03 login=fenceuser \
-         op monitor interval=30s
+         params ipaddr=192.168.100.202 \
+         secure=true plug=focal03 login=fenceuser \
+         op monitor interval=30s
  
  # resources
  
  primitive virtual_ip IPaddr2 \
-         params ip=10.250.92.90 nic=public01 \
-         op monitor interval=5s
+         params ip=10.250.92.90 nic=public01 \
+         op monitor interval=5s
  
  primitive webserver systemd:lighttpd \
-         op monitor interval=5s \
-         op start interval=0s timeout=2s \
-         op stop interval=0s timeout=2s \
-         meta migration-threshold=2
+         op monitor interval=5s \
+         op start interval=0s timeout=2s \
+         op stop interval=0s timeout=2s \
+         meta migration-threshold=2
  
  # resource group
  
  group webserver_vip webserver virtual_ip \
-         meta target-role=Stopped
+         meta target-role=Stopped
  
  # locations
  
  location fence-focal01-location fence-focal01 -inf: focal01
  location fence-focal02-location fence-focal02 -inf: focal02
  location fence-focal03-location fence-focal03 -inf: focal03
  
  # properties
  
  property cib-bootstrap-options: \
-         have-watchdog=false \
-         dc-version=2.0.3-4b1f869f0f \
-         cluster-infrastructure=corosync \
-         stonith-enabled=on \
-         stonith-action=reboot \
-         no-quorum-policy=stop \
-         cluster-name=focal
+         have-watchdog=false \
+         dc-version=2.0.3-4b1f869f0f \
+         cluster-infrastructure=corosync \
+         stonith-enabled=on \
+         stonith-action=reboot \
+         no-quorum-policy=stop \
+         cluster-name=focal
  
  * Try to stop an already started resource group with "op stop
  timeout=2s" for the systemd resource will not be accounted as 2 seconds:
  
  Failed Resource Actions:
-   * webserver_stop_0 on focal03 'OCF_TIMEOUT' (198): call=29, status='Timed 
Out', exitreason='', last-rc-change='1970-01-01 00:01:57Z', queued=1828ms, 
exec=204557ms
+   * webserver_stop_0 on focal03 'OCF_TIMEOUT' (198): call=29, status='Timed 
Out', exitreason='', last-rc-change='1970-01-01 00:01:57Z', queued=1828ms, 
exec=204557ms
  
  * Watch the cluster collapse.. (fencing nodes, trying to start
  resources, fencing nodes again, and over)
  
  Increasing timeout to 20s does not help:
  
  Failed Resource Actions:
-   * webserver_stop_0 on focal01 'OCF_TIMEOUT' (198): call=47, status='Timed 
Out', exitreason='', last-rc-change='1970-01-01 00:10:35Z', queued=20ms, 
exec=236013ms
-   * webserver_start_0 on focal03 'OCF_TIMEOUT' (198): call=22, status='Timed 
Out', exitreason='', last-rc-change='1970-01-01 00:05:09Z', queued=33ms, 
exec=241831ms
+   * webserver_stop_0 on focal01 'OCF_TIMEOUT' (198): call=47, status='Timed 
Out', exitreason='', last-rc-change='1970-01-01 00:10:35Z', queued=20ms, 
exec=236013ms
+   * webserver_start_0 on focal03 'OCF_TIMEOUT' (198): call=22, status='Timed 
Out', exitreason='', last-rc-change='1970-01-01 00:05:09Z', queued=33ms, 
exec=241831ms
  
  and the systemd resources startup is much less than 20 seconds.
  
  [Regression Potential]
  
   * Debian was still using ftime() for pacemaker 2.0.3, and, because of
  deprecation warnings, wgrant has changed it in: pacemaker
  (2.0.3-3ubuntu2):
  
  This was "bad" because it made this issue to appear (as we started using
  clock_gettime(CLOCK_MONOTONIC) instead of ftime(). But.. it was good,
  because in order for pacemaker to support systemd resources a monotonic
  clock is required (and this change enabled it).
  
   * So, there is no easy path: Its either we disable clock_gettime()
  support, by defining PCMK_TIME_EMERGENCY_CGT (like 2.0.3 does by
  default) - and stick with broken systemd resources + FTBFS - or we fix
  the clock_gettime() support (with this patchset) enabled by wgrant in
  2.0.3.
  
  Now... to the potential issues:
  
   * This patchset was not done in 2.0.3 because it was missed also (it is
  like "half fix" for clock_gettime() was done before the release).
  
   * The number of patches are not small but they're ALL related to the
  same thing: fixing timeout not working and re-organizing timing for
  resources. They're also mostly touching the same file: execd_commands.c
  (and configure.ac to control macros).
  
   * timeouts are confirmed broken for systemd resources (like the test
  case shows). We could, perhaps, brake for OCF resorces and/or fencing as
  well.
  
   * This change has been recommended by upstream maintainer (from 2 merge
  numbers he pointed out in the upstream bug =
  https://bugs.clusterlabs.org/show_bug.cgi?id=5429).
  
  [Other Info]
  
   * Original Description (from the reporter):
  
  While working on pacemaker, i discovered a issue with timeouts
  
  haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed
  Out', exitreason='', last-rc-change='1970-01-04 17:21:18 -05:00',
  queued=44ms,      exec=176272ms
  
  this lead me down the path of finding that setting a timeout unit value
  was not doing anything
  
  primitive haproxy systemd:haproxy \
          op monitor interval=2s \
          op start interval=0s timeout=500s \
          op stop interval=0s timeout=500s \
          meta migration-threshold=2
  
  primitive haproxy systemd:haproxy \
          op monitor interval=2s \
          op start interval=0s timeout=500 \
          op stop interval=0s timeout=500 \
          meta migration-threshold=2
  
  the two above configs result in the same behavior, pacemaker/crm seems
  to be ignoring the "s"
  
  I file a bug with pacemaker itself
  https://bugs.clusterlabs.org/show_bug.cgi?id=5429
  
  but this lead to the following responsed, copied from the ticket:
  
  <<Looking back on your irc chat, I see you have a version of Pacemaker
  with a known bug:
  
  <<haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed
  Out', exitreason='', last-rc-<<change='1970-01-04 17:21:18 -05:00',
  queued=44ms,      exec=176272ms
  
  <<The incorrect date is a result of bugs that occur in systemd resources
  when Pacemaker 2.0.3 is built <<with the -UPCMK_TIME_EMERGENCY_CGT C
  flag (which is not the default). I was only aware of that being the
  <<case in one Fedora release. If those are stock Ubuntu packages, please
  file an Ubuntu bug to make sure <<they are aware of it.
  
  <<The underlying bugs are fixed as of the Pacemaker 2.0.4 release. If
  anyone wants to backport specific <<commits instead, the github pull
  requests #1992 and #1997 should take care of it.
  
  It appears the the root cause of my issue with setting timeout values
  with units ("600s") is a bug in the build process of ubuntu pacemaker
  
  1) lsb_release -d Description:    Ubuntu 20.04 LTS
  2) ii  pacemaker                            2.0.3-3ubuntu3                    
amd64        cluster resource manager
  3) setting "100s" in the timeout of a resource should result in a 100 second 
timeout, not a 100 milisecond timeout
  4) the settings unit value "s", is being ignored. force me to set the timeout 
to 10000 to get a 10 second timeout


-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1881762

Title:
  resource timeout not respecting units

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1881762/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1881762] Re: resource timeout not respecting units

Reply via email to