** Description changed:

  [Impact]
- The Sharded OpWQ will opportunistically wait for more work when processing an 
empty queue. While waiting, the heartbeat timeout and suicide_grace values are 
modified. On Luminous, the `threadpool_default_timeout` grace is left applied 
and suicide_grace is left disabled. On later releases both the grace and 
suicide_grace are left disabled.
+ The Sharded OpWQ will opportunistically wait for more work when processing an 
empty queue. While waiting, the heartbeat timeout and suicide_grace values are 
modified. The `threadpool_default_timeout` grace is left applied and 
suicide_grace is disabled.
  
  After finding work, the original work queue grace/suicide_grace values
  are not re-applied. This can result in hung operations that do not
  trigger an OSD suicide recovery.
  
  The missing suicide recovery was observed on Luminous 12.2.11. The
  environment was consistently hitting a known authentication race
  condition (issue#37778 [0]) due to repeated OSD service restarts on a
  node exhibiting MCEs from a faulty DIMM.
  
  The auth race condition would stall pg operations. In some cases, the
  hung ops would persist for hours without suicide recovery.
  
  [Test Case]
  - In-Progress -
  Haven't landed on a reliable reproducer. Currently testing the fix by 
exercising I/O. Since the fix applies to all version of Ceph, the plan is to 
let this bake in the latest release before considering a back-port.
  
  [Regression Potential]
  This fix improves suicide_grace coverage of the Sharded OpWq.
  
  This change is made in a critical code path that drives client I/O. An
  OSD suicide will trigger a service restart and repeated restarts
  (flapping) will adversely impact cluster performance.
  
  The fix mitigates risk by keeping the applied suicide_grace value
  consistent with the value applied before entering
  `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty
  queue edge-case that drops the suicide_grace timeout. The suicide_grace
  value is only re-applied when work is found after waiting on an empty
  queue.
  
  - In-Progress -
  The fix needs to bake upstream on later levels before back-port consideration.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1840348

Title:
  Sharded OpWQ drops suicide_grace after waiting for work

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1840348/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to