** Description changed: [Impact] The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. The `threadpool_default_timeout` grace is left applied and suicide_grace is disabled. After finding work, the original work queue grace/suicide_grace values are not re-applied. This can result in hung operations that do not trigger an OSD suicide recovery. The missing suicide recovery was observed on Luminous 12.2.11. The environment was consistently hitting a known authentication race condition (issue#37778 [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a faulty DIMM. The auth race condition would stall pg operations. In some cases, the hung ops would persist for hours without suicide recovery. [Test Case] - - In-Progress - - Haven't landed on a reliable reproducer. Currently testing the fix by exercising I/O. Since the fix applies to all version of Ceph, the plan is to let this bake in the latest release before considering a back-port. + I have not identified a reliable reproducer. Currently testing the fix by exercising I/O. + + Recommend letting this bake upstream before considering a back-port. [Regression Potential] This fix improves suicide_grace coverage of the Sharded OpWq. This change is made in a critical code path that drives client I/O. An OSD suicide will trigger a service restart and repeated restarts (flapping) will adversely impact cluster performance. The fix mitigates risk by keeping the applied suicide_grace value consistent with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty queue edge-case that drops the suicide_grace timeout. The suicide_grace value is only re-applied when work is found after waiting on an empty queue. - In-Progress - - The fix needs to bake upstream on later levels before back-port consideration. + Opened upstream tracker for issue#45076 [1] and fix pr#34575 [2] + + [0] https://tracker.ceph.com/issues/37778 + [1] https://tracker.ceph.com/issues/45076 + [2] https://github.com/ceph/ceph/pull/34575
-- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1840348 Title: Sharded OpWQ drops suicide_grace after waiting for work To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1840348/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
