[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work

2022-01-07 Thread Kellen Renshaw
@dbungert Apologies, got sidetracked and this version has been
superceded. Can we rebase this debdiff on the 12.2.13-0ubuntu0.18.04.10
version?

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1840348

Title:
  Sharded OpWQ drops suicide_grace after waiting for work

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1840348/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work

2021-10-26 Thread Brian Murray
Hello Kellen, or anyone else affected,

Accepted ceph into bionic-proposed. The package will build now and be
available at
https://launchpad.net/ubuntu/+source/ceph/12.2.13-0ubuntu0.18.04.9 in a
few hours, and then in the -proposed repository.

Please help us by testing this new package.  See
https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how
to enable and use -proposed.  Your feedback will aid us getting this
update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug,
mentioning the version of the package you tested, what testing has been
performed on the package and change the tag from verification-needed-
bionic to verification-done-bionic. If it does not fix the bug for you,
please add a comment stating that, and change the tag to verification-
failed-bionic. In either case, without details of your testing we will
not be able to proceed.

Further information regarding the verification process can be found at
https://wiki.ubuntu.com/QATeam/PerformingSRUVerification .  Thank you in
advance for helping!

N.B. The updated package will be released to -updates after the bug(s)
fixed by this package have been verified and the package has been in
-proposed for a minimum of 7 days.

** Changed in: ceph (Ubuntu Bionic)
   Status: In Progress => Fix Committed

** Tags added: verification-needed verification-needed-bionic

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1840348

Title:
  Sharded OpWQ drops suicide_grace after waiting for work

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1840348/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work

2021-10-18 Thread Dan Streetman
unsubscribing ubuntu sponsors team, as sponsorship for this is being
requested from the openstack team

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1840348

Title:
  Sharded OpWQ drops suicide_grace after waiting for work

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1840348/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work

2021-10-18 Thread Kellen Renshaw
@dbungert - Taking over this one for hillpd, we would like to pursue
this backport to Bionic/Queens (dropping Stein).

Originally held back on it since there isn't a viable reproducer. Now,
since the change has been in upstream since May 1, 2020, there is
confidence that it doesn't introduce a regression.

** Changed in: cloud-archive/stein
   Status: In Progress => Won't Fix

** Changed in: ceph (Ubuntu Bionic)
 Assignee: Dan Hill (hillpd) => Kellen Renshaw (krenshaw)

** Changed in: cloud-archive/queens
 Assignee: Dan Hill (hillpd) => Kellen Renshaw (krenshaw)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1840348

Title:
  Sharded OpWQ drops suicide_grace after waiting for work

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1840348/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work

2021-10-14 Thread Dan Bungert
@hillpd - should this still be in the sponsor queue?  Are we still
trying to SRU this to Bionic?

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1840348

Title:
  Sharded OpWQ drops suicide_grace after waiting for work

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1840348/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work

2020-12-17 Thread Dan Hill
** Changed in: cloud-archive/train
   Importance: Undecided => Medium

** Changed in: cloud-archive/stein
   Importance: Undecided => Medium

** Changed in: cloud-archive/queens
   Importance: Undecided => Medium

** Changed in: cloud-archive
   Importance: Undecided => Medium

** Changed in: cloud-archive/rocky
   Importance: Undecided => Medium

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1840348

Title:
  Sharded OpWQ drops suicide_grace after waiting for work

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1840348/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work

2020-12-08 Thread Billy Olsen
** Changed in: cloud-archive/rocky
   Status: Invalid => Won't Fix

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1840348

Title:
  Sharded OpWQ drops suicide_grace after waiting for work

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1840348/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work

2020-12-08 Thread Dan Hill
** Changed in: cloud-archive/train
   Status: New => Fix Released

** Changed in: cloud-archive/rocky
   Status: New => Invalid

** Changed in: cloud-archive/stein
   Status: New => In Progress

** Changed in: cloud-archive/train
 Assignee: (unassigned) => Dan Hill (hillpd)

** Changed in: cloud-archive/stein
 Assignee: (unassigned) => Dan Hill (hillpd)

** Changed in: cloud-archive/rocky
 Assignee: (unassigned) => Dan Hill (hillpd)

** Changed in: cloud-archive/queens
 Assignee: (unassigned) => Dan Hill (hillpd)

** Changed in: ceph (Ubuntu Bionic)
   Status: Confirmed => In Progress

** Changed in: cloud-archive/queens
   Status: New => In Progress

** Changed in: cloud-archive
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1840348

Title:
  Sharded OpWQ drops suicide_grace after waiting for work

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1840348/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work

2020-12-08 Thread Billy Olsen
** Also affects: cloud-archive/train
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/rocky
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/queens
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/stein
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1840348

Title:
  Sharded OpWQ drops suicide_grace after waiting for work

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1840348/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work

2020-12-02 Thread Dan Hill
** Also affects: cloud-archive
   Importance: Undecided
   Status: New

** Changed in: cloud-archive
 Assignee: (unassigned) => Dan Hill (hillpd)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1840348

Title:
  Sharded OpWQ drops suicide_grace after waiting for work

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1840348/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work

2020-11-30 Thread Mathew Hodson
** Changed in: ceph (Ubuntu)
   Status: Confirmed => Fix Released

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1840348

Title:
  Sharded OpWQ drops suicide_grace after waiting for work

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1840348/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work

2020-11-30 Thread Dan Hill
The SRUs for 15.2.5 [0] and 14.2.11 [1] have been released and contain a
fix for this issue.

We are currently evaluating the need for a fix in Luminous
(Bionic/Queens) and Mimic (Stein).

[0] https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1898200
[1] https://bugs.launchpad.net/cloud-archive/+bug/1891077

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1840348

Title:
  Sharded OpWQ drops suicide_grace after waiting for work

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1840348/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work

2020-11-30 Thread Dan Hill
** Changed in: ceph (Ubuntu Focal)
   Status: Confirmed => Fix Released

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1840348

Title:
  Sharded OpWQ drops suicide_grace after waiting for work

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1840348/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work

2020-08-18 Thread Brian Murray
The Eoan Ermine has reached end of life, so this bug will not be fixed
for that release

** Changed in: ceph (Ubuntu Eoan)
   Status: Confirmed => Won't Fix

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1840348

Title:
  Sharded OpWQ drops suicide_grace after waiting for work

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1840348/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work

2020-04-15 Thread Dan Hill
** Description changed:

  [Impact]
  The Sharded OpWQ will opportunistically wait for more work when processing an 
empty queue. While waiting, the heartbeat timeout and suicide_grace values are 
modified. The `threadpool_default_timeout` grace is left applied and 
suicide_grace is disabled.
  
  After finding work, the original work queue grace/suicide_grace values
  are not re-applied. This can result in hung operations that do not
  trigger an OSD suicide recovery.
  
  The missing suicide recovery was observed on Luminous 12.2.11. The
  environment was consistently hitting a known authentication race
  condition (issue#37778 [0]) due to repeated OSD service restarts on a
  node exhibiting MCEs from a faulty DIMM.
  
  The auth race condition would stall pg operations. In some cases, the
  hung ops would persist for hours without suicide recovery.
  
  [Test Case]
- - In-Progress -
- Haven't landed on a reliable reproducer. Currently testing the fix by 
exercising I/O. Since the fix applies to all version of Ceph, the plan is to 
let this bake in the latest release before considering a back-port.
+ I have not identified a reliable reproducer. Currently testing the fix by 
exercising I/O. 
+ 
+ Recommend letting this bake upstream before considering a back-port.
  
  [Regression Potential]
  This fix improves suicide_grace coverage of the Sharded OpWq.
  
  This change is made in a critical code path that drives client I/O. An
  OSD suicide will trigger a service restart and repeated restarts
  (flapping) will adversely impact cluster performance.
  
  The fix mitigates risk by keeping the applied suicide_grace value
  consistent with the value applied before entering
  `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty
  queue edge-case that drops the suicide_grace timeout. The suicide_grace
  value is only re-applied when work is found after waiting on an empty
  queue.
  
  - In-Progress -
- The fix needs to bake upstream on later levels before back-port consideration.
+ Opened upstream tracker for issue#45076 [1] and fix pr#34575 [2]
+ 
+ [0] https://tracker.ceph.com/issues/37778
+ [1] https://tracker.ceph.com/issues/45076
+ [2] https://github.com/ceph/ceph/pull/34575

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1840348

Title:
  Sharded OpWQ drops suicide_grace after waiting for work

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1840348/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work

2020-04-13 Thread Dan Hill
** Description changed:

  [Impact]
- The Sharded OpWQ will opportunistically wait for more work when processing an 
empty queue. While waiting, the heartbeat timeout and suicide_grace values are 
modified. On Luminous, the `threadpool_default_timeout` grace is left applied 
and suicide_grace is left disabled. On later releases both the grace and 
suicide_grace are left disabled.
+ The Sharded OpWQ will opportunistically wait for more work when processing an 
empty queue. While waiting, the heartbeat timeout and suicide_grace values are 
modified. The `threadpool_default_timeout` grace is left applied and 
suicide_grace is disabled.
  
  After finding work, the original work queue grace/suicide_grace values
  are not re-applied. This can result in hung operations that do not
  trigger an OSD suicide recovery.
  
  The missing suicide recovery was observed on Luminous 12.2.11. The
  environment was consistently hitting a known authentication race
  condition (issue#37778 [0]) due to repeated OSD service restarts on a
  node exhibiting MCEs from a faulty DIMM.
  
  The auth race condition would stall pg operations. In some cases, the
  hung ops would persist for hours without suicide recovery.
  
  [Test Case]
  - In-Progress -
  Haven't landed on a reliable reproducer. Currently testing the fix by 
exercising I/O. Since the fix applies to all version of Ceph, the plan is to 
let this bake in the latest release before considering a back-port.
  
  [Regression Potential]
  This fix improves suicide_grace coverage of the Sharded OpWq.
  
  This change is made in a critical code path that drives client I/O. An
  OSD suicide will trigger a service restart and repeated restarts
  (flapping) will adversely impact cluster performance.
  
  The fix mitigates risk by keeping the applied suicide_grace value
  consistent with the value applied before entering
  `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty
  queue edge-case that drops the suicide_grace timeout. The suicide_grace
  value is only re-applied when work is found after waiting on an empty
  queue.
  
  - In-Progress -
  The fix needs to bake upstream on later levels before back-port consideration.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1840348

Title:
  Sharded OpWQ drops suicide_grace after waiting for work

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1840348/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work

2020-04-10 Thread Ubuntu Foundations Team Bug Bot
The attachment
"ceph_12.2.13-0ubuntu0.18.04.1+20200409sf00238701b1.debdiff" seems to be
a debdiff.  The ubuntu-sponsors team has been subscribed to the bug
report so that they can review and hopefully sponsor the debdiff.  If
the attachment isn't a patch, please remove the "patch" flag from the
attachment, remove the "patch" tag, and if you are member of the
~ubuntu-sponsors, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by
~brian-murray, for any issue please contact him.]

** Tags added: patch

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1840348

Title:
  Sharded OpWQ drops suicide_grace after waiting for work

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1840348/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work

2020-04-10 Thread Dan Hill
** Also affects: ceph (Ubuntu Focal)
   Importance: Medium
 Assignee: Dan Hill (hillpd)
   Status: Triaged

** Also affects: ceph (Ubuntu Bionic)
   Importance: Undecided
   Status: New

** Also affects: ceph (Ubuntu Eoan)
   Importance: Undecided
   Status: New

** Changed in: ceph (Ubuntu Bionic)
   Status: New => Confirmed

** Changed in: ceph (Ubuntu Bionic)
 Assignee: (unassigned) => Dan Hill (hillpd)

** Changed in: ceph (Ubuntu Eoan)
 Assignee: (unassigned) => Dan Hill (hillpd)

** Changed in: ceph (Ubuntu Bionic)
   Importance: Undecided => Medium

** Changed in: ceph (Ubuntu Eoan)
   Importance: Undecided => Medium

** Changed in: ceph (Ubuntu Eoan)
   Status: New => Confirmed

** Changed in: ceph (Ubuntu Focal)
   Status: Triaged => Confirmed

** Description changed:

  [Impact]
  The Sharded OpWQ will opportunistically wait for more work when processing an 
empty queue. While waiting, the heartbeat timeout and suicide_grace values are 
modified. On Luminous, the `threadpool_default_timeout` grace is left applied 
and suicide_grace is left disabled. On later releases both the grace and 
suicide_grace are left disabled.
  
  After finding work, the original work queue grace/suicide_grace values
  are not re-applied. This can result in hung operations that do not
  trigger an OSD suicide recovery.
  
  The missing suicide recovery was observed on Luminous 12.2.11. The
  environment was consistently hitting a known authentication race
  condition (issue#37778 [0]) due to repeated OSD service restarts on a
  node exhibiting MCEs from a faulty DIMM.
  
  The auth race condition would stall pg operations. In some cases, the
  hung ops would persist for hours without suicide recovery.
  
  [Test Case]
  - In-Progress -
  Haven't landed on a reliable reproducer. Currently testing the fix by 
exercising I/O. Since the fix applies to all version of Ceph, the plan is to 
let this bake in the latest release before considering a back-port.
  
  [Regression Potential]
  This fix improves suicide_grace coverage of the Sharded OpWq.
  
  This change is made in a critical code path that drives client I/O. An
  OSD suicide will trigger a service restart and repeated restarts
  (flapping) will adversely impact cluster performance.
  
  The fix mitigates risk by keeping the applied suicide_grace value
  consistent with the value applied before entering
  `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty
  queue edge-case that drops the suicide_grace timeout. The suicide_grace
  value is only re-applied when work is found after waiting on an empty
  queue.
  
  - In-Progress -
- The fix will bake upstream on later levels before back-port consideration.
+ The fix needs to bake upstream on later levels before back-port consideration.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1840348

Title:
  Sharded OpWQ drops suicide_grace after waiting for work

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1840348/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work

2020-04-10 Thread Dan Hill
Attaching the proposed fix for 12.2.13 that I am testing.

** Patch added: "ceph_12.2.13-0ubuntu0.18.04.1+20200409sf00238701b1.debdiff"
   
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1840348/+attachment/5351517/+files/ceph_12.2.13-0ubuntu0.18.04.1+20200409sf00238701b1.debdiff

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1840348

Title:
  Sharded OpWQ drops suicide_grace after waiting for work

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1840348/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work

2020-04-10 Thread Dan Hill
There are two edge-cases in 12.2.11 where a worker thread's suicide_grace value 
gets dropped:
[0] In the Threadpool context, Threadpool:worker() drops suicide_grace while 
waiting on an empty work queue.
[1] In the ShardedThreadpool context, OSD::ShardedOpWQ::_process() drops 
suicide_grace while opportunistically waiting for more work (to prevent 
additional lock contention).

The Threadpool context always re-assigns suicide_grace before driving
any work. The ShardedThreadpool context does not follow this pattern.
After delaying to find additional work, the default sharded work queue
timeouts are not re-applied.

This oversight exists in Luminous on-wards. Mimic, and Nautilus have
each reworked the ShardedOpWQ code path, but did not address the
problem.

[0] https://github.com/ceph/ceph/blob/v12.2.11/src/common/WorkQueue.cc#L137
[1] https://github.com/ceph/ceph/blob/v12.2.11/src/osd/OSD.cc#L10476

** Description changed:

- Multiple incidents have been seen where ops were blocked for various
- reasons and the suicide_grace timeout was not observed, meaning that the
- OSD failed to suicide as expected.
+ [Impact]
+ The Sharded OpWQ will opportunistically wait for more work when processing an
+ empty queue. While waiting, the heartbeat timeout and suicide_grace values are
+ modified. On Luminous, the `threadpool_default_timeout` grace is left applied
+ and suicide_grace is left disabled. On later releases both the grace and
+ suicide_grace are left disabled. 
+ 
+ After finding work, the original work queue grace/suicide_grace values are
+ not re-applied. This can result in hung operations that do not trigger an OSD
+ suicide recovery.
+ 
+ The missing suicide recovery was observed on Luminous 12.2.11. The environment
+ was consistently hitting a known authentication race condition (issue#37778
+ [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a
+ faulty DIMM. 
+ 
+ The auth race condition would stall pg operations. In some cases the hung ops
+ would persist for hours without suicide recovery.
+ 
+ [Test Case]
+ - In-Progress -
+ Haven't landed on a reliable reproducer. Currently testing the fix by
+ exercising I/O. Since the fix applies to all version of Ceph, the plan is to
+ let this bake in the latest release before considering a back-port. 
+ 
+ [Regression Potential]
+ This fix improves suicide_grace coverage of the Sharded OpWq. 
+ 
+ This change is made in a critical code path that drives client I/O. An OSD
+ suicide will trigger a service restart and repeated restarts (flapping) will
+ adversely impact cluster performance. 
+ 
+ The fix mitigates risk by keeping the applied suicide_grace value consistent
+ with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix
+ is also restricted to the empty queue edge-case that drops the suicide_grace
+ timeout. The suicide_grace value is only re-applied when work is found after
+ waiting on an empty queue. 
+ 
+ - In-Progress -
+ The fix will bake upstream on later levels before back-port consideration.

** Description changed:

  [Impact]
- The Sharded OpWQ will opportunistically wait for more work when processing an
- empty queue. While waiting, the heartbeat timeout and suicide_grace values are
- modified. On Luminous, the `threadpool_default_timeout` grace is left applied
- and suicide_grace is left disabled. On later releases both the grace and
- suicide_grace are left disabled. 
+ The Sharded OpWQ will opportunistically wait for more work when processing an 
empty queue. While waiting, the heartbeat timeout and suicide_grace values are 
modified. On Luminous, the `threadpool_default_timeout` grace is left applied 
and suicide_grace is left disabled. On later releases both the grace and 
suicide_grace are left disabled.
  
- After finding work, the original work queue grace/suicide_grace values are
- not re-applied. This can result in hung operations that do not trigger an OSD
- suicide recovery.
+ After finding work, the original work queue grace/suicide_grace values
+ are not re-applied. This can result in hung operations that do not
+ trigger an OSD suicide recovery.
  
- The missing suicide recovery was observed on Luminous 12.2.11. The environment
- was consistently hitting a known authentication race condition (issue#37778
- [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a
- faulty DIMM. 
+ The missing suicide recovery was observed on Luminous 12.2.11. The
+ environment was consistently hitting a known authentication race
+ condition (issue#37778 [0]) due to repeated OSD service restarts on a
+ node exhibiting MCEs from a faulty DIMM.
  
- The auth race condition would stall pg operations. In some cases the hung ops
- would persist for hours without suicide recovery.
+ The auth race condition would stall pg operations. In some cases, the
+ hung ops would persist for hours without suicide recovery.
  
  [Test Case]
  - In-Progress -
- Haven't landed on a reliable reproducer. 

[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work

2020-04-10 Thread Dan Hill
** Summary changed:

- Ceph 12.2.11-0ubuntu0.18.04.2 doesn't honor suicide_grace
+ Sharded OpWQ drops suicide_grace after waiting for work

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1840348

Title:
  Sharded OpWQ drops suicide_grace after waiting for work

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1840348/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs