[Bug 1906496] Re: [SRU] mgr can be very slow in a large ceph cluster

2021-01-19 Thread Corey Bryant
This bug was fixed in the package ceph - 12.2.13-0ubuntu0.18.04.6~cloud0
---

 ceph (12.2.13-0ubuntu0.18.04.6~cloud0) xenial-queens; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 ceph (12.2.13-0ubuntu0.18.04.6) bionic; urgency=medium
 .
   * d/p/bug1906496.patch: disable network stats in
 dump_osd_stats (LP: #1906496)


** Changed in: cloud-archive/stein
   Status: Fix Committed => Fix Released

** Changed in: cloud-archive/queens
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906496

Title:
  [SRU] mgr can be very slow in a large ceph cluster

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1906496/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1906496] Re: [SRU] mgr can be very slow in a large ceph cluster

2021-01-19 Thread Corey Bryant
This bug was fixed in the package ceph - 13.2.9-0ubuntu0.19.04.1~cloud3
---

 ceph (13.2.9-0ubuntu0.19.04.1~cloud3) bionic-stein; urgency=medium
 .
   * d/p/bug1906496.patch: disable network stats in
 dump_osd_stats (LP: #1906496)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906496

Title:
  [SRU] mgr can be very slow in a large ceph cluster

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1906496/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1906496] Re: [SRU] mgr can be very slow in a large ceph cluster

2021-01-18 Thread Launchpad Bug Tracker
This bug was fixed in the package ceph - 12.2.13-0ubuntu0.18.04.6

---
ceph (12.2.13-0ubuntu0.18.04.6) bionic; urgency=medium

  * d/p/bug1906496.patch: disable network stats in
dump_osd_stats (LP: #1906496)

 -- Ponnuvel Palaniyappan   Mon, 07 Dec 2020
18:15:24 +

** Changed in: ceph (Ubuntu Bionic)
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906496

Title:
  [SRU] mgr can be very slow in a large ceph cluster

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1906496/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1906496] Re: [SRU] mgr can be very slow in a large ceph cluster

2021-01-16 Thread Mathew Hodson
** Tags removed: verification-needed-done
** Tags added: verification-bionic-done

** Tags removed: verification-bionic-done
** Tags added: verification-done-bionic

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906496

Title:
  [SRU] mgr can be very slow in a large ceph cluster

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1906496/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1906496] Re: [SRU] mgr can be very slow in a large ceph cluster

2021-01-08 Thread Ponnuvel Palaniyappan
Queens verification of 12.2.13-0ubuntu0.18.04.6~cloud0

Ran same tests on Queens - ceph mgr was functional and responsive under
cluster load.


** Attachment added: "queens.sru"
   
https://bugs.launchpad.net/ubuntu/bionic/+source/ceph/+bug/1906496/+attachment/5450803/+files/queens.sru

** Tags removed: verification-needed verification-queens-needed
** Tags added: verification-done verification-queens-done

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906496

Title:
  [SRU] mgr can be very slow in a large ceph cluster

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1906496/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1906496] Re: [SRU] mgr can be very slow in a large ceph cluster

2021-01-07 Thread Ponnuvel Palaniyappan
** Description changed:

  [Impact]
  Ceph upstream implemented a new feature [1] that will check/report those long 
network ping times between OSDs, but it introduced an issue that ceph-mgr might 
be very slow because it needs to dump all the new OSD network ping stats [2] 
for some tasks, this can be bad especially when the cluster has large number of 
OSDs.
  
  Since these kind OSD network ping stats doesn't need to be exposed to
- the python mgr module. so, it only makes the mgr doing more work than it
+ the python mgr module. So, it only makes the mgr doing more work than it
  needs to, it could cause the mgr slow or even hang and could cause the
  CPU usage of mgr process constantly high. The fix is to disable the ping
  time dump for those mgr python modules.
  
  This resulted in ceph-mgr not responding to commands and/or hanging (and
- had to be restarted) in clusters with a large number of OSDs.
+ had to be restarted) in clusters with many OSDs.
  
  [0] is the upstream bug. It was backported to Nautilus but rejected for
  Luminous and Mimic because they reached EOL in upstream. But I want to
  backport to these two releases Ubuntu/UCA.
  
  The major fix from upstream is here [3], and also I found an improvement
  commit [4] that submitted later in another PR.
  
  [Test Case]
  Deploy a Ceph cluster (Luminous 13.2.9 or Mimic 13.2.9) with large number of 
Ceph OSDs (600+). During normal operations of the cluster, as the ceph-mgr 
dumps the network ping stats regularly, this problem would manifest. This is 
relatively hard to reproduce as the ceph-mgr may not always get overloaded and 
thus not hang.
  
  A simpler version could be to deploy a Ceph cluster with as many OSDs as
  the hardware/system setup allows (not necessarily 600+) and drive I/O on
  the cluster for sometime (say, 60 mins). Then various queries could be
  sent to the manager to verify it does report and doesn't get stuck.
  
  [Regression Potential]
  Fix has been accepted upstream (the changes are here in "sync" with upstream 
to the extent these old releases match the latest source code) and have been 
confirmed to work. So the risk is minimal.
  
  At worst, this could affect modules that consume the stats from ceph-mgr
  (such as prometheus or other monitoring scripts/tools) and thus becomes
  less useful. But still shouldn't cause any problems to the operations of
  the cluster itself.
  
  [Other Info]
  - In addition to the fix from [1], another commit [4] is also cherry-picked 
and backported here - this was also accepted upstream.
  
  - Since the ceph-mgr hangs when affected, this also impact sosreport
  collection - commands time out as the mgr doesn't respond and thus info
  get truncated/not collected in that case. This fix should help avoid
  that problem in sosreports.
  
  [0] https://tracker.ceph.com/issues/43364
  [1] https://github.com/ceph/ceph/pull/28755
  [2] 
https://github.com/ceph/ceph/pull/28755/files#diff-5498d83111f1210998ee186e98d5836d2bce9992be7648addc83f59e798cddd8L430
  [3] https://github.com/ceph/ceph/pull/32406
  [4] 
https://github.com/ceph/ceph/pull/32554/commits/1112584621016c4a8cac1bedb1a1b8b17c394f7f

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906496

Title:
  [SRU] mgr can be very slow in a large ceph cluster

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1906496/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1906496] Re: [SRU] mgr can be very slow in a large ceph cluster

2021-01-07 Thread Ponnuvel Palaniyappan
** Changed in: ceph (Ubuntu Bionic)
 Assignee: (unassigned) => Ponnuvel Palaniyappan (pponnuvel)

** Changed in: cloud-archive/stein
 Assignee: (unassigned) => Ponnuvel Palaniyappan (pponnuvel)

** Changed in: cloud-archive/queens
 Assignee: (unassigned) => Ponnuvel Palaniyappan (pponnuvel)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906496

Title:
  [SRU] mgr can be very slow in a large ceph cluster

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1906496/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1906496] Re: [SRU] mgr can be very slow in a large ceph cluster

2021-01-07 Thread Ponnuvel Palaniyappan
Bionic 12.2.13-0ubuntu0.18.04.6

** Attachment added: "bionic.sru"
   
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1906496/+attachment/5450097/+files/bionic.sru

** Tags removed: verification-needed-bionic verification-stein-needed
** Tags added: verification-needed-done verification-stein-done

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906496

Title:
  [SRU] mgr can be very slow in a large ceph cluster

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1906496/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1906496] Re: [SRU] mgr can be very slow in a large ceph cluster

2021-01-07 Thread Ponnuvel Palaniyappan
Stein UCA 13.2.9-0ubuntu0.19.04.1~cloud3

** Attachment added: "stein.sru"
   
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1906496/+attachment/5450096/+files/stein.sru

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906496

Title:
  [SRU] mgr can be very slow in a large ceph cluster

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1906496/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1906496] Re: [SRU] mgr can be very slow in a large ceph cluster

2021-01-07 Thread Ponnuvel Palaniyappan
SRU tests:
Deployed a cluster with many OSDs with the new packages; I/O was driven from a 
VM (both reads & writes). Enabled a number mgr modules, too. And under load, 
the cluster was functioning and mgr was still responding.  Attaching some 
relevant info on the tests here.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906496

Title:
  [SRU] mgr can be very slow in a large ceph cluster

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1906496/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1906496] Re: [SRU] mgr can be very slow in a large ceph cluster

2021-01-06 Thread Robie Basak
Hello dongdong, or anyone else affected,

Accepted ceph into bionic-proposed. The package will build now and be
available at
https://launchpad.net/ubuntu/+source/ceph/12.2.13-0ubuntu0.18.04.6 in a
few hours, and then in the -proposed repository.

Please help us by testing this new package.  See
https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how
to enable and use -proposed.  Your feedback will aid us getting this
update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug,
mentioning the version of the package you tested, what testing has been
performed on the package and change the tag from verification-needed-
bionic to verification-done-bionic. If it does not fix the bug for you,
please add a comment stating that, and change the tag to verification-
failed-bionic. In either case, without details of your testing we will
not be able to proceed.

Further information regarding the verification process can be found at
https://wiki.ubuntu.com/QATeam/PerformingSRUVerification .  Thank you in
advance for helping!

N.B. The updated package will be released to -updates after the bug(s)
fixed by this package have been verified and the package has been in
-proposed for a minimum of 7 days.

** Changed in: ceph (Ubuntu Bionic)
   Status: Triaged => Fix Committed

** Tags added: verification-needed verification-needed-bionic

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906496

Title:
  [SRU] mgr can be very slow in a large ceph cluster

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1906496/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1906496] Re: [SRU] mgr can be very slow in a large ceph cluster

2020-12-14 Thread Ponnuvel Palaniyappan
** Description changed:

  [Impact]
- Ceph upstream implemented a new feature [1] that will check/report those long 
network ping times between osds, but it introduced an issue that ceph-mgr might 
be very slow because it needs to dump all the new osd network ping stats [2] 
for some tasks, this can be bad especially when the cluster has large number of 
osds.
+ Ceph upstream implemented a new feature [1] that will check/report those long 
network ping times between OSDs, but it introduced an issue that ceph-mgr might 
be very slow because it needs to dump all the new OSD network ping stats [2] 
for some tasks, this can be bad especially when the cluster has large number of 
OSDs.
  
- Since these kind osd network ping stats doesn't need to be exposed to
+ Since these kind OSD network ping stats doesn't need to be exposed to
  the python mgr module. so, it only makes the mgr doing more work than it
  needs to, it could cause the mgr slow or even hang and could cause the
- cpu usage of mgr process constantly high. The fix is to disable the ping
+ CPU usage of mgr process constantly high. The fix is to disable the ping
  time dump for those mgr python modules.
  
  This resulted in ceph-mgr not responding to commands and/or hanging (and
  had to be restarted) in clusters with a large number of OSDs.
  
- [0] is the upstreambug. It was backported to Nautilus but rejected for
+ [0] is the upstream bug. It was backported to Nautilus but rejected for
  Luminous and Mimic because they reached EOL in upstream. But I want to
  backport to these two releases Ubuntu/UCA.
  
  The major fix from upstream is here [3], and also I found an improvement
  commit [4] that submitted later in another PR.
  
  [Test Case]
  Deploy a Ceph cluster (Luminous 13.2.9 or Mimic 13.2.9) with large number of 
Ceph OSDs (600+). During normal operations of the cluster, as the ceph-mgr 
dumps the network ping stats regularly, this problem would manifest. This is 
relatively hard to reproduce as the ceph-mgr may not always get overloaded and 
thus not hang.
  
  A simpler version could be to deploy a Ceph cluster with as many OSDs as
  the hardware/system setup allows (not necessarily 600+) and drive I/O on
- the cluster for sometime. Then various queries could be sent to the
- manager to verify it does report and doesn't get stuck.
+ the cluster for sometime (say, 60 mins). Then various queries could be
+ sent to the manager to verify it does report and doesn't get stuck.
  
  [Regression Potential]
  Fix has been accepted upstream (the changes are here in "sync" with upstream 
to the extent these old releases match the latest source code) and have been 
confirmed to work. So the risk is minimal.
  
  At worst, this could affect modules that consume the stats from ceph-mgr
  (such as prometheus or other monitoring scripts/tools) and thus becomes
  less useful. But still shouldn't cause any problems to the operations of
  the cluster itself.
  
  [Other Info]
  - In addition to the fix from [1], another commit [4] is also cherry-picked 
and backported here - this was also accepted upstream.
  
  - Since the ceph-mgr hangs when affected, this also impact sosreport
  collection - commands time out as the mgr doesn't respond and thus info
  get truncated/not collected in that case. This fix should help avoid
  that problem in sosreports.
  
  [0] https://tracker.ceph.com/issues/43364
  [1] https://github.com/ceph/ceph/pull/28755
  [2] 
https://github.com/ceph/ceph/pull/28755/files#diff-5498d83111f1210998ee186e98d5836d2bce9992be7648addc83f59e798cddd8L430
  [3] https://github.com/ceph/ceph/pull/32406
  [4] 
https://github.com/ceph/ceph/pull/32554/commits/1112584621016c4a8cac1bedb1a1b8b17c394f7f

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906496

Title:
  [SRU] mgr can be very slow in a large ceph cluster

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1906496/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1906496] Re: [SRU] mgr can be very slow in a large ceph cluster

2020-12-13 Thread Ponnuvel Palaniyappan
@Corey, yes, I am happy to do the SRU verification when the packages are
available. I've updated the [Test case] section to note a simplified,
functional test.

** Description changed:

- [Impact] 
+ [Impact]
  Ceph upstream implemented a new feature [1] that will check/report those long 
network ping times between osds, but it introduced an issue that ceph-mgr might 
be very slow because it needs to dump all the new osd network ping stats [2] 
for some tasks, this can be bad especially when the cluster has large number of 
osds.
  
  Since these kind osd network ping stats doesn't need to be exposed to
  the python mgr module. so, it only makes the mgr doing more work than it
  needs to, it could cause the mgr slow or even hang and could cause the
  cpu usage of mgr process constantly high. The fix is to disable the ping
  time dump for those mgr python modules.
  
  This resulted in ceph-mgr not responding to commands and/or hanging (and
  had to be restarted) in clusters with a large number of OSDs.
  
  [0] is the upstreambug. It was backported to Nautilus but rejected for
  Luminous and Mimic because they reached EOL in upstream. But I want to
  backport to these two releases Ubuntu/UCA.
  
  The major fix from upstream is here [3], and also I found an improvement
  commit [4] that submitted later in another PR.
  
  [Test Case]
  Deploy a Ceph cluster (Luminous 13.2.9 or Mimic 13.2.9) with large number of 
Ceph OSDs (600+). During normal operations of the cluster, as the ceph-mgr 
dumps the network ping stats regularly, this problem would manifest. This is 
relatively hard to reproduce as the ceph-mgr may not always get overloaded and 
thus not hang.
+ 
+ A simpler version could be to deploy a Ceph cluster with as many OSDs as
+ the hardware/system setup allows and drive I/O on the cluster for
+ sometime. Then various queries could be sent to the manager to verify it
+ does report and doesn't get stuck.
  
  [Regression Potential]
  Fix has been accepted upstream (the changes are here in "sync" with upstream 
to the extent these old releases match the latest source code) and have been 
confirmed to work. So the risk is minimal.
  
  At worst, this could affect modules that consume the stats from ceph-mgr
  (such as prometheus or other monitoring scripts/tools) and thus becomes
  less useful. But still shouldn't cause any problems to the operations of
  the cluster itself.
  
  [Other Info]
  - In addition to the fix from [1], another commit [4] is also cherry-picked 
and backported here - this was also accepted upstream.
  
  - Since the ceph-mgr hangs when affected, this also impact sosreport
  collection - commands time out as the mgr doesn't respond and thus info
  get truncated/not collected in that case. This fix should help avoid
  that problem in sosreports.
  
  [0] https://tracker.ceph.com/issues/43364
  [1] https://github.com/ceph/ceph/pull/28755
  [2] 
https://github.com/ceph/ceph/pull/28755/files#diff-5498d83111f1210998ee186e98d5836d2bce9992be7648addc83f59e798cddd8L430
  [3] https://github.com/ceph/ceph/pull/32406
  [4] 
https://github.com/ceph/ceph/pull/32554/commits/1112584621016c4a8cac1bedb1a1b8b17c394f7f

** Description changed:

  [Impact]
  Ceph upstream implemented a new feature [1] that will check/report those long 
network ping times between osds, but it introduced an issue that ceph-mgr might 
be very slow because it needs to dump all the new osd network ping stats [2] 
for some tasks, this can be bad especially when the cluster has large number of 
osds.
  
  Since these kind osd network ping stats doesn't need to be exposed to
  the python mgr module. so, it only makes the mgr doing more work than it
  needs to, it could cause the mgr slow or even hang and could cause the
  cpu usage of mgr process constantly high. The fix is to disable the ping
  time dump for those mgr python modules.
  
  This resulted in ceph-mgr not responding to commands and/or hanging (and
  had to be restarted) in clusters with a large number of OSDs.
  
  [0] is the upstreambug. It was backported to Nautilus but rejected for
  Luminous and Mimic because they reached EOL in upstream. But I want to
  backport to these two releases Ubuntu/UCA.
  
  The major fix from upstream is here [3], and also I found an improvement
  commit [4] that submitted later in another PR.
  
  [Test Case]
  Deploy a Ceph cluster (Luminous 13.2.9 or Mimic 13.2.9) with large number of 
Ceph OSDs (600+). During normal operations of the cluster, as the ceph-mgr 
dumps the network ping stats regularly, this problem would manifest. This is 
relatively hard to reproduce as the ceph-mgr may not always get overloaded and 
thus not hang.
  
  A simpler version could be to deploy a Ceph cluster with as many OSDs as
- the hardware/system setup allows and drive I/O on the cluster for
- sometime. Then various queries could be sent to the manager to verify it
- does report and doesn't get stuck.
+ the hardware/system setup allows (not nece

[Bug 1906496] Re: [SRU] mgr can be very slow in a large ceph cluster

2020-12-11 Thread Corey Bryant
Thanks for the patches Ponnuvel.

New versions of ceph with the sponsored changes to fix this bug have
been uploaded to stein-staging and the bionic unapproved queue.

@Ponnuvel, would you mind helping to test this once it is available in
proposed? I think the [Test Case] section above needs updating. I think
it can be something simpler to verify that the fix is working as
designed.

Thanks,
Corey

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906496

Title:
  [SRU] mgr can be very slow in a large ceph cluster

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1906496/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1906496] Re: [SRU] mgr can be very slow in a large ceph cluster

2020-12-11 Thread Corey Bryant
** Also affects: ceph (Ubuntu Groovy)
   Importance: Undecided
   Status: New

** Also affects: ceph (Ubuntu Hirsute)
   Importance: High
 Assignee: Ponnuvel Palaniyappan (pponnuvel)
   Status: Fix Released

** Also affects: ceph (Ubuntu Focal)
   Importance: Undecided
   Status: New

** Changed in: ceph (Ubuntu Groovy)
   Status: New => Fix Released

** Changed in: ceph (Ubuntu Focal)
   Status: New => Fix Released

** Also affects: cloud-archive/stein
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/victoria
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/ussuri
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/queens
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/train
   Importance: Undecided
   Status: New

** No longer affects: cloud-archive/victoria

** Changed in: cloud-archive/ussuri
   Status: New => Fix Released

** Changed in: cloud-archive/train
   Status: New => Fix Released

** Changed in: cloud-archive/stein
   Importance: Undecided => High

** Changed in: cloud-archive/stein
   Status: New => Triaged

** Changed in: cloud-archive/queens
   Importance: Undecided => High

** Changed in: cloud-archive/queens
   Status: New => Triaged

** Changed in: cloud-archive
   Status: New => Fix Released

** Changed in: ceph (Ubuntu Bionic)
   Importance: Medium => High

** Changed in: ceph (Ubuntu Bionic)
   Status: New => Triaged

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906496

Title:
  [SRU] mgr can be very slow in a large ceph cluster

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1906496/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1906496] Re: [SRU] mgr can be very slow in a large ceph cluster

2020-12-10 Thread Mathew Hodson
** Also affects: ceph (Ubuntu Bionic)
   Importance: Undecided
   Status: New

** Changed in: ceph (Ubuntu)
   Importance: Undecided => High

** Changed in: ceph (Ubuntu Bionic)
   Importance: Undecided => Medium

** Also affects: cloud-archive
   Importance: Undecided
   Status: New

** Changed in: ceph (Ubuntu)
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906496

Title:
  [SRU] mgr can be very slow in a large ceph cluster

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1906496/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1906496] Re: [SRU] mgr can be very slow in a large ceph cluster

2020-12-10 Thread Ponnuvel Palaniyappan
** Summary changed:

- mgr can be very slow in a large ceph cluster
+ [SRU] mgr can be very slow in a large ceph cluster

** Description changed:

- upstream implemented a new feature [1] that will check/report those long
- network ping times between osds, but it introduced an issue that ceph-
- mgr might be very slow because it needs to dump all the new osd network
- ping stats [2] for some tasks, this can be bad especially when the
- cluster has large number of osds.
+ [Impact] 
+ Ceph upstream implemented a new feature [1] that will check/report those long 
network ping times between osds, but it introduced an issue that ceph-mgr might 
be very slow because it needs to dump all the new osd network ping stats [2] 
for some tasks, this can be bad especially when the cluster has large number of 
osds.
  
- Since these kind osd network ping stats doesn't need to be exposed to the 
python mgr module.
- so, it only makes the mgr doing more work than it needs to, it could cause 
the mgr slow or even hang and could cause the cpu usage of mgr process 
constantly high. the fix is to disable the ping time dump for those mgr python 
modules.
+ Since these kind osd network ping stats doesn't need to be exposed to
+ the python mgr module. so, it only makes the mgr doing more work than it
+ needs to, it could cause the mgr slow or even hang and could cause the
+ cpu usage of mgr process constantly high. The fix is to disable the ping
+ time dump for those mgr python modules.
+ 
+ This resulted in ceph-mgr not responding to commands and/or hanging (and
+ had to be restarted) in clusters with a large number of OSDs.
+ 
+ [0] is the upstreambug. It was backported to Nautilus but rejected for
+ Luminous and Mimic because they reached EOL in upstream. But I want to
+ backport to these two releases Ubuntu/UCA.
  
  The major fix from upstream is here [3], and also I found an improvement
  commit [4] that submitted later in another PR.
  
- We need to backport them to bionic Luminous and Mimic(Stein), Nautilus
- and Octopus have the fix
+ [Test Case]
+ Deploy a Ceph cluster (Luminous 13.2.9 or Mimic 13.2.9) with large number of 
Ceph OSDs (600+). During normal operations of the cluster, as the ceph-mgr 
dumps the network ping stats regularly, this problem would manifest. This is 
relatively hard to reproduce as the ceph-mgr may not always get overloaded and 
thus not hang.
  
+ [Regression Potential]
+ Fix has been accepted upstream (the changes are here in "sync" with upstream 
to the extent these old releases match the latest source code) and have been 
confirmed to work. So the risk is minimal.
+ 
+ At worst, this could affect modules that consume the stats from ceph-mgr
+ (such as prometheus or other monitoring scripts/tools) and thus becomes
+ less useful. But still shouldn't cause any problems to the operations of
+ the cluster itself.
+ 
+ [Other Info]
+ - In addition to the fix from [1], another commit [4] is also cherry-picked 
and backported here - this was also accepted upstream.
+ 
+ - Since the ceph-mgr hangs when affected, this also impact sosreport
+ collection - commands time out as the mgr doesn't respond and thus info
+ get truncated/not collected in that case. This fix should help avoid
+ that problem in sosreports.
+ 
+ [0] https://tracker.ceph.com/issues/43364
  [1] https://github.com/ceph/ceph/pull/28755
  [2] 
https://github.com/ceph/ceph/pull/28755/files#diff-5498d83111f1210998ee186e98d5836d2bce9992be7648addc83f59e798cddd8L430
  [3] https://github.com/ceph/ceph/pull/32406
  [4] 
https://github.com/ceph/ceph/pull/32554/commits/1112584621016c4a8cac1bedb1a1b8b17c394f7f

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906496

Title:
  [SRU] mgr can be very slow in a large ceph cluster

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1906496/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs