Public bug reported:

[Impact]

The radosgw beast frontend in ceph nautilus might hit coroutine stack
corruption on startup and requests.

This is usually observed right at the startup of the ceph-radosgw systemd unit; 
sometimes 1 minute later.
But it might occur any time handling requests, depending on coroutine/request's 
function path/stack size.

The symptoms are usually a crash with stack trace listing TCMalloc 
(de)allocate/release to central cache,
but less rare signs are large allocs in the _terabytes_ range (pointer to stack 
used as allocation size)
and stack traces showing function return addresses (RIP) that are actually 
pointers to an stack address.

This is not widely hit in Ubuntu as most deployments use the ceph-radosgw charm 
that hardcodes 'civetweb'
as rgw frontend, which is _not_ affected; custom/cephadm deployments that 
choose 'beast' might hit this.

  @ charm-ceph-radosgw/templates/ceph.conf
        rgw frontends = civetweb port={{ port }}

Let's report this LP bug for documentation and tracking purposes until
UCA gets the fixes.

[Fix]

This has been reported by an Ubuntu Advantage user, and another user in ceph 
tracker #47910 [1].
This had been reported and fixed in Octopus [2] (confirmed by UA user; no 
longer affected.)

The Nautilus backport has recently been merged [3, 4] and should be
available in v14.2.19.

[Test Case]

The conditions to trigger the bug aren't clear, but apparently related to EC 
pools w/ very large buckets,
and of course the radosgw frontend beast being enabled (civetweb is not 
affected.)

[Where problems could occur]

The fixes are restricted to the beast frontend, specifically to the coroutines 
used to handle requests.
So problems would probably be seen in request handling only with the beast 
frontend.
Workarounds thus include switching back to the civetweb frontend.

This changes core/base parts of the RGW beast frontend code, but are in place 
from Octopus released.
The other user/reporter in the ceph tracker has been using the patches for 
weeks with no regression;
the ceph tests have passed and likely serious issues would be caught by ceph CI 
upstream.

[1] https://tracker.ceph.com/issues/47910 report tracker (nautilus)
[2] https://tracker.ceph.com/issues/43739 master tracker (octopus)
[3] https://tracker.ceph.com/issues/43921 backport tracker (nautilus)
[4] https://github.com/ceph/ceph/pull/39947 github PR

** Affects: cloud-archive
     Importance: Undecided
         Status: Fix Released

** Affects: cloud-archive/train
     Importance: Undecided
         Status: Confirmed

** Affects: ceph (Ubuntu)
     Importance: Undecided
         Status: Fix Released

** Changed in: ceph (Ubuntu)
       Status: New => Fix Released

** Also affects: cloud-archive
   Importance: Undecided
       Status: New

** Also affects: cloud-archive/train
   Importance: Undecided
       Status: New

** Changed in: cloud-archive
       Status: New => Fix Released

** Changed in: cloud-archive/train
       Status: New => Confirmed

** Summary changed:

- nautilus: ceph radosgw beast frontend might hit coroutine stack corruption
+ nautilus: ceph radosgw beast frontend coroutine stack corruption

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1921749

Title:
  nautilus: ceph radosgw beast frontend coroutine stack corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1921749/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to