coredump #6
RIP address in stack is actually not an instruction address, but a stack
address (corrupted stack).
GDB searched hard for stack unwinding, and got a very long trace, that might
not be correct.
Oct 23 16:41:27 HOSTNAME radosgw[3572]: *** Caught signal (Segmentation
fault) **
Oct 23 16:41:27 HOSTNAME radosgw[3572]: in thread 7f826a190700
thread_name:radosgw
Oct 23 16:41:27 HOSTNAME radosgw[3572]: ceph version 14.2.11
(f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable)
Oct 23 16:41:27 HOSTNAME radosgw[3572]: 1: (()+0x128a0)
[0x7f82987d68a0]
Oct 23 16:41:27 HOSTNAME radosgw[3572]: 2: [0x56275c77ac20]
Oct 23 16:41:27 HOSTNAME radosgw[3572]: 2020-10-23 16:41:27.433
7f826a190700 -1 *** Caught signal (Segmentation fault) **
The RIP is in the stack address range from the previous frame:
(gdb) frame 5
#5 0x00007f829a26e83d in AsyncConnection::send_message
(this=0x56275c71ad00, m=0x56275c76f600) at
./src/msg/async/AsyncConnection.cc:548
(gdb) info reg $rbp
rbp 0x56275c76f600 0x56275c76f600
(gdb) info reg $rsp
rsp 0x56275c782480 0x56275c782480
#0 raise (sig=sig@entry=11) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x0000562758ea16b0 in reraise_fatal (signum=11) at
./src/global/signal_handler.cc:81
#2 handle_fatal_signal (signum=11) at
./src/global/signal_handler.cc:326
#3 <signal handler called>
#4 0x000056275c77ac20 in ?? ()
#5 0x00007f829a26e83d in AsyncConnection::send_message
(this=0x56275c71ad00, m=0x56275c76f600) at
./src/msg/async/AsyncConnection.cc:548
#6 0x00007f82a32ee135 in Objecter::_send_op
(this=this@entry=0x56275bc35080, op=op@entry=0x56275c76f000) at
./src/osdc/Objecter.cc:3274
#7 0x00007f82a32f0433 in Objecter::_op_submit
(this=this@entry=0x56275bc35080, op=op@entry=0x56275c76f000, sul=...,
ptid=ptid@entry=0x56275c7827e8) at ./src/osdc/Objecter.cc:2456
#8 0x00007f82a32fb43d in Objecter::_op_submit_with_budget
(this=this@entry=0x56275bc35080, op=op@entry=0x56275c76f000, sul=...,
ptid=ptid@entry=0x56275c7827e8, ctx_budget=ctx_budget@entry=0x0) at
./src/osdc/Objecter.cc:2284
#9 0x00007f82a32fb680 in Objecter::op_submit (this=0x56275bc35080,
op=0x56275c76f000, ptid=0x56275c7827e8, ctx_budget=0x0) at
./src/osdc/Objecter.cc:2251
#10 0x00007f82a32c2e3a in librados::IoCtxImpl::operate_read
(this=0x56275bd83ee0, oid=..., o=0x56275bdaa180, pbl=pbl@entry=0x0,
flags=flags@entry=0) at ./src/librados/IoCtxImpl.cc:725
#11 0x00007f82a32987ec in librados::v14_2_0::IoCtx::operate
(this=this@entry=0x56275c782c98, oid=..., o=o@entry=0x56275c782b80,
pbl=pbl@entry=0x0) at ./src/librados/librados_cxx.cc:1423
#12 0x0000562759244dbc in rgw_rados_operate (ioctx=..., oid=...,
op=op@entry=0x56275c782b80, pbl=pbl@entry=0x0, y=...) at
./src/rgw/rgw_tools.cc:218
#13 0x00005627592b7fbf in RGWSI_RADOS::Obj::operate
(this=this@entry=0x56275c782c10, op=op@entry=0x56275c782b80, pbl=pbl@entry=0x0,
y=...) at ./src/rgw/services/svc_rados.cc:96
#14 0x0000562758f33a22 in RGWSI_SysObj_Core::read
(this=this@entry=0x56275afe7540, obj_ctx=..., read_state=...,
objv_tracker=objv_tracker@entry=0x56275c783948, obj=...,
bl=bl@entry=0x56275c783460, ofs=0, end=-1, attrs=0x56275c782dc0,
raw_attrs=true, cache_info=0x56275c783680) at
./src/rgw/services/svc_sys_obj_core.cc:222
#15 0x00005627592bbe7d in RGWSI_SysObj_Cache::read
(this=this@entry=0x56275afe7540, obj_ctx=..., read_state=...,
objv_tracker=0x56275c783948, obj=..., obl=obl@entry=0x56275c783460, ofs=0,
end=-1, attrs=0x56275c783b90, raw_attrs=false, cache_info=0x56275c783680,
refresh_version=...) at ./src/rgw/services/svc_sys_obj_cache.cc:147
#16 0x0000562758f2fbcb in RGWSI_SysObj::Obj::ROp::read
(this=this@entry=0x56275c783260, ofs=ofs@entry=0, end=end@entry=-1,
bl=bl@entry=0x56275c783460) at ./src/rgw/services/svc_sys_obj.cc:47
#17 0x00005627592426c3 in RGWSI_SysObj::Obj::ROp::read
(pbl=0x56275c783460, this=0x56275c783260) at ./src/rgw/services/svc_sys_obj.h:99
#18 rgw_get_system_obj (rgwstore=rgwstore@entry=0x56275b073800,
obj_ctx=..., pool=..., key=..., bl=...,
objv_tracker=objv_tracker@entry=0x56275c783948, pmtime=0x56275c783b88,
pattrs=0x56275c783b90, cache_info=0x56275c783680, refresh_version=...) at
./src/rgw/rgw_tools.cc:156
#19 0x0000562759198257 in RGWRados::get_bucket_instance_from_oid
(this=this@entry=0x56275b073800, obj_ctx=..., oid=..., info=...,
pmtime=pmtime@entry=0x56275c783b88, pattrs=pattrs@entry=0x56275c783b90,
cache_info=0x56275c783680, refresh_version=...) at ./src/rgw/rgw_rados.cc:8250
#20 0x000056275919b5ad in RGWRados::_get_bucket_info
(this=0x56275b073800, obj_ctx=..., tenant=..., bucket_name=..., info=...,
pmtime=pmtime@entry=0x0, pattrs=0x56275c785fb8, refresh_version=...) at
./src/rgw/rgw_rados.cc:8405
#21 0x000056275919c03b in RGWRados::get_bucket_info (this=<optimized
out>, obj_ctx=..., tenant=..., bucket_name=..., info=...,
pmtime=pmtime@entry=0x0, pattrs=0x56275c785fb8) at ./src/rgw/rgw_rados.cc:8443
#22 0x000056275914e3a6 in RGWCreateBucket::execute (this=<optimized
out>) at ./src/rgw/rgw_op.cc:3078
#23 0x0000562758ec0290 in rgw_process_authenticated
(handler=handler@entry=0x56275c736d00, op=@0x56275c785090: 0x56275bd71000,
req=req@entry=0x56275c786930, s=s@entry=0x56275c785620,
skip_retarget=skip_retarget@entry=false) at ./src/rgw/rgw_process.cc:161
#24 0x0000562758ec2af8 in process_request (store=0x56275b073800,
rest=0x7ffc1d0cd300, req=0x56275c786930, frontend_prefix=...,
auth_registry=..., client_io=client_io@entry=0x56275c7869c0, olog=0x0,
yield=..., scheduler=0x56275c379a78, http_ret=0x0) at
./src/rgw/rgw_process.cc:278
#25 0x0000562758e18660 in (anonymous
namespace)::handle_connection<boost::asio::basic_stream_socket<boost::asio::ip::tcp>
> (context=..., env=..., stream=..., buffer=..., pause_mutex=...,
scheduler=<optimized out>, ec=..., yield=..., is_ssl=false) at
./src/rgw/rgw_asio_frontend.cc:167
#26 0x0000562758e1986d in (anonymous
namespace)::AsioFrontend::<lambda(boost::asio::yield_context)>::operator()
(yield=..., __closure=0x56275bff8798) at ./src/rgw/rgw_asio_frontend.cc:638
#27
boost::asio::detail::coro_entry_point<boost::asio::executor_binder<void (*)(),
boost::asio::strand<boost::asio::io_context::executor_type> >, (anonymous
namespace)::AsioFrontend::accept((anonymous
namespace)::AsioFrontend::Listener&,
boost::system::error_code)::<lambda(boost::asio::yield_context)> >::operator()
(ca=..., this=<optimized out>) at
./obj-x86_64-linux-gnu/boost/include/boost/asio/impl/spawn.hpp:337
#28
boost::coroutines::detail::push_coroutine_object<boost::coroutines::pull_coroutine<void>,
void, boost::asio::detail::coro_entry_point<boost::asio::executor_binder<void
(*)(), boost::asio::strand<boost::asio::io_context::executor_type> >,
(anonymous namespace)::AsioFrontend::accept((anonymous
namespace)::AsioFrontend::Listener&,
boost::system::error_code)::<lambda(boost::asio::yield_context)> >&,
boost::coroutines::basic_standard_stack_allocator<boost::coroutines::stack_traits>
>::run (this=0x56275c787f60) at
./obj-x86_64-linux-gnu/boost/include/boost/coroutine/detail/push_coroutine_object.hpp:302
#29
boost::coroutines::detail::trampoline_push_void<boost::coroutines::detail::push_coroutine_object<boost::coroutines::pull_coroutine<void>,
void, boost::asio::detail::coro_entry_point<boost::asio::executor_binder<void
(*)(), boost::asio::strand<boost::asio::io_context::executor_type> >,
(anonymous namespace)::AsioFrontend::accept((anonymous
namespace)::AsioFrontend::Listener&,
boost::system::error_code)::<lambda(boost::asio::yield_context)> >&,
boost::coroutines::basic_standard_stack_allocator<boost::coroutines::stack_traits>
> >(boost::context::detail::transfer_t) (t=...) at
./obj-x86_64-linux-gnu/boost/include/boost/coroutine/detail/trampoline_push.hpp:70
#30 0x0000562759355d5f in make_fcontext ()
#31 0x000056275979acd0 in vtable for
boost::coroutines::detail::push_coroutine_object<boost::coroutines::pull_coroutine<void>,
void, boost::asio::detail::coro_entry_point<boost::asio::executor_binder<void
(*)(), boost::asio::strand<boost::asio::io_context::executor_type> >,
(anonymous namespace)::AsioFrontend::accept((anonymous
namespace)::AsioFrontend::Listener&,
boost::system::error_code)::{lambda(boost::asio::basic_yield_context<boost::asio::executor_binder<void
(*)(), boost::asio::executor> >)#4}>&,
boost::coroutines::basic_standard_stack_allocator<boost::coroutines::stack_traits>
> ()
#32 0x0000000000000026 in ?? ()
#33 0x0000000000000000 in ?? ()
** Description changed:
[Impact]
The radosgw beast frontend in ceph nautilus might hit coroutine stack
corruption on startup and requests.
This is usually observed right at the startup of the ceph-radosgw systemd
unit; sometimes 1 minute later.
But it might occur any time handling requests, depending on
coroutine/request's function path/stack size.
The symptoms are usually a crash with stack trace listing TCMalloc
(de)allocate/release to central cache,
but less rare signs are large allocs in the _terabytes_ range (pointer to
stack used as allocation size)
and stack traces showing function return addresses (RIP) that are actually
pointers to an stack address.
+
+ This is not widely hit in Ubuntu as most deployments use the ceph-radosgw
charm that hardcodes 'civetweb'
+ as rgw frontend, which is _not_ affected; custom/cephadm deployments that
choose 'beast' might hit this.
+
+ @ charm-ceph-radosgw/templates/ceph.conf
+ rgw frontends = civetweb port={{ port }}
Let's report this LP bug for documentation and tracking purposes until
UCA gets the fixes.
[Fix]
This has been reported by an Ubuntu Advantage user, and another user in ceph
tracker #47910 [1].
This had been reported and fixed in Octopus [2] (confirmed by UA user; no
longer affected.)
The Nautilus backport has recently been merged [3, 4] and should be
available in v14.2.19.
[Test Case]
The conditions to trigger the bug aren't clear, but apparently related to EC
pools w/ very large buckets,
and of course the radosgw frontend beast being enabled (civetweb is not
affected.)
- This is not widely hit in Ubuntu as most deployments use the ceph-radosgw
charm that hardcodes 'civetweb'
- as rgw frontend, which is _not_ affected; custom/cephadm deployments that
choose 'beast' might hit this.
-
- @ charm-ceph-radosgw/templates/ceph.conf
- rgw frontends = civetweb port={{ port }}
-
[Where problems could occur]
The fixes are restricted to the beast frontend, specifically to the
coroutines used to handle requests.
So problems would probably be seen in request handling only with the beast
frontend.
Workarounds thus include switching back to the civetweb frontend.
This changes core/base parts of the RGW beast frontend code, but are in place
from Octopus released.
The other user/reporter in the ceph tracker has been using the patches for
weeks with no regression;
the ceph tests have passed and likely serious issues would be caught by ceph
CI upstream.
-
[1] https://tracker.ceph.com/issues/47910 report tracker (nautilus)
[2] https://tracker.ceph.com/issues/43739 master tracker (octopus)
[3] https://tracker.ceph.com/issues/43921 backport tracker (nautilus)
[4] https://github.com/ceph/ceph/pull/39947 github PR
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1921749
Title:
nautilus: ceph radosgw beast frontend coroutine stack corruption
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1921749/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs