Hi all,
I am looking for guidance on a VPP 21.06 issue where the *main thread gets
stuck forever in the binary API / main-thread RPC path* on one VM, while an
equivalent healthy VM is idle in normal epoll wait.
Version / setup
- VPP: 21.06.0-21~gf3966f893
- socksvr { default }
- CLI socket: /run/vpp/cli.sock
- API socket: /run/vpp/api.sock
Timeline
- Existing VPP instance was forcefully killed (status=9/KILL) after a
keepalived restart.
- systemd restarted VPP successfully.
- New VPP instance read and executed startup config successfully
(interfaces, SR policy/steer, and thousands (~8k) of static routes).
- Later, the VM entered a stuck / high-CPU state.
GDB Backtrace of the Main thread
On the bad VM, live GDB on the main thread shows:
(gdb) thread 1
[Switching to thread 1 (Thread 0x7f43ace357c0 (LWP 2476029))]
#0 0x00007f43aa1cf6bb in sched_yield () from /lib64/libc.so.6
(gdb) bt
#0 0x00007f43aa1cf6bb in sched_yield () from /lib64/libc.so.6
#1 0x00007f43aa885378 in mspace_malloc () from
/lib64/libvppinfra.so.21.06.0
#2 0x00007f43aa886e07 in mspace_get_aligned () from
/lib64/libvppinfra.so.21.06.0
#3 0x00007f43aa843f86 in vec_resize_allocate_memory () from
/lib64/libvppinfra.so.21.06.0
#4 0x00007f43aa858d42 in serialize_vector_write () from
/lib64/libvppinfra.so.21.06.0
#5 0x00007f43aa864628 in serialize_write_not_inline () from
/lib64/libvppinfra.so.21.06.0
#6 0x00007f43aca08d5a in serialize_stream_read_write (flags=2, n_bytes=4,
s=0x7f43aa05fc10, header=0x7f43aa05fbb0)
at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vppinfra/serialize.h:140
#7 serialize_get (n_bytes=4, m=0x7f43aa05fbb0) at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vppinfra/serialize.h:180
#8 serialize_integer (n_bytes=4, x=<optimized out>, m=0x7f43aa05fbb0) at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vppinfra/serialize.h:187
#9 vl_api_serialize_message_table (am=0x7f43acc17f40 <api_global_main>,
vector=vector@entry=0x0)
at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vlibapi/api_shared.c:213
#10 0x00007f43aca091e5 in vl_msg_api_trace_save (am=am@entry=0x7f43acc17f40
<api_global_main>, which=which@entry=VL_API_TRACE_RX, fp=fp@entry
=0x55733bc948c0)
at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vlibapi/api_shared.c:262
#11 0x00007f43aca0b6a9 in vl_msg_api_post_mortem_dump () at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vlibapi/api_shared.c:990
#12 0x0000557339ff00d9 in os_panic () at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vpp/vnet/main.c:405
#13 0x00007f43aa8848b5 in mspace_free () from /lib64/libvppinfra.so.21.06.0
#14 0x00007f43ac9ef801 in clib_mem_free (p=0x130179bb8) at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vppinfra/mem.h:321
#15 vl_msg_api_free_w_region (vlib_rp=0x130021000, a=0x130179bc8) at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vlibmemory/memory_shared.c:306
#16 0x00007f43ac9ef837 in vl_msg_api_free (a=<optimized out>) at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vlibmemory/memory_shared.c:314
#17 0x00007f43aca09998 in vl_msg_api_handler_with_vm_node
(am=am@entry=0x7f43acc17f40
<api_global_main>, vlib_rp=0x130021000, the_msg=<optimized out>,
vm=vm@entry=0x7f436a000680, node=node@entry=0x7f436b3901c0,
is_private=is_private@entry=0 '\000')
at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vlibapi/api_shared.c:665
#18 0x00007f43ac9ee376 in vl_mem_api_handle_rpc (vm=vm@entry=0x7f436a000680,
node=node@entry=0x7f436b3901c0)
at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vlibmemory/memory_api.c:746
#19 0x00007f43aca00623 in vl_api_clnt_process (vm=<optimized out>,
node=0x7f436b3901c0, f=<optimized out>)
at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vlibmemory/vlib_api.c:338
#20 0x00007f43ab326b36 in vlib_process_bootstrap (_a=<optimized out>) at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vlib/main.c:1284
#21 0x00007f43aa88b358 in clib_calljmp () at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vppinfra/longjmp.S:123
#22 0x00007f435b46adc0 in ?? ()
#23 0x00007f43ab329ee4 in vlib_process_startup (f=0x0, p=0x7f436b3901c0,
vm=0x7f436a000680) at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vppinfra/types.h:133
#24 dispatch_process (vm=<optimized out>, p=0x7f436b3901c0,
last_time_stamp=<optimized out>, f=0x0) at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vlib/main.c:1365
#25 0x0000000000000000 in ?? ()
Inside vl_mem_api_handle_rpc(), i=1079, I inspected:
vm->processing_rpc_requests[1079]
and decoded it as vl_api_rpc_call_t.
The embedded callback/function pointer resolves to:
ip_neighbor_learn
src/vnet/ip-neighbor/ip_neighbor.c:7
So at least one queued RPC being processed by the stuck main thread is *dynamic
neighbor learning*.
Other observations
- vppctl over CLI socket still works.
- Binary API behavior is unhealthy:
- shared-memory vpp_api_test can hit memclnt_create_reply timeout
- vpp_api_test socket-name /run/vpp/api.sock can hang
- vppctl show runtime
Thread 0 vpp_main (lcore 1)
Name State Calls Vectors
Suspends Clocks Vectors/Call
api-rx-from-ring any wait 0 0
29106 1.25e5 0.00
events pending,
Question
Has anyone seen a *21.06-era* issue where the main thread gets stuck in:
vl_api_clnt_process
-> vl_mem_api_handle_rpc
-> vl_msg_api_free
with queued RPC work resolving to *ip_neighbor_learn()*?
Thanks.
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#27056): https://lists.fd.io/g/vpp-dev/message/27056
Mute This Topic: https://lists.fd.io/mt/119735075/21656
Group Owner: [email protected]
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/14379924/21656/631435203/xyzzy
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-