Hi all,

I am looking for guidance on a VPP 21.06 issue where the *main thread gets
stuck forever in the binary API / main-thread RPC path* on one VM, while an
equivalent healthy VM is idle in normal epoll wait.
Version / setup

   - VPP: 21.06.0-21~gf3966f893
   - socksvr { default }
   - CLI socket: /run/vpp/cli.sock
   - API socket: /run/vpp/api.sock

Timeline

   - Existing VPP instance was forcefully killed (status=9/KILL) after a
   keepalived restart.
   - systemd restarted VPP successfully.
   - New VPP instance read and executed startup config successfully
   (interfaces, SR policy/steer, and thousands (~8k) of static routes).
   - Later, the VM entered a stuck / high-CPU state.

GDB Backtrace of the Main thread
On the bad VM, live GDB on the main thread shows:
(gdb) thread 1
[Switching to thread 1 (Thread 0x7f43ace357c0 (LWP 2476029))]
#0  0x00007f43aa1cf6bb in sched_yield () from /lib64/libc.so.6
(gdb) bt
#0  0x00007f43aa1cf6bb in sched_yield () from /lib64/libc.so.6
#1  0x00007f43aa885378 in mspace_malloc () from
/lib64/libvppinfra.so.21.06.0
#2  0x00007f43aa886e07 in mspace_get_aligned () from
/lib64/libvppinfra.so.21.06.0
#3  0x00007f43aa843f86 in vec_resize_allocate_memory () from
/lib64/libvppinfra.so.21.06.0
#4  0x00007f43aa858d42 in serialize_vector_write () from
/lib64/libvppinfra.so.21.06.0
#5  0x00007f43aa864628 in serialize_write_not_inline () from
/lib64/libvppinfra.so.21.06.0
#6  0x00007f43aca08d5a in serialize_stream_read_write (flags=2, n_bytes=4,
s=0x7f43aa05fc10, header=0x7f43aa05fbb0)
    at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vppinfra/serialize.h:140
#7  serialize_get (n_bytes=4, m=0x7f43aa05fbb0) at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vppinfra/serialize.h:180
#8  serialize_integer (n_bytes=4, x=<optimized out>, m=0x7f43aa05fbb0) at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vppinfra/serialize.h:187
#9  vl_api_serialize_message_table (am=0x7f43acc17f40 <api_global_main>,
vector=vector@entry=0x0)
    at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vlibapi/api_shared.c:213
#10 0x00007f43aca091e5 in vl_msg_api_trace_save (am=am@entry=0x7f43acc17f40
<api_global_main>, which=which@entry=VL_API_TRACE_RX, fp=fp@entry
=0x55733bc948c0)
    at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vlibapi/api_shared.c:262
#11 0x00007f43aca0b6a9 in vl_msg_api_post_mortem_dump () at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vlibapi/api_shared.c:990
#12 0x0000557339ff00d9 in os_panic () at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vpp/vnet/main.c:405
#13 0x00007f43aa8848b5 in mspace_free () from /lib64/libvppinfra.so.21.06.0
#14 0x00007f43ac9ef801 in clib_mem_free (p=0x130179bb8) at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vppinfra/mem.h:321
#15 vl_msg_api_free_w_region (vlib_rp=0x130021000, a=0x130179bc8) at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vlibmemory/memory_shared.c:306
#16 0x00007f43ac9ef837 in vl_msg_api_free (a=<optimized out>) at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vlibmemory/memory_shared.c:314
#17 0x00007f43aca09998 in vl_msg_api_handler_with_vm_node
(am=am@entry=0x7f43acc17f40
<api_global_main>, vlib_rp=0x130021000, the_msg=<optimized out>,
    vm=vm@entry=0x7f436a000680, node=node@entry=0x7f436b3901c0,
is_private=is_private@entry=0 '\000')
    at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vlibapi/api_shared.c:665
#18 0x00007f43ac9ee376 in vl_mem_api_handle_rpc (vm=vm@entry=0x7f436a000680,
node=node@entry=0x7f436b3901c0)
    at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vlibmemory/memory_api.c:746
#19 0x00007f43aca00623 in vl_api_clnt_process (vm=<optimized out>,
node=0x7f436b3901c0, f=<optimized out>)
    at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vlibmemory/vlib_api.c:338
#20 0x00007f43ab326b36 in vlib_process_bootstrap (_a=<optimized out>) at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vlib/main.c:1284
#21 0x00007f43aa88b358 in clib_calljmp () at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vppinfra/longjmp.S:123
#22 0x00007f435b46adc0 in ?? ()
#23 0x00007f43ab329ee4 in vlib_process_startup (f=0x0, p=0x7f436b3901c0,
vm=0x7f436a000680) at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vppinfra/types.h:133
#24 dispatch_process (vm=<optimized out>, p=0x7f436b3901c0,
last_time_stamp=<optimized out>, f=0x0) at
/usr/src/debug/vpp-21.06.0-21~gf3966f893.x86_64/src/vlib/main.c:1365
#25 0x0000000000000000 in ?? ()


Inside vl_mem_api_handle_rpc(), i=1079, I inspected:
vm->processing_rpc_requests[1079]
and decoded it as vl_api_rpc_call_t.

The embedded callback/function pointer resolves to:
ip_neighbor_learn

src/vnet/ip-neighbor/ip_neighbor.c:7

So at least one queued RPC being processed by the stuck main thread is *dynamic
neighbor learning*.

Other observations

   - vppctl over CLI socket still works.
   - Binary API behavior is unhealthy:
      - shared-memory vpp_api_test can hit memclnt_create_reply timeout
      - vpp_api_test socket-name /run/vpp/api.sock can hang
   - vppctl show runtime

Thread 0 vpp_main (lcore 1)



Name                     State         Calls          Vectors
 Suspends         Clocks       Vectors/Call

 api-rx-from-ring     any wait       0               0
29106              1.25e5            0.00

 events pending,

Question

Has anyone seen a *21.06-era* issue where the main thread gets stuck in:
vl_api_clnt_process
-> vl_mem_api_handle_rpc
 -> vl_msg_api_free

with queued RPC work resolving to *ip_neighbor_learn()*?

Thanks.
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#27056): https://lists.fd.io/g/vpp-dev/message/27056
Mute This Topic: https://lists.fd.io/mt/119735075/21656
Group Owner: [email protected]
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/14379924/21656/631435203/xyzzy 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to