It looks like the root-cause is a corrupted heap. See also 
mspace_free()->check_top_chunk()->do_check_top_chunk(). One of the assertions 
is failing.

 

Once the heap is pickled, all bets are off in terms of getting a useful API 
trace. 

 

Since you have a couple of (possibly) useful post-mortem dumps, the first step 
would be to figure out why printing the api traces causes a NULL pointer 
dereference.

 

It should be comparatively simple to work out the NULL pointer dereference in 
the api trace printer. You can binary search for the offending message, walk 
through vl_msg_print_trace(...) and figure out what’s going on. It’s likely 
that the message in question simply doesn’t have a print function. See 
src/vlibmemory/vlib_api_cli.c:api_trace_command_fn() for details. In 
particular, “first” and “last” describe which messages should be printed (or 
replayed).

 

HTH... Dave

 

From: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> On Behalf Of Pim van Pelt via 
lists.fd.io
Sent: Sunday, January 8, 2023 1:24 PM
To: vpp-dev <vpp-dev@lists.fd.io>
Subject: [vpp-dev] Possible VPP deadlock

 

Hoi,

 

I've had a few instances of a recent VPP hanging - API and CLI go unresponsive, 
forwarding stops (at least, I think), but the worker threads are still 
consuming CPU.

Attaching GDB, I see the main thread is doing the following:

(gdb) bt

#0  0x00007f5f6f8f271b in sched_yield () at 
../sysdeps/unix/syscall-template.S:78

#1  0x00007f5f6fb3df8b in spin_acquire_lock (sl=<optimized out>) at 
/home/pim/src/vpp/src/vppinfra/dlmalloc.c:468

#2  mspace_malloc (msp=0x130048040, bytes=72) at 
/home/pim/src/vpp/src/vppinfra/dlmalloc.c:4351

#3  0x00007f5f6fb66f81 in mspace_memalign (msp=0x130048040, 
alignment=<optimized out>, bytes=72) at 
/home/pim/src/vpp/src/vppinfra/dlmalloc.c:4667

#4  clib_mem_heap_alloc_inline (heap=<optimized out>, size=72, align=<optimized 
out>, os_out_of_memory_on_failure=1) at 
/home/pim/src/vpp/src/vppinfra/mem_dlmalloc.c:608

#5  clib_mem_heap_alloc_aligned (heap=<optimized out>, size=72, align=8) at 
/home/pim/src/vpp/src/vppinfra/mem_dlmalloc.c:664

#6  0x00007f5f6fba5157 in _vec_alloc_internal (n_elts=64, attr=<optimized out>) 
at /home/pim/src/vpp/src/vppinfra/vec.c:35

#7  0x00007f5f6fb848c8 in _vec_resize (vp=<optimized out>, n_add=64, hdr_sz=0, 
align=8, elt_sz=<optimized out>) at /home/pim/src/vpp/src/vppinfra/vec.h:256

#8  serialize_vector_write (m=<optimized out>, s=0x7f5f0dbfebc0) at 
/home/pim/src/vpp/src/vppinfra/serialize.c:908

#9  0x00007f5f6fb843c1 in serialize_write_not_inline (m=0x7f5f0dbfeb60, 
s=<optimized out>, n_bytes_to_write=4, flags=<optimized out>) at 
/home/pim/src/vpp/src/vppinfra/serialize.c:734

#10 0x00007f5f6fe5a053 in serialize_stream_read_write (header=0x7f5f0dbfeb60, 
s=<optimized out>, n_bytes=4, flags=2) at 
/home/pim/src/vpp/src/vppinfra/serialize.h:140

#11 serialize_get (m=0x7f5f0dbfeb60, n_bytes=4) at 
/home/pim/src/vpp/src/vppinfra/serialize.h:180

#12 serialize_integer (m=0x7f5f0dbfeb60, x=<optimized out>, n_bytes=4) at 
/home/pim/src/vpp/src/vppinfra/serialize.h:187

#13 vl_api_serialize_message_table (am=0x7f5f6fe66258 <api_global_main>, 
vector=<optimized out>) at /home/pim/src/vpp/src/vlibapi/api_shared.c:210

#14 0x00007f5f6fe5a715 in vl_msg_api_trace_save (am=0x130048040, 
which=<optimized out>, fp=0x13f0690, is_json=27 '\033') at 
/home/pim/src/vpp/src/vlibapi/api_shared.c:410

#15 0x00007f5f6fe5c0ea in vl_msg_api_post_mortem_dump () at 
/home/pim/src/vpp/src/vlibapi/api_shared.c:880

#16 0x00000000004068c6 in os_panic () at 
/home/pim/src/vpp/src/vpp/vnet/main.c:415

#17 0x00007f5f6fb3feed in mspace_free (msp=0x130048040, mem=<optimized out>) at 
/home/pim/src/vpp/src/vppinfra/dlmalloc.c:2954

#18 0x00007f5f6fb6bf8c in clib_mem_heap_free (heap=0x0, p=<optimized out>) at 
/home/pim/src/vpp/src/vppinfra/mem_dlmalloc.c:768

#19 clib_mem_free (p=<optimized out>) at 
/home/pim/src/vpp/src/vppinfra/mem_dlmalloc.c:774

#20 0x00007f5f2fa32b40 in ?? ()

#21 0x00007f5f3302f848 in ?? ()

#22 0x0000000000000000 in ?? ()

 

When I kill VPP, sometimes an api_post_mortem is emitted (although most of the 
time they are empty), but subsequently trying to dump it, makes VPP crash -

 

-rw------- 1 ipng ipng 35437 Jan  8 19:08 api_post_mortem.76724

-rw------- 1 ipng ipng 35368 Jan  8 19:08 api_post_mortem.76842

-rw------- 1 ipng ipng     0 Jan  8 19:08 api_post_mortem.76978

-rw------- 1 ipng ipng     0 Jan  8 19:08 api_post_mortem.84008

 

#0  0x0000000000000000 in ?? ()

#1  0x00007ffff7fada5f in vl_msg_print_trace (msg=0x7fff9db73bd8 "", 
ctx=0x7fff53b62ca0) at /home/pim/src/vpp/src/vlibmemory/vlib_api_cli.c:693

#2  0x00007ffff66a55bb in vl_msg_traverse_trace (tp=0x7fff9b4e7998, 
fn=0x7ffff7fad790 <vl_msg_print_trace>, ctx=0x7fff53b62ca0)

    at /home/pim/src/vpp/src/vlibapi/api_shared.c:321

#3  0x00007ffff7fab854 in api_trace_command_fn (vm=0x7fff96000700, 
input=0x7fff53b62f30, cmd=<optimized out>)

    at /home/pim/src/vpp/src/vlibmemory/vlib_api_cli.c:727

#4  0x00007ffff647fdad in vlib_cli_dispatch_sub_commands (vm=0x7fff96000700, 
cm=<optimized out>, input=0x7fff53b62f30, 

    parent_command_index=<optimized out>) at 
/home/pim/src/vpp/src/vlib/cli.c:650

#5  0x00007ffff647fb91 in vlib_cli_dispatch_sub_commands (vm=0x7fff96000700, 
cm=<optimized out>, input=0x7fff53b62f30, 

    parent_command_index=<optimized out>) at 
/home/pim/src/vpp/src/vlib/cli.c:607

#6  0x00007ffff647f0cd in vlib_cli_input (vm=0x7fff96000700, 
input=0x7fff53b62f30, function=<optimized out>, function_arg=<optimized out>)

    at /home/pim/src/vpp/src/vlib/cli.c:753

#7  0x00007ffff64fd5c7 in unix_cli_process_input (cm=<optimized out>, 
cli_file_index=0) at /home/pim/src/vpp/src/vlib/unix/cli.c:2616

#8  unix_cli_process (vm=<optimized out>, rt=0x7fff9b69bdc0, f=<optimized out>) 
at /home/pim/src/vpp/src/vlib/unix/cli.c:2745

#9  0x00007ffff64a7837 in vlib_process_bootstrap (_a=<optimized out>) at 
/home/pim/src/vpp/src/vlib/main.c:1221

#10 0x00007ffff63f9d94 in clib_calljmp () at 
/home/pim/src/vpp/src/vppinfra/longjmp.S:123

#11 0x00007fff94700b00 in ?? ()

#12 0x00007ffff649f3d0 in vlib_process_startup (vm=0x7fff96000700, 
p=0x7fff9b69bdc0, f=0x0) at /home/pim/src/vpp/src/vlib/main.c:1246

#13 dispatch_process (vm=0x7fff96000700, p=0x7fff9b69bdc0, f=0x0, 
last_time_stamp=<optimized out>) at /home/pim/src/vpp/src/vlib/main.c:1302

#14 0x0000000000000000 in ?? ()

 

Has anybody else seen API calls seemingly hang the VPP instance? Is there an 
alternative way to pry loose the information in api_post_mortem.* files ? Or 
any other clues where to narrow down the issue?

It's a rare issue (running a dozen or so instances with 6mo+ of uptime, and one 
of them had this hang/crash a few times in a row this week).

 

-- 

Pim van Pelt <p...@ipng.nl <mailto:p...@ipng.nl> > 
PBVP1-RIPE - http://www.ipng.nl/

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22433): https://lists.fd.io/g/vpp-dev/message/22433
Mute This Topic: https://lists.fd.io/mt/96136395/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to