Agreed on doing the upgrade to newer code base being prudent and yes, we are in 
the midst of doing so. However, I did not see any obvious changes in this area, 
so I am a bit pessimistic on upgrade being the fix. Perhaps I missed a subtle 
improvement in this area folks could point me at to ease my concerns?

Regarding 2nd paragraph comments/questions, yes, it is a single worker and I 
too would like to know why the main thread did not just move on instead of 
throwing the os_panic().

-Mike

From: v...@barachs.net <v...@barachs.net>
Sent: Thursday, June 24, 2021 5:46 AM
To: Bly, Mike <m...@ciena.com>; vpp-dev@lists.fd.io
Subject: [**EXTERNAL**] RE: [vpp-dev] #vpp #vnet os_panic for failed barrier 
timeout

Given the reported MTBF of 9 months and nearly 2-year-old software, switching 
to 21.01 [and then to 21.06 when released] seems like the only sensible next 
step.

>From the gdb info provided, it looks like there is one worker thread. Is that 
>correct? If so, the "workers_at_barrier" count seems correct, so why wouldn't 
>the main thread have moved on instead of spinning waiting for something which 
>already happened?

D.


From: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io> 
<vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>> On Behalf Of Bly, Mike via 
lists.fd.io
Sent: Wednesday, June 23, 2021 10:59 AM
To: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>
Subject: [vpp-dev] #vpp #vnet os_panic for failed barrier timeout

We are looking for advise on whether this os_panic() for a barrier timeout has 
anyone looking at it. We see in the forum many instances of type of main thread 
back-trace. For this incedent, referencing the sw_interface_dump API we created 
a lighter oper-get call to simply fetch link state vs. all of the extensive 
information the dump command fetches for each interface. At the time we added 
our new oper-get function,  we overlooked the "is_mp_safe" enablement for dump 
and as such did NOT set it for our new oper-get. The end result is a fairly 
light API that requires barrier support. When this issue occurred the 
configuration was using a single separate worker thread so the API is waiting 
for a barrier count of 1. Interestingly, the BT analysis shows the count value 
was met, which implies some deeper issue. Why did a single worker, with at most 
10s of packets per second workload at the time fail to stall at a barrier 
within the allotted one second timeout value? And, even more fun to answer is 
why we even reached the os_panic call as the BT shows the worker was stalled at 
the barrier. Please refer to GDB analysis at bottom of this email.

This code is based on 19.08. We are in the process of upgrading to 21.01, but 
in review of the forum posts, this type of BT is seen across many versions. 
This is an extremely rare event. We had one occurrence in September of last 
year that we could not reproduce and then just had a second occurrence this 
week. As such, we are not able to reproduce this on demand, let alone in stock 
VPP code given this is a new API.

While we could simply enable is_mp_safe as done for sw_interface_dump to avoid 
the issue, we are troubled at not being able to explain why the os_panic 
occurred in the first place. As such, we are hoping someone might be able to 
provide guidance here on next steps. What additional details from the core-file 
can we provide?


Thread 1 backtrace

#0 __GI_raise (sig=sig@entry=6) at 
/usr/src/debug/glibc/2.30-r0/git/sysdeps/unix/sysv/linux/raise.c:50
#1 0x0000003cb8425548 in __GI_abort () at 
/usr/src/debug/glibc/2.30-r0/git/stdlib/abort.c:79
#2 0x00000000004075da in os_exit () at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vpp/vnet/main.c:379
#3 0x00007ff1f5740794 in unix_signal_handler (signum=<optimized out>, 
si=<optimized out>, uc=<optimized out>)
at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vlib/unix/main.c:183
#4 <signal handler called>
#5 __GI_raise (sig=sig@entry=6) at 
/usr/src/debug/glibc/2.30-r0/git/sysdeps/unix/sysv/linux/raise.c:50
#6 0x0000003cb8425548 in __GI_abort () at 
/usr/src/debug/glibc/2.30-r0/git/stdlib/abort.c:79
#7 0x0000000000407583 in os_panic () at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vpp/vnet/main.c:355
#8 0x00007ff1f5728643 in vlib_worker_thread_barrier_sync_int (vm=0x7ff1f575ba40 
<vlib_global_main>, func_name=<optimized out>)
at /usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vlib/threads.c:1476
#9 0x00007ff1f62c6d56 in vl_msg_api_handler_with_vm_node 
(am=am@entry=0x7ff1f62d8d40 <api_main>, the_msg=0x1300ba738,
vm=vm@entry=0x7ff1f575ba40 <vlib_global_main>, node=node@entry=0x7ff1b588c000)
at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vlibapi/api_shared.c:583
#10 0x00007ff1f62b1237 in void_mem_api_handle_msg_i (am=<optimized out>, 
q=<optimized out>, node=0x7ff1b588c000,
vm=0x7ff1f575ba40 <vlib_global_main>) at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vlibmemory/memory_api.c:712
#11 vl_mem_api_handle_msg_main (vm=vm@entry=0x7ff1f575ba40 <vlib_global_main>, 
node=node@entry=0x7ff1b588c000)
at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vlibmemory/memory_api.c:722
#12 0x00007ff1f62be713 in vl_api_clnt_process (f=<optimized out>, 
node=<optimized out>, vm=<optimized out>)
at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vlibmemory/vlib_api.c:326
#13 vl_api_clnt_process (vm=<optimized out>, node=<optimized out>, f=<optimized 
out>)
at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vlibmemory/vlib_api.c:252
#14 0x00007ff1f56f90b7 in vlib_process_bootstrap (_a=<optimized out>)
at /usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vlib/main.c:1468
#15 0x00007ff1f561f220 in clib_calljmp () at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vppinfra/longjmp.S:123
#16 0x00007ff1b5e39db0 in ?? ()
#17 0x00007ff1f56fc669 in vlib_process_startup (f=0x0, p=0x7ff1b588c000, 
vm=0x7ff1f575ba40 <vlib_global_main>)
at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vppinfra/types.h:133

Thread 3 backtrace

(gdb) thr 3
[Switching to thread 3 (LWP 440)]
#0 vlib_worker_thread_barrier_check () at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vlib/threads.h:426
426 ;
(gdb) bt
#0 vlib_worker_thread_barrier_check () at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vlib/threads.h:426
#1 vlib_main_or_worker_loop (is_main=0, vm=0x7ff1b6a5e0c0) at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vlib/main.c:1744
#2 vlib_worker_loop (vm=0x7ff1b6a5e0c0) at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vlib/main.c:1934
#3 0x00007ff1f561f220 in clib_calljmp () at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vppinfra/longjmp.S:123
#4 0x00007ff1a72abd70 in ?? ()
#5 0x00007ff1b41ba5d6 in eal_thread_loop (arg=<optimized out>)
at 
/usr/src/debug/dpdk/19.11+gitAUTOINC+f3773c9b7b-r0/git/lib/librte_eal/linux/eal/eal_thread.c:153
#6 0x0000000000000000 in ?? ()

BT information regarding OS_PANIC by main thread (#1):
#2  0x00007ff1f5728643 in vlib_worker_thread_barrier_sync_int 
(vm=0x7ff1f575ba40 <vlib_global_main>, func_name=<optimized out>)
    at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vlib/threads.c:1476
1476              os_panic ();
(gdb) thread
[Current thread is 1 (LWP 438)]
(gdb) p now
$4 = 261985.75692258833
(gdb) p deadline
$5 = <optimized out>
(gdb) p vm->barrier_epoch
$8 = 261984.75683731204
(gdb) p count
$9 = 1
(gdb) p *vlib_worker_threads->workers_at_barrier
$10 = 1
(gdb) p *vlib_worker_threads->wait_at_barrier
$11 = 1

-Mike
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#19665): https://lists.fd.io/g/vpp-dev/message/19665
Mute This Topic: https://lists.fd.io/mt/83901238/21656
Mute #vpp:https://lists.fd.io/g/vpp-dev/mutehashtag/vpp
Mute #vnet:https://lists.fd.io/g/vpp-dev/mutehashtag/vnet
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to