By adding the following code right after process dispatch in the main loop, the crash is fixed.
So I think the condition mentioned above is a rare but valid case. A ctrl process node being scheduled adds a packet (pending frame) to a node and the packet is referring to an interface which will be deleted soon. The interface will then be deleted in the unix_epoll_input PRE_INPUT node which handles API input, then in the following graph scheduling it will trigger various assert failures. { > /* Ctrl nodes may have added work to the pending vector too. > Process pending vector until there is nothing left. > All pending vectors will be processed from input -> output. */ > for (i = 0; i < _vec_len (nm->pending_frames); i++) > cpu_time_now = dispatch_pending_node (vm, i, cpu_time_now); > /* Reset pending vector for next iteration. */ > vec_set_len (nm->pending_frames, 0); > > if (is_main) > { > // We also need do a barrier here to ensure worker node which > have > // pkt handoffed. > vlib_worker_thread_barrier_sync (vm); > vlib_worker_thread_barrier_release (vm); > } > } > Zhang Dongya via lists.fd.io <fortitude.zhang=gmail....@lists.fd.io> 于2022年12月14日周三 11:52写道: > Hi list, > > During the test, when l3sub if is deleted, I got a new abort in interface > drop node, seems the packet reference to a deleted interface. > > #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 >> #1 0x00007face8d17859 in __GI_abort () at abort.c:79 >> #2 0x0000000000407397 in os_exit (code=1) at >> /home/fortitude/glx/vpp/src/vpp/vnet/main.c:440 >> #3 0x00007face922dd57 in unix_signal_handler (signum=6, >> si=0x7faca2891170, uc=0x7faca2891040) at >> /home/fortitude/glx/vpp/src/vlib/unix/main.c:188 >> #4 <signal handler called> >> #5 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 >> #6 0x00007face8d17859 in __GI_abort () at abort.c:79 >> #7 0x0000000000407333 in os_panic () at >> /home/fortitude/glx/vpp/src/vpp/vnet/main.c:416 >> #8 0x00007face9067039 in debugger () at >> /home/fortitude/glx/vpp/src/vppinfra/error.c:84 >> #9 0x00007face9066dfa in _clib_error (how_to_die=2, function_name=0x0, >> line_number=0, fmt=0x7face9f7a208 "%s:%d (%s) assertion `%s' fails") at >> /home/fortitude/glx/vpp/src/vppinfra/error.c:143 >> #10 0x00007face9b28358 in vnet_get_sw_interface (vnm=0x7facea243f38 >> <vnet_main>, sw_if_index=14) at >> /home/fortitude/glx/vpp/src/vnet/interface_funcs.h:60 >> #11 0x00007face9b2a4ba in interface_drop_punt (vm=0x7facac8e5b00, >> node=0x7faca95c8840, frame=0x7facc2004a40, >> disposition=VNET_ERROR_DISPOSITION_DROP) >> at /home/fortitude/glx/vpp/src/vnet/interface_output.c:1061 >> #12 0x00007face9b29a96 in interface_drop_fn_hsw (vm=0x7facac8e5b00, >> node=0x7faca95c8840, frame=0x7facc2004a40) at >> /home/fortitude/glx/vpp/src/vnet/interface_output.c:1215 >> #13 0x00007face91cd50d in dispatch_node (vm=0x7facac8e5b00, >> node=0x7faca95c8840, type=VLIB_NODE_TYPE_INTERNAL, >> dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x7facc2004a40, >> last_time_stamp=404307411779413) at >> /home/fortitude/glx/vpp/src/vlib/main.c:961 >> #14 0x00007face91cdfb0 in dispatch_pending_node (vm=0x7facac8e5b00, >> pending_frame_index=3, last_time_stamp=404307411779413) at >> /home/fortitude/glx/vpp/src/vlib/main.c:1120 >> #15 0x00007face91c921f in vlib_main_or_worker_loop (vm=0x7facac8e5b00, >> is_main=0) at /home/fortitude/glx/vpp/src/vlib/main.c:1589 >> #16 0x00007face91c8947 in vlib_worker_loop (vm=0x7facac8e5b00) at >> /home/fortitude/glx/vpp/src/vlib/main.c:1723 >> #17 0x00007face92080a4 in vlib_worker_thread_fn (arg=0x7facaa227d00) at >> /home/fortitude/glx/vpp/src/vlib/threads.c:1579 >> #18 0x00007face9203195 in vlib_worker_thread_bootstrap_fn >> (arg=0x7facaa227d00) at /home/fortitude/glx/vpp/src/vlib/threads.c:418 >> #19 0x00007face9121609 in start_thread (arg=<optimized out>) at >> pthread_create.c:477 >> #20 0x00007face8e14133 in clone () at >> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 >> > > From the first mail, I want to know is the sequence can happen or not ? > > 1, my process node adds a pkt by using put_frame_to_node to ip4-lookup > directly, which set the rx interface to the l3 sub interface created before. > 2, my control plane agent (using govpp) delete the l3 sub interface. (it > should be handled in vpp api-process node) > 3, vpp schedule pending nodes. since the rx interface is deleted, vpp > can't get a valid fib index and there is not check in the following > ip4_fib_forwarding_lookup, so it crash with abort. > > I don't think a api barrier in step 2 can solve this, since the pkt is > already in the pending frame. > > Zhang Dongya via lists.fd.io <fortitude.zhang=gmail....@lists.fd.io> > 于2022年12月8日周四 00:17写道: > >> The crash have not been found anymore. >> >> Does this fix make any sense? it it does, I will submit a patch later. >> >> Zhang Dongya via lists.fd.io <fortitude.zhang=gmail....@lists.fd.io> 于 >> 2022年11月29日周二 22:51写道: >> >>> Hi ben, >>> >>> In the beginning I also think it should be a barrier issue, however it >>> turned out not the case. >>> >>> The pkt which had sw_if_index[VLIB_RX] set as the to-be-deleted >>> interface is actually being put to ip4-lookup node by my process node, the >>> process node add pkt in a timer drive way. >>> >>> Since the pkt is added by my process node, I think it is not affected by >>> the worker barrier. in my case the sub if is deleted by API, which is >>> processed in linux_epoll_input PRE_INPUT node, let's consider the following >>> sequence: >>> >>> >>> 1. my process add a pkt to ip4-node, and the pkt refer to a valid sw >>> if index >>> 2. linux_epoll_input process a API request to delete the above sw if >>> index. >>> 3. vpp schedule ip4-lookup node, then it will crash because the sw >>> if index is deleted and ip4_lookup node can't use sw_if_index[VLIB_RX] >>> which is now ~0 to get a valid fib index. >>> >>> >>> There are some code that do this way (ikev2_send_ike and others), I >>> think it's not doable to update the pending frame when the interface is >>> deleted. >>> >>> Benoit Ganne (bganne) via lists.fd.io <bganne=cisco....@lists.fd.io> >>> 于2022年11月29日周二 22:22写道: >>> >>>> Hi Zhang, >>>> >>>> I'd expect the interface deletion to happen under the worker barrier. >>>> VPP workers should drain all their in-flight packets before entering the >>>> barrier, so it should not be possible for the interface to disappear >>>> between your node and ip4-lookup. Or am I missing something? >>>> What I have seen happening is you'd have some data structure where you >>>> keep the interface index that you use in your node, and this data is not >>>> updated when the interface is removed. >>>> Regarding your proposal, I suspect an issue could be when we reuse the >>>> sw_if_index: if you del a sw_interface and then add a new one, chances are >>>> you'll be reusing the same index, but fib_index might be different. >>>> >>>> Best >>>> ben >>>> >>>> > -----Original Message----- >>>> > From: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> On Behalf Of Zhang >>>> Dongya >>>> > Sent: Tuesday, November 29, 2022 3:45 >>>> > To: vpp-dev@lists.fd.io >>>> > Subject: Re: [vpp-dev] possible use deleted sw if index in ip4-lookup >>>> and >>>> > cause crash >>>> > >>>> > >>>> > I have found a solution and it can solve the crash issue. >>>> > >>>> > In ip4_sw_interface_add_del which is a callback for interface >>>> deletion, we >>>> > may set the fib index of the removed interface to 0 (default fib) >>>> instead >>>> > of ~0. This behavior is same with interface creation. >>>> > >>>> > >>>> > >>>> > Zhang Dongya via lists.fd.io <http://lists.fd.io> >>>> > <fortitude.zhang=gmail....@lists.fd.io <mailto:gmail....@lists.fd.io> >>>> > 于 >>>> > 2022年11月28日周一 19:41写道: >>>> > >>>> > >>>> > Hi list, >>>> > >>>> > Recently I encountered a vpp crash with my plugin enabled, after >>>> > some investigation I find it may related with l3 sub interface delete >>>> > while my process node add work to ip4-lookup node. >>>> > >>>> > >>>> > Intuitively I think it may related to a barrier usage but I >>>> tried >>>> > to fix by add some check in my process node to guard the case that l3 >>>> sub >>>> > interface is deleted. however the crash still exists. >>>> > >>>> > Finally I think it should be related to a pattern like this: >>>> > >>>> > 1, my process node adds a pkt by using put_frame_to_node to ip4- >>>> > lookup directly, which set the rx interface to the l3 sub interface >>>> > created before. >>>> > >>>> > 2, my control plane agent (using govpp) delete the l3 sub >>>> > interface. (it should be handled in vpp api-process node) >>>> > >>>> > 3, vpp schedule pending nodes. since the rx interface is >>>> deleted, >>>> > vpp can't get a valid fib index and there is not check in the >>>> following >>>> > ip4_fib_forwarding_lookup, so it crash with abort. >>>> > >>>> > I think vpp may schedule my process node(timeout driven) and >>>> api- >>>> > process node one over one, then it will schedule the pending nodes. >>>> > >>>> > Should I add some check in ip4-lookup or there are better way of >>>> > sending pkt in ctrl process not correct ? >>>> > >>>> > Thanks a lot. >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> >>>> >>>> >>>> >>>> >>> >>> >>> >> >> >> > > >
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#22327): https://lists.fd.io/g/vpp-dev/message/22327 Mute This Topic: https://lists.fd.io/mt/95307938/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-