Re: [vpp-dev] [VCL] hoststack app crash with invalid memfd segment address

2019-11-15 Thread Florin Coras
Hi Hanlin,

Just to make sure, are you running master or some older VPP?

Regarding the issue you could be hitting lower, here’s [1] a patch that I have 
not yet pushed for merging because it leads to api changes for applications 
that directly use the session layer application interface instead of vcl. I 
haven’t tested it extensively, but the goal with it is to signal segment 
allocation/deallocation over the mq instead of the binary api.

Finally, I’ve never tested LDP with Envoy, so not sure if that works properly. 
There’s ongoing work to integrate Envoy with VCL, so you may want to get in 
touch with the authors. 

Regards,
Florin

[1] https://gerrit.fd.io/r/c/vpp/+/21497

> On Nov 15, 2019, at 2:26 AM, wanghanlin  wrote:
> 
> hi ALL,
> I accidentally got following crash stack when I used VCL with hoststack and 
> memfd. But corresponding invalid rx_fifo address (0x2f42e2480) is valid in 
> VPP process and also can be found in /proc/map. That is, shared memfd segment 
> memory is not consistent between hoststack app and VPP.
> Generally, VPP allocate/dealloc the memfd segment and then notify hoststack 
> app to attach/detach. But If just after VPP dealloc memfd segment and notify 
> hoststack app, and then VPP allocate same memfd segment at once because of 
> session connected, and then what happened now? Because hoststack app process 
> dealloc message and connected message with diffrent threads, maybe 
> rx_thread_fn just detach the memfd segment and not attach the same memfd 
> segment, then unfortunately worker thread get the connected message. 
> 
> These are just my guess, maybe I misunderstand.
> 
> (gdb) bt
> #0  0x7f7cde21ffbf in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
> #1  0x01190a64 in Envoy::SignalAction::sigHandler (sig=11, 
> info=, context=) at 
> source/common/signal/signal_action.cc:73 
> #2  
> #3  0x7f7cddc2e85e in vcl_session_connected_handler (wrk=0x7f7ccd4bad00, 
> mp=0x224052f4a) at /home/wanghanlin/vpp-new/src/vcl/vppcom.c:471
> #4  0x7f7cddc37fec in vcl_epoll_wait_handle_mq_event (wrk=0x7f7ccd4bad00, 
> e=0x224052f48, events=0x395000c, num_ev=0x7f7cca49e5e8)
> at /home/wanghanlin/vpp-new/src/vcl/vppcom.c:2658
> #5  0x7f7cddc3860d in vcl_epoll_wait_handle_mq (wrk=0x7f7ccd4bad00, 
> mq=0x224042480, events=0x395000c, maxevents=63, wait_for_time=0, 
> num_ev=0x7f7cca49e5e8)
> at /home/wanghanlin/vpp-new/src/vcl/vppcom.c:2762
> #6  0x7f7cddc38c74 in vppcom_epoll_wait_eventfd (wrk=0x7f7ccd4bad00, 
> events=0x395000c, maxevents=63, n_evts=0, wait_for_time=0)
> at /home/wanghanlin/vpp-new/src/vcl/vppcom.c:2823
> #7  0x7f7cddc393a0 in vppcom_epoll_wait (vep_handle=33554435, 
> events=0x395000c, maxevents=63, wait_for_time=0) at 
> /home/wanghanlin/vpp-new/src/vcl/vppcom.c:2880
> #8  0x7f7cddc5d659 in vls_epoll_wait (ep_vlsh=3, events=0x395000c, 
> maxevents=63, wait_for_time=0) at 
> /home/wanghanlin/vpp-new/src/vcl/vcl_locked.c:895
> #9  0x7f7cdeb4c252 in ldp_epoll_pwait (epfd=67, events=0x395, 
> maxevents=64, timeout=32, sigmask=0x0) at 
> /home/wanghanlin/vpp-new/src/vcl/ldp.c:2334
> #10 0x7f7cdeb4c334 in epoll_wait (epfd=67, events=0x395, 
> maxevents=64, timeout=32) at /home/wanghanlin/vpp-new/src/vcl/ldp.c:2389
> #11 0x00fc9458 in epoll_dispatch ()
> #12 0x00fc363c in event_base_loop ()
> #13 0x00c09b1c in Envoy::Server::WorkerImpl::threadRoutine 
> (this=0x357d8c0, guard_dog=...) at source/server/worker_impl.cc:104 
> 
> #14 0x01193485 in std::function::operator()() const 
> (this=0x7f7ccd4b8544)
> at 
> /usr/lib/gcc/x86_64-linux-gnu/7.4.0/../../../../include/c++/7.4.0/bits/std_function.h:706
> #15 Envoy::Thread::ThreadImplPosix::ThreadImplPosix(std::function ()>)::$_0::operator()(void*) const (this=, arg=0x2f42e2480)
> at source/common/common/posix/thread_impl.cc:33 
> 
> #16 Envoy::Thread::ThreadImplPosix::ThreadImplPosix(std::function ()>)::$_0::__invoke(void*) (arg=0x2f42e2480) at 
> source/common/common/posix/thread_impl.cc:32 
> #17 0x7f7cde2164a4 in start_thread () from 
> /lib/x86_64-linux-gnu/libpthread.so.0
> #18 0x7f7cddf58d0f in clone () from /lib/x86_64-linux-gnu/libc.so.6
> (gdb) f 3
> #3  0x7f7cddc2e85e in vcl_session_connected_handler (wrk=0x7f7ccd4bad00, 
> mp=0x224052f4a) at /home/wanghanlin/vpp-new/src/vcl/vppcom.c:471
> 471   rx_fifo->client_session_index = session_index;
> (gdb) p rx_fifo
> $1 = (svm_fifo_t *) 0x2f42e2480
> (gdb) p *rx_fifo
> Cannot access memory at address 0x2f42e2480
> (gdb)
> 
> 
> Regards,
> Hanlin
>   
> wanghanlin
> 
> wanghan...@corp.netease.com
>  
> 

Re: [vpp-dev] NAT worker HANDOFF but no HANDED-OFF -- no worker picks up the handed-off work

2019-11-15 Thread Elias Rudberg
Hi Andrew,

Thanks, that looks promising. The issue 
https://jira.fd.io/browse/VPP-1734 that the fix refers to seems like it
could be the same issue we are seeing.

We have just restarted vpp with the fix, it will be interesting to see
if it helps. Thanks again for your help!

/ Elias


On Fri, 2019-11-15 at 11:26 +0100, Andrew  Yourtchenko wrote:
> Hi Elias,
> 
> Could you give a shot running a build with 
> https://gerrit.fd.io/r/#/c/vpp/+/23461/ in ?
> 
> I cherry-picked it from master today but it is not in 19.08 branch
> yet.
> 
> --a
> 
> > On 15 Nov 2019, at 11:05, Elias Rudberg 
> > wrote:
> > 
> > We are using VPP 19.08 for NAT (nat44) and are struggling with the
> > following problem: it first works seemingly fine for a while, like
> > several days or weeks, but then suddenly VPP stops forwarding
> > traffic.
> > Even ping to the "outside" IP address fails.
> > 
> > The VPP process is still running so we try to investigate further
> > using
> > vppctl, enabling packet trace as follows:
> > 
> > clear trace
> > trace add rdma-input 5
> > 
> > then doing ping to "outside" and then "show trace".
> > 
> > To see the normal behavior we have compared to another server
> > running
> > VPP without the strange problem happening; there we can see that
> > the
> > normal behavior is that one worker starts processing the packet and
> > then does NAT44_OUT2IN_WORKER_HANDOFF after which another worker
> > takes
> > over: "handoff_trace" and then "HANDED-OFF: from thread..." and
> > then
> > that worker continues processing the packet.
> > So the relevant parts of the trace look like this (abbreviated to
> > show
> > only node names and handoff info) for a case when thread 8 hands
> > off
> > work to thread 3:
> > 
> > --- Start of thread 3 vpp_wk_2 ---
> > Packet 1
> > 
> > 08:15:10:781992: handoff_trace
> >  HANDED-OFF: from thread 8 trace index 0
> > 08:15:10:781992: nat44-out2in
> > 08:15:10:782008: ip4-lookup
> > 08:15:10:782009: ip4-local
> > 08:15:10:782010: ip4-icmp-input
> > 08:15:10:782011: ip4-icmp-echo-request
> > 08:15:10:782011: ip4-load-balance
> > 08:15:10:782013: ip4-rewrite
> > 08:15:10:782014: BondEthernet0-output
> > 
> > --- Start of thread 8 vpp_wk_7 ---
> > Packet 1
> > 
> > 08:15:10:781986: rdma-input
> > 08:15:10:781988: bond-input
> > 08:15:10:781989: ethernet-input
> > 08:15:10:781989: ip4-input
> > 08:15:10:781990: nat44-out2in-worker-handoff
> >  NAT44_OUT2IN_WORKER_HANDOFF : next-worker 3 trace index 0
> > 
> > The above is what it looks like normally. The problem is that
> > sometimes, for some reason, the handoff stops working so that we
> > only
> > get the initial processing by a worker and that working saying
> > NAT44_OUT2IN_WORKER_HANDOFF but the other worker does not pick up
> > the
> > work, it is seemingly ignored.
> > 
> > Here is what it looks like then, when the problem has happened,
> > thread
> > 7 trying to handoff to thread 3:
> > 
> > --- Start of thread 3 vpp_wk_2 ---
> > No packets in trace buffer
> > 
> > --- Start of thread 7 vpp_wk_6 ---
> > Packet 1
> > 
> > 08:38:41:904654: rdma-input
> > 08:38:41:904656: bond-input
> > 08:38:41:904658: ethernet-input
> > 08:38:41:904660: ip4-input
> > 08:38:41:904663: nat44-out2in-worker-handoff
> >  NAT44_OUT2IN_WORKER_HANDOFF : next-worker 3 trace index 0
> > 
> > So, work is also in this case handed off to thread 3 but thread 3
> > does
> > not pick it up. There is no "HANDED-OFF" message in the trace at
> > all,
> > not for any worker. It seems like the handed-off work was ignored.
> > Then
> > of course it is understandable that the ping does not work and
> > packet
> > forwarding does not work, the question is: why does that hand-off
> > procedure fail?
> > 
> > Are there some known reasons that can cause this behavior?
> > 
> > When there is a NAT44_OUT2IN_WORKER_HANDOFF message in the packet
> > trace, should there always be a corresponding "HANDED-OFF" message
> > for
> > another thread picking it up?
> > 
> > One more question related to the above: sometimes when looking at
> > trace
> > for ICMP packets to investigate this problem we have seen a worker
> > apparently handing off work to itself, which seems strange.
> > Example:
> > 
> > --- Start of thread 3 vpp_wk_2 ---
> > Packet 1
> > 
> > 08:31:23:871274: rdma-input
> > 08:31:23:871279: bond-input
> > 08:31:23:871282: ethernet-input
> > 08:31:23:871285: ip4-input
> > 08:31:23:871289: nat44-out2in-worker-handoff
> >  NAT44_OUT2IN_WORKER_HANDOFF : next-worker 3 trace index 0
> > 
> > If the purpose of "handoff" is to let another thread take over,
> > then
> > this seems strange by itself (even without considering that there
> > is no
> > "HANDED-OFF" for any thread): why is thread 3 trying to handoff
> > work to
> > itself? Does that indicate something wrong or are there legitimate
> > cases where a 

[vpp-dev] Coverity run FAILED as of 2019-11-15 14:14:06 UTC

2019-11-15 Thread Noreply Jenkins
Coverity run failed today.

Current number of outstanding issues are 5
Newly detected: 0
Eliminated: 0
More details can be found at  
https://scan.coverity.com/projects/fd-io-vpp/view_defects
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#14607): https://lists.fd.io/g/vpp-dev/message/14607
Mute This Topic: https://lists.fd.io/mt/59276471/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-


Re: [vpp-dev] Crash in vlib_worker_thread_barrier_sync_int

2019-11-15 Thread Dave Barach via Lists.Fd.Io
See 
https://fd.io/docs/vpp/master/troubleshooting/reportingissues/reportingissues.html,
 specifically:

“Before you press the Jira button to create a bug report - or email 
vpp-dev@lists.fd.io - please ask yourself whether there’s enough information 
for someone else to understand and to reproduce the issue given a reasonable 
amount of effort.”

In this case, you’ve made up a root cause – possibly correct, possibly not – 
with no supporting data. At a minimum, please send gdb backtraces from all 
threads, version info, and so on.

D.

From: vpp-dev@lists.fd.io  On Behalf Of Satya Murthy
Sent: Friday, November 15, 2019 6:31 AM
To: vpp-dev@lists.fd.io
Subject: [vpp-dev] Crash in vlib_worker_thread_barrier_sync_int

Hi ,

We are seeing crash in vlib_worker_thread_barrier_sync_int() function as soon 
as we send a CLI command to VPP.
We are sending CLI command to VPP via a script, which may not be waiting enough 
for VPP initialization to settle.

I see that the crash is happening in the following piece of code.
  while (*vlib_worker_threads->workers_at_barrier != count)
{
  if ((now = vlib_time_now (vm)) > deadline)
{
  fformat (stderr, "%s: worker thread deadlock\n", __FUNCTION__);
  os_panic ();    Here
}

From what I see in the code, not all worker threads got a chance to increment 
workers_at_barrier  and hence not able to hit the condition within the deadline 
( 1 sec) and hence the crash. Is my understanding correct ?

If so, are we sending the CLI command too early to VPP ( before all workers 
initialized ), and hence this crash ?

Any inputs on this would really help us.

--
Thanks & Regards,
Murthy
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#14606): https://lists.fd.io/g/vpp-dev/message/14606
Mute This Topic: https://lists.fd.io/mt/59188069/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-


[vpp-dev] Crash in vlib_worker_thread_barrier_sync_int

2019-11-15 Thread Satya Murthy
Hi ,

We are seeing crash in vlib_worker_thread_barrier_sync_int() function as soon 
as we send a CLI command to VPP.
We are sending CLI command to VPP via a script, which may not be waiting enough 
for VPP initialization to settle.

I see that the crash is happening in the following piece of code.

while (*vlib_worker_threads->workers_at_barrier != count)
{
if ((now = vlib_time_now (vm)) > deadline)
{
fformat (stderr, "%s: worker thread deadlock\n", __FUNCTION__);
os_panic ();    Here
}

>From what I see in the code, not all worker threads got a chance to increment 
>workers_at_barrier  and hence not able to hit the condition within the 
>deadline ( 1 sec) and hence the crash. Is my understanding correct ?

If so, are we sending the CLI command too early to VPP ( before all workers 
initialized ), and hence this crash ?

Any inputs on this would really help us.

--
Thanks & Regards,
Murthy
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#14605): https://lists.fd.io/g/vpp-dev/message/14605
Mute This Topic: https://lists.fd.io/mt/59188069/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-


Re: [vpp-dev] NAT worker HANDOFF but no HANDED-OFF -- no worker picks up the handed-off work

2019-11-15 Thread Andrew Yourtchenko
Hi Elias,

Could you give a shot running a build with 
https://gerrit.fd.io/r/#/c/vpp/+/23461/ in ?

I cherry-picked it from master today but it is not in 19.08 branch yet.

--a

> On 15 Nov 2019, at 11:05, Elias Rudberg  wrote:
> 
> We are using VPP 19.08 for NAT (nat44) and are struggling with the
> following problem: it first works seemingly fine for a while, like
> several days or weeks, but then suddenly VPP stops forwarding traffic.
> Even ping to the "outside" IP address fails.
> 
> The VPP process is still running so we try to investigate further using
> vppctl, enabling packet trace as follows:
> 
> clear trace
> trace add rdma-input 5
> 
> then doing ping to "outside" and then "show trace".
> 
> To see the normal behavior we have compared to another server running
> VPP without the strange problem happening; there we can see that the
> normal behavior is that one worker starts processing the packet and
> then does NAT44_OUT2IN_WORKER_HANDOFF after which another worker takes
> over: "handoff_trace" and then "HANDED-OFF: from thread..." and then
> that worker continues processing the packet.
> So the relevant parts of the trace look like this (abbreviated to show
> only node names and handoff info) for a case when thread 8 hands off
> work to thread 3:
> 
> --- Start of thread 3 vpp_wk_2 ---
> Packet 1
> 
> 08:15:10:781992: handoff_trace
>  HANDED-OFF: from thread 8 trace index 0
> 08:15:10:781992: nat44-out2in
> 08:15:10:782008: ip4-lookup
> 08:15:10:782009: ip4-local
> 08:15:10:782010: ip4-icmp-input
> 08:15:10:782011: ip4-icmp-echo-request
> 08:15:10:782011: ip4-load-balance
> 08:15:10:782013: ip4-rewrite
> 08:15:10:782014: BondEthernet0-output
> 
> --- Start of thread 8 vpp_wk_7 ---
> Packet 1
> 
> 08:15:10:781986: rdma-input
> 08:15:10:781988: bond-input
> 08:15:10:781989: ethernet-input
> 08:15:10:781989: ip4-input
> 08:15:10:781990: nat44-out2in-worker-handoff
>  NAT44_OUT2IN_WORKER_HANDOFF : next-worker 3 trace index 0
> 
> The above is what it looks like normally. The problem is that
> sometimes, for some reason, the handoff stops working so that we only
> get the initial processing by a worker and that working saying
> NAT44_OUT2IN_WORKER_HANDOFF but the other worker does not pick up the
> work, it is seemingly ignored.
> 
> Here is what it looks like then, when the problem has happened, thread
> 7 trying to handoff to thread 3:
> 
> --- Start of thread 3 vpp_wk_2 ---
> No packets in trace buffer
> 
> --- Start of thread 7 vpp_wk_6 ---
> Packet 1
> 
> 08:38:41:904654: rdma-input
> 08:38:41:904656: bond-input
> 08:38:41:904658: ethernet-input
> 08:38:41:904660: ip4-input
> 08:38:41:904663: nat44-out2in-worker-handoff
>  NAT44_OUT2IN_WORKER_HANDOFF : next-worker 3 trace index 0
> 
> So, work is also in this case handed off to thread 3 but thread 3 does
> not pick it up. There is no "HANDED-OFF" message in the trace at all,
> not for any worker. It seems like the handed-off work was ignored. Then
> of course it is understandable that the ping does not work and packet
> forwarding does not work, the question is: why does that hand-off
> procedure fail?
> 
> Are there some known reasons that can cause this behavior?
> 
> When there is a NAT44_OUT2IN_WORKER_HANDOFF message in the packet
> trace, should there always be a corresponding "HANDED-OFF" message for
> another thread picking it up?
> 
> One more question related to the above: sometimes when looking at trace
> for ICMP packets to investigate this problem we have seen a worker
> apparently handing off work to itself, which seems strange. Example:
> 
> --- Start of thread 3 vpp_wk_2 ---
> Packet 1
> 
> 08:31:23:871274: rdma-input
> 08:31:23:871279: bond-input
> 08:31:23:871282: ethernet-input
> 08:31:23:871285: ip4-input
> 08:31:23:871289: nat44-out2in-worker-handoff
>  NAT44_OUT2IN_WORKER_HANDOFF : next-worker 3 trace index 0
> 
> If the purpose of "handoff" is to let another thread take over, then
> this seems strange by itself (even without considering that there is no
> "HANDED-OFF" for any thread): why is thread 3 trying to handoff work to
> itself? Does that indicate something wrong or are there legitimate
> cases where a thread "hands off" something to itself?
> 
> We have encountered this problem several times but unfortunately we
> have not yet found a way to reproduce it in a lab environment, we do
> not know exactly what triggers the problem. Previous times, when we
> have restarted vpp it starts working normally again.
> 
> Any input on this or ideas for how to troubleshoot further would be
> much appreciated.
> 
> Best regards,
> Elias
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
> 
> View/Reply Online (#14602): https://lists.fd.io/g/vpp-dev/message/14602
> Mute This Topic: https://lists.fd.io/mt/59112885/675608
> Group Owner: 

[vpp-dev] [VCL] hoststack app crash with invalid memfd segment address

2019-11-15 Thread wanghanlin






hi ALL,I accidentally got following crash stack when I used VCL with hoststack and memfd. But corresponding invalid rx_fifo address (0x2f42e2480) is valid in VPP process and also can be found in /proc/map. That is, shared memfd segment memory is not consistent between hoststack app and VPP.Generally, VPP allocate/dealloc the memfd segment and then notify hoststack app to attach/detach. But If just after VPP dealloc memfd segment and notify hoststack app, and then VPP allocate same memfd segment at once because of session connected, and then what happened now? Because hoststack app process dealloc message and connected message with diffrent threads, maybe rx_thread_fn just detach the memfd segment and not attach the same memfd segment, then unfortunately worker thread get the connected message. These are just my guess, maybe I misunderstand.(gdb) bt#0  0x7f7cde21ffbf in raise () from /lib/x86_64-linux-gnu/libpthread.so.0#1  0x01190a64 in Envoy::SignalAction::sigHandler (sig=11, info=, context=) at source/common/signal/signal_action.cc:73#2  #3  0x7f7cddc2e85e in vcl_session_connected_handler (wrk=0x7f7ccd4bad00, mp=0x224052f4a) at /home/wanghanlin/vpp-new/src/vcl/vppcom.c:471#4  0x7f7cddc37fec in vcl_epoll_wait_handle_mq_event (wrk=0x7f7ccd4bad00, e=0x224052f48, events=0x395000c, num_ev=0x7f7cca49e5e8)    at /home/wanghanlin/vpp-new/src/vcl/vppcom.c:2658#5  0x7f7cddc3860d in vcl_epoll_wait_handle_mq (wrk=0x7f7ccd4bad00, mq=0x224042480, events=0x395000c, maxevents=63, wait_for_time=0, num_ev=0x7f7cca49e5e8)    at /home/wanghanlin/vpp-new/src/vcl/vppcom.c:2762#6  0x7f7cddc38c74 in vppcom_epoll_wait_eventfd (wrk=0x7f7ccd4bad00, events=0x395000c, maxevents=63, n_evts=0, wait_for_time=0)    at /home/wanghanlin/vpp-new/src/vcl/vppcom.c:2823#7  0x7f7cddc393a0 in vppcom_epoll_wait (vep_handle=33554435, events=0x395000c, maxevents=63, wait_for_time=0) at /home/wanghanlin/vpp-new/src/vcl/vppcom.c:2880#8  0x7f7cddc5d659 in vls_epoll_wait (ep_vlsh=3, events=0x395000c, maxevents=63, wait_for_time=0) at /home/wanghanlin/vpp-new/src/vcl/vcl_locked.c:895#9  0x7f7cdeb4c252 in ldp_epoll_pwait (epfd=67, events=0x395, maxevents=64, timeout=32, sigmask=0x0) at /home/wanghanlin/vpp-new/src/vcl/ldp.c:2334#10 0x7f7cdeb4c334 in epoll_wait (epfd=67, events=0x395, maxevents=64, timeout=32) at /home/wanghanlin/vpp-new/src/vcl/ldp.c:2389#11 0x00fc9458 in epoll_dispatch ()#12 0x00fc363c in event_base_loop ()#13 0x00c09b1c in Envoy::Server::WorkerImpl::threadRoutine (this=0x357d8c0, guard_dog=...) at source/server/worker_impl.cc:104#14 0x01193485 in std::function::operator()() const (this=0x7f7ccd4b8544)    at /usr/lib/gcc/x86_64-linux-gnu/7.4.0/../../../../include/c++/7.4.0/bits/std_function.h:706#15 Envoy::Thread::ThreadImplPosix::ThreadImplPosix(std::function)::$_0::operator()(void*) const (this=, arg=0x2f42e2480)    at source/common/common/posix/thread_impl.cc:33#16 Envoy::Thread::ThreadImplPosix::ThreadImplPosix(std::function)::$_0::__invoke(void*) (arg=0x2f42e2480) at source/common/common/posix/thread_impl.cc:32#17 0x7f7cde2164a4 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0#18 0x7f7cddf58d0f in clone () from /lib/x86_64-linux-gnu/libc.so.6(gdb) f 3#3  0x7f7cddc2e85e in vcl_session_connected_handler (wrk=0x7f7ccd4bad00, mp=0x224052f4a) at /home/wanghanlin/vpp-new/src/vcl/vppcom.c:471471       rx_fifo->client_session_index = session_index;(gdb) p rx_fifo$1 = (svm_fifo_t *) 0x2f42e2480(gdb) p *rx_fifoCannot access memory at address 0x2f42e2480(gdb)Regards,Hanlin

 










wanghanlin







wanghan...@corp.netease.com








签名由
网易邮箱大师
定制

 



-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#14604): https://lists.fd.io/g/vpp-dev/message/14604
Mute This Topic: https://lists.fd.io/mt/59126583/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-


[vpp-dev] NAT worker HANDOFF but no HANDED-OFF -- no worker picks up the handed-off work

2019-11-15 Thread Elias Rudberg
We are using VPP 19.08 for NAT (nat44) and are struggling with the
following problem: it first works seemingly fine for a while, like
several days or weeks, but then suddenly VPP stops forwarding traffic.
Even ping to the "outside" IP address fails.

The VPP process is still running so we try to investigate further using
vppctl, enabling packet trace as follows:

clear trace
trace add rdma-input 5

then doing ping to "outside" and then "show trace".

To see the normal behavior we have compared to another server running
VPP without the strange problem happening; there we can see that the
normal behavior is that one worker starts processing the packet and
then does NAT44_OUT2IN_WORKER_HANDOFF after which another worker takes
over: "handoff_trace" and then "HANDED-OFF: from thread..." and then
that worker continues processing the packet.
So the relevant parts of the trace look like this (abbreviated to show
only node names and handoff info) for a case when thread 8 hands off
work to thread 3:

--- Start of thread 3 vpp_wk_2 ---
Packet 1

08:15:10:781992: handoff_trace
  HANDED-OFF: from thread 8 trace index 0
08:15:10:781992: nat44-out2in
08:15:10:782008: ip4-lookup
08:15:10:782009: ip4-local
08:15:10:782010: ip4-icmp-input
08:15:10:782011: ip4-icmp-echo-request
08:15:10:782011: ip4-load-balance
08:15:10:782013: ip4-rewrite
08:15:10:782014: BondEthernet0-output

--- Start of thread 8 vpp_wk_7 ---
Packet 1

08:15:10:781986: rdma-input
08:15:10:781988: bond-input
08:15:10:781989: ethernet-input
08:15:10:781989: ip4-input
08:15:10:781990: nat44-out2in-worker-handoff
  NAT44_OUT2IN_WORKER_HANDOFF : next-worker 3 trace index 0

The above is what it looks like normally. The problem is that
sometimes, for some reason, the handoff stops working so that we only
get the initial processing by a worker and that working saying
NAT44_OUT2IN_WORKER_HANDOFF but the other worker does not pick up the
work, it is seemingly ignored.

Here is what it looks like then, when the problem has happened, thread
7 trying to handoff to thread 3:

--- Start of thread 3 vpp_wk_2 ---
No packets in trace buffer

--- Start of thread 7 vpp_wk_6 ---
Packet 1

08:38:41:904654: rdma-input
08:38:41:904656: bond-input
08:38:41:904658: ethernet-input
08:38:41:904660: ip4-input
08:38:41:904663: nat44-out2in-worker-handoff
  NAT44_OUT2IN_WORKER_HANDOFF : next-worker 3 trace index 0

So, work is also in this case handed off to thread 3 but thread 3 does
not pick it up. There is no "HANDED-OFF" message in the trace at all,
not for any worker. It seems like the handed-off work was ignored. Then
of course it is understandable that the ping does not work and packet
forwarding does not work, the question is: why does that hand-off
procedure fail?

Are there some known reasons that can cause this behavior?

When there is a NAT44_OUT2IN_WORKER_HANDOFF message in the packet
trace, should there always be a corresponding "HANDED-OFF" message for
another thread picking it up?

One more question related to the above: sometimes when looking at trace
for ICMP packets to investigate this problem we have seen a worker
apparently handing off work to itself, which seems strange. Example:

--- Start of thread 3 vpp_wk_2 ---
Packet 1

08:31:23:871274: rdma-input
08:31:23:871279: bond-input
08:31:23:871282: ethernet-input
08:31:23:871285: ip4-input
08:31:23:871289: nat44-out2in-worker-handoff
  NAT44_OUT2IN_WORKER_HANDOFF : next-worker 3 trace index 0

If the purpose of "handoff" is to let another thread take over, then
this seems strange by itself (even without considering that there is no
"HANDED-OFF" for any thread): why is thread 3 trying to handoff work to
itself? Does that indicate something wrong or are there legitimate
cases where a thread "hands off" something to itself?

We have encountered this problem several times but unfortunately we
have not yet found a way to reproduce it in a lab environment, we do
not know exactly what triggers the problem. Previous times, when we
have restarted vpp it starts working normally again.

Any input on this or ideas for how to troubleshoot further would be
much appreciated.

Best regards,
Elias
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#14602): https://lists.fd.io/g/vpp-dev/message/14602
Mute This Topic: https://lists.fd.io/mt/59112885/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-