Re: [**EXTERNAL**] RE: [vpp-dev] #vpp #vnet os_panic for failed barrier timeout

2021-06-30 Thread Bly, Mike via lists.fd.io
Agreed on doing the upgrade to newer code base being prudent and yes, we are in 
the midst of doing so. However, I did not see any obvious changes in this area, 
so I am a bit pessimistic on upgrade being the fix. Perhaps I missed a subtle 
improvement in this area folks could point me at to ease my concerns?

Regarding 2nd paragraph comments/questions, yes, it is a single worker and I 
too would like to know why the main thread did not just move on instead of 
throwing the os_panic().

-Mike

From: v...@barachs.net 
Sent: Thursday, June 24, 2021 5:46 AM
To: Bly, Mike ; vpp-dev@lists.fd.io
Subject: [**EXTERNAL**] RE: [vpp-dev] #vpp #vnet os_panic for failed barrier 
timeout

Given the reported MTBF of 9 months and nearly 2-year-old software, switching 
to 21.01 [and then to 21.06 when released] seems like the only sensible next 
step.

>From the gdb info provided, it looks like there is one worker thread. Is that 
>correct? If so, the "workers_at_barrier" count seems correct, so why wouldn't 
>the main thread have moved on instead of spinning waiting for something which 
>already happened?

D.


From: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io> 
mailto:vpp-dev@lists.fd.io>> On Behalf Of Bly, Mike via 
lists.fd.io
Sent: Wednesday, June 23, 2021 10:59 AM
To: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>
Subject: [vpp-dev] #vpp #vnet os_panic for failed barrier timeout

We are looking for advise on whether this os_panic() for a barrier timeout has 
anyone looking at it. We see in the forum many instances of type of main thread 
back-trace. For this incedent, referencing the sw_interface_dump API we created 
a lighter oper-get call to simply fetch link state vs. all of the extensive 
information the dump command fetches for each interface. At the time we added 
our new oper-get function,  we overlooked the "is_mp_safe" enablement for dump 
and as such did NOT set it for our new oper-get. The end result is a fairly 
light API that requires barrier support. When this issue occurred the 
configuration was using a single separate worker thread so the API is waiting 
for a barrier count of 1. Interestingly, the BT analysis shows the count value 
was met, which implies some deeper issue. Why did a single worker, with at most 
10s of packets per second workload at the time fail to stall at a barrier 
within the allotted one second timeout value? And, even more fun to answer is 
why we even reached the os_panic call as the BT shows the worker was stalled at 
the barrier. Please refer to GDB analysis at bottom of this email.

This code is based on 19.08. We are in the process of upgrading to 21.01, but 
in review of the forum posts, this type of BT is seen across many versions. 
This is an extremely rare event. We had one occurrence in September of last 
year that we could not reproduce and then just had a second occurrence this 
week. As such, we are not able to reproduce this on demand, let alone in stock 
VPP code given this is a new API.

While we could simply enable is_mp_safe as done for sw_interface_dump to avoid 
the issue, we are troubled at not being able to explain why the os_panic 
occurred in the first place. As such, we are hoping someone might be able to 
provide guidance here on next steps. What additional details from the core-file 
can we provide?


Thread 1 backtrace

#0 __GI_raise (sig=sig@entry=6) at 
/usr/src/debug/glibc/2.30-r0/git/sysdeps/unix/sysv/linux/raise.c:50
#1 0x003cb8425548 in __GI_abort () at 
/usr/src/debug/glibc/2.30-r0/git/stdlib/abort.c:79
#2 0x004075da in os_exit () at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vpp/vnet/main.c:379
#3 0x7ff1f5740794 in unix_signal_handler (signum=, 
si=, uc=)
at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vlib/unix/main.c:183
#4 
#5 __GI_raise (sig=sig@entry=6) at 
/usr/src/debug/glibc/2.30-r0/git/sysdeps/unix/sysv/linux/raise.c:50
#6 0x003cb8425548 in __GI_abort () at 
/usr/src/debug/glibc/2.30-r0/git/stdlib/abort.c:79
#7 0x00407583 in os_panic () at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vpp/vnet/main.c:355
#8 0x7ff1f5728643 in vlib_worker_thread_barrier_sync_int (vm=0x7ff1f575ba40 
, func_name=)
at /usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vlib/threads.c:1476
#9 0x7ff1f62c6d56 in vl_msg_api_handler_with_vm_node 
(am=am@entry=0x7ff1f62d8d40 , the_msg=0x1300ba738,
vm=vm@entry=0x7ff1f575ba40 , node=node@entry=0x7ff1b588c000)
at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vlibapi/api_shared.c:583
#10 0x7ff1f62b1237 in void_mem_api_handle_msg_i (am=, 
q=, node=0x7ff1b588c000,
vm=0x7ff1f575ba40 ) at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vlibmemory/memory_api.c:712
#11 vl_mem_api_handle_msg_main (vm=vm@entry=0x7ff1f575ba40 , 
node=node@entry=0x7ff1b588c000)
at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vlibmemory/memory_api.c:722
#12 0x

[vpp-dev] #vpp #vnet os_panic for failed barrier timeout

2021-06-23 Thread Bly, Mike via lists.fd.io
We are looking for advise on whether this os_panic() for a barrier timeout has 
anyone looking at it. We see in the forum many instances of type of main thread 
back-trace. For this incedent, referencing the sw_interface_dump API we created 
a lighter oper-get call to simply fetch link state vs. all of the extensive 
information the dump command fetches for each interface. At the time we added 
our new oper-get function,  we overlooked the "is_mp_safe" enablement for dump 
and as such did NOT set it for our new oper-get. The end result is a fairly 
light API that requires barrier support. When this issue occurred the 
configuration was using a single separate worker thread so the API is waiting 
for a barrier count of 1. Interestingly, the BT analysis shows the count value 
was met, which implies some deeper issue. Why did a single worker, with at most 
10s of packets per second workload at the time fail to stall at a barrier 
within the allotted one second timeout value? And, even more fun to answer is 
why we even reached the os_panic call as the BT shows the worker was stalled at 
the barrier. Please refer to GDB analysis at bottom of this email.

This code is based on 19.08. We are in the process of upgrading to 21.01, but 
in review of the forum posts, this type of BT is seen across many versions. 
This is an extremely rare event. We had one occurrence in September of last 
year that we could not reproduce and then just had a second occurrence this 
week. As such, we are not able to reproduce this on demand, let alone in stock 
VPP code given this is a new API.

While we could simply enable is_mp_safe as done for sw_interface_dump to avoid 
the issue, we are troubled at not being able to explain why the os_panic 
occurred in the first place. As such, we are hoping someone might be able to 
provide guidance here on next steps. What additional details from the core-file 
can we provide?


Thread 1 backtrace

#0 __GI_raise (sig=sig@entry=6) at 
/usr/src/debug/glibc/2.30-r0/git/sysdeps/unix/sysv/linux/raise.c:50
#1 0x003cb8425548 in __GI_abort () at 
/usr/src/debug/glibc/2.30-r0/git/stdlib/abort.c:79
#2 0x004075da in os_exit () at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vpp/vnet/main.c:379
#3 0x7ff1f5740794 in unix_signal_handler (signum=, 
si=, uc=)
at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vlib/unix/main.c:183
#4 
#5 __GI_raise (sig=sig@entry=6) at 
/usr/src/debug/glibc/2.30-r0/git/sysdeps/unix/sysv/linux/raise.c:50
#6 0x003cb8425548 in __GI_abort () at 
/usr/src/debug/glibc/2.30-r0/git/stdlib/abort.c:79
#7 0x00407583 in os_panic () at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vpp/vnet/main.c:355
#8 0x7ff1f5728643 in vlib_worker_thread_barrier_sync_int (vm=0x7ff1f575ba40 
, func_name=)
at /usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vlib/threads.c:1476
#9 0x7ff1f62c6d56 in vl_msg_api_handler_with_vm_node 
(am=am@entry=0x7ff1f62d8d40 , the_msg=0x1300ba738,
vm=vm@entry=0x7ff1f575ba40 , node=node@entry=0x7ff1b588c000)
at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vlibapi/api_shared.c:583
#10 0x7ff1f62b1237 in void_mem_api_handle_msg_i (am=, 
q=, node=0x7ff1b588c000,
vm=0x7ff1f575ba40 ) at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vlibmemory/memory_api.c:712
#11 vl_mem_api_handle_msg_main (vm=vm@entry=0x7ff1f575ba40 , 
node=node@entry=0x7ff1b588c000)
at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vlibmemory/memory_api.c:722
#12 0x7ff1f62be713 in vl_api_clnt_process (f=, 
node=, vm=)
at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vlibmemory/vlib_api.c:326
#13 vl_api_clnt_process (vm=, node=, f=)
at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vlibmemory/vlib_api.c:252
#14 0x7ff1f56f90b7 in vlib_process_bootstrap (_a=)
at /usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vlib/main.c:1468
#15 0x7ff1f561f220 in clib_calljmp () at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vppinfra/longjmp.S:123
#16 0x7ff1b5e39db0 in ?? ()
#17 0x7ff1f56fc669 in vlib_process_startup (f=0x0, p=0x7ff1b588c000, 
vm=0x7ff1f575ba40 )
at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vppinfra/types.h:133

Thread 3 backtrace

(gdb) thr 3
[Switching to thread 3 (LWP 440)]
#0 vlib_worker_thread_barrier_check () at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vlib/threads.h:426
426 ;
(gdb) bt
#0 vlib_worker_thread_barrier_check () at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vlib/threads.h:426
#1 vlib_main_or_worker_loop (is_main=0, vm=0x7ff1b6a5e0c0) at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vlib/main.c:1744
#2 vlib_worker_loop (vm=0x7ff1b6a5e0c0) at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vlib/main.c:1934
#3 0x7ff1f561f220 in clib_calljmp () at 
/usr/src/debug/vpp/19.08+gitAUTOINC+6641eb3e8f-r0/git/src/vppinfra/longjmp.S:123

Re: [**EXTERNAL**] Re: [vpp-dev] #vpp #vnet apparent buffer prefetch issue - seeing "l3 mac mismatch" discards

2020-06-05 Thread Bly, Mike via lists.fd.io
Yeah, it did not cherry pick clean for me either as it needs a line # change.

-Mike

From: Andrew  Yourtchenko 
Sent: Friday, June 05, 2020 10:58 AM
To: ayour...@gmail.com
Cc: Bly, Mike ; vpp-dev@lists.fd.io
Subject: [**EXTERNAL**] Re: [vpp-dev] #vpp #vnet apparent buffer prefetch issue 
- seeing "l3 mac mismatch" discards

Ah sorry, didn’t see your reply. I’ll take a look. It doesn’t appear to 
cherry-pick cleanly to stable/1908, so it probably was among those that 500 :(
--a


On 5 Jun 2020, at 19:54, Andrew Yourtchenko via lists.fd.io 
mailto:ayourtch=gmail@lists.fd.io>> wrote:
Hi Mike,

Any chances you might do “git bisect” on this ? Should take about 8-9 
iterations I suppose...

You could use v20.01-rc0 label as the “git bisect bad” start and the v20.09-rc0 
as the “git bisect good” start.

I am currently working on integrating a bunch of “fix” commits that happened 
between the 19.08.1 and the latest, but only about 1/3 of them (about 250) can 
apply automatically, so I am analyzing the other 500. It is taking some time, 
so if you could bisect - then I could just push that specific fix into 
stable/1908 and get you going.

As for the tests - this type of testing won’t be covered in “make test”, that 
is the per-patch test.
But maybe there is a simpler way to catch it - so knowing which commit fixed it 
would surely help...

--a


On 4 Jun 2020, at 22:16, Bly, Mike via lists.fd.io 
mailto:mbly=ciena@lists.fd.io>> wrote:

Hello,

We are observing a small percentage of frames being discarded in simple 2-port 
L2 xconnect setup when a constant, same frame, single (full duplex) traffic 
profile is offered to the system. The frames are being discarded due to a 
failed VLAN classification when all frames offered have the same VLAN present, 
i.e. send two sets of 1B of the same frame in two directions (A <-> B), see x% 
discarded due to random VLAN classification issues.

We did not see this issue in v18.07.1. At the start of the year we upgraded to 
19.08 and started seeing this issue during scale testing. We have been trying 
to root cause it and are at a point where we need some assistance. Moving from 
our integrated VPP solution to using stock VPP built in an Ubuntu container, we 
have found this issue to be present in all releases between 19.08 – 20.01, but 
appears fixed in 20.05. We are not in a position where we can immediately 
upgrade to v20.05, so we need a solution for the v19.08 code base, based on key 
changes v20.01 -> v20.05. As such, we are looking for guidance on potentially 
relevant changes made between v20.01 and v20.05.

VPP configuration used:
create sub-interfaces TenGigabitEthernet19/0/0 100 dot1q 100
create sub-interfaces TenGigabitEthernet19/0/1 100 dot1q 100
set interface state TenGigabitEthernet19/0/0 up
set interface state TenGigabitEthernet19/0/0.100 up
set interface state TenGigabitEthernet19/0/1 up
set interface state TenGigabitEthernet19/0/1.100 up
set interface l2 xconnect TenGigabitEthernet19/0/0.100 
TenGigabitEthernet19/0/1.100
set interface l2 xconnect TenGigabitEthernet19/0/1.100 
TenGigabitEthernet19/0/0.100

Traffic/setup:

  *   Two traffic generator connections to 10G physical NICs, each connection 
having a single traffic stream, where all frames are the same
  *   No NIC offloading being used, no RSS, single worker thread separate from 
master
  *   64B frames with fixed/cross-matching unicast L2 MAC addresses, non-IP 
Etype, incrementing payload
  *   1 billion frames full duplex, offered at max “lossless” throughput, e.g. 
approx. 36% of 10Gb/s for v20.05

 *   “lossless” is maximum throughput allowed without observing “show 
interface” -> “rx-miss” statistics

Resulting statistics:

Working v18.07.1 with proper/expected “error” statistics:
vpp# show version
vpp v18.07.1

vpp# show errors
   CountNode  Reason
20l2-output   L2 output packets
20l2-inputL2 input packets

Non-Working v20.01 with unexpected “error” statistics:
vpp# show version
vpp v20.01-release

vpp# show errors
   CountNode  Reason
174332l2-output   L2 output packets
174332l2-inputL2 input packets
 25668 ethernet-input l3 mac mismatch <-- 
we should NOT be seeing these

Working v20.05 with proper/expected “error” statistics:
vpp# show version
vpp v20.05-release

vpp# show errors
   CountNode  Reason
20l2-output   L2 output packets
20l2-inputL2 input packets

Issue found:

In eth_input_process_frame() calls to eth_input_get_etype_and_tags() are 
sometimes failing to properly parse/store the “etype” and/or “tag” values, 
which then results later on in failed VLAN classification and 

Re: [**EXTERNAL**] Re: [vpp-dev] #vpp #vnet apparent buffer prefetch issue - seeing "l3 mac mismatch" discards

2020-06-05 Thread Bly, Mike via lists.fd.io
Damjan,

Yes, after posting yesterday, I worked through a git bisect sequence and found 
the following single commit is was fixed v20.05. I cherry picked this back to 
v19.08.2 and confirmed it fixes that release as well.

Please cherry-pick the following accordingly to 2019 LTS. Feel free to use the 
information in my original email to drive bug process or LMK if you’d prefer I 
drive this.

cbe36e47b5647becc6c03a08e2745ee5ead5de20

We will be applying this locally for our solution as well.

Best Regards,
Mike

From: Damjan Marion 
Sent: Friday, June 05, 2020 8:51 AM
To: Bly, Mike 
Cc: vpp-dev@lists.fd.io
Subject: [**EXTERNAL**] Re: [vpp-dev] #vpp #vnet apparent buffer prefetch issue 
- seeing "l3 mac mismatch" discards


Have you tried to use "git bisect" to find which patch fixes this issue?

—
Damjan



On 4 Jun 2020, at 22:15, Bly, Mike via lists.fd.io 
[lists.fd.io]<https://urldefense.com/v3/__http:/lists.fd.io__;!!OSsGDw!YGNZFFN8kuAKpMClUofpJdmXZKtHAOO20YYGUw-PDkxFr1jG7elEI0lW6Ec$>
 mailto:mbly=ciena@lists.fd.io>> wrote:

Hello,

We are observing a small percentage of frames being discarded in simple 2-port 
L2 xconnect setup when a constant, same frame, single (full duplex) traffic 
profile is offered to the system. The frames are being discarded due to a 
failed VLAN classification when all frames offered have the same VLAN present, 
i.e. send two sets of 1B of the same frame in two directions (A <-> B), see x% 
discarded due to random VLAN classification issues.

We did not see this issue in v18.07.1. At the start of the year we upgraded to 
19.08 and started seeing this issue during scale testing. We have been trying 
to root cause it and are at a point where we need some assistance. Moving from 
our integrated VPP solution to using stock VPP built in an Ubuntu container, we 
have found this issue to be present in all releases between 19.08 – 20.01, but 
appears fixed in 20.05. We are not in a position where we can immediately 
upgrade to v20.05, so we need a solution for the v19.08 code base, based on key 
changes v20.01 -> v20.05. As such, we are looking for guidance on potentially 
relevant changes made between v20.01 and v20.05.

VPP configuration used:
create sub-interfaces TenGigabitEthernet19/0/0 100 dot1q 100
create sub-interfaces TenGigabitEthernet19/0/1 100 dot1q 100
set interface state TenGigabitEthernet19/0/0 up
set interface state TenGigabitEthernet19/0/0.100 up
set interface state TenGigabitEthernet19/0/1 up
set interface state TenGigabitEthernet19/0/1.100 up
set interface l2 xconnect TenGigabitEthernet19/0/0.100 
TenGigabitEthernet19/0/1.100
set interface l2 xconnect TenGigabitEthernet19/0/1.100 
TenGigabitEthernet19/0/0.100

Traffic/setup:

· Two traffic generator connections to 10G physical NICs, each 
connection having a single traffic stream, where all frames are the same

· No NIC offloading being used, no RSS, single worker thread separate 
from master

· 64B frames with fixed/cross-matching unicast L2 MAC addresses, non-IP 
Etype, incrementing payload

· 1 billion frames full duplex, offered at max “lossless” throughput, 
e.g. approx. 36% of 10Gb/s for v20.05

o“lossless” is maximum throughput allowed without observing “show 
interface” -> “rx-miss” statistics

Resulting statistics:

Working v18.07.1 with proper/expected “error” statistics:
vpp# show version
vpp v18.07.1

vpp# show errors
   CountNode  Reason
20l2-output   L2 output packets
20l2-inputL2 input packets

Non-Working v20.01 with unexpected “error” statistics:
vpp# show version
vpp v20.01-release

vpp# show errors
   CountNode  Reason
174332l2-output   L2 output packets
174332l2-inputL2 input packets
 25668 ethernet-input l3 mac mismatch <-- 
we should NOT be seeing these

Working v20.05 with proper/expected “error” statistics:
vpp# show version
vpp v20.05-release

vpp# show errors
   CountNode  Reason
20l2-output   L2 output packets
20l2-inputL2 input packets

Issue found:

In eth_input_process_frame() calls to eth_input_get_etype_and_tags() are 
sometimes failing to properly parse/store the “etype” and/or “tag” values, 
which then results later on in failed VLAN classification and resultant “l3 mac 
mismatch” discards due to parent L3 mode.

Here is a sample debug profiling of the discards. We implement some 
down-n-dirty debug statistics as shown:

· bad_l3_frm_offset[256] is showing which frame in “n_left” sequence of 
a given batch was discarded

· bad_l3_batch_size[256] is showing the size of each batch of frames 
being processed when a disca

[vpp-dev] #vpp #vnet apparent buffer prefetch issue - seeing "l3 mac mismatch" discards

2020-06-04 Thread Bly, Mike via lists.fd.io
Hello,

We are observing a small percentage of frames being discarded in simple 2-port 
L2 xconnect setup when a constant, same frame, single (full duplex) traffic 
profile is offered to the system. The frames are being discarded due to a 
failed VLAN classification when all frames offered have the same VLAN present, 
i.e. send two sets of 1B of the same frame in two directions (A <-> B), see x% 
discarded due to random VLAN classification issues.

We did not see this issue in v18.07.1. At the start of the year we upgraded to 
19.08 and started seeing this issue during scale testing. We have been trying 
to root cause it and are at a point where we need some assistance. Moving from 
our integrated VPP solution to using stock VPP built in an Ubuntu container, we 
have found this issue to be present in all releases between 19.08 - 20.01, but 
appears fixed in 20.05. We are not in a position where we can immediately 
upgrade to v20.05, so we need a solution for the v19.08 code base, based on key 
changes v20.01 -> v20.05. As such, we are looking for guidance on potentially 
relevant changes made between v20.01 and v20.05.

VPP configuration used:
create sub-interfaces TenGigabitEthernet19/0/0 100 dot1q 100
create sub-interfaces TenGigabitEthernet19/0/1 100 dot1q 100
set interface state TenGigabitEthernet19/0/0 up
set interface state TenGigabitEthernet19/0/0.100 up
set interface state TenGigabitEthernet19/0/1 up
set interface state TenGigabitEthernet19/0/1.100 up
set interface l2 xconnect TenGigabitEthernet19/0/0.100 
TenGigabitEthernet19/0/1.100
set interface l2 xconnect TenGigabitEthernet19/0/1.100 
TenGigabitEthernet19/0/0.100

Traffic/setup:

  *   Two traffic generator connections to 10G physical NICs, each connection 
having a single traffic stream, where all frames are the same
  *   No NIC offloading being used, no RSS, single worker thread separate from 
master
  *   64B frames with fixed/cross-matching unicast L2 MAC addresses, non-IP 
Etype, incrementing payload
  *   1 billion frames full duplex, offered at max "lossless" throughput, e.g. 
approx. 36% of 10Gb/s for v20.05
 *   "lossless" is maximum throughput allowed without observing "show 
interface" -> "rx-miss" statistics

Resulting statistics:

Working v18.07.1 with proper/expected "error" statistics:
vpp# show version
vpp v18.07.1

vpp# show errors
   CountNode  Reason
20l2-output   L2 output packets
20l2-inputL2 input packets

Non-Working v20.01 with unexpected "error" statistics:
vpp# show version
vpp v20.01-release

vpp# show errors
   CountNode  Reason
174332l2-output   L2 output packets
174332l2-inputL2 input packets
 25668 ethernet-input l3 mac mismatch <-- 
we should NOT be seeing these

Working v20.05 with proper/expected "error" statistics:
vpp# show version
vpp v20.05-release

vpp# show errors
   CountNode  Reason
20l2-output   L2 output packets
20l2-inputL2 input packets

Issue found:

In eth_input_process_frame() calls to eth_input_get_etype_and_tags() are 
sometimes failing to properly parse/store the "etype" and/or "tag" values, 
which then results later on in failed VLAN classification and resultant "l3 mac 
mismatch" discards due to parent L3 mode.

Here is a sample debug profiling of the discards. We implement some 
down-n-dirty debug statistics as shown:

  *   bad_l3_frm_offset[256] is showing which frame in "n_left" sequence of a 
given batch was discarded
  *   bad_l3_batch_size[256] is showing the size of each batch of frames being 
processed when a discard occurs

(gdb) p bad_l3_frm_offset
$1 = {1078, 1078, 1078, 1078, 0 , 383, 383, 383, 383, 0 
}

(gdb) p bad_l3_batch_size
$2 = {0 , 1424, 0, 0, 1356, 3064}

I did manage to find the following thread, which seems to be possibly related 
to our issue: https://lists.fd.io/g/vpp-dev/message/15488 Sharing just in case 
it is in fact relevant.

Finally, are VPP performance regressions monitoring/checking "vpp show errors" 
content? We are looking to understand how this may have gone unnoticed between 
v18.07.1 and 20.05 release efforts given the simplicity of the configuration 
and test stimulus.

-Mike
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#16654): https://lists.fd.io/g/vpp-dev/message/16654
Mute This Topic: https://lists.fd.io/mt/74679715/21656
Mute #vpp: https://lists.fd.io/mk?hashtag=vpp=1480452
Mute #vnet: https://lists.fd.io/mk?hashtag=vnet=1480452
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-