May be it is this one?  https://gerrit.fd.io/r/c/vpp/+/26961   -John

From: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> On Behalf Of Damjan Marion via 
lists.fd.io
Sent: Friday, June 05, 2020 11:51 AM
To: m...@ciena.com
Cc: vpp-dev@lists.fd.io
Subject: Re: [vpp-dev] #vpp #vnet apparent buffer prefetch issue - seeing "l3 
mac mismatch" discards


Have you tried to use "git bisect" to find which patch fixes this issue?

—
Damjan



On 4 Jun 2020, at 22:15, Bly, Mike via lists.fd.io<http://lists.fd.io> 
<mbly=ciena....@lists.fd.io<mailto:mbly=ciena....@lists.fd.io>> wrote:

Hello,

We are observing a small percentage of frames being discarded in simple 2-port 
L2 xconnect setup when a constant, same frame, single (full duplex) traffic 
profile is offered to the system. The frames are being discarded due to a 
failed VLAN classification when all frames offered have the same VLAN present, 
i.e. send two sets of 1B of the same frame in two directions (A <-> B), see x% 
discarded due to random VLAN classification issues.

We did not see this issue in v18.07.1. At the start of the year we upgraded to 
19.08 and started seeing this issue during scale testing. We have been trying 
to root cause it and are at a point where we need some assistance. Moving from 
our integrated VPP solution to using stock VPP built in an Ubuntu container, we 
have found this issue to be present in all releases between 19.08 – 20.01, but 
appears fixed in 20.05. We are not in a position where we can immediately 
upgrade to v20.05, so we need a solution for the v19.08 code base, based on key 
changes v20.01 -> v20.05. As such, we are looking for guidance on potentially 
relevant changes made between v20.01 and v20.05.

VPP configuration used:
create sub-interfaces TenGigabitEthernet19/0/0 100 dot1q 100
create sub-interfaces TenGigabitEthernet19/0/1 100 dot1q 100
set interface state TenGigabitEthernet19/0/0 up
set interface state TenGigabitEthernet19/0/0.100 up
set interface state TenGigabitEthernet19/0/1 up
set interface state TenGigabitEthernet19/0/1.100 up
set interface l2 xconnect TenGigabitEthernet19/0/0.100 
TenGigabitEthernet19/0/1.100
set interface l2 xconnect TenGigabitEthernet19/0/1.100 
TenGigabitEthernet19/0/0.100

Traffic/setup:

·        Two traffic generator connections to 10G physical NICs, each 
connection having a single traffic stream, where all frames are the same

·        No NIC offloading being used, no RSS, single worker thread separate 
from master

·        64B frames with fixed/cross-matching unicast L2 MAC addresses, non-IP 
Etype, incrementing payload

·        1 billion frames full duplex, offered at max “lossless” throughput, 
e.g. approx. 36% of 10Gb/s for v20.05

o   “lossless” is maximum throughput allowed without observing “show interface” 
-> “rx-miss” statistics

Resulting statistics:

Working v18.07.1 with proper/expected “error” statistics:
vpp# show version
vpp v18.07.1

vpp# show errors
   Count                    Node                  Reason
2000000000                l2-output               L2 output packets
2000000000                l2-input                L2 input packets

Non-Working v20.01 with unexpected “error” statistics:
vpp# show version
vpp v20.01-release

vpp# show errors
   Count                    Node                  Reason
1999974332                l2-output               L2 output packets
1999974332                l2-input                L2 input packets
     25668             ethernet-input             l3 mac mismatch         <-- 
we should NOT be seeing these

Working v20.05 with proper/expected “error” statistics:
vpp# show version
vpp v20.05-release

vpp# show errors
   Count                    Node                  Reason
2000000000                l2-output               L2 output packets
2000000000                l2-input                L2 input packets

Issue found:

In eth_input_process_frame() calls to eth_input_get_etype_and_tags() are 
sometimes failing to properly parse/store the “etype” and/or “tag” values, 
which then results later on in failed VLAN classification and resultant “l3 mac 
mismatch” discards due to parent L3 mode.

Here is a sample debug profiling of the discards. We implement some 
down-n-dirty debug statistics as shown:

·        bad_l3_frm_offset[256] is showing which frame in “n_left” sequence of 
a given batch was discarded

·        bad_l3_batch_size[256] is showing the size of each batch of frames 
being processed when a discard occurs

(gdb) p bad_l3_frm_offset
$1 = {1078, 1078, 1078, 1078, 0 <repeats 12 times>, 383, 383, 383, 383, 0 
<repeats 236 times>}

(gdb) p bad_l3_batch_size
$2 = {0 <repeats 251 times>, 1424, 0, 0, 1356, 3064}

I did manage to find the following thread, which seems to be possibly related 
to our issue: https://lists.fd.io/g/vpp-dev/message/15488 Sharing just in case 
it is in fact relevant.

Finally, are VPP performance regressions monitoring/checking “vpp show errors” 
content? We are looking to understand how this may have gone unnoticed between 
v18.07.1 and 20.05 release efforts given the simplicity of the configuration 
and test stimulus.

-Mike

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#16671): https://lists.fd.io/g/vpp-dev/message/16671
Mute This Topic: https://lists.fd.io/mt/74679715/21656
Mute #vpp: https://lists.fd.io/mk?hashtag=vpp&subid=1480452
Mute #vnet: https://lists.fd.io/mk?hashtag=vnet&subid=1480452
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to