Thanks for the update John. I'll this along to our test team. Not sure when we can schedule a retest, but when we do, I'll provide our results.
Thanks again, Billy On Tue, Sep 5, 2017 at 10:10 AM, John Lo (loj) <l...@cisco.com> wrote: > Hi Billy, > > > > I submitted fixes for VPP-963, now merged in both 17.07 and master/17.10, > that I believe should address the NDR/PDR performance issue with the 10K > and 1M flow cases. The regression was caused by a bug fix in the L2 > learning path to update stale time stamp and sequence number of MAC entries > in L2FIB. Because the time stamp is in unit of minutes, whenever the clock > hits the minute mark, there can be a prolonged burst of MAC updates > affecting forwarding performance with large number of MACs in L2 FIB > needing updates. My fix would smooth out the update burst to reduce the > impact. I believe you should now find the 17.07 or 17.10 performance for > 10K and 1M flows slightly lower but fairly close to the level of 17.04, > instead of somewhere between 1/3 to 1/2 to that of the 17.04 as you > measured before. > > > > I also doubled the memory size of L2FIB table to fit 4M MACs and set the > learn limit to 4M entries. During my test, I found L2FIB will run out of > memory at around 2.8M MACs with the previous memory size. > > > > Regards, > > John > > > > *From:* vpp-dev-boun...@lists.fd.io [mailto:vpp-dev-boun...@lists.fd.io] *On > Behalf Of *Billy McFall > *Sent:* Monday, August 28, 2017 12:47 PM > *To:* Maciek Konstantynowicz (mkonstan) <mkons...@cisco.com> > *Cc:* csit-...@lists.fd.io; vpp-dev <vpp-dev@lists.fd.io> > *Subject:* Re: [vpp-dev] VPP Performance drop from 17.04 to 17.07 > > > > > > > > On Mon, Aug 28, 2017 at 8:53 AM, Maciek Konstantynowicz (mkonstan) < > mkons...@cisco.com> wrote: > > + csit-dev > > > > Billy, > > > > Per the last week CSIT project call, from CSIT perspective, we > > classified your reported issue as Test coverage escape. > > > > Summary > > ======= > > CSIT test coverage got fixed, see more detail below. The CSIT tests > > uncovered regression for L2BD with MAC learning with higher total number > > of MACs in L2FIB, >>10k MAC, for multi-threaded configurations. Single- > > threaded configurations seem to be not impacted. > > > > Billy, Karl, Can you confirm this aligns with your findings? > > > > When you say "multi-threaded configuration", I assume you mean multiple > worker threads? Karl's tests had 4 workers, one for each NIC (physical > and vhost-user). He only tested multi-threaded, so we can not confirm that > single-threaded > configurations seem to be not impacted. > > > > Our numbers are a little different from yours, but we are both seeing > drops between releases. We had a bigger drop off with 10k flows, but seems > to be similar with the million flow tests. > > > > I was a little disappointed the MAC limit change by John Lo on 8/23 didn't > improve master number some. > > > > Thanks for all the hard work and adding these additional test cases. > > > > Billy > > > > > > More detail > > =========== > > MAC scale tests have been now added L2BD and L2BD+vhost CSIT suites, as > > a simple extension to existing L2 testing suites. Some known issues with > > TG prevented CSIT to add those tests in the past, but now as TG issues > > have been addressed, the tests could be added swiftly. The complete list > > of added tests is listed in [1] - thanks to Peter Mikus for great work > > there! > > > > Results from running those tests multiple times within FD.io > <http://fd.io> CSIT lab > > infra can be glanced over by checking dedicated test trigger commits > > [2][3][4], summary graphs in linked xls [5]. The results confirm there > > is regression in VPP l2fib code affecting all scaled up MAC tests in > > multi-thread configuration. Single-thread configurations seems not be > > impacted. > > > > The tests in commit [1] are not merged yet, as they're waiting for > > TG/TRex team to fix TRex issue with mis-calculating Ethernet FCS with > > large number of L2 MAC flows (>10k MAC flows). Issue is tracked by [6], > > TRex v2.29 with the fix ETA is w/e 1-Sep i.e. this week. Reported CSIT test > > results are using Ethernet frames with UDP headers that's masking the > > TRex issue. > > > > We have also vpp git bisected the problem between v17.04 (good) and > > v17.07 (bad) in a separate IXIA based lab in SJC, and found the culprit > > vpp patch [7]. Awaiting fix from vpp-dev, jira ticket raised [8]. > > > > Many thanks for reporting this regression and working with CSIT to plug > > this hole in testing. > > > > -Maciek > > > > [1] CSIT-786 L2FIB scale testing [https://gerrit.fd.io/r/#/c/8145/ > ge8145] [https://jira.fd.io/browse/CSIT-786 CSIT-786]; > > L2FIB scale testing for 10k, 100k, 1M FIB entries > > ./l2: > > 10ge2p1x520-eth-l2bdscale10kmaclrn-ndrpdrdisc.robot > > 10ge2p1x520-eth-l2bdscale100kmaclrn-ndrpdrdisc.robot > > 10ge2p1x520-eth-l2bdscale1mmaclrn-ndrpdrdisc.robot > > 10ge2p1x520-eth-l2bdscale10kmaclrn-eth-2vhostvr1024-1vm-cfsrr1- > ndrpdrdisc > > 10ge2p1x520-eth-l2bdscale100kmaclrn-eth-2vhostvr1024-1vm-cfsrr1- > ndrpdrdisc > > 10ge2p1x520-eth-l2bdscale1mmaclrn-eth-2vhostvr1024-1vm-cfsrr1- > ndrpdrdisc > > [2] VPP master branch [https://gerrit.fd.io/r/#/c/8173/ ge8173]; > > [3] VPP stable/1707 [https://gerrit.fd.io/r/#/c/8167/ ge8167]; > > [4] VPP stable/1704 [https://gerrit.fd.io/r/#/c/8172/ ge8172]; > > [5] CSIT-794 VPP v17.07 L2BD yields lower NDR and PDR performance vs. > v17.04, 20170825_l2fib_regression_10k_100k_1M.xlsx, [ > https://jira.fd.io/browse/CSIT-794 CSIT-794]; > > [6] TRex v2.28 Ethernet FCS mis-calculation issue [ > https://jira.fd.io/browse/CSIT-793 CSIT-793]; > > [7] commit 25ff2ea3a31e422094f6d91eab46222a29a77c4b; > > [8] VPP v17.07 L2BD NDR and PDR multi-thread performance broken [ > https://jira.fd.io/browse/VPP-963 VPP-963]; > > > > On 14 Aug 2017, at 23:40, Billy McFall <bmcf...@redhat.com> wrote: > > > > In the last VPP call, I reported some internal Red Hat performance testing > was showing a significant drop in performance between releases 17.04 to > 17.07. This with l2-bridge testing - PVP - 0.002% Drop Rate: > > VPP-17.04: 256 Flow 7.8 MP/s 10k Flow 7.3 MP/s 1m Flow 5.2 MP/s > > VPP-17.07: 256 Flow 7.7 MP/s 10k Flow 2.7 MP/s 1m Flow 1.8 MP/s > > > > The performance team re-ran some of the tests for me with some additional > data collected. Looks like the size of the L2 FIB table was reduced in > 17.07. Below are the number of entries in the MAC Table after the tests are > run: > > 17.04: > > show l2fib > > 4000008 l2fib entries > > 17.07: > > show l2fib > > 1067053 l2fib entries with 1048576 learned (or non-static) entries > > > > This caused more packets to be flooded (see out of 'show node counters' > below). I looked but couldn't find anything. Is the size of the L2 FIB > Table table configurable? > > > > Thanks, > > Billy McFall > > > > > > 17.04: > > > > show node counters > > Count Node Reason > > : > > 313035313 l2-input L2 input packets > > 555726 l2-flood L2 flood packets > > : > > 310115490 l2-input L2 input packets > > 824859 l2-flood L2 flood packets > > : > > 313508376 l2-input L2 input packets > > 1041961 l2-flood L2 flood packets > > : > > 313691024 l2-input L2 input packets > > 698968 l2-flood L2 flood packets > > > > 17.07: > > > > show node counters > > Count Node Reason > > : > > 97810569 l2-input L2 input packets > > 72557612 l2-flood L2 flood packets > > : > > 97830674 l2-input L2 input packets > > 72478802 l2-flood L2 flood packets > > : > > 97714888 l2-input L2 input packets > > 71655987 l2-flood L2 flood packets > > : > > 97710374 l2-input L2 input packets > > 70058006 l2-flood L2 flood packets > > > > > > -- > > *Billy McFall* > SDN Group > Office of Technology > *Red Hat* > > _______________________________________________ > vpp-dev mailing list > vpp-dev@lists.fd.io > https://lists.fd.io/mailman/listinfo/vpp-dev > > > > > > > > -- > > *Billy McFall* > SDN Group > Office of Technology > *Red Hat* > -- *Billy McFall* SDN Group Office of Technology *Red Hat*
_______________________________________________ vpp-dev mailing list vpp-dev@lists.fd.io https://lists.fd.io/mailman/listinfo/vpp-dev