Hi Xuo, That’s great news! :-) i am happy about the timing of all of this - I literally stumbled upon it just a couple of weeks ago :-) 🎉
We are still reviewing/discussing some potential tweaks to this patch (thanks to Neale and Florin who are helping!): despite being in the place which was previously just “sleep for 10ms”, it’s still a codepath that gets a lot of hits on every instance of VPP, so the more reviews and the more scrutiny it gets - the better. So I’d say add yourself to the CC on that change in gerrit and help discussing/testing the possible modifications to it, if there are any... Background: The basic problem the patch aims to solve is that when VPP is in the epoll sleeping in the kernel land during relatively idle times, the VPP doesn’t know anything about what happened in shared memory for a whole 10ms of that epoll sleep which is eternity when you do a lot of API transfers over shared memory. This is where you get down to ~300-600 API request-response cycles per second down from several hundred thousand. But you can’t avoid sleeping in epoll either since then you will just be burning the CPU cycles. So a lot of code in that block is heuristics that are inferring when we can expect more work in the nearest future... predicting future, even the next 10ms, is tricky! :-) the existing code does a very good job of guessing, except this particular case. For example you may notice if you run the exact same example *while* sending a lot of traffic through, it will take noticeably less time to run the example you sent. But, if all of the above doesn’t sound interesting stuff to you and you just want to move on with whatever task you wanted to accomplish: Another approach on your side is to try using the Unix socket transport for the API. It won’t have the same problem, because the API message sent onto the Unix socket will immediately wake up VPP, so your API exchanges per second will be still quick even when the VPP is idle. You can compare some of the behaviors also using my work in progress Rust code here, which includes a cli_inband benchmark: https://github.com/ayourtch/vpp-api-transport I don’t get into any optimizations yet on the Rust side, but it shows that in the case of Unix socket VPP won’t be the bottleneck for you, it will be the python interpreter speed.. Hope this helps either way! :) (Side note: it’s odd that in single thread VPP you are getting good performance, I think I had it slow even then when I tested from Rust... let me see if I can make a rust benchmark that does the same as your example and then explore it a bit more.... hopefully this weekend. Will reply-all on this thread when I have something of note to say ...) --a > On 6 Mar 2021, at 06:17, Xuo Guoto <[email protected]> wrote: > > Hi Andrew, > > It does make things way faster. Now I am getting time in the range of 2.5 > secs. Is this patch ready for prime time? > > X. > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > On Tuesday, March 2, 2021 7:35 PM, Andrew Yourtchenko <[email protected]> > wrote: > >> Hi Xuo, >> >> I’ve seen a maybe related problem recently - could you try an image with >> https://gerrit.fd.io/r/c/vpp/+/31368 in it and see if you still have a >> similarly large difference or does it make things faster for you ? >> >> --a >> >>> On 25 Feb 2021, at 16:20, Xuo Guoto via lists.fd.io >>> <[email protected]> wrote: >>> >>> Hi List, >>> >>> We have been using policer_add_del and classify_add_del_session in single >>> threaded VPP (ie one main thread only) and both API were giving decent >>> performance, but after switching to multi thread VPP the performance seems >>> be drastically less. >>> >>> To test this out a small test program was written which will add 10,000 >>> policer and classify table entries and measure the speed. >>> >>> In single threaded VPP the program took 2.19 sec while with 1 main and 2 >>> worker threads it took 115.89 sec. The tests were conducted without any >>> traffic flowing through VPP. >>> >>> The python test program too is attached for reference. >>> >>> Platform and version are: >>> >>> vpp# sh version >>> vpp v21.01.0-1~gfa065f96d built by root on ubuntu20-04 at >>> 2021-02-24T09:00:32 >>> >>> vpp# sh cpu >>> Model name: Intel(R) Xeon(R) Silver 4208 CPU @ 2.10GHz >>> Microarch model (family): [0x6] Skylake ([0x55] Skylake X/SP) stepping 0x7 >>> Flags: sse3 pclmulqdq ssse3 sse41 sse42 avx rdrand avx2 >>> pqm pqe avx512f rdseed aes avx512_vnni invariant_tsc >>> Base frequency: 2.09 GHz >>> vpp# >>> >>> vpp# sh thread >>> ID Name Type LWP Sched Policy (Priority) >>> lcore Core Socket State >>> 0 vpp_main 4230 other (0) 1 >>> 7 0 >>> 1 vpp_wk_0 workers 4243 other (0) 2 >>> 1 0 >>> 2 vpp_wk_1 workers 4244 other (0) 3 >>> 6 0 >>> vpp# >>> >>> corresponding classify table : classify table mask l3 ip4 src miss-next >>> drop memory-size 800M >>> >>> Is this behavior expected? Can some thing be done to achieve performance >>> similar to single threaded VPP while running VPP with multiple threads? >>> >>> X. >>> <2101_api_test.py>
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#18875): https://lists.fd.io/g/vpp-dev/message/18875 Mute This Topic: https://lists.fd.io/mt/80903834/21656 Group Owner: [email protected] Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [[email protected]] -=-=-=-=-=-=-=-=-=-=-=-
