Re: [vpp-dev] Performance of PAPI request in multi threaded VPP

Andrew Yourtchenko Sat, 06 Mar 2021 00:03:17 -0800

Hi Xuo,

That’s great news! :-) i am happy about the timing of all of this - I literally 
stumbled upon it just a couple of weeks ago :-) 🎉

We are still reviewing/discussing some potential tweaks to this patch (thanks 
to Neale and Florin who are helping!): despite being in the place which was 
previously just “sleep for 10ms”, it’s still a codepath that gets a lot of hits 
on every instance of VPP, so the more reviews and the more scrutiny it gets - 
the better.

So I’d say add yourself to the CC on that change in gerrit and help 
discussing/testing the possible modifications to it, if there are any...

Background: The basic problem the patch aims to solve is that when VPP  is in 
the epoll sleeping in the kernel land during relatively idle times, the VPP 
doesn’t know anything about what happened in shared memory for a whole 10ms of 
that epoll sleep which is eternity when you do a lot of API transfers over 
shared memory. This is where you get down to ~300-600 API request-response 
cycles per second down from several hundred thousand. 

But you can’t avoid sleeping in epoll either since then you will just be 
burning the CPU cycles. So a lot of code in that block is heuristics that are 
inferring when we can expect more work in the nearest future... predicting 
future, even the next 10ms, is tricky! :-) the existing code does a very good 
job of guessing, except this particular case.
For example you may notice if you run the exact same example *while* sending a 
lot of traffic through, it will take noticeably less time to run the example 
you sent.

But, if all of the above doesn’t sound interesting stuff to you and you just 
want to move on with whatever task you wanted to accomplish:

Another approach on your side is to try using the Unix socket transport for the 
API. It won’t have the same problem, because the API message sent onto the Unix 
socket will immediately wake up VPP, so your API exchanges per second will be 
still quick even when the VPP is idle. 

You can compare some of the behaviors also using my work in progress Rust code 
here, which includes a cli_inband benchmark:

https://github.com/ayourtch/vpp-api-transport

I don’t get into any optimizations yet on the Rust side, but it shows that in 
the case of Unix socket VPP won’t be the bottleneck for you, it will be the 
python interpreter speed..

Hope this helps either way! :)

(Side note: it’s odd that in single thread VPP you are getting good 
performance, I think I had it slow even then when I tested from Rust... let me 
see if I can make a rust benchmark that does the same as your example and then 
explore it a bit more.... hopefully this weekend. Will reply-all on this thread 
when I have something of note to say ...)

--a

> On 6 Mar 2021, at 06:17, Xuo Guoto <[email protected]> wrote:
> 
> Hi Andrew,
> 
> It does make things way faster. Now I am getting time in the range of 2.5 
> secs. Is this patch ready for prime time? 
> 
> X.
> 
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Tuesday, March 2, 2021 7:35 PM, Andrew Yourtchenko <[email protected]> 
> wrote:
> 
>> Hi Xuo,
>> 
>> I’ve seen a maybe related problem recently - could you try an image with 
>> https://gerrit.fd.io/r/c/vpp/+/31368 in it and see if you still have a 
>> similarly large difference or does it make things faster for you ?
>> 
>> --a
>> 
>>> On 25 Feb 2021, at 16:20, Xuo Guoto via lists.fd.io 
>>> <[email protected]> wrote:
>>> 
>>> Hi List,
>>> 
>>> We have been using policer_add_del and classify_add_del_session in single 
>>> threaded VPP (ie one main thread only) and both API were giving decent 
>>> performance, but after switching to multi thread VPP the performance seems 
>>> be drastically less.
>>> 
>>> To test this out a small test program was written which will add 10,000 
>>> policer and classify table entries and measure the speed.
>>> 
>>> In single threaded VPP the program took 2.19 sec while with 1 main and 2 
>>> worker threads it took 115.89 sec. The tests were conducted without any 
>>> traffic flowing through VPP.
>>> 
>>> The python test program too is attached for reference.
>>> 
>>> Platform and version are:
>>> 
>>> vpp# sh version
>>> vpp v21.01.0-1~gfa065f96d built by root on ubuntu20-04 at 
>>> 2021-02-24T09:00:32
>>> 
>>> vpp# sh cpu    
>>> Model name:               Intel(R) Xeon(R) Silver 4208 CPU @ 2.10GHz
>>> Microarch model (family): [0x6] Skylake ([0x55] Skylake X/SP) stepping 0x7
>>> Flags:                    sse3 pclmulqdq ssse3 sse41 sse42 avx rdrand avx2 
>>> pqm pqe avx512f rdseed aes avx512_vnni invariant_tsc
>>> Base frequency:           2.09 GHz
>>> vpp#
>>> 
>>> vpp# sh thread
>>> ID     Name                Type        LWP     Sched Policy (Priority)  
>>> lcore  Core   Socket State    
>>> 0      vpp_main                        4230    other (0)                1   
>>>    7      0     
>>> 1      vpp_wk_0            workers     4243    other (0)                2   
>>>    1      0     
>>> 2      vpp_wk_1            workers     4244    other (0)                3   
>>>    6      0     
>>> vpp#
>>> 
>>> corresponding classify table : classify table mask l3 ip4 src miss-next 
>>> drop memory-size 800M
>>> 
>>> Is this behavior expected? Can some thing be done to achieve performance 
>>> similar to single threaded VPP while running VPP with multiple threads?
>>> 
>>> X.
>>> <2101_api_test.py>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#18875): https://lists.fd.io/g/vpp-dev/message/18875
Mute This Topic: https://lists.fd.io/mt/80903834/21656
Group Owner: [email protected]
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [vpp-dev] Performance of PAPI request in multi threaded VPP

Reply via email to