Handoff design issues [Re: RES: RES: [vpp-dev] Increasing NAT worker handoff frame queue size NAT_FQ_NELTS to avoid congestion drops?]

Christian Hopps Fri, 13 Nov 2020 13:48:26 -0800

FWIW, I too have hit this issue. Basically VPP is designed to process a packet 
from rx to tx in the same thread. When downstream nodes run slower, the 
upstream rx node doesn't run, so the vector size in each frame naturally 
increases, and then the downstream nodes can benefit from "V" (i.e., processing 
multiple packets in one go).


This back-pressure from downstream does not occur when you hand-off from a fast 
thread to a slower thread, so you end up with many single packet frames and 
fill your hand-off queue.

The quick fix one tries then is to increase the queue size; however, this is 
not a great solution b/c you are still not taking advantage of the "V" in VPP. 
To really fit this back into the original design one needs to somehow still be 
creating larger vectors in the hand-off frames.

TBH I think the right solution here is to not hand-off frames, and instead 
switch to packet queues and then on the handed-off side the frames would get 
constructed from packet queues (basically creating another polling input node 
but on the new thread).

Thanks,
Chris.

> On Nov 13, 2020, at 12:21 PM, Marcos - Mgiga <mar...@mgiga.com.br> wrote:
> 
> Understood. And what path did you take in order to analyse and monitor vector 
> rates ? Is there some specific command or log ?
> 
> Thanks
> 
> Marcos
> 
> -----Mensagem original-----
> De: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> Em nome de ksekera via []
> Enviada em: sexta-feira, 13 de novembro de 2020 14:02
> Para: Marcos - Mgiga <mar...@mgiga.com.br>
> Cc: Elias Rudberg <elias.rudb...@bahnhof.net>; vpp-dev@lists.fd.io
> Assunto: Re: RES: [vpp-dev] Increasing NAT worker handoff frame queue size 
> NAT_FQ_NELTS to avoid congestion drops?
> 
> Not completely idle, more like medium load. Vector rates at which I saw 
> congestion drops were roughly 40 for thread doing no work (just handoffs - I 
> hardcoded it this way for test purpose), and roughly 100 for thread picking 
> the packets doing NAT.
> 
> What got me into infra investigation was the fact that once I was hitting 
> vector rates around 255, I did see packet drops, but no congestion drops.
> 
> HTH,
> Klement
> 
>> On 13 Nov 2020, at 17:51, Marcos - Mgiga <mar...@mgiga.com.br> wrote:
>> 
>> So you mean that this situation ( congestion drops) is most likely to occur 
>> when the system in general is idle than when it is processing a large amount 
>> of traffic?
>> 
>> Best Regards
>> 
>> Marcos
>> 
>> -----Mensagem original-----
>> De: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> Em nome de Klement
>> Sekera via lists.fd.io Enviada em: sexta-feira, 13 de novembro de 2020
>> 12:15
>> Para: Elias Rudberg <elias.rudb...@bahnhof.net>
>> Cc: vpp-dev@lists.fd.io
>> Assunto: Re: [vpp-dev] Increasing NAT worker handoff frame queue size 
>> NAT_FQ_NELTS to avoid congestion drops?
>> 
>> Hi Elias,
>> 
>> I’ve already debugged this and came to the conclusion that it’s the infra 
>> which is the weak link. I was seeing congestion drops at mild load, but not 
>> at full load. Issue is that with handoff, there is uneven workload. For 
>> simplicity’s sake, just consider thread 1 handing off all the traffic to 
>> thread 2. What happens is that for thread 1, the job is much easier, it just 
>> does some ip4 parsing and then hands packet to thread 2, which actually does 
>> the heavy lifting of hash inserts/lookups/translation etc. 64 element queue 
>> can hold 64 frames, one extreme is 64 1-packet frames, totalling 64 packets, 
>> other extreme is 64 255-packet frames, totalling ~16k packets. What happens 
>> is this: thread 1 is mostly idle and just picking a few packets from NIC and 
>> every one of these small frames creates an entry in the handoff queue. Now 
>> thread 2 picks one element from the handoff queue and deals with it before 
>> picking another one. If the queue has only 3-packet or 10-packet elements, 
>> then thread 2 can never really get into what VPP excels in - bulk processing.
>> 
>> Q: Why doesn’t it pick as many packets as possible from the handoff queue?
>> A: It’s not implemented.
>> 
>> I already wrote a patch for it, which made all congestion drops which I saw 
>> (in above synthetic test case) disappear. Mentioned patch 
>> https://gerrit.fd.io/r/c/vpp/+/28980 is sitting in gerrit.
>> 
>> Would you like to give it a try and see if it helps your issue? We
>> shouldn’t need big queues under mild loads anyway …
>> 
>> Regards,
>> Klement
>> 
>>> On 13 Nov 2020, at 16:03, Elias Rudberg <elias.rudb...@bahnhof.net> wrote:
>>> 
>>> Hello VPP experts,
>>> 
>>> We are using VPP for NAT44 and we get some "congestion drops", in a
>>> situation where we think VPP is far from overloaded in general. Then
>>> we started to investigate if it would help to use a larger handoff
>>> frame queue size. In theory at least, allowing a longer queue could
>>> help avoiding drops in case of short spikes of traffic, or if it
>>> happens that some worker thread is temporarily busy for whatever
>>> reason.
>>> 
>>> The NAT worker handoff frame queue size is hard-coded in the
>>> NAT_FQ_NELTS macro in src/plugins/nat/nat.h where the current value
>>> is 64. The idea is that putting a larger value there could help.
>>> 
>>> We have run some tests where we changed the NAT_FQ_NELTS value from
>>> 64 to a range of other values, each time rebuilding VPP and running
>>> an identical test, a test case that is to some extent trying to mimic
>>> our real traffic, although of course it is simplified. The test runs
>>> many
>>> iperf3 tests simultaneously using TCP, combined with some UDP traffic
>>> chosen to trigger VPP to create more new sessions (to make the NAT
>>> "slowpath" happen more).
>>> 
>>> The following NAT_FQ_NELTS values were tested:
>>> 16
>>> 32
>>> 64  <-- current value
>>> 128
>>> 256
>>> 512
>>> 1024
>>> 2048  <-- best performance in our tests
>>> 4096
>>> 8192
>>> 16384
>>> 32768
>>> 65536
>>> 131072
>>> 
>>> In those tests, performance was very bad for the smallest
>>> NAT_FQ_NELTS values of 16 and 32, while values larger than 64 gave
>>> improved performance. The best results in terms of throughput were
>>> seen for NAT_FQ_NELTS=2048. For even larger values than that, we got
>>> reduced performance compared to the 2048 case.
>>> 
>>> The tests were done for VPP 20.05 running on a Ubuntu 18.04 server
>>> with a 12-core Intel Xeon CPU and two Mellanox mlx5 network cards.
>>> The number of NAT threads was 8 in some of the tests and 4 in some of
>>> the tests.
>>> 
>>> According to these tests, the effect of changing NAT_FQ_NELTS can be
>>> quite large. For example, for one test case chosen such that
>>> congestion drops were a significant problem, the throughput increased
>>> from about 43 to 90 Gbit/second with the amount of congestion drops
>>> per second reduced to about one third. In another kind of test,
>>> throughput increased by about 20% with congestion drops reduced to
>>> zero. Of course such results depend a lot on how the tests are
>>> constructed. But anyway, it seems clear that the choice of
>>> NAT_FQ_NELTS value can be important and that increasing it would be
>>> good, at least for the kind of usage we have tested now.
>>> 
>>> Based on the above, we are considering changing NAT_FQ_NELTS from 64
>>> to a larger value and start trying that in our production environment
>>> (so far we have only tried it in a test environment).
>>> 
>>> Were there specific reasons for setting NAT_FQ_NELTS to 64?
>>> 
>>> Are there some potential drawbacks or dangers of changing it to a
>>> larger value?
>>> 
>>> Would you consider changing to a larger value in the official VPP
>>> code?
>>> 
>>> Best regards,
>>> Elias
>>> 
>>> 
>>> 
>>> 
>> 
>> 
> 
> 
> 
> 
>

signature.asc
Description: Message signed with OpenPGP

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#18026): https://lists.fd.io/g/vpp-dev/message/18026
Mute This Topic: https://lists.fd.io/mt/78240090/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Handoff design issues [Re: RES: RES: [vpp-dev] Increasing NAT worker handoff frame queue size NAT_FQ_NELTS to avoid congestion drops?]

Reply via email to