> On 16 Oct 2023, at 14:48, Ilya Maximets wrote:
>
> On 10/6/23 20:10, Алексей Кашавкин via discuss wrote:
>> Hello!
>>
>> I am using OVS with DPDK in OpenStack. This is RDO+TripleO deployment with
>> the Train release. I am trying to measure the performance of the DPDK
>> compute node. I have created two VMs [1], one as a DUT with DPDK and one as
>> a traffic generator with SR-IOV [2]. Both of them are using Pktgen.
>>
>> What happens is the following: for the first 3-4 minutes I see 2.6Gbit [3]
>> reception in DUT, after that the speed always drops to 400Mbit [4]. At the
>> same time in the output of `pmd-rxq-show` command I always see one of the
>> interfaces in the bond loaded [5], but it happens that after flapping of the
>> active interface the speed in DUT increases up to 5Gbit and in the output of
>> `pmd-rxq-show` command I start to see the load on two interfaces [6]. But at
>> the same time after 3-4 minutes the speed drops to 700Mbit and I continue to
>> see the same load on the two interfaces in the bond in the `pmd-rxq-show`
>> command. In the logs I see nothing but flapping [7] of the interfaces in
>> bond and the flapping has no effect on the speed drop after 3-4 minutes of
>> test. After the speed drop from the DUT itself I run traffic towards the
>> traffic generator [8] for a while and stop, then the speed on the DUT is
>> restored to 2.6Gbit again with traffic going through one interface or 5Gbit
>> with traffic going through two interfaces, but this again is only for 3-4
>> minutes. If I do a test with a traffic generator with a 2.5 Gbit or 1 Gbit
>> speed limit, the speed also drops to DUT after 4-5 minutes. I've put logging
>> in debug for bond, dpdk, netdev_dpdk, dpif_netdev, but haven't seen anything
>> that clarifies what's going on, and also it's not clear that sometimes after
>> flapping the active interface traffic starts going through both interfaces
>> in bond, but this happens rarely, not in every test.
>
> Since rate is restored after you sending some traffic in the backward
> direction, I'd say you have MAC learning somewhere on the path and
> it is getting expired. For example, if you use NORMAL action in one
> of the bridges, once the MAC is expired, the bridge will start flooding
> packets to all ports of the bridge, which is very slow. You may look
> at datapath flow dump to confirm which actions are getting executed
> on your packets: ovs-appctl dpctl/dump-flows.
>
> In general, you should always continuously send some traffic back
> for learned MAC addresses to not expire. I'm not sure if Pktgen is
> doing that these days, but it wasn't a very robust piece of software
> in the past.
Yes, that is exactly what is happening. I noticed that in the bridge fdb table
the mac DUT is being expiring and in the ip fabric the mac DUT entry is also
being expiring. If you clear both of these tables, performance drops. The speed
does not drop as long as one of the tables still has a mac address entry.
Now, if performance drops, I check the FDB tables and if I really don't see mac
in them, I send a single ping packet from the DUT VM with pktgen to the traffic
generator side, after which mac is learned again and speed is restored.
Thank you, Ilya.
>>
>> [4] The flapping of the interface through which traffic is going to the DUT
>> VM is probably due to the fact that it is heavily loaded alone in the bond
>> and there are no LACP PDU packets going to or from it. The log shows that it
>> is down for 30 seconds because the LACP rate is set to slow mode.
>
> Dropped LACP packets can cause bond flapping indeed. The only way to
> fix that in older versions of OVS is to reduce the load. With OVS 3.2
> you may try experimental 'rx-steering' configuration that was designed
> exactly for this scenario and should ensure that PDU packets are not
> dropped.
>
> Also, balancing depends on packet hashes, so you need to send many
> different traffic flows in order to get consistent balancing.
>
>>
>> I have done DUT on different OS, with different versions of DPDK and Pktgen.
>> But always the same thing happens, after 3-4 minutes the speed drops.
>> Only on the DPDK compute node I didn't change anything. The compute node has
>> Intel E810 network card with 25Gbit ports and Intel Xeon Gold 6230R CPU. The
>> PMD threads uses cores 11, 21, 63, 73 on numa 0 and 36, 44, 88, 96 on numa 1.
>
> All in all, 2.6Gbps seems like a small number for the type of a
> system you have. You might have some other configuration issues.
This figure is probably related to the tcp-packet size of 64 bytes. The traffic
generator sends a frame of 64 bytes.