Hi Harry,
Thanks for the reply, please see responses inline
> On Aug 7, 2018, at 4:34 AM, Van Haaren, Harry
> wrote:
>
> Hi Tony,
>
>> -Original Message-
>> From: users [mailto:users-boun...@dpdk.org] On Behalf Of Anthony Hart
>> Sent: Sunday, August 5, 2018 8:03 PM
>> To: users@dpdk.org
>> Subject: [dpdk-users] eventdev performance
>>
>> I’ve been doing some performance measurements with the eventdev_pipeline
>> example application (to see how the eventdev library performs - dpdk 18.05)
>> and I’m looking for some help in determining where the bottlenecks are in my
>> testing.
>
> If you have the "perf top" tool available, it is very useful in printing
> statistics
> of where CPU cycles are spent during runtime. I use it regularly to identify
> bottlenecks in the code for specific lcores.
Yes I have perf if there is something you’d like to see I can post it.
>
>
>> I have 1 Rx, 1 Tx, 1 Scheduler and N worker cores (1 sw_event0 device). In
>> this configuration performance tops out with 3 workers (6 cores total) and
>> adding more workers actually causes a reduction in throughput. In my setup
>> this is about 12Mpps. The same setup running testpmd will reach >25Mpps
>> using only 1 core.
>
> Raw forwarding of a packet is less work than forwarding and load-balancing
> across multiple cores. More work means more CPU cycles spent per packet,
> hence less mpps.
ok.
>
>
>> This is the eventdev command line.
>> eventdev_pipeline -l 0,1-6 -w:02:00.0 --vdev event_sw0 -- -r2 -t4 -e8 -
>> w70 -s1 -n0 -c128 -W0 -D
>
> The -W0 indicates to perform zero cycles of work on each worker core.
> This makes each of the 3 worker cores very fast in returning work to the
> scheduler core, and puts extra pressure on the scheduler. Note that in a
> real-world use-case you presumably want to do work on each of the worker
> cores, so the command above (while valid for understanding how it works,
> and performance of certain things) is not expected to be used in production.
>
> I'm not sure how familiar you are with CPU caches, but it is worth
> understanding
> that reading this "locally" from L1 or L2 cache is very fast compared to
> communicating with another core.
>
> Given that with -W0 the worker cores are very fast, the scheduler can rarely
> read data locally - it always has to communicate with other cores.
>
> Try adding -W1000 or so, and perhaps 4 worker cores. The 1000 cycles of work
> per event mimic doing actual work on each event.
Adding work with -W reduces performance.
I modify eventdev_pipeline to print the contents of
rte_event_eth_rx_adapter_stats for the device. In particular I print the
rx_enq_retry and rx_poll_count values for the receive thread.Once I get to
a load level where packets are dropped I see that the number of retires equals
or exceeds the poll count (as I increase the load the retries exceeds the poll
count).
I think this indicates that the Scheduler is not keeping up. That could be (I
assume) because the workers are not consuming fast enough. However if I
increase the number of workers then the ratio of retry to poll_count (in the rx
thread) goes up, for example adding 4 more workers and the retries:poll ration
becomes 5:1
Seems like this is indicating that the Scheduler is the bottleneck?
>
>
>> This is the tested command line.
>> testpmd -w:02:00.0 -l 0,1 -- -i --nb-core 1 --numa --rxq 1 --txq 1 --
>> port-topology=loop
>>
>>
>> I’m guessing that its either the RX or Sched that’s the bottleneck in my
>> eventdev_pipeline setup.
>
> Given that you state testpmd is capable of forwarding at >25 mpps on your
> platform it is safe to rule out RX, since testpmd is performing the RX in
> that forwarding workload.
>
> Which leaves the scheduler - and indeed the scheduler is probably what is
> the limiting factor in this case.
yes seems so.
>
>
>> So I first tried to use 2 cores for RX (-r6), performance went down. It
>> seems that configuring 2 RX cores still only sets up 1 h/w receive ring and
>> access to that one ring is alternated between the two cores?So that
>> doesn’t help.
>
> Correct - it is invalid to use two CPU cores on a single RX queue without
> some form of serialization (otherwise it causes race-conditions). The
> eventdev_pipeline sample app helpfully provides that - but there is a
> performance
> impact on doing so. Using two RX threads on a single RX queue is generally
> not recommended.
>
>
>> Next, I could use 2 scheduler cores, but how does that work, do they again
>> alternate? In any case throughput is reduced by 50% in that test.
>
> Yes, for the same reason. The event_sw0 PMD does not allow multiple threads
> to run it at the same time, and hence the serialization is in place to ensure
> that the results are valid.
>
>
>> thanks for any insights,
>> tony
>
> Try the suggestion above of adding work to the worker cores - this should
> "balance out" the current scheduling bottleneck,