Re: [dpdk-users] eventdev performance

2018-08-15 Thread Van Haaren, Harry
> From: Anthony Hart [mailto:ah...@domainhart.com]
> Sent: Thursday, August 9, 2018 4:56 PM
> To: Van Haaren, Harry 
> Cc: users@dpdk.org
> Subject: Re: [dpdk-users] eventdev performance
> 
> Hi Harry,
> Thanks for the reply, please see responses inline
> 
> > On Aug 7, 2018, at 4:34 AM, Van Haaren, Harry 
> wrote:
> >
> > Hi Tony,
> >
> >> -Original Message-
> >> From: users [mailto:users-boun...@dpdk.org] On Behalf Of Anthony Hart
> >> Sent: Sunday, August 5, 2018 8:03 PM
> >> To: users@dpdk.org
> >> Subject: [dpdk-users] eventdev performance
> >>
> >> I’ve been doing some performance measurements with the eventdev_pipeline
> >> example application (to see how the eventdev library performs - dpdk
> 18.05)
> >> and I’m looking for some help in determining where the bottlenecks are in
> my
> >> testing.
> >
> > If you have the "perf top" tool available, it is very useful in printing
> statistics
> > of where CPU cycles are spent during runtime. I use it regularly to
> identify
> > bottlenecks in the code for specific lcores.
> 
> Yes I have perf if there is something you’d like to see I can post it.

I'll check the rest of your email first. Generally I use perf to see are the
cycles being spent on each core where is expected. In this case, it might
be useful to look at the scheduler core and see where it is spending its time.


> >> I have 1 Rx, 1 Tx, 1 Scheduler and N worker cores (1 sw_event0 device).
> In
> >> this configuration performance tops out with 3 workers (6 cores total)
> and
> >> adding more workers actually causes a reduction in throughput.   In my
> setup
> >> this is about 12Mpps.   The same setup running testpmd will reach >25Mpps
> >> using only 1 core.
> >
> > Raw forwarding of a packet is less work than forwarding and load-balancing
> > across multiple cores. More work means more CPU cycles spent per packet,
> hence less mpps.
> 
> ok.
> 
> >
> >
> >> This is the eventdev command line.
> >> eventdev_pipeline -l 0,1-6 -w:02:00.0 --vdev event_sw0 -- -r2 -t4 -e8
> -
> >> w70 -s1 -n0 -c128 -W0 -D
> >
> > The -W0 indicates to perform zero cycles of work on each worker core.
> > This makes each of the 3 worker cores very fast in returning work to the
> > scheduler core, and puts extra pressure on the scheduler. Note that in a
> > real-world use-case you presumably want to do work on each of the worker
> > cores, so the command above (while valid for understanding how it works,
> > and performance of certain things) is not expected to be used in
> production.
> >
> > I'm not sure how familiar you are with CPU caches, but it is worth
> understanding
> > that reading this "locally" from L1 or L2 cache is very fast compared to
> > communicating with another core.
> >
> > Given that with -W0 the worker cores are very fast, the scheduler can
> rarely
> > read data locally - it always has to communicate with other cores.
> >
> > Try adding -W1000 or so, and perhaps 4 worker cores. The 1000 cycles of
> work
> > per event mimic doing actual work on each event.
> 
> Adding work with -W reduces performance.

OK - that means that the worker cores are at least part of the bottleneck.
If they were very idle, adding some work to them would not have changed
the performance.

> I modify eventdev_pipeline to print the contents of
> rte_event_eth_rx_adapter_stats for the device.  In particular I print the
> rx_enq_retry and rx_poll_count values for the receive thread.Once I get
> to a load level where packets are dropped I see that the number of retires
> equals or exceeds the poll count (as I increase the load the retries exceeds
> the poll count).
>
> I think this indicates that the Scheduler is not keeping up.  That could be
> (I assume) because the workers are not consuming fast enough.  However if I
> increase the number of workers then the ratio of retry to poll_count (in the
> rx thread) goes up, for example adding 4 more workers and the retries:poll
> ration becomes 5:1
> 
> Seems like this is indicating that the Scheduler is the bottleneck?

So I gather you have  prototyped the pipeline you want to run with the
eventdev_pipeline sample app? Would you share the command line being
used with the eventdev_pipeline sample app, and I can try reproduce / 
understand.

One of the easiest mistakes (that I make regularly :) is that the RX/TX/Sched
core overlap, which causes excessive work to be performed on one thread,
reducing overall performance.


> >> This is the tested command line.
> >&

Re: [dpdk-users] eventdev performance

2018-08-09 Thread Anthony Hart
Hi Harry,
Thanks for the reply, please see responses inline

> On Aug 7, 2018, at 4:34 AM, Van Haaren, Harry  
> wrote:
> 
> Hi Tony,
> 
>> -Original Message-
>> From: users [mailto:users-boun...@dpdk.org] On Behalf Of Anthony Hart
>> Sent: Sunday, August 5, 2018 8:03 PM
>> To: users@dpdk.org
>> Subject: [dpdk-users] eventdev performance
>> 
>> I’ve been doing some performance measurements with the eventdev_pipeline
>> example application (to see how the eventdev library performs - dpdk 18.05)
>> and I’m looking for some help in determining where the bottlenecks are in my
>> testing.
> 
> If you have the "perf top" tool available, it is very useful in printing 
> statistics
> of where CPU cycles are spent during runtime. I use it regularly to identify
> bottlenecks in the code for specific lcores.

Yes I have perf if there is something you’d like to see I can post it.  

> 
> 
>> I have 1 Rx, 1 Tx, 1 Scheduler and N worker cores (1 sw_event0 device).   In
>> this configuration performance tops out with 3 workers (6 cores total) and
>> adding more workers actually causes a reduction in throughput.   In my setup
>> this is about 12Mpps.   The same setup running testpmd will reach >25Mpps
>> using only 1 core.
> 
> Raw forwarding of a packet is less work than forwarding and load-balancing
> across multiple cores. More work means more CPU cycles spent per packet, 
> hence less mpps.

ok.  

> 
> 
>> This is the eventdev command line.
>> eventdev_pipeline -l 0,1-6 -w:02:00.0 --vdev event_sw0 -- -r2 -t4 -e8 -
>> w70 -s1 -n0 -c128 -W0 -D
> 
> The -W0 indicates to perform zero cycles of work on each worker core.
> This makes each of the 3 worker cores very fast in returning work to the
> scheduler core, and puts extra pressure on the scheduler. Note that in a
> real-world use-case you presumably want to do work on each of the worker
> cores, so the command above (while valid for understanding how it works,
> and performance of certain things) is not expected to be used in production.
> 
> I'm not sure how familiar you are with CPU caches, but it is worth 
> understanding
> that reading this "locally" from L1 or L2 cache is very fast compared to
> communicating with another core.
> 
> Given that with -W0 the worker cores are very fast, the scheduler can rarely
> read data locally - it always has to communicate with other cores.
> 
> Try adding -W1000 or so, and perhaps 4 worker cores. The 1000 cycles of work
> per event mimic doing actual work on each event. 

Adding work with -W reduces performance.

I modify eventdev_pipeline to print the contents of 
rte_event_eth_rx_adapter_stats for the device.  In particular I print the 
rx_enq_retry and rx_poll_count values for the receive thread.Once I get to 
a load level where packets are dropped I see that the number of retires equals 
or exceeds the poll count (as I increase the load the retries exceeds the poll 
count).

I think this indicates that the Scheduler is not keeping up.  That could be (I 
assume) because the workers are not consuming fast enough.  However if I 
increase the number of workers then the ratio of retry to poll_count (in the rx 
thread) goes up, for example adding 4 more workers and the retries:poll ration 
becomes 5:1

Seems like this is indicating that the Scheduler is the bottleneck?


> 
> 
>> This is the tested command line.
>> testpmd -w:02:00.0 -l 0,1 -- -i --nb-core 1 --numa --rxq 1 --txq 1 --
>> port-topology=loop
>> 
>> 
>> I’m guessing that its either the RX or Sched that’s the bottleneck in my
>> eventdev_pipeline setup.
> 
> Given that you state testpmd is capable of forwarding at >25 mpps on your
> platform it is safe to rule out RX, since testpmd is performing the RX in
> that forwarding workload.
> 
> Which leaves the scheduler - and indeed the scheduler is probably what is
> the limiting factor in this case.

yes seems so.

> 
> 
>> So I first tried to use 2 cores for RX (-r6), performance went down.   It
>> seems that configuring 2 RX cores still only sets up 1 h/w receive ring and
>> access to that one ring is alternated between the two cores?So that
>> doesn’t help.
> 
> Correct - it is invalid to use two CPU cores on a single RX queue without
> some form of serialization (otherwise it causes race-conditions). The
> eventdev_pipeline sample app helpfully provides that - but there is a 
> performance
> impact on doing so. Using two RX threads on a single RX queue is generally
> not recommended.
> 
> 
>> Next, I could use 2 scheduler cores,  but how does that work, do they again
>> alternate?   In any case throughput is reduced by 50% in that test.
> 
> Yes, for the same reason. The event_sw0 PMD does not allow multiple threads
> to run it at the same time, and hence the serialization is in place to ensure
> that the results are valid.
> 
> 
>> thanks for any insights,
>> tony
> 
> Try the suggestion above of adding work to the worker cores - this should
> "balance out" the current scheduling bottleneck,