Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-08 Thread Eric Dumazet
On Thu, 2016-09-08 at 11:48 -0700, Rick Jones wrote:

> With small packets and the "default" ring size for this NIC/driver 
> combination, is the BQL large enough that the ring fills before one hits 
> the BQL?

It depends on how TX completion (NAPI handler) is implemented in the
driver.

Say how many packets can be dequeued by each invocation.

Drivers have a lot of variations there.





Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-08 Thread Rick Jones

On 09/08/2016 11:16 AM, Tom Herbert wrote:

On Thu, Sep 8, 2016 at 10:19 AM, Jesper Dangaard Brouer
 wrote:

On Thu, 8 Sep 2016 09:26:03 -0700
Tom Herbert  wrote:

Shouldn't qdisc bulk size be based on the BQL limit? What is the
simple algorithm to apply to in-flight packets?


Maybe the algorithm is not so simple, and we likely also have to take
BQL bytes into account.

The reason for wanting packets-in-flight is because we are attacking a
transaction cost.  The tailptr/doorbell cost around 70ns.  (Based on
data in this patch desc, 4.9Mpps -> 7.5Mpps (1/4.90-1/7.5)*1000 =
70.74). The 10G wirespeed small packets budget is 67.2ns, this with
fixed overhead per packet of 70ns we can never reach 10G wirespeed.


But you should be able to do this with BQL and it is more accurate.
BQL tells how many bytes need to be sent and that can be used to
create a bulk of packets to send with one doorbell.


With small packets and the "default" ring size for this NIC/driver 
combination, is the BQL large enough that the ring fills before one hits 
the BQL?


rick jones



Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-08 Thread Tom Herbert
On Thu, Sep 8, 2016 at 10:19 AM, Jesper Dangaard Brouer
 wrote:
> On Thu, 8 Sep 2016 09:26:03 -0700
> Tom Herbert  wrote:
>
>> On Wed, Sep 7, 2016 at 10:11 PM, Jesper Dangaard Brouer
>>  wrote:
>> >
>> > On Wed, 7 Sep 2016 20:21:24 -0700 Tom Herbert  wrote:
>> >
>> >> On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend  
>> >> wrote:
>> >> > On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:
>> >> >>
>> >> >> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed 
>> >> >>  wrote:
>> >> >>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet  
>> >> >>> wrote:
>> >>  On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
>> >> > On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet 
>> >> >  wrote:
>> >> >> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
>> >> >> [...]
>> >> 
>> >>  Only if a qdisc is present and pressure is high enough.
>> >> 
>> >>  But in a forwarding setup, we likely receive at a lower rate than the
>> >>  NIC can transmit.
>> >> >>
>> >> >> Yes, I can confirm this happens in my experiments.
>> >> >>
>> >> 
>> >> >>>
>> >> >>> Jesper has a similar Idea to make the qdisc think it is under
>> >> >>> pressure, when the device TX ring is idle most of the time, i think
>> >> >>> his idea can come in handy here. I am not fully involved in the
>> >> >>> details, maybe he can elaborate more.
>> >> >>>
>> >> >>> But if it works, it will be transparent to napi, and xmit more will
>> >> >>> happen by design.
>> >> >>
>> >> >> Yes. I have some ideas around getting more bulking going from the qdisc
>> >> >> layer, by having the drivers provide some feedback to the qdisc layer
>> >> >> indicating xmit_more should be possible.  This will be a topic at the
>> >> >> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
>> >> >> challenge people to come up with a good solution ;-)
>> >> >>
>> >> >
>> >> > One thing I've noticed but haven't yet actually analyzed much is if
>> >> > I shrink the nic descriptor ring size to only be slightly larger than
>> >> > the qdisc layer bulking size I get more bulking and better perf numbers.
>> >> > At least on microbenchmarks. The reason being the nic pushes back more
>> >> > on the qdisc. So maybe a case for making the ring size in the NIC some
>> >> > factor of the expected number of queues feeding the descriptor ring.
>> >> >
>> >
>> > I've also played with shrink the NIC descriptor ring size, it works,
>> > but it is an ugly hack to get NIC pushes backs, and I foresee it will
>> > hurt normal use-cases. (There are other reasons for shrinking the ring
>> > size like cache usage, but that is unrelated to this).
>> >
>> >
>> >> BQL is not helping with that?
>> >
>> > Exactly. But the BQL _byte_ limit is not what is needed, what we need
>> > to know is the _packets_ currently "in-flight".  Which Tom already have
>> > a patch for :-)  Once we have that the algorithm is simple.
>> >
>> > Qdisc dequeue look at BQL pkts-in-flight, if driver have "enough"
>> > packets in-flight, the qdisc start it's bulk dequeue building phase,
>> > before calling the driver. The allowed max qdisc bulk size should
>> > likely be related to pkts-in-flight.
>> >
>> Sorry, I'm still missing it. The point of BQL is that we minimize the
>> amount of data (and hence number of packets) that needs to be queued
>> in the device in order to prevent the link from going idle while there
>> are outstanding packets to be sent. The algorithm is based on counting
>> bytes not packets because bytes are roughly an equal cost unit of
>> work. So if we've queued 100K of bytes on the queue we know how long
>> that takes around 80 usecs @10G, but if we count packets then we
>> really don't know much about that. 100 packets enqueued could
>> represent 6400 bytes or 6400K worth of data so time to transmit is
>> anywhere from 5usecs to 5msecs
>>
>> Shouldn't qdisc bulk size be based on the BQL limit? What is the
>> simple algorithm to apply to in-flight packets?
>
> Maybe the algorithm is not so simple, and we likely also have to take
> BQL bytes into account.
>
> The reason for wanting packets-in-flight is because we are attacking a
> transaction cost.  The tailptr/doorbell cost around 70ns.  (Based on
> data in this patch desc, 4.9Mpps -> 7.5Mpps (1/4.90-1/7.5)*1000 =
> 70.74). The 10G wirespeed small packets budget is 67.2ns, this with
> fixed overhead per packet of 70ns we can never reach 10G wirespeed.
>
But you should be able to do this with BQL and it is more accurate.
BQL tells how many bytes need to be sent and that can be used to
create a bulk of packets to send with one doorbell.

> The idea/algo is trying to predict the future.  If we see a given/high
> packet rate, which equals a high transaction cost, then lets try not
> calling the driver, and instead backlog the packet in 

Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-08 Thread Jesper Dangaard Brouer
On Thu, 8 Sep 2016 09:26:03 -0700
Tom Herbert  wrote:

> On Wed, Sep 7, 2016 at 10:11 PM, Jesper Dangaard Brouer
>  wrote:
> >
> > On Wed, 7 Sep 2016 20:21:24 -0700 Tom Herbert  wrote:
> >  
> >> On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend  
> >> wrote:  
> >> > On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:  
> >> >>
> >> >> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed 
> >> >>  wrote:  
> >> >>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet  
> >> >>> wrote:  
> >>  On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:  
> >> > On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet 
> >> >  wrote:  
> >> >> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:  
> >> >> [...]  
> >> 
> >>  Only if a qdisc is present and pressure is high enough.
> >> 
> >>  But in a forwarding setup, we likely receive at a lower rate than the
> >>  NIC can transmit.  
> >> >>
> >> >> Yes, I can confirm this happens in my experiments.
> >> >>  
> >>   
> >> >>>
> >> >>> Jesper has a similar Idea to make the qdisc think it is under
> >> >>> pressure, when the device TX ring is idle most of the time, i think
> >> >>> his idea can come in handy here. I am not fully involved in the
> >> >>> details, maybe he can elaborate more.
> >> >>>
> >> >>> But if it works, it will be transparent to napi, and xmit more will
> >> >>> happen by design.  
> >> >>
> >> >> Yes. I have some ideas around getting more bulking going from the qdisc
> >> >> layer, by having the drivers provide some feedback to the qdisc layer
> >> >> indicating xmit_more should be possible.  This will be a topic at the
> >> >> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
> >> >> challenge people to come up with a good solution ;-)
> >> >>  
> >> >
> >> > One thing I've noticed but haven't yet actually analyzed much is if
> >> > I shrink the nic descriptor ring size to only be slightly larger than
> >> > the qdisc layer bulking size I get more bulking and better perf numbers.
> >> > At least on microbenchmarks. The reason being the nic pushes back more
> >> > on the qdisc. So maybe a case for making the ring size in the NIC some
> >> > factor of the expected number of queues feeding the descriptor ring.
> >> >  
> >
> > I've also played with shrink the NIC descriptor ring size, it works,
> > but it is an ugly hack to get NIC pushes backs, and I foresee it will
> > hurt normal use-cases. (There are other reasons for shrinking the ring
> > size like cache usage, but that is unrelated to this).
> >
> >  
> >> BQL is not helping with that?  
> >
> > Exactly. But the BQL _byte_ limit is not what is needed, what we need
> > to know is the _packets_ currently "in-flight".  Which Tom already have
> > a patch for :-)  Once we have that the algorithm is simple.
> >
> > Qdisc dequeue look at BQL pkts-in-flight, if driver have "enough"
> > packets in-flight, the qdisc start it's bulk dequeue building phase,
> > before calling the driver. The allowed max qdisc bulk size should
> > likely be related to pkts-in-flight.
> >  
> Sorry, I'm still missing it. The point of BQL is that we minimize the
> amount of data (and hence number of packets) that needs to be queued
> in the device in order to prevent the link from going idle while there
> are outstanding packets to be sent. The algorithm is based on counting
> bytes not packets because bytes are roughly an equal cost unit of
> work. So if we've queued 100K of bytes on the queue we know how long
> that takes around 80 usecs @10G, but if we count packets then we
> really don't know much about that. 100 packets enqueued could
> represent 6400 bytes or 6400K worth of data so time to transmit is
> anywhere from 5usecs to 5msecs
> 
> Shouldn't qdisc bulk size be based on the BQL limit? What is the
> simple algorithm to apply to in-flight packets?

Maybe the algorithm is not so simple, and we likely also have to take
BQL bytes into account.

The reason for wanting packets-in-flight is because we are attacking a
transaction cost.  The tailptr/doorbell cost around 70ns.  (Based on
data in this patch desc, 4.9Mpps -> 7.5Mpps (1/4.90-1/7.5)*1000 =
70.74). The 10G wirespeed small packets budget is 67.2ns, this with
fixed overhead per packet of 70ns we can never reach 10G wirespeed.

The idea/algo is trying to predict the future.  If we see a given/high
packet rate, which equals a high transaction cost, then lets try not
calling the driver, and instead backlog the packet in the qdisc,
speculatively hoping the current rate continues.  This will in effect
allow bulking and amortize the 70ns transaction cost over N packets.

Instead of tracking a rate of packets or doorbells per sec, I will let
BQLs packet-in-flight tell me when the driver sees a rate high enough
that the drivers (DMA-TX completion) consider 

Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-08 Thread Tom Herbert
On Wed, Sep 7, 2016 at 10:11 PM, Jesper Dangaard Brouer
 wrote:
>
> On Wed, 7 Sep 2016 20:21:24 -0700 Tom Herbert  wrote:
>
>> On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend  
>> wrote:
>> > On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:
>> >>
>> >> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed 
>> >>  wrote:
>> >>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet  
>> >>> wrote:
>>  On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
>> > On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet  
>> > wrote:
>> >> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
>> >> [...]
>> 
>>  Only if a qdisc is present and pressure is high enough.
>> 
>>  But in a forwarding setup, we likely receive at a lower rate than the
>>  NIC can transmit.
>> >>
>> >> Yes, I can confirm this happens in my experiments.
>> >>
>> 
>> >>>
>> >>> Jesper has a similar Idea to make the qdisc think it is under
>> >>> pressure, when the device TX ring is idle most of the time, i think
>> >>> his idea can come in handy here. I am not fully involved in the
>> >>> details, maybe he can elaborate more.
>> >>>
>> >>> But if it works, it will be transparent to napi, and xmit more will
>> >>> happen by design.
>> >>
>> >> Yes. I have some ideas around getting more bulking going from the qdisc
>> >> layer, by having the drivers provide some feedback to the qdisc layer
>> >> indicating xmit_more should be possible.  This will be a topic at the
>> >> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
>> >> challenge people to come up with a good solution ;-)
>> >>
>> >
>> > One thing I've noticed but haven't yet actually analyzed much is if
>> > I shrink the nic descriptor ring size to only be slightly larger than
>> > the qdisc layer bulking size I get more bulking and better perf numbers.
>> > At least on microbenchmarks. The reason being the nic pushes back more
>> > on the qdisc. So maybe a case for making the ring size in the NIC some
>> > factor of the expected number of queues feeding the descriptor ring.
>> >
>
> I've also played with shrink the NIC descriptor ring size, it works,
> but it is an ugly hack to get NIC pushes backs, and I foresee it will
> hurt normal use-cases. (There are other reasons for shrinking the ring
> size like cache usage, but that is unrelated to this).
>
>
>> BQL is not helping with that?
>
> Exactly. But the BQL _byte_ limit is not what is needed, what we need
> to know is the _packets_ currently "in-flight".  Which Tom already have
> a patch for :-)  Once we have that the algorithm is simple.
>
> Qdisc dequeue look at BQL pkts-in-flight, if driver have "enough"
> packets in-flight, the qdisc start it's bulk dequeue building phase,
> before calling the driver. The allowed max qdisc bulk size should
> likely be related to pkts-in-flight.
>
Sorry, I'm still missing it. The point of BQL is that we minimize the
amount of data (and hence number of packets) that needs to be queued
in the device in order to prevent the link from going idle while there
are outstanding packets to be sent. The algorithm is based on counting
bytes not packets because bytes are roughly an equal cost unit of
work. So if we've queued 100K of bytes on the queue we know how long
that takes around 80 usecs @10G, but if we count packets then we
really don't know much about that. 100 packets enqueued could
represent 6400 bytes or 6400K worth of data so time to transmit is
anywhere from 5usecs to 5msecs

Shouldn't qdisc bulk size be based on the BQL limit? What is the
simple algorithm to apply to in-flight packets?

Tom

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-07 Thread Jesper Dangaard Brouer

On Wed, 7 Sep 2016 20:21:24 -0700 Tom Herbert  wrote:

> On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend  
> wrote:
> > On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:  
> >>
> >> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed 
> >>  wrote:  
> >>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet  
> >>> wrote:  
>  On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:  
> > On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet  
> > wrote:  
> >> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:  
> >> [...]  
> 
>  Only if a qdisc is present and pressure is high enough.
> 
>  But in a forwarding setup, we likely receive at a lower rate than the
>  NIC can transmit.  
> >>
> >> Yes, I can confirm this happens in my experiments.
> >>  
>   
> >>>
> >>> Jesper has a similar Idea to make the qdisc think it is under
> >>> pressure, when the device TX ring is idle most of the time, i think
> >>> his idea can come in handy here. I am not fully involved in the
> >>> details, maybe he can elaborate more.
> >>>
> >>> But if it works, it will be transparent to napi, and xmit more will
> >>> happen by design.  
> >>
> >> Yes. I have some ideas around getting more bulking going from the qdisc
> >> layer, by having the drivers provide some feedback to the qdisc layer
> >> indicating xmit_more should be possible.  This will be a topic at the
> >> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
> >> challenge people to come up with a good solution ;-)
> >>  
> >
> > One thing I've noticed but haven't yet actually analyzed much is if
> > I shrink the nic descriptor ring size to only be slightly larger than
> > the qdisc layer bulking size I get more bulking and better perf numbers.
> > At least on microbenchmarks. The reason being the nic pushes back more
> > on the qdisc. So maybe a case for making the ring size in the NIC some
> > factor of the expected number of queues feeding the descriptor ring.
> >  

I've also played with shrink the NIC descriptor ring size, it works,
but it is an ugly hack to get NIC pushes backs, and I foresee it will
hurt normal use-cases. (There are other reasons for shrinking the ring
size like cache usage, but that is unrelated to this).

 
> BQL is not helping with that?

Exactly. But the BQL _byte_ limit is not what is needed, what we need
to know is the _packets_ currently "in-flight".  Which Tom already have
a patch for :-)  Once we have that the algorithm is simple.

Qdisc dequeue look at BQL pkts-in-flight, if driver have "enough"
packets in-flight, the qdisc start it's bulk dequeue building phase,
before calling the driver. The allowed max qdisc bulk size should
likely be related to pkts-in-flight.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-07 Thread Tom Herbert
On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend  wrote:
> On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:
>>
>> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed  
>> wrote:
>>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet  wrote:
 On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet  
> wrote:
>> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
>> [...]

 Only if a qdisc is present and pressure is high enough.

 But in a forwarding setup, we likely receive at a lower rate than the
 NIC can transmit.
>>
>> Yes, I can confirm this happens in my experiments.
>>

>>>
>>> Jesper has a similar Idea to make the qdisc think it is under
>>> pressure, when the device TX ring is idle most of the time, i think
>>> his idea can come in handy here. I am not fully involved in the
>>> details, maybe he can elaborate more.
>>>
>>> But if it works, it will be transparent to napi, and xmit more will
>>> happen by design.
>>
>> Yes. I have some ideas around getting more bulking going from the qdisc
>> layer, by having the drivers provide some feedback to the qdisc layer
>> indicating xmit_more should be possible.  This will be a topic at the
>> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
>> challenge people to come up with a good solution ;-)
>>
>
> One thing I've noticed but haven't yet actually analyzed much is if
> I shrink the nic descriptor ring size to only be slightly larger than
> the qdisc layer bulking size I get more bulking and better perf numbers.
> At least on microbenchmarks. The reason being the nic pushes back more
> on the qdisc. So maybe a case for making the ring size in the NIC some
> factor of the expected number of queues feeding the descriptor ring.
>

BQL is not helping with that?

Tom

> .John


Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-07 Thread John Fastabend
On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:
> 
> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed  
> wrote:
>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet  wrote:
>>> On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:  
 On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet  
 wrote:  
> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:  
> [...]
>>>
>>> Only if a qdisc is present and pressure is high enough.
>>>
>>> But in a forwarding setup, we likely receive at a lower rate than the
>>> NIC can transmit.
> 
> Yes, I can confirm this happens in my experiments.
> 
>>>  
>>
>> Jesper has a similar Idea to make the qdisc think it is under
>> pressure, when the device TX ring is idle most of the time, i think
>> his idea can come in handy here. I am not fully involved in the
>> details, maybe he can elaborate more.
>>
>> But if it works, it will be transparent to napi, and xmit more will
>> happen by design.
> 
> Yes. I have some ideas around getting more bulking going from the qdisc
> layer, by having the drivers provide some feedback to the qdisc layer
> indicating xmit_more should be possible.  This will be a topic at the
> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
> challenge people to come up with a good solution ;-)
> 

One thing I've noticed but haven't yet actually analyzed much is if
I shrink the nic descriptor ring size to only be slightly larger than
the qdisc layer bulking size I get more bulking and better perf numbers.
At least on microbenchmarks. The reason being the nic pushes back more
on the qdisc. So maybe a case for making the ring size in the NIC some
factor of the expected number of queues feeding the descriptor ring.

.John


Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-07 Thread Saeed Mahameed
On Wed, Sep 7, 2016 at 9:19 PM, Eric Dumazet  wrote:
> On Wed, 2016-09-07 at 19:57 +0300, Saeed Mahameed wrote:
>
>> Jesper has a similar Idea to make the qdisc think it is under
>> pressure, when the device
>> TX ring is idle most of the time, i think his idea can come in handy here.
>> I am not fully involved in the details, maybe he can elaborate more.
>>
>> But if it works, it will be transparent to napi, and xmit more will
>> happen by design.
>
> I do not think qdisc is relevant here.
>
> Right now, skb->xmit_more is set only by qdisc layer (and pktgen tool),
> because only this layer can know if more packets are to come.
>
>
> What I am saying is that regardless of skb->xmit_more being set or not,
> (for example if no qdisc is even used)
> a NAPI driver can arm a bit asking the doorbell being sent at the end of
> NAPI.
>
> I am not saying this must be done, only that the idea could be extended
> to non XDP world, if we care enough.
>

Yes, and i am just trying to suggest Ideas that do not require
communication between RX (NAPI) and TX.

The problem here is the synchronization (TX doorbell from RX) which is
not as simple as atomic operation for some drivers.

How about RX bulking ? it also can help here, since for the forwarding
case, the forwarding path will be able to
process bulk of RX SKBs and can bulk xmit the portion of SKBs that
will be forwarded.

As Jesper suggested, Let's talk in Netdev1.2 at jesper's session. ( if
you are joining of course).

Thanks
Saeed.


Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-07 Thread Jesper Dangaard Brouer

On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed  
wrote:
> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet  wrote:
> > On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:  
> >> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet  
> >> wrote:  
> >> > On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:  
[...]
> >
> > Only if a qdisc is present and pressure is high enough.
> >
> > But in a forwarding setup, we likely receive at a lower rate than the
> > NIC can transmit.

Yes, I can confirm this happens in my experiments.

> >  
> 
> Jesper has a similar Idea to make the qdisc think it is under
> pressure, when the device TX ring is idle most of the time, i think
> his idea can come in handy here. I am not fully involved in the
> details, maybe he can elaborate more.
> 
> But if it works, it will be transparent to napi, and xmit more will
> happen by design.

Yes. I have some ideas around getting more bulking going from the qdisc
layer, by having the drivers provide some feedback to the qdisc layer
indicating xmit_more should be possible.  This will be a topic at the
Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
challenge people to come up with a good solution ;-)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

[1] http://netdevconf.org/1.2/session.html?jesper-performance-workshop


Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-07 Thread Eric Dumazet
On Wed, 2016-09-07 at 19:57 +0300, Saeed Mahameed wrote:

> Jesper has a similar Idea to make the qdisc think it is under
> pressure, when the device
> TX ring is idle most of the time, i think his idea can come in handy here.
> I am not fully involved in the details, maybe he can elaborate more.
> 
> But if it works, it will be transparent to napi, and xmit more will
> happen by design.

I do not think qdisc is relevant here.

Right now, skb->xmit_more is set only by qdisc layer (and pktgen tool),
because only this layer can know if more packets are to come.

What I am saying is that regardless of skb->xmit_more being set or not,
(for example if no qdisc is even used)
a NAPI driver can arm a bit asking the doorbell being sent at the end of
NAPI.

I am not saying this must be done, only that the idea could be extended
to non XDP world, if we care enough.







Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-07 Thread Saeed Mahameed
On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet  wrote:
> On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
>> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet  wrote:
>> > On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
>> >> Previously we rang XDP SQ doorbell on every forwarded XDP packet.
>> >>
>> >> Here we introduce a xmit more like mechanism that will queue up more
>> >> than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
>> >>
>> >> Once RX napi budget is consumed and we exit napi RX loop, we will
>> >> flush (doorbell) all XDP looped packets in case there are such.
>> >
>> > Why is this idea depends on XDP ?
>> >
>> > It looks like we could apply it to any driver having one IRQ servicing
>> > one RX and one TX, without XDP being involved.
>> >
>>
>> Yes but it is more complicated than XDP case, where the RX ring posts
>> the TX descriptors and once done
>> the RX ring hits the doorbell once for all the TX descriptors it
>> posted, and it is the only possible place to hit a doorbell
>> for XDP TX ring.
>>
>> For regular TX and RX ring sharing the same IRQ, there is no such
>> simple connection between them, and hitting a doorbell
>> from RX ring napi would race with xmit ndo function of the TX ring.
>>
>> How do you synchronize in such case ?
>> isn't the existing xmit more mechanism sufficient enough ?
>
> Only if a qdisc is present and pressure is high enough.
>
> But in a forwarding setup, we likely receive at a lower rate than the
> NIC can transmit.
>

Jesper has a similar Idea to make the qdisc think it is under
pressure, when the device
TX ring is idle most of the time, i think his idea can come in handy here.
I am not fully involved in the details, maybe he can elaborate more.

But if it works, it will be transparent to napi, and xmit more will
happen by design.

> A simple cmpxchg could be used to synchronize the thing, if we really
> cared about doorbell cost. (Ie if the cost of this cmpxchg() is way
> smaller than doorbell one)
>


Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-07 Thread Eric Dumazet
On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet  wrote:
> > On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
> >> Previously we rang XDP SQ doorbell on every forwarded XDP packet.
> >>
> >> Here we introduce a xmit more like mechanism that will queue up more
> >> than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
> >>
> >> Once RX napi budget is consumed and we exit napi RX loop, we will
> >> flush (doorbell) all XDP looped packets in case there are such.
> >
> > Why is this idea depends on XDP ?
> >
> > It looks like we could apply it to any driver having one IRQ servicing
> > one RX and one TX, without XDP being involved.
> >
> 
> Yes but it is more complicated than XDP case, where the RX ring posts
> the TX descriptors and once done
> the RX ring hits the doorbell once for all the TX descriptors it
> posted, and it is the only possible place to hit a doorbell
> for XDP TX ring.
> 
> For regular TX and RX ring sharing the same IRQ, there is no such
> simple connection between them, and hitting a doorbell
> from RX ring napi would race with xmit ndo function of the TX ring.
> 
> How do you synchronize in such case ?
> isn't the existing xmit more mechanism sufficient enough ?

Only if a qdisc is present and pressure is high enough.

But in a forwarding setup, we likely receive at a lower rate than the
NIC can transmit.

A simple cmpxchg could be used to synchronize the thing, if we really
cared about doorbell cost. (Ie if the cost of this cmpxchg() is way
smaller than doorbell one)








Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-07 Thread Saeed Mahameed
On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet  wrote:
> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
>> Previously we rang XDP SQ doorbell on every forwarded XDP packet.
>>
>> Here we introduce a xmit more like mechanism that will queue up more
>> than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
>>
>> Once RX napi budget is consumed and we exit napi RX loop, we will
>> flush (doorbell) all XDP looped packets in case there are such.
>
> Why is this idea depends on XDP ?
>
> It looks like we could apply it to any driver having one IRQ servicing
> one RX and one TX, without XDP being involved.
>

Yes but it is more complicated than XDP case, where the RX ring posts
the TX descriptors and once done
the RX ring hits the doorbell once for all the TX descriptors it
posted, and it is the only possible place to hit a doorbell
for XDP TX ring.

For regular TX and RX ring sharing the same IRQ, there is no such
simple connection between them, and hitting a doorbell
from RX ring napi would race with xmit ndo function of the TX ring.

How do you synchronize in such case ?
isn't the existing xmit more mechanism sufficient enough ? maybe we
can have a fence from napi RX function
that will hold the xmit queue until done and then flush the TX queue
with the setting the right xmit more flags, without the need
of explicitly intervening with TX flow (hitting the doorbell).


Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-07 Thread Eric Dumazet
On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
> Previously we rang XDP SQ doorbell on every forwarded XDP packet.
> 
> Here we introduce a xmit more like mechanism that will queue up more
> than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
> 
> Once RX napi budget is consumed and we exit napi RX loop, we will
> flush (doorbell) all XDP looped packets in case there are such.

Why is this idea depends on XDP ?

It looks like we could apply it to any driver having one IRQ servicing
one RX and one TX, without XDP being involved.





Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-07 Thread John Fastabend
On 16-09-07 05:42 AM, Saeed Mahameed wrote:
> Previously we rang XDP SQ doorbell on every forwarded XDP packet.
> 
> Here we introduce a xmit more like mechanism that will queue up more
> than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
> 
> Once RX napi budget is consumed and we exit napi RX loop, we will
> flush (doorbell) all XDP looped packets in case there are such.
> 
> XDP forward packet rate:
> 
> Comparing XDP with and w/o xmit more (bulk transmit):
> 
> Streams XDP TX   XDP TX (xmit more)
> ---
> 1   4.90Mpps  7.50Mpps
> 2   9.50Mpps  14.8Mpps
> 4   16.5Mpps  25.1Mpps
> 8   21.5Mpps  27.5Mpps*
> 16  24.1Mpps  27.5Mpps*
> 

Hi Saeed,

How many cores are you using with these numbers? Just a single
core? Or are streams being RSS'd across cores somehow.

> *It seems we hit a wall of 27.5Mpps, for 8 and 16 streams,
> we will be working on the analysis and will publish the conclusions
> later.
> 

Thanks,
John