Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Thu, 2016-09-08 at 11:48 -0700, Rick Jones wrote: > With small packets and the "default" ring size for this NIC/driver > combination, is the BQL large enough that the ring fills before one hits > the BQL? It depends on how TX completion (NAPI handler) is implemented in the driver. Say how many packets can be dequeued by each invocation. Drivers have a lot of variations there.
Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On 09/08/2016 11:16 AM, Tom Herbert wrote: On Thu, Sep 8, 2016 at 10:19 AM, Jesper Dangaard Brouerwrote: On Thu, 8 Sep 2016 09:26:03 -0700 Tom Herbert wrote: Shouldn't qdisc bulk size be based on the BQL limit? What is the simple algorithm to apply to in-flight packets? Maybe the algorithm is not so simple, and we likely also have to take BQL bytes into account. The reason for wanting packets-in-flight is because we are attacking a transaction cost. The tailptr/doorbell cost around 70ns. (Based on data in this patch desc, 4.9Mpps -> 7.5Mpps (1/4.90-1/7.5)*1000 = 70.74). The 10G wirespeed small packets budget is 67.2ns, this with fixed overhead per packet of 70ns we can never reach 10G wirespeed. But you should be able to do this with BQL and it is more accurate. BQL tells how many bytes need to be sent and that can be used to create a bulk of packets to send with one doorbell. With small packets and the "default" ring size for this NIC/driver combination, is the BQL large enough that the ring fills before one hits the BQL? rick jones
Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Thu, Sep 8, 2016 at 10:19 AM, Jesper Dangaard Brouerwrote: > On Thu, 8 Sep 2016 09:26:03 -0700 > Tom Herbert wrote: > >> On Wed, Sep 7, 2016 at 10:11 PM, Jesper Dangaard Brouer >> wrote: >> > >> > On Wed, 7 Sep 2016 20:21:24 -0700 Tom Herbert wrote: >> > >> >> On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend >> >> wrote: >> >> > On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote: >> >> >> >> >> >> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed >> >> >> wrote: >> >> >>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet >> >> >>> wrote: >> >> On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote: >> >> > On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet >> >> > wrote: >> >> >> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote: >> >> >> [...] >> >> >> >> Only if a qdisc is present and pressure is high enough. >> >> >> >> But in a forwarding setup, we likely receive at a lower rate than the >> >> NIC can transmit. >> >> >> >> >> >> Yes, I can confirm this happens in my experiments. >> >> >> >> >> >> >> >>> >> >> >>> Jesper has a similar Idea to make the qdisc think it is under >> >> >>> pressure, when the device TX ring is idle most of the time, i think >> >> >>> his idea can come in handy here. I am not fully involved in the >> >> >>> details, maybe he can elaborate more. >> >> >>> >> >> >>> But if it works, it will be transparent to napi, and xmit more will >> >> >>> happen by design. >> >> >> >> >> >> Yes. I have some ideas around getting more bulking going from the qdisc >> >> >> layer, by having the drivers provide some feedback to the qdisc layer >> >> >> indicating xmit_more should be possible. This will be a topic at the >> >> >> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully >> >> >> challenge people to come up with a good solution ;-) >> >> >> >> >> > >> >> > One thing I've noticed but haven't yet actually analyzed much is if >> >> > I shrink the nic descriptor ring size to only be slightly larger than >> >> > the qdisc layer bulking size I get more bulking and better perf numbers. >> >> > At least on microbenchmarks. The reason being the nic pushes back more >> >> > on the qdisc. So maybe a case for making the ring size in the NIC some >> >> > factor of the expected number of queues feeding the descriptor ring. >> >> > >> > >> > I've also played with shrink the NIC descriptor ring size, it works, >> > but it is an ugly hack to get NIC pushes backs, and I foresee it will >> > hurt normal use-cases. (There are other reasons for shrinking the ring >> > size like cache usage, but that is unrelated to this). >> > >> > >> >> BQL is not helping with that? >> > >> > Exactly. But the BQL _byte_ limit is not what is needed, what we need >> > to know is the _packets_ currently "in-flight". Which Tom already have >> > a patch for :-) Once we have that the algorithm is simple. >> > >> > Qdisc dequeue look at BQL pkts-in-flight, if driver have "enough" >> > packets in-flight, the qdisc start it's bulk dequeue building phase, >> > before calling the driver. The allowed max qdisc bulk size should >> > likely be related to pkts-in-flight. >> > >> Sorry, I'm still missing it. The point of BQL is that we minimize the >> amount of data (and hence number of packets) that needs to be queued >> in the device in order to prevent the link from going idle while there >> are outstanding packets to be sent. The algorithm is based on counting >> bytes not packets because bytes are roughly an equal cost unit of >> work. So if we've queued 100K of bytes on the queue we know how long >> that takes around 80 usecs @10G, but if we count packets then we >> really don't know much about that. 100 packets enqueued could >> represent 6400 bytes or 6400K worth of data so time to transmit is >> anywhere from 5usecs to 5msecs >> >> Shouldn't qdisc bulk size be based on the BQL limit? What is the >> simple algorithm to apply to in-flight packets? > > Maybe the algorithm is not so simple, and we likely also have to take > BQL bytes into account. > > The reason for wanting packets-in-flight is because we are attacking a > transaction cost. The tailptr/doorbell cost around 70ns. (Based on > data in this patch desc, 4.9Mpps -> 7.5Mpps (1/4.90-1/7.5)*1000 = > 70.74). The 10G wirespeed small packets budget is 67.2ns, this with > fixed overhead per packet of 70ns we can never reach 10G wirespeed. > But you should be able to do this with BQL and it is more accurate. BQL tells how many bytes need to be sent and that can be used to create a bulk of packets to send with one doorbell. > The idea/algo is trying to predict the future. If we see a given/high > packet rate, which equals a high transaction cost, then lets try not > calling the driver, and instead backlog the packet in
Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Thu, 8 Sep 2016 09:26:03 -0700 Tom Herbertwrote: > On Wed, Sep 7, 2016 at 10:11 PM, Jesper Dangaard Brouer > wrote: > > > > On Wed, 7 Sep 2016 20:21:24 -0700 Tom Herbert wrote: > > > >> On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend > >> wrote: > >> > On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote: > >> >> > >> >> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed > >> >> wrote: > >> >>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet > >> >>> wrote: > >> On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote: > >> > On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet > >> > wrote: > >> >> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote: > >> >> [...] > >> > >> Only if a qdisc is present and pressure is high enough. > >> > >> But in a forwarding setup, we likely receive at a lower rate than the > >> NIC can transmit. > >> >> > >> >> Yes, I can confirm this happens in my experiments. > >> >> > >> > >> >>> > >> >>> Jesper has a similar Idea to make the qdisc think it is under > >> >>> pressure, when the device TX ring is idle most of the time, i think > >> >>> his idea can come in handy here. I am not fully involved in the > >> >>> details, maybe he can elaborate more. > >> >>> > >> >>> But if it works, it will be transparent to napi, and xmit more will > >> >>> happen by design. > >> >> > >> >> Yes. I have some ideas around getting more bulking going from the qdisc > >> >> layer, by having the drivers provide some feedback to the qdisc layer > >> >> indicating xmit_more should be possible. This will be a topic at the > >> >> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully > >> >> challenge people to come up with a good solution ;-) > >> >> > >> > > >> > One thing I've noticed but haven't yet actually analyzed much is if > >> > I shrink the nic descriptor ring size to only be slightly larger than > >> > the qdisc layer bulking size I get more bulking and better perf numbers. > >> > At least on microbenchmarks. The reason being the nic pushes back more > >> > on the qdisc. So maybe a case for making the ring size in the NIC some > >> > factor of the expected number of queues feeding the descriptor ring. > >> > > > > > I've also played with shrink the NIC descriptor ring size, it works, > > but it is an ugly hack to get NIC pushes backs, and I foresee it will > > hurt normal use-cases. (There are other reasons for shrinking the ring > > size like cache usage, but that is unrelated to this). > > > > > >> BQL is not helping with that? > > > > Exactly. But the BQL _byte_ limit is not what is needed, what we need > > to know is the _packets_ currently "in-flight". Which Tom already have > > a patch for :-) Once we have that the algorithm is simple. > > > > Qdisc dequeue look at BQL pkts-in-flight, if driver have "enough" > > packets in-flight, the qdisc start it's bulk dequeue building phase, > > before calling the driver. The allowed max qdisc bulk size should > > likely be related to pkts-in-flight. > > > Sorry, I'm still missing it. The point of BQL is that we minimize the > amount of data (and hence number of packets) that needs to be queued > in the device in order to prevent the link from going idle while there > are outstanding packets to be sent. The algorithm is based on counting > bytes not packets because bytes are roughly an equal cost unit of > work. So if we've queued 100K of bytes on the queue we know how long > that takes around 80 usecs @10G, but if we count packets then we > really don't know much about that. 100 packets enqueued could > represent 6400 bytes or 6400K worth of data so time to transmit is > anywhere from 5usecs to 5msecs > > Shouldn't qdisc bulk size be based on the BQL limit? What is the > simple algorithm to apply to in-flight packets? Maybe the algorithm is not so simple, and we likely also have to take BQL bytes into account. The reason for wanting packets-in-flight is because we are attacking a transaction cost. The tailptr/doorbell cost around 70ns. (Based on data in this patch desc, 4.9Mpps -> 7.5Mpps (1/4.90-1/7.5)*1000 = 70.74). The 10G wirespeed small packets budget is 67.2ns, this with fixed overhead per packet of 70ns we can never reach 10G wirespeed. The idea/algo is trying to predict the future. If we see a given/high packet rate, which equals a high transaction cost, then lets try not calling the driver, and instead backlog the packet in the qdisc, speculatively hoping the current rate continues. This will in effect allow bulking and amortize the 70ns transaction cost over N packets. Instead of tracking a rate of packets or doorbells per sec, I will let BQLs packet-in-flight tell me when the driver sees a rate high enough that the drivers (DMA-TX completion) consider
Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Wed, Sep 7, 2016 at 10:11 PM, Jesper Dangaard Brouerwrote: > > On Wed, 7 Sep 2016 20:21:24 -0700 Tom Herbert wrote: > >> On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend >> wrote: >> > On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote: >> >> >> >> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed >> >> wrote: >> >>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet >> >>> wrote: >> On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote: >> > On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet >> > wrote: >> >> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote: >> >> [...] >> >> Only if a qdisc is present and pressure is high enough. >> >> But in a forwarding setup, we likely receive at a lower rate than the >> NIC can transmit. >> >> >> >> Yes, I can confirm this happens in my experiments. >> >> >> >> >>> >> >>> Jesper has a similar Idea to make the qdisc think it is under >> >>> pressure, when the device TX ring is idle most of the time, i think >> >>> his idea can come in handy here. I am not fully involved in the >> >>> details, maybe he can elaborate more. >> >>> >> >>> But if it works, it will be transparent to napi, and xmit more will >> >>> happen by design. >> >> >> >> Yes. I have some ideas around getting more bulking going from the qdisc >> >> layer, by having the drivers provide some feedback to the qdisc layer >> >> indicating xmit_more should be possible. This will be a topic at the >> >> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully >> >> challenge people to come up with a good solution ;-) >> >> >> > >> > One thing I've noticed but haven't yet actually analyzed much is if >> > I shrink the nic descriptor ring size to only be slightly larger than >> > the qdisc layer bulking size I get more bulking and better perf numbers. >> > At least on microbenchmarks. The reason being the nic pushes back more >> > on the qdisc. So maybe a case for making the ring size in the NIC some >> > factor of the expected number of queues feeding the descriptor ring. >> > > > I've also played with shrink the NIC descriptor ring size, it works, > but it is an ugly hack to get NIC pushes backs, and I foresee it will > hurt normal use-cases. (There are other reasons for shrinking the ring > size like cache usage, but that is unrelated to this). > > >> BQL is not helping with that? > > Exactly. But the BQL _byte_ limit is not what is needed, what we need > to know is the _packets_ currently "in-flight". Which Tom already have > a patch for :-) Once we have that the algorithm is simple. > > Qdisc dequeue look at BQL pkts-in-flight, if driver have "enough" > packets in-flight, the qdisc start it's bulk dequeue building phase, > before calling the driver. The allowed max qdisc bulk size should > likely be related to pkts-in-flight. > Sorry, I'm still missing it. The point of BQL is that we minimize the amount of data (and hence number of packets) that needs to be queued in the device in order to prevent the link from going idle while there are outstanding packets to be sent. The algorithm is based on counting bytes not packets because bytes are roughly an equal cost unit of work. So if we've queued 100K of bytes on the queue we know how long that takes around 80 usecs @10G, but if we count packets then we really don't know much about that. 100 packets enqueued could represent 6400 bytes or 6400K worth of data so time to transmit is anywhere from 5usecs to 5msecs Shouldn't qdisc bulk size be based on the BQL limit? What is the simple algorithm to apply to in-flight packets? Tom > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > Author of http://www.iptv-analyzer.org > LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Wed, 7 Sep 2016 20:21:24 -0700 Tom Herbertwrote: > On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend > wrote: > > On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote: > >> > >> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed > >> wrote: > >>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet > >>> wrote: > On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote: > > On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet > > wrote: > >> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote: > >> [...] > > Only if a qdisc is present and pressure is high enough. > > But in a forwarding setup, we likely receive at a lower rate than the > NIC can transmit. > >> > >> Yes, I can confirm this happens in my experiments. > >> > > >>> > >>> Jesper has a similar Idea to make the qdisc think it is under > >>> pressure, when the device TX ring is idle most of the time, i think > >>> his idea can come in handy here. I am not fully involved in the > >>> details, maybe he can elaborate more. > >>> > >>> But if it works, it will be transparent to napi, and xmit more will > >>> happen by design. > >> > >> Yes. I have some ideas around getting more bulking going from the qdisc > >> layer, by having the drivers provide some feedback to the qdisc layer > >> indicating xmit_more should be possible. This will be a topic at the > >> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully > >> challenge people to come up with a good solution ;-) > >> > > > > One thing I've noticed but haven't yet actually analyzed much is if > > I shrink the nic descriptor ring size to only be slightly larger than > > the qdisc layer bulking size I get more bulking and better perf numbers. > > At least on microbenchmarks. The reason being the nic pushes back more > > on the qdisc. So maybe a case for making the ring size in the NIC some > > factor of the expected number of queues feeding the descriptor ring. > > I've also played with shrink the NIC descriptor ring size, it works, but it is an ugly hack to get NIC pushes backs, and I foresee it will hurt normal use-cases. (There are other reasons for shrinking the ring size like cache usage, but that is unrelated to this). > BQL is not helping with that? Exactly. But the BQL _byte_ limit is not what is needed, what we need to know is the _packets_ currently "in-flight". Which Tom already have a patch for :-) Once we have that the algorithm is simple. Qdisc dequeue look at BQL pkts-in-flight, if driver have "enough" packets in-flight, the qdisc start it's bulk dequeue building phase, before calling the driver. The allowed max qdisc bulk size should likely be related to pkts-in-flight. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Wed, Sep 7, 2016 at 7:58 PM, John Fastabendwrote: > On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote: >> >> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed >> wrote: >>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet wrote: On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote: > On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet > wrote: >> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote: >> [...] Only if a qdisc is present and pressure is high enough. But in a forwarding setup, we likely receive at a lower rate than the NIC can transmit. >> >> Yes, I can confirm this happens in my experiments. >> >>> >>> Jesper has a similar Idea to make the qdisc think it is under >>> pressure, when the device TX ring is idle most of the time, i think >>> his idea can come in handy here. I am not fully involved in the >>> details, maybe he can elaborate more. >>> >>> But if it works, it will be transparent to napi, and xmit more will >>> happen by design. >> >> Yes. I have some ideas around getting more bulking going from the qdisc >> layer, by having the drivers provide some feedback to the qdisc layer >> indicating xmit_more should be possible. This will be a topic at the >> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully >> challenge people to come up with a good solution ;-) >> > > One thing I've noticed but haven't yet actually analyzed much is if > I shrink the nic descriptor ring size to only be slightly larger than > the qdisc layer bulking size I get more bulking and better perf numbers. > At least on microbenchmarks. The reason being the nic pushes back more > on the qdisc. So maybe a case for making the ring size in the NIC some > factor of the expected number of queues feeding the descriptor ring. > BQL is not helping with that? Tom > .John
Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote: > > On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed> wrote: >> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet wrote: >>> On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote: On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet wrote: > On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote: > [...] >>> >>> Only if a qdisc is present and pressure is high enough. >>> >>> But in a forwarding setup, we likely receive at a lower rate than the >>> NIC can transmit. > > Yes, I can confirm this happens in my experiments. > >>> >> >> Jesper has a similar Idea to make the qdisc think it is under >> pressure, when the device TX ring is idle most of the time, i think >> his idea can come in handy here. I am not fully involved in the >> details, maybe he can elaborate more. >> >> But if it works, it will be transparent to napi, and xmit more will >> happen by design. > > Yes. I have some ideas around getting more bulking going from the qdisc > layer, by having the drivers provide some feedback to the qdisc layer > indicating xmit_more should be possible. This will be a topic at the > Network Performance Workshop[1] at NetDev 1.2, I have will hopefully > challenge people to come up with a good solution ;-) > One thing I've noticed but haven't yet actually analyzed much is if I shrink the nic descriptor ring size to only be slightly larger than the qdisc layer bulking size I get more bulking and better perf numbers. At least on microbenchmarks. The reason being the nic pushes back more on the qdisc. So maybe a case for making the ring size in the NIC some factor of the expected number of queues feeding the descriptor ring. .John
Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Wed, Sep 7, 2016 at 9:19 PM, Eric Dumazetwrote: > On Wed, 2016-09-07 at 19:57 +0300, Saeed Mahameed wrote: > >> Jesper has a similar Idea to make the qdisc think it is under >> pressure, when the device >> TX ring is idle most of the time, i think his idea can come in handy here. >> I am not fully involved in the details, maybe he can elaborate more. >> >> But if it works, it will be transparent to napi, and xmit more will >> happen by design. > > I do not think qdisc is relevant here. > > Right now, skb->xmit_more is set only by qdisc layer (and pktgen tool), > because only this layer can know if more packets are to come. > > > What I am saying is that regardless of skb->xmit_more being set or not, > (for example if no qdisc is even used) > a NAPI driver can arm a bit asking the doorbell being sent at the end of > NAPI. > > I am not saying this must be done, only that the idea could be extended > to non XDP world, if we care enough. > Yes, and i am just trying to suggest Ideas that do not require communication between RX (NAPI) and TX. The problem here is the synchronization (TX doorbell from RX) which is not as simple as atomic operation for some drivers. How about RX bulking ? it also can help here, since for the forwarding case, the forwarding path will be able to process bulk of RX SKBs and can bulk xmit the portion of SKBs that will be forwarded. As Jesper suggested, Let's talk in Netdev1.2 at jesper's session. ( if you are joining of course). Thanks Saeed.
Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameedwrote: > On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet wrote: > > On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote: > >> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet > >> wrote: > >> > On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote: [...] > > > > Only if a qdisc is present and pressure is high enough. > > > > But in a forwarding setup, we likely receive at a lower rate than the > > NIC can transmit. Yes, I can confirm this happens in my experiments. > > > > Jesper has a similar Idea to make the qdisc think it is under > pressure, when the device TX ring is idle most of the time, i think > his idea can come in handy here. I am not fully involved in the > details, maybe he can elaborate more. > > But if it works, it will be transparent to napi, and xmit more will > happen by design. Yes. I have some ideas around getting more bulking going from the qdisc layer, by having the drivers provide some feedback to the qdisc layer indicating xmit_more should be possible. This will be a topic at the Network Performance Workshop[1] at NetDev 1.2, I have will hopefully challenge people to come up with a good solution ;-) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer [1] http://netdevconf.org/1.2/session.html?jesper-performance-workshop
Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Wed, 2016-09-07 at 19:57 +0300, Saeed Mahameed wrote: > Jesper has a similar Idea to make the qdisc think it is under > pressure, when the device > TX ring is idle most of the time, i think his idea can come in handy here. > I am not fully involved in the details, maybe he can elaborate more. > > But if it works, it will be transparent to napi, and xmit more will > happen by design. I do not think qdisc is relevant here. Right now, skb->xmit_more is set only by qdisc layer (and pktgen tool), because only this layer can know if more packets are to come. What I am saying is that regardless of skb->xmit_more being set or not, (for example if no qdisc is even used) a NAPI driver can arm a bit asking the doorbell being sent at the end of NAPI. I am not saying this must be done, only that the idea could be extended to non XDP world, if we care enough.
Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazetwrote: > On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote: >> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet wrote: >> > On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote: >> >> Previously we rang XDP SQ doorbell on every forwarded XDP packet. >> >> >> >> Here we introduce a xmit more like mechanism that will queue up more >> >> than one packet into SQ (up to RX napi budget) w/o notifying the hardware. >> >> >> >> Once RX napi budget is consumed and we exit napi RX loop, we will >> >> flush (doorbell) all XDP looped packets in case there are such. >> > >> > Why is this idea depends on XDP ? >> > >> > It looks like we could apply it to any driver having one IRQ servicing >> > one RX and one TX, without XDP being involved. >> > >> >> Yes but it is more complicated than XDP case, where the RX ring posts >> the TX descriptors and once done >> the RX ring hits the doorbell once for all the TX descriptors it >> posted, and it is the only possible place to hit a doorbell >> for XDP TX ring. >> >> For regular TX and RX ring sharing the same IRQ, there is no such >> simple connection between them, and hitting a doorbell >> from RX ring napi would race with xmit ndo function of the TX ring. >> >> How do you synchronize in such case ? >> isn't the existing xmit more mechanism sufficient enough ? > > Only if a qdisc is present and pressure is high enough. > > But in a forwarding setup, we likely receive at a lower rate than the > NIC can transmit. > Jesper has a similar Idea to make the qdisc think it is under pressure, when the device TX ring is idle most of the time, i think his idea can come in handy here. I am not fully involved in the details, maybe he can elaborate more. But if it works, it will be transparent to napi, and xmit more will happen by design. > A simple cmpxchg could be used to synchronize the thing, if we really > cared about doorbell cost. (Ie if the cost of this cmpxchg() is way > smaller than doorbell one) >
Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote: > On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazetwrote: > > On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote: > >> Previously we rang XDP SQ doorbell on every forwarded XDP packet. > >> > >> Here we introduce a xmit more like mechanism that will queue up more > >> than one packet into SQ (up to RX napi budget) w/o notifying the hardware. > >> > >> Once RX napi budget is consumed and we exit napi RX loop, we will > >> flush (doorbell) all XDP looped packets in case there are such. > > > > Why is this idea depends on XDP ? > > > > It looks like we could apply it to any driver having one IRQ servicing > > one RX and one TX, without XDP being involved. > > > > Yes but it is more complicated than XDP case, where the RX ring posts > the TX descriptors and once done > the RX ring hits the doorbell once for all the TX descriptors it > posted, and it is the only possible place to hit a doorbell > for XDP TX ring. > > For regular TX and RX ring sharing the same IRQ, there is no such > simple connection between them, and hitting a doorbell > from RX ring napi would race with xmit ndo function of the TX ring. > > How do you synchronize in such case ? > isn't the existing xmit more mechanism sufficient enough ? Only if a qdisc is present and pressure is high enough. But in a forwarding setup, we likely receive at a lower rate than the NIC can transmit. A simple cmpxchg could be used to synchronize the thing, if we really cared about doorbell cost. (Ie if the cost of this cmpxchg() is way smaller than doorbell one)
Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazetwrote: > On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote: >> Previously we rang XDP SQ doorbell on every forwarded XDP packet. >> >> Here we introduce a xmit more like mechanism that will queue up more >> than one packet into SQ (up to RX napi budget) w/o notifying the hardware. >> >> Once RX napi budget is consumed and we exit napi RX loop, we will >> flush (doorbell) all XDP looped packets in case there are such. > > Why is this idea depends on XDP ? > > It looks like we could apply it to any driver having one IRQ servicing > one RX and one TX, without XDP being involved. > Yes but it is more complicated than XDP case, where the RX ring posts the TX descriptors and once done the RX ring hits the doorbell once for all the TX descriptors it posted, and it is the only possible place to hit a doorbell for XDP TX ring. For regular TX and RX ring sharing the same IRQ, there is no such simple connection between them, and hitting a doorbell from RX ring napi would race with xmit ndo function of the TX ring. How do you synchronize in such case ? isn't the existing xmit more mechanism sufficient enough ? maybe we can have a fence from napi RX function that will hold the xmit queue until done and then flush the TX queue with the setting the right xmit more flags, without the need of explicitly intervening with TX flow (hitting the doorbell).
Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote: > Previously we rang XDP SQ doorbell on every forwarded XDP packet. > > Here we introduce a xmit more like mechanism that will queue up more > than one packet into SQ (up to RX napi budget) w/o notifying the hardware. > > Once RX napi budget is consumed and we exit napi RX loop, we will > flush (doorbell) all XDP looped packets in case there are such. Why is this idea depends on XDP ? It looks like we could apply it to any driver having one IRQ servicing one RX and one TX, without XDP being involved.
Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On 16-09-07 05:42 AM, Saeed Mahameed wrote: > Previously we rang XDP SQ doorbell on every forwarded XDP packet. > > Here we introduce a xmit more like mechanism that will queue up more > than one packet into SQ (up to RX napi budget) w/o notifying the hardware. > > Once RX napi budget is consumed and we exit napi RX loop, we will > flush (doorbell) all XDP looped packets in case there are such. > > XDP forward packet rate: > > Comparing XDP with and w/o xmit more (bulk transmit): > > Streams XDP TX XDP TX (xmit more) > --- > 1 4.90Mpps 7.50Mpps > 2 9.50Mpps 14.8Mpps > 4 16.5Mpps 25.1Mpps > 8 21.5Mpps 27.5Mpps* > 16 24.1Mpps 27.5Mpps* > Hi Saeed, How many cores are you using with these numbers? Just a single core? Or are streams being RSS'd across cores somehow. > *It seems we hit a wall of 27.5Mpps, for 8 and 16 streams, > we will be working on the analysis and will publish the conclusions > later. > Thanks, John