Re: [RFC PATCH 06/13] xen-netback: copy buffer on xenvif_start_xmit()
On Fri, May 22, 2015 at 10:26:48AM +, Joao Martins wrote: [...] return IRQ_HANDLED; } @@ -168,8 +169,12 @@ static int xenvif_start_xmit(struct sk_buff *skb, struct net_device *dev) cb = XENVIF_RX_CB(skb); cb-expires = jiffies + vif-drain_timeout; - xenvif_rx_queue_tail(queue, skb); - xenvif_kick_thread(queue); + if (!queue-vif-persistent_grants) { + xenvif_rx_queue_tail(queue, skb); + xenvif_kick_thread(queue); + } else if (xenvif_rx_map(queue, skb)) { + return NETDEV_TX_BUSY; + } We now have two different functions for guest RX, one is xenvif_rx_map, the other is xenvif_rx_action. They look very similar. Can we only have one? I think I can merge this into xenvif_rx_action, and I notice that the stall detection its missing. I will also add that. Perhaps I could also disable the RX kthread, since this doesn't get used with persistent grants? Disabling that kthread is fine. But we do need to make sure we can do the same things in start_xmit as we are in kthread. I.e. what context does start_xmit run in and what are the restrictions. Wei. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 06/13] xen-netback: copy buffer on xenvif_start_xmit()
On 19 May 2015, at 17:35, Wei Liu wei.l...@citrix.com wrote: On Tue, May 12, 2015 at 07:18:30PM +0200, Joao Martins wrote: By introducing persistent grants we speed up the RX thread with the decreased copy cost, that leads to a throughput decrease of 20%. It is observed that the rx_queue stays mostly at 10% of its capacity, as opposed to full capacity when using grant copy. And a finer measure with lock_stat (below with pkt_size 64, burst 1) shows much higher wait queue contention on queue-wq, which hints that the RX kthread is waits/wake_up more often, that is actually doing work. Without persistent grants: class namecon-bouncescontentions waittime-min waittime-max waittime-total waittime-avgacq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg -- queue-wq: 792792 0.36 24.36 1140.30 1.44 42081002671 0.00 46.75 538164.02 0.54 -- queue-wq326 [8115949f] __wake_up+0x2f/0x80 queue-wq410 [811592bf] finish_wait+0x4f/0xa0 queue-wq 56 [811593eb] prepare_to_wait+0x2b/0xb0 -- queue-wq202 [811593eb] prepare_to_wait+0x2b/0xb0 queue-wq467 [8115949f] __wake_up+0x2f/0x80 queue-wq123 [811592bf] finish_wait+0x4f/0xa0 With persistent grants: queue-wq: 61834 61836 0.32 30.12 99710.27 1.61 2414001125308 0.00 75.61 1106578.82 0.98 -- queue-wq 5079[8115949f] __wake_up+0x2f/0x80 queue-wq56280[811592bf] finish_wait+0x4f/0xa0 queue-wq 479[811593eb] prepare_to_wait+0x2b/0xb0 -- queue-wq 1005[811592bf] finish_wait+0x4f/0xa0 queue-wq56761[8115949f] __wake_up+0x2f/0x80 queue-wq 4072[811593eb] prepare_to_wait+0x2b/0xb0 Also, with persistent grants, we don't require batching grant copy ops (besides the initial copy+map) which makes me believe that deferring the skb to the RX kthread just adds up unnecessary overhead (for this particular case). This patch proposes copying the buffer on xenvif_start_xmit(), which lets us both remove the contention on queue-wq and lock on rx_queue. Here, an alternative to xenvif_rx_action routine is added namely xenvif_rx_map() that maps and copies the buffer to the guest. This is only used when persistent grants are used, since it would otherwise mean an hypercall per packet. Improvements are up to a factor of 2.14 with a single queue getting us from 1.04 Mpps to 1.7 Mpps (burst 1, pkt_size 64) and 1.5 to 2.6 Mpps (burst 2, pkt_size 64) compared to using the kthread. Maximum with grant copy is 1.2 Mpps, irrespective of the burst. All of this, measured on an Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz. Signed-off-by: Joao Martins joao.mart...@neclab.eu --- drivers/net/xen-netback/common.h| 2 ++ drivers/net/xen-netback/interface.c | 11 +--- drivers/net/xen-netback/netback.c | 52 + 3 files changed, 51 insertions(+), 14 deletions(-) diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h index 23deb6a..f3ece12 100644 --- a/drivers/net/xen-netback/common.h +++ b/drivers/net/xen-netback/common.h @@ -363,6 +363,8 @@ void xenvif_kick_thread(struct xenvif_queue *queue); int xenvif_dealloc_kthread(void *data); +int xenvif_rx_map(struct xenvif_queue *queue, struct sk_buff *skb); + void xenvif_rx_queue_tail(struct xenvif_queue *queue, struct sk_buff *skb); /* Determine whether the needed number of slots (req) are available, diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c index 1103568..dfe2b7b 100644 --- a/drivers/net/xen-netback/interface.c +++ b/drivers/net/xen-netback/interface.c @@ -109,7 +109,8 @@ static irqreturn_t xenvif_rx_interrupt(int irq, void *dev_id) { struct xenvif_queue *queue = dev_id; -xenvif_kick_thread(queue); +if (!queue-vif-persistent_grants) +xenvif_kick_thread(queue); return IRQ_HANDLED; } @@ -168,8 +169,12 @@ static int xenvif_start_xmit(struct sk_buff *skb, struct net_device *dev) cb = XENVIF_RX_CB(skb); cb-expires = jiffies + vif-drain_timeout; -xenvif_rx_queue_tail(queue, skb); -xenvif_kick_thread(queue); +if (!queue-vif-persistent_grants) { +xenvif_rx_queue_tail(queue, skb); +xenvif_kick_thread(queue); +} else if (xenvif_rx_map(queue, skb)) { +return NETDEV_TX_BUSY; +} We now have two different functions for guest RX, one is xenvif_rx_map, the other is xenvif_rx_action. They look very similar. Can
Re: [RFC PATCH 06/13] xen-netback: copy buffer on xenvif_start_xmit()
On Tue, May 12, 2015 at 07:18:30PM +0200, Joao Martins wrote: By introducing persistent grants we speed up the RX thread with the decreased copy cost, that leads to a throughput decrease of 20%. It is observed that the rx_queue stays mostly at 10% of its capacity, as opposed to full capacity when using grant copy. And a finer measure with lock_stat (below with pkt_size 64, burst 1) shows much higher wait queue contention on queue-wq, which hints that the RX kthread is waits/wake_up more often, that is actually doing work. Without persistent grants: class namecon-bouncescontentions waittime-min waittime-max waittime-total waittime-avgacq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg -- queue-wq: 792792 0.36 24.36 1140.30 1.44 42081002671 0.00 46.75 538164.02 0.54 -- queue-wq326 [8115949f] __wake_up+0x2f/0x80 queue-wq410 [811592bf] finish_wait+0x4f/0xa0 queue-wq 56 [811593eb] prepare_to_wait+0x2b/0xb0 -- queue-wq202 [811593eb] prepare_to_wait+0x2b/0xb0 queue-wq467 [8115949f] __wake_up+0x2f/0x80 queue-wq123 [811592bf] finish_wait+0x4f/0xa0 With persistent grants: queue-wq: 61834 61836 0.32 30.12 99710.27 1.61 2414001125308 0.00 75.61 1106578.82 0.98 -- queue-wq 5079[8115949f] __wake_up+0x2f/0x80 queue-wq56280[811592bf] finish_wait+0x4f/0xa0 queue-wq 479[811593eb] prepare_to_wait+0x2b/0xb0 -- queue-wq 1005[811592bf] finish_wait+0x4f/0xa0 queue-wq56761[8115949f] __wake_up+0x2f/0x80 queue-wq 4072[811593eb] prepare_to_wait+0x2b/0xb0 Also, with persistent grants, we don't require batching grant copy ops (besides the initial copy+map) which makes me believe that deferring the skb to the RX kthread just adds up unnecessary overhead (for this particular case). This patch proposes copying the buffer on xenvif_start_xmit(), which lets us both remove the contention on queue-wq and lock on rx_queue. Here, an alternative to xenvif_rx_action routine is added namely xenvif_rx_map() that maps and copies the buffer to the guest. This is only used when persistent grants are used, since it would otherwise mean an hypercall per packet. Improvements are up to a factor of 2.14 with a single queue getting us from 1.04 Mpps to 1.7 Mpps (burst 1, pkt_size 64) and 1.5 to 2.6 Mpps (burst 2, pkt_size 64) compared to using the kthread. Maximum with grant copy is 1.2 Mpps, irrespective of the burst. All of this, measured on an Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz. Signed-off-by: Joao Martins joao.mart...@neclab.eu --- drivers/net/xen-netback/common.h| 2 ++ drivers/net/xen-netback/interface.c | 11 +--- drivers/net/xen-netback/netback.c | 52 + 3 files changed, 51 insertions(+), 14 deletions(-) diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h index 23deb6a..f3ece12 100644 --- a/drivers/net/xen-netback/common.h +++ b/drivers/net/xen-netback/common.h @@ -363,6 +363,8 @@ void xenvif_kick_thread(struct xenvif_queue *queue); int xenvif_dealloc_kthread(void *data); +int xenvif_rx_map(struct xenvif_queue *queue, struct sk_buff *skb); + void xenvif_rx_queue_tail(struct xenvif_queue *queue, struct sk_buff *skb); /* Determine whether the needed number of slots (req) are available, diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c index 1103568..dfe2b7b 100644 --- a/drivers/net/xen-netback/interface.c +++ b/drivers/net/xen-netback/interface.c @@ -109,7 +109,8 @@ static irqreturn_t xenvif_rx_interrupt(int irq, void *dev_id) { struct xenvif_queue *queue = dev_id; - xenvif_kick_thread(queue); + if (!queue-vif-persistent_grants) + xenvif_kick_thread(queue); return IRQ_HANDLED; } @@ -168,8 +169,12 @@ static int xenvif_start_xmit(struct sk_buff *skb, struct net_device *dev) cb = XENVIF_RX_CB(skb); cb-expires = jiffies + vif-drain_timeout; - xenvif_rx_queue_tail(queue, skb); - xenvif_kick_thread(queue); + if (!queue-vif-persistent_grants) { + xenvif_rx_queue_tail(queue, skb); + xenvif_kick_thread(queue); + } else if (xenvif_rx_map(queue, skb)) { + return NETDEV_TX_BUSY; + } We now have two different functions for guest RX, one is xenvif_rx_map, the other is xenvif_rx_action. They look very similar. Can we only have one? return
[RFC PATCH 06/13] xen-netback: copy buffer on xenvif_start_xmit()
By introducing persistent grants we speed up the RX thread with the decreased copy cost, that leads to a throughput decrease of 20%. It is observed that the rx_queue stays mostly at 10% of its capacity, as opposed to full capacity when using grant copy. And a finer measure with lock_stat (below with pkt_size 64, burst 1) shows much higher wait queue contention on queue-wq, which hints that the RX kthread is waits/wake_up more often, that is actually doing work. Without persistent grants: class namecon-bouncescontentions waittime-min waittime-max waittime-total waittime-avgacq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg -- queue-wq: 792792 0.36 24.36 1140.30 1.44 42081002671 0.00 46.75 538164.02 0.54 -- queue-wq326 [8115949f] __wake_up+0x2f/0x80 queue-wq410 [811592bf] finish_wait+0x4f/0xa0 queue-wq 56 [811593eb] prepare_to_wait+0x2b/0xb0 -- queue-wq202 [811593eb] prepare_to_wait+0x2b/0xb0 queue-wq467 [8115949f] __wake_up+0x2f/0x80 queue-wq123 [811592bf] finish_wait+0x4f/0xa0 With persistent grants: queue-wq: 61834 61836 0.32 30.12 99710.27 1.61 2414001125308 0.00 75.61 1106578.82 0.98 -- queue-wq 5079[8115949f] __wake_up+0x2f/0x80 queue-wq56280[811592bf] finish_wait+0x4f/0xa0 queue-wq 479[811593eb] prepare_to_wait+0x2b/0xb0 -- queue-wq 1005[811592bf] finish_wait+0x4f/0xa0 queue-wq56761[8115949f] __wake_up+0x2f/0x80 queue-wq 4072[811593eb] prepare_to_wait+0x2b/0xb0 Also, with persistent grants, we don't require batching grant copy ops (besides the initial copy+map) which makes me believe that deferring the skb to the RX kthread just adds up unnecessary overhead (for this particular case). This patch proposes copying the buffer on xenvif_start_xmit(), which lets us both remove the contention on queue-wq and lock on rx_queue. Here, an alternative to xenvif_rx_action routine is added namely xenvif_rx_map() that maps and copies the buffer to the guest. This is only used when persistent grants are used, since it would otherwise mean an hypercall per packet. Improvements are up to a factor of 2.14 with a single queue getting us from 1.04 Mpps to 1.7 Mpps (burst 1, pkt_size 64) and 1.5 to 2.6 Mpps (burst 2, pkt_size 64) compared to using the kthread. Maximum with grant copy is 1.2 Mpps, irrespective of the burst. All of this, measured on an Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz. Signed-off-by: Joao Martins joao.mart...@neclab.eu --- drivers/net/xen-netback/common.h| 2 ++ drivers/net/xen-netback/interface.c | 11 +--- drivers/net/xen-netback/netback.c | 52 + 3 files changed, 51 insertions(+), 14 deletions(-) diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h index 23deb6a..f3ece12 100644 --- a/drivers/net/xen-netback/common.h +++ b/drivers/net/xen-netback/common.h @@ -363,6 +363,8 @@ void xenvif_kick_thread(struct xenvif_queue *queue); int xenvif_dealloc_kthread(void *data); +int xenvif_rx_map(struct xenvif_queue *queue, struct sk_buff *skb); + void xenvif_rx_queue_tail(struct xenvif_queue *queue, struct sk_buff *skb); /* Determine whether the needed number of slots (req) are available, diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c index 1103568..dfe2b7b 100644 --- a/drivers/net/xen-netback/interface.c +++ b/drivers/net/xen-netback/interface.c @@ -109,7 +109,8 @@ static irqreturn_t xenvif_rx_interrupt(int irq, void *dev_id) { struct xenvif_queue *queue = dev_id; - xenvif_kick_thread(queue); + if (!queue-vif-persistent_grants) + xenvif_kick_thread(queue); return IRQ_HANDLED; } @@ -168,8 +169,12 @@ static int xenvif_start_xmit(struct sk_buff *skb, struct net_device *dev) cb = XENVIF_RX_CB(skb); cb-expires = jiffies + vif-drain_timeout; - xenvif_rx_queue_tail(queue, skb); - xenvif_kick_thread(queue); + if (!queue-vif-persistent_grants) { + xenvif_rx_queue_tail(queue, skb); + xenvif_kick_thread(queue); + } else if (xenvif_rx_map(queue, skb)) { + return NETDEV_TX_BUSY; + } return NETDEV_TX_OK; diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c index c4f57d7..228df92 100644 --- a/drivers/net/xen-netback/netback.c +++ b/drivers/net/xen-netback/netback.c @@ -883,9 +883,48 @@ static bool xenvif_rx_submit(struct xenvif_queue *queue,