Re: [PATCH net-next v3 0/5] virtio-net tx napi

2017-04-24 Thread Michael S. Tsirkin
On Mon, Apr 24, 2017 at 01:49:25PM -0400, Willem de Bruijn wrote:
> From: Willem de Bruijn 
> 
> Add napi for virtio-net transmit completion processing.


Acked-by: Michael S. Tsirkin 

> Changes:
>   v2 -> v3:
> - convert __netif_tx_trylock to __netif_tx_lock on tx napi poll
>   ensure that the handler always cleans, to avoid deadlock
> - unconditionally clean in start_xmit
>   avoid adding an unnecessary "if (use_napi)" branch
> - remove virtqueue_disable_cb in patch 5/5
>   a noop in the common event_idx based loop
> - document affinity_hint_set constraint
> 
>   v1 -> v2:
> - disable by default
> - disable unless affinity_hint_set
>   because cache misses add up to a third higher cycle cost,
> e.g., in TCP_RR tests. This is not limited to the patch
> that enables tx completion cleaning in rx napi.
> - use trylock to avoid contention between tx and rx napi
> - keep interrupts masked during xmit_more (new patch 5/5)
>   this improves cycles especially for multi UDP_STREAM, which
> does not benefit from cleaning tx completions on rx napi.
> - move free_old_xmit_skbs (new patch 3/5)
>   to avoid forward declaration
> 
> not changed:
> - deduplicate virnet_poll_tx and virtnet_poll_txclean
>   they look similar, but have differ too much to make it
> worthwhile.
> - delay netif_wake_subqueue for more than 2 + MAX_SKB_FRAGS
>   evaluated, but made no difference
> - patch 1/5
> 
>   RFC -> v1:
> - dropped vhost interrupt moderation patch:
>   not needed and likely expensive at light load
> - remove tx napi weight
> - always clean all tx completions
> - use boolean to toggle tx-napi, instead
> - only clean tx in rx if tx-napi is enabled
> - then clean tx before rx
> - fix: add missing braces in virtnet_freeze_down
> - testing: add 4KB TCP_RR + UDP test results
> 
> Based on previous patchsets by Jason Wang:
> 
>   [RFC V7 PATCH 0/7] enable tx interrupts for virtio-net
>   http://lkml.iu.edu/hypermail/linux/kernel/1505.3/00245.html
> 
> 
> Before commit b0c39dbdc204 ("virtio_net: don't free buffers in xmit
> ring") the virtio-net driver would free transmitted packets on
> transmission of new packets in ndo_start_xmit and, to catch the edge
> case when no new packet is sent, also in a timer at 10HZ.
> 
> A timer can cause long stalls. VIRTIO_F_NOTIFY_ON_EMPTY avoids stalls
> due to low free descriptor count. It does not address a stalls due to
> low socket SO_SNDBUF. Increasing timer frequency decreases that stall
> time, but increases interrupt rate and, thus, cycle count.
> 
> Currently, with no timer, packets are freed only at ndo_start_xmit.
> Latency of consume_skb is now unbounded. To avoid a deadlock if a sock
> reaches SO_SNDBUF, packets are orphaned on tx. This breaks TCP small
> queues.
> 
> Reenable TCP small queues by removing the orphan. Instead of using a
> timer, convert the driver to regular tx napi. This does not have the
> unresolved stall issue and does not have any frequency to tune.
> 
> By keeping interrupts enabled by default, napi increases tx
> interrupt rate. VIRTIO_F_EVENT_IDX avoids sending an interrupt if
> one is already unacknowledged, so makes this more feasible today.
> Combine that with an optimization that brings interrupt rate
> back in line with the existing version for most workloads:
> 
> Tx completion cleaning on rx interrupts elides most explicit tx
> interrupts by relying on the fact that many rx interrupts fire.
> 
> Tested by running {1, 10, 100} {TCP, UDP} STREAM, RR, 4K_RR benchmarks
> from a guest to a server on the host, on an x86_64 Haswell. The guest
> runs 4 vCPUs pinned to 4 cores. vhost and the test server are
> pinned to a core each.
> 
> All results are the median of 5 runs, with variance well < 10%.
> Used neper (github.com/google/neper) as test process.
> 
> Napi increases single stream throughput, but increases cycle cost.
> The optimizations bring this down. The previous patchset saw a
> regression with UDP_STREAM, which does not benefit from cleaning tx
> interrupts in rx napi. This regression is now gone for 10x, 100x.
> Remaining difference is higher 1x TCP_STREAM, lower 1x UDP_STREAM.
> 
> The latest results are with process, rx napi and tx napi affine to
> the same core. All numbers are lower than the previous patchset.
> 
> 
>  upstream napi
> TCP_STREAM:
> 1x:
>   Mbps  2781639805
>   Gcycles 274  285
> 
> 10x:
>   Mbps  4294742531
>   Gcycles 300  296
> 
> 100x:
>   Mbps  3183028042
>   Gcycles 279  269
> 
> TCP_RR Latency (us):
> 1x:
>   p50  21   21
>   p99  27   27
>   Gcycles 180  167
> 
> 10x:
>   p50  40   39
>   p99  52   52
>   Gcycles 

Re: [PATCH net-next v2 2/5] virtio-net: transmit napi

2017-04-24 Thread Willem de Bruijn
On Mon, Apr 24, 2017 at 1:14 PM, Michael S. Tsirkin  wrote:
> On Mon, Apr 24, 2017 at 01:05:45PM -0400, Willem de Bruijn wrote:
>> On Mon, Apr 24, 2017 at 12:40 PM, Michael S. Tsirkin  wrote:
>> > On Fri, Apr 21, 2017 at 10:50:12AM -0400, Willem de Bruijn wrote:
>> >> >>> Maybe I was wrong, but according to Michael's comment it looks like he
>> >> >>> want
>> >> >>> check affinity_hint_set just for speculative tx polling on rx napi
>> >> >>> instead
>> >> >>> of disabling it at all.
>> >> >>>
>> >> >>> And I'm not convinced this is really needed, driver only provide 
>> >> >>> affinity
>> >> >>> hint instead of affinity, so it's not guaranteed that tx and rx 
>> >> >>> interrupt
>> >> >>> are in the same vcpus.
>> >> >>
>> >> >> You're right. I made the restriction broader than the request, to 
>> >> >> really
>> >> >> err
>> >> >> on the side of caution for the initial merge of napi tx. And enabling
>> >> >> the optimization is always a win over keeping it off, even without irq
>> >> >> affinity.
>> >> >>
>> >> >> The cycle cost is significant without affinity regardless of whether 
>> >> >> the
>> >> >> optimization is used.
>> >> >
>> >> >
>> >> > Yes, I noticed this in the past too.
>> >> >
>> >> >> Though this is not limited to napi-tx, it is more
>> >> >> pronounced in that mode than without napi.
>> >> >>
>> >> >> 1x TCP_RR for affinity configuration {process, rx_irq, tx_irq}:
>> >> >>
>> >> >> upstream:
>> >> >>
>> >> >> 1,1,1: 28985 Mbps, 278 Gcyc
>> >> >> 1,0,2: 30067 Mbps, 402 Gcyc
>> >> >>
>> >> >> napi tx:
>> >> >>
>> >> >> 1,1,1: 34492 Mbps, 269 Gcyc
>> >> >> 1,0,2: 36527 Mbps, 537 Gcyc (!)
>> >> >> 1,0,1: 36269 Mbps, 394 Gcyc
>> >> >> 1,0,0: 34674 Mbps, 402 Gcyc
>> >> >>
>> >> >> This is a particularly strong example. It is also representative
>> >> >> of most RR tests. It is less pronounced in other streaming tests.
>> >> >> 10x TCP_RR, for instance:
>> >> >>
>> >> >> upstream:
>> >> >>
>> >> >> 1,1,1: 42267 Mbps, 301 Gcyc
>> >> >> 1,0,2: 40663 Mbps, 445 Gcyc
>> >> >>
>> >> >> napi tx:
>> >> >>
>> >> >> 1,1,1: 42420 Mbps, 303 Gcyc
>> >> >> 1,0,2:  42267 Mbps, 431 Gcyc
>> >> >>
>> >> >> These numbers were obtained with the virtqueue_enable_cb_delayed
>> >> >> optimization after xmit_skb, btw. It turns out that moving that before
>> >> >> increases 1x TCP_RR further to ~39 Gbps, at the cost of reducing
>> >> >> 100x TCP_RR a bit.
>> >> >
>> >> >
>> >> > I see, so I think we can leave the affinity hint optimization/check for
>> >> > future investigation:
>> >> >
>> >> > - to avoid endless optimization (e.g we may want to share a single
>> >> > vector/napi for tx/rx queue pairs in the future) for this series.
>> >> > - tx napi is disabled by default which means we can do optimization on 
>> >> > top.
>> >>
>> >> Okay. I'll drop the vi->affinity_hint_set from the patch set for now.
>> >
>> > I kind of like it, let's be conservative. But I'd prefer a comment
>> > near it explaining why it's there.
>>
>> I don't feel strongly. Was minutes away from sending a v3 with this
>> code reverted, but I'll reinstate it and add a comment. Other planned
>> changes based on Jason's feedback to v2:
>>
>>   v2 -> v3:
>> - convert __netif_tx_trylock to __netif_tx_lock on tx napi poll
>>   ensure that the handler always cleans, to avoid deadlock
>> - unconditionally clean in start_xmit
>>   avoid adding an unnecessary "if (use_napi)" branch
>> - remove virtqueue_disable_cb in patch 5/5
>>   a noop in the common event_idx based loop
>
> Makes sense, thanks!

Great. Sent that, thanks.

The actual diff to v2 is quite small:

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index b107ae011632..003143835766 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -986,6 +986,9 @@ static void virtnet_napi_tx_enable(struct virtnet_info *vi,
if (!napi->weight)
return;

+   /* Tx napi touches cachelines on the cpu handling tx interrupts. Only
+* enable the feature if this is likely affine with the transmit path.
+*/
if (!vi->affinity_hint_set) {
napi->weight = 0;
return;
@@ -1131,10 +1134,9 @@ static int virtnet_poll_tx(struct napi_struct
*napi, int budget)
struct virtnet_info *vi = sq->vq->vdev->priv;
struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq));

-   if (__netif_tx_trylock(txq)) {
-   free_old_xmit_skbs(sq);
-   __netif_tx_unlock(txq);
-   }
+   __netif_tx_lock(txq, raw_smp_processor_id());
+   free_old_xmit_skbs(sq);
+   __netif_tx_unlock(txq);

virtqueue_napi_complete(napi, sq->vq, 0);

@@ -1196,14 +1198,10 @@ static netdev_tx_t start_xmit(struct sk_buff
*skb, struct net_device *dev)
bool use_napi = sq->napi.weight;

/* Free up any pending old buffers before queueing new ones. */
-   if (use_napi) {
-   if 

[PATCH net-next v3 4/5] virtio-net: clean tx descriptors from rx napi

2017-04-24 Thread Willem de Bruijn
From: Willem de Bruijn 

Amortize the cost of virtual interrupts by doing both rx and tx work
on reception of a receive interrupt if tx napi is enabled. With
VIRTIO_F_EVENT_IDX, this suppresses most explicit tx completion
interrupts for bidirectional workloads.

Signed-off-by: Willem de Bruijn 
---
 drivers/net/virtio_net.c | 21 +
 1 file changed, 21 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 4ec79e5d7a86..9dd978f34c1f 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1075,12 +1075,33 @@ static void free_old_xmit_skbs(struct send_queue *sq)
u64_stats_update_end(>tx_syncp);
 }
 
+static void virtnet_poll_cleantx(struct receive_queue *rq)
+{
+   struct virtnet_info *vi = rq->vq->vdev->priv;
+   unsigned int index = vq2rxq(rq->vq);
+   struct send_queue *sq = >sq[index];
+   struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, index);
+
+   if (!sq->napi.weight)
+   return;
+
+   if (__netif_tx_trylock(txq)) {
+   free_old_xmit_skbs(sq);
+   __netif_tx_unlock(txq);
+   }
+
+   if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
+   netif_tx_wake_queue(txq);
+}
+
 static int virtnet_poll(struct napi_struct *napi, int budget)
 {
struct receive_queue *rq =
container_of(napi, struct receive_queue, napi);
unsigned int received;
 
+   virtnet_poll_cleantx(rq);
+
received = virtnet_receive(rq, budget);
 
/* Out of packets? */
-- 
2.12.2.816.g281164-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v3 5/5] virtio-net: keep tx interrupts disabled unless kick

2017-04-24 Thread Willem de Bruijn
From: Willem de Bruijn 

Tx napi mode increases the rate of transmit interrupts. Suppress some
by masking interrupts while more packets are expected. The interrupts
will be reenabled before the last packet is sent.

This optimization reduces the througput drop with tx napi for
unidirectional flows such as UDP_STREAM that do not benefit from
cleaning tx completions in the the receive napi handler.

Signed-off-by: Willem de Bruijn 
---
 drivers/net/virtio_net.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 9dd978f34c1f..003143835766 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1200,6 +1200,9 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct 
net_device *dev)
/* Free up any pending old buffers before queueing new ones. */
free_old_xmit_skbs(sq);
 
+   if (use_napi && kick)
+   virtqueue_enable_cb_delayed(sq->vq);
+
/* timestamp packet in software */
skb_tx_timestamp(skb);
 
-- 
2.12.2.816.g281164-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v3 2/5] virtio-net: transmit napi

2017-04-24 Thread Willem de Bruijn
From: Willem de Bruijn 

Convert virtio-net to a standard napi tx completion path. This enables
better TCP pacing using TCP small queues and increases single stream
throughput.

The virtio-net driver currently cleans tx descriptors on transmission
of new packets in ndo_start_xmit. Latency depends on new traffic, so
is unbounded. To avoid deadlock when a socket reaches its snd limit,
packets are orphaned on tranmission. This breaks socket backpressure,
including TSQ.

Napi increases the number of interrupts generated compared to the
current model, which keeps interrupts disabled as long as the ring
has enough free descriptors. Keep tx napi optional and disabled for
now. Follow-on patches will reduce the interrupt cost.

Signed-off-by: Willem de Bruijn 
Signed-off-by: Jason Wang 
---
 drivers/net/virtio_net.c | 76 ++--
 1 file changed, 67 insertions(+), 9 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index b9c1df29892c..356d18481ee4 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -33,9 +33,10 @@
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
 
-static bool csum = true, gso = true;
+static bool csum = true, gso = true, napi_tx;
 module_param(csum, bool, 0444);
 module_param(gso, bool, 0444);
+module_param(napi_tx, bool, 0644);
 
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
@@ -86,6 +87,8 @@ struct send_queue {
 
/* Name of the send queue: output.$index */
char name[40];
+
+   struct napi_struct napi;
 };
 
 /* Internal representation of a receive virtqueue */
@@ -262,12 +265,16 @@ static void virtqueue_napi_complete(struct napi_struct 
*napi,
 static void skb_xmit_done(struct virtqueue *vq)
 {
struct virtnet_info *vi = vq->vdev->priv;
+   struct napi_struct *napi = >sq[vq2txq(vq)].napi;
 
/* Suppress further interrupts. */
virtqueue_disable_cb(vq);
 
-   /* We were probably waiting for more output buffers. */
-   netif_wake_subqueue(vi->dev, vq2txq(vq));
+   if (napi->weight)
+   virtqueue_napi_schedule(napi, vq);
+   else
+   /* We were probably waiting for more output buffers. */
+   netif_wake_subqueue(vi->dev, vq2txq(vq));
 }
 
 static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx)
@@ -972,6 +979,24 @@ static void virtnet_napi_enable(struct virtqueue *vq, 
struct napi_struct *napi)
local_bh_enable();
 }
 
+static void virtnet_napi_tx_enable(struct virtnet_info *vi,
+  struct virtqueue *vq,
+  struct napi_struct *napi)
+{
+   if (!napi->weight)
+   return;
+
+   /* Tx napi touches cachelines on the cpu handling tx interrupts. Only
+* enable the feature if this is likely affine with the transmit path.
+*/
+   if (!vi->affinity_hint_set) {
+   napi->weight = 0;
+   return;
+   }
+
+   return virtnet_napi_enable(vq, napi);
+}
+
 static void refill_work(struct work_struct *work)
 {
struct virtnet_info *vi =
@@ -1046,6 +1071,7 @@ static int virtnet_open(struct net_device *dev)
if (!try_fill_recv(vi, >rq[i], GFP_KERNEL))
schedule_delayed_work(>refill, 0);
virtnet_napi_enable(vi->rq[i].vq, >rq[i].napi);
+   virtnet_napi_tx_enable(vi, vi->sq[i].vq, >sq[i].napi);
}
 
return 0;
@@ -1081,6 +1107,24 @@ static void free_old_xmit_skbs(struct send_queue *sq)
u64_stats_update_end(>tx_syncp);
 }
 
+static int virtnet_poll_tx(struct napi_struct *napi, int budget)
+{
+   struct send_queue *sq = container_of(napi, struct send_queue, napi);
+   struct virtnet_info *vi = sq->vq->vdev->priv;
+   struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq));
+
+   __netif_tx_lock(txq, raw_smp_processor_id());
+   free_old_xmit_skbs(sq);
+   __netif_tx_unlock(txq);
+
+   virtqueue_napi_complete(napi, sq->vq, 0);
+
+   if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
+   netif_tx_wake_queue(txq);
+
+   return 0;
+}
+
 static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
 {
struct virtio_net_hdr_mrg_rxbuf *hdr;
@@ -1130,6 +1174,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct 
net_device *dev)
int err;
struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
bool kick = !skb->xmit_more;
+   bool use_napi = sq->napi.weight;
 
/* Free up any pending old buffers before queueing new ones. */
free_old_xmit_skbs(sq);
@@ -1152,8 +1197,10 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, 
struct net_device *dev)
}
 
/* Don't wait up for transmitted skbs to be freed. */
-   

[PATCH net-next v3 3/5] virtio-net: move free_old_xmit_skbs

2017-04-24 Thread Willem de Bruijn
From: Willem de Bruijn 

An upcoming patch will call free_old_xmit_skbs indirectly from
virtnet_poll. Move the function above this to avoid having to
introduce a forward declaration.

This is a pure move: no code changes.

Signed-off-by: Willem de Bruijn 
---
 drivers/net/virtio_net.c | 60 
 1 file changed, 30 insertions(+), 30 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 356d18481ee4..4ec79e5d7a86 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1045,6 +1045,36 @@ static int virtnet_receive(struct receive_queue *rq, int 
budget)
return received;
 }
 
+static void free_old_xmit_skbs(struct send_queue *sq)
+{
+   struct sk_buff *skb;
+   unsigned int len;
+   struct virtnet_info *vi = sq->vq->vdev->priv;
+   struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
+   unsigned int packets = 0;
+   unsigned int bytes = 0;
+
+   while ((skb = virtqueue_get_buf(sq->vq, )) != NULL) {
+   pr_debug("Sent skb %p\n", skb);
+
+   bytes += skb->len;
+   packets++;
+
+   dev_kfree_skb_any(skb);
+   }
+
+   /* Avoid overhead when no packets have been processed
+* happens when called speculatively from start_xmit.
+*/
+   if (!packets)
+   return;
+
+   u64_stats_update_begin(>tx_syncp);
+   stats->tx_bytes += bytes;
+   stats->tx_packets += packets;
+   u64_stats_update_end(>tx_syncp);
+}
+
 static int virtnet_poll(struct napi_struct *napi, int budget)
 {
struct receive_queue *rq =
@@ -1077,36 +1107,6 @@ static int virtnet_open(struct net_device *dev)
return 0;
 }
 
-static void free_old_xmit_skbs(struct send_queue *sq)
-{
-   struct sk_buff *skb;
-   unsigned int len;
-   struct virtnet_info *vi = sq->vq->vdev->priv;
-   struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
-   unsigned int packets = 0;
-   unsigned int bytes = 0;
-
-   while ((skb = virtqueue_get_buf(sq->vq, )) != NULL) {
-   pr_debug("Sent skb %p\n", skb);
-
-   bytes += skb->len;
-   packets++;
-
-   dev_kfree_skb_any(skb);
-   }
-
-   /* Avoid overhead when no packets have been processed
-* happens when called speculatively from start_xmit.
-*/
-   if (!packets)
-   return;
-
-   u64_stats_update_begin(>tx_syncp);
-   stats->tx_bytes += bytes;
-   stats->tx_packets += packets;
-   u64_stats_update_end(>tx_syncp);
-}
-
 static int virtnet_poll_tx(struct napi_struct *napi, int budget)
 {
struct send_queue *sq = container_of(napi, struct send_queue, napi);
-- 
2.12.2.816.g281164-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v3 1/5] virtio-net: napi helper functions

2017-04-24 Thread Willem de Bruijn
From: Willem de Bruijn 

Prepare virtio-net for tx napi by converting existing napi code to
use helper functions. This also deduplicates some logic.

Signed-off-by: Willem de Bruijn 
Signed-off-by: Jason Wang 
---
 drivers/net/virtio_net.c | 65 ++--
 1 file changed, 35 insertions(+), 30 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 666ada6130ab..b9c1df29892c 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -239,6 +239,26 @@ static struct page *get_a_page(struct receive_queue *rq, 
gfp_t gfp_mask)
return p;
 }
 
+static void virtqueue_napi_schedule(struct napi_struct *napi,
+   struct virtqueue *vq)
+{
+   if (napi_schedule_prep(napi)) {
+   virtqueue_disable_cb(vq);
+   __napi_schedule(napi);
+   }
+}
+
+static void virtqueue_napi_complete(struct napi_struct *napi,
+   struct virtqueue *vq, int processed)
+{
+   int opaque;
+
+   opaque = virtqueue_enable_cb_prepare(vq);
+   if (napi_complete_done(napi, processed) &&
+   unlikely(virtqueue_poll(vq, opaque)))
+   virtqueue_napi_schedule(napi, vq);
+}
+
 static void skb_xmit_done(struct virtqueue *vq)
 {
struct virtnet_info *vi = vq->vdev->priv;
@@ -936,27 +956,20 @@ static void skb_recv_done(struct virtqueue *rvq)
struct virtnet_info *vi = rvq->vdev->priv;
struct receive_queue *rq = >rq[vq2rxq(rvq)];
 
-   /* Schedule NAPI, Suppress further interrupts if successful. */
-   if (napi_schedule_prep(>napi)) {
-   virtqueue_disable_cb(rvq);
-   __napi_schedule(>napi);
-   }
+   virtqueue_napi_schedule(>napi, rvq);
 }
 
-static void virtnet_napi_enable(struct receive_queue *rq)
+static void virtnet_napi_enable(struct virtqueue *vq, struct napi_struct *napi)
 {
-   napi_enable(>napi);
+   napi_enable(napi);
 
/* If all buffers were filled by other side before we napi_enabled, we
-* won't get another interrupt, so process any outstanding packets
-* now.  virtnet_poll wants re-enable the queue, so we disable here.
-* We synchronize against interrupts via NAPI_STATE_SCHED */
-   if (napi_schedule_prep(>napi)) {
-   virtqueue_disable_cb(rq->vq);
-   local_bh_disable();
-   __napi_schedule(>napi);
-   local_bh_enable();
-   }
+* won't get another interrupt, so process any outstanding packets now.
+* Call local_bh_enable after to trigger softIRQ processing.
+*/
+   local_bh_disable();
+   virtqueue_napi_schedule(napi, vq);
+   local_bh_enable();
 }
 
 static void refill_work(struct work_struct *work)
@@ -971,7 +984,7 @@ static void refill_work(struct work_struct *work)
 
napi_disable(>napi);
still_empty = !try_fill_recv(vi, rq, GFP_KERNEL);
-   virtnet_napi_enable(rq);
+   virtnet_napi_enable(rq->vq, >napi);
 
/* In theory, this can happen: if we don't get any buffers in
 * we will *never* try to fill again.
@@ -1011,21 +1024,13 @@ static int virtnet_poll(struct napi_struct *napi, int 
budget)
 {
struct receive_queue *rq =
container_of(napi, struct receive_queue, napi);
-   unsigned int r, received;
+   unsigned int received;
 
received = virtnet_receive(rq, budget);
 
/* Out of packets? */
-   if (received < budget) {
-   r = virtqueue_enable_cb_prepare(rq->vq);
-   if (napi_complete_done(napi, received)) {
-   if (unlikely(virtqueue_poll(rq->vq, r)) &&
-   napi_schedule_prep(napi)) {
-   virtqueue_disable_cb(rq->vq);
-   __napi_schedule(napi);
-   }
-   }
-   }
+   if (received < budget)
+   virtqueue_napi_complete(napi, rq->vq, received);
 
return received;
 }
@@ -1040,7 +1045,7 @@ static int virtnet_open(struct net_device *dev)
/* Make sure we have some buffers: if oom use wq. */
if (!try_fill_recv(vi, >rq[i], GFP_KERNEL))
schedule_delayed_work(>refill, 0);
-   virtnet_napi_enable(>rq[i]);
+   virtnet_napi_enable(vi->rq[i].vq, >rq[i].napi);
}
 
return 0;
@@ -1747,7 +1752,7 @@ static int virtnet_restore_up(struct virtio_device *vdev)
schedule_delayed_work(>refill, 0);
 
for (i = 0; i < vi->max_queue_pairs; i++)
-   virtnet_napi_enable(>rq[i]);
+   virtnet_napi_enable(vi->rq[i].vq, >rq[i].napi);
}
 
netif_device_attach(vi->dev);
-- 

Re: [PATCH net-next v2 2/5] virtio-net: transmit napi

2017-04-24 Thread Willem de Bruijn
On Mon, Apr 24, 2017 at 12:40 PM, Michael S. Tsirkin  wrote:
> On Fri, Apr 21, 2017 at 10:50:12AM -0400, Willem de Bruijn wrote:
>> >>> Maybe I was wrong, but according to Michael's comment it looks like he
>> >>> want
>> >>> check affinity_hint_set just for speculative tx polling on rx napi
>> >>> instead
>> >>> of disabling it at all.
>> >>>
>> >>> And I'm not convinced this is really needed, driver only provide affinity
>> >>> hint instead of affinity, so it's not guaranteed that tx and rx interrupt
>> >>> are in the same vcpus.
>> >>
>> >> You're right. I made the restriction broader than the request, to really
>> >> err
>> >> on the side of caution for the initial merge of napi tx. And enabling
>> >> the optimization is always a win over keeping it off, even without irq
>> >> affinity.
>> >>
>> >> The cycle cost is significant without affinity regardless of whether the
>> >> optimization is used.
>> >
>> >
>> > Yes, I noticed this in the past too.
>> >
>> >> Though this is not limited to napi-tx, it is more
>> >> pronounced in that mode than without napi.
>> >>
>> >> 1x TCP_RR for affinity configuration {process, rx_irq, tx_irq}:
>> >>
>> >> upstream:
>> >>
>> >> 1,1,1: 28985 Mbps, 278 Gcyc
>> >> 1,0,2: 30067 Mbps, 402 Gcyc
>> >>
>> >> napi tx:
>> >>
>> >> 1,1,1: 34492 Mbps, 269 Gcyc
>> >> 1,0,2: 36527 Mbps, 537 Gcyc (!)
>> >> 1,0,1: 36269 Mbps, 394 Gcyc
>> >> 1,0,0: 34674 Mbps, 402 Gcyc
>> >>
>> >> This is a particularly strong example. It is also representative
>> >> of most RR tests. It is less pronounced in other streaming tests.
>> >> 10x TCP_RR, for instance:
>> >>
>> >> upstream:
>> >>
>> >> 1,1,1: 42267 Mbps, 301 Gcyc
>> >> 1,0,2: 40663 Mbps, 445 Gcyc
>> >>
>> >> napi tx:
>> >>
>> >> 1,1,1: 42420 Mbps, 303 Gcyc
>> >> 1,0,2:  42267 Mbps, 431 Gcyc
>> >>
>> >> These numbers were obtained with the virtqueue_enable_cb_delayed
>> >> optimization after xmit_skb, btw. It turns out that moving that before
>> >> increases 1x TCP_RR further to ~39 Gbps, at the cost of reducing
>> >> 100x TCP_RR a bit.
>> >
>> >
>> > I see, so I think we can leave the affinity hint optimization/check for
>> > future investigation:
>> >
>> > - to avoid endless optimization (e.g we may want to share a single
>> > vector/napi for tx/rx queue pairs in the future) for this series.
>> > - tx napi is disabled by default which means we can do optimization on top.
>>
>> Okay. I'll drop the vi->affinity_hint_set from the patch set for now.
>
> I kind of like it, let's be conservative. But I'd prefer a comment
> near it explaining why it's there.

I don't feel strongly. Was minutes away from sending a v3 with this
code reverted, but I'll reinstate it and add a comment. Other planned
changes based on Jason's feedback to v2:

  v2 -> v3:
- convert __netif_tx_trylock to __netif_tx_lock on tx napi poll
  ensure that the handler always cleans, to avoid deadlock
- unconditionally clean in start_xmit
  avoid adding an unnecessary "if (use_napi)" branch
- remove virtqueue_disable_cb in patch 5/5
  a noop in the common event_idx based loop
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v2 5/5] virtio-net: keep tx interrupts disabled unless kick

2017-04-24 Thread Willem de Bruijn
On Thu, Apr 20, 2017 at 10:03 AM, Willem de Bruijn
 wrote:
>>> -   if (!use_napi)
>>> +   if (use_napi) {
>>> +   if (kick)
>>> +   virtqueue_enable_cb_delayed(sq->vq);
>>> +   else
>>> +   virtqueue_disable_cb(sq->vq);
>>
>>
>> Since virtqueue_disable_cb() do nothing for event idx. I wonder whether or
>> not just calling enable_cb_dealyed() is ok here.
>
> Good point.
>
>> Btw, it does not disable interrupt at all, I propose a patch in the past
>> which can do more than this:
>>
>> https://patchwork.kernel.org/patch/6472601/
>
> Interesting. Yes, let me evaluate that variant.

In initial tests I don't see a significant change, but we can look
into this more closely as a follow-on patch.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v2 2/5] virtio-net: transmit napi

2017-04-24 Thread Willem de Bruijn
On Thu, Apr 20, 2017 at 12:02 PM, Willem de Bruijn
 wrote:
>>>   static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
>>>   {
>>> struct virtio_net_hdr_mrg_rxbuf *hdr;
>>> @@ -1130,9 +1172,11 @@ static netdev_tx_t start_xmit(struct sk_buff *skb,
>>> struct net_device *dev)
>>> int err;
>>> struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
>>> bool kick = !skb->xmit_more;
>>> +   bool use_napi = sq->napi.weight;
>>> /* Free up any pending old buffers before queueing new ones. */
>>> -   free_old_xmit_skbs(sq);
>>> +   if (!use_napi)
>>> +   free_old_xmit_skbs(sq);
>>
>>
>> I'm not sure this is best or even correct. Consider we clean xmit packets
>> speculatively in virtnet_poll_tx(), we need call free_old_xmit_skbs()
>> unconditionally. This can also help to reduce the possible of napi
>> rescheduling in virtnet_poll_tx().
>
> Because of the use of trylock there. Absolutely, thanks! Perhaps I should
> only use trylock in the opportunistic clean path from the rx softirq and
> full locking in the tx softirq.
>
> I previously observed that cleaning here would, counterintuitively,
> reduce efficiency. It reverted the improvements of cleaning transmit
> completions from the receive softirq. Going through my data, I did
> not observe this regression anymore on the latest patchset.
>
> Let me test again, with and without the new
> virtqueue_enable_cb_delayed patch. Perhaps that made a
> difference.

Neither cleaning in start_xmit nor converting the napi tx trylock to lock
shows a significant impact on loadtests, whether cpu affine or not.

I'll make both changes, as the first reduces patch size and code complexity
and the second is a more obviously correct codepath than than with trylock.

To be clear, the variant called from the rx napi handler will still
opportunistically use trylock.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v2 2/5] virtio-net: transmit napi

2017-04-24 Thread Willem de Bruijn
>>> Maybe I was wrong, but according to Michael's comment it looks like he
>>> want
>>> check affinity_hint_set just for speculative tx polling on rx napi
>>> instead
>>> of disabling it at all.
>>>
>>> And I'm not convinced this is really needed, driver only provide affinity
>>> hint instead of affinity, so it's not guaranteed that tx and rx interrupt
>>> are in the same vcpus.
>>
>> You're right. I made the restriction broader than the request, to really
>> err
>> on the side of caution for the initial merge of napi tx. And enabling
>> the optimization is always a win over keeping it off, even without irq
>> affinity.
>>
>> The cycle cost is significant without affinity regardless of whether the
>> optimization is used.
>
>
> Yes, I noticed this in the past too.
>
>> Though this is not limited to napi-tx, it is more
>> pronounced in that mode than without napi.
>>
>> 1x TCP_RR for affinity configuration {process, rx_irq, tx_irq}:
>>
>> upstream:
>>
>> 1,1,1: 28985 Mbps, 278 Gcyc
>> 1,0,2: 30067 Mbps, 402 Gcyc
>>
>> napi tx:
>>
>> 1,1,1: 34492 Mbps, 269 Gcyc
>> 1,0,2: 36527 Mbps, 537 Gcyc (!)
>> 1,0,1: 36269 Mbps, 394 Gcyc
>> 1,0,0: 34674 Mbps, 402 Gcyc
>>
>> This is a particularly strong example. It is also representative
>> of most RR tests. It is less pronounced in other streaming tests.
>> 10x TCP_RR, for instance:
>>
>> upstream:
>>
>> 1,1,1: 42267 Mbps, 301 Gcyc
>> 1,0,2: 40663 Mbps, 445 Gcyc
>>
>> napi tx:
>>
>> 1,1,1: 42420 Mbps, 303 Gcyc
>> 1,0,2:  42267 Mbps, 431 Gcyc
>>
>> These numbers were obtained with the virtqueue_enable_cb_delayed
>> optimization after xmit_skb, btw. It turns out that moving that before
>> increases 1x TCP_RR further to ~39 Gbps, at the cost of reducing
>> 100x TCP_RR a bit.
>
>
> I see, so I think we can leave the affinity hint optimization/check for
> future investigation:
>
> - to avoid endless optimization (e.g we may want to share a single
> vector/napi for tx/rx queue pairs in the future) for this series.
> - tx napi is disabled by default which means we can do optimization on top.

Okay. I'll drop the vi->affinity_hint_set from the patch set for now.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 2] net: virtio_net: use new api ethtool_{get|set}_link_ksettings

2017-04-24 Thread Philippe Reynes
The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes 
---
Changelog:
v2:
- remove comment about the missing hardware,
  I've tested this change with qemu

 drivers/net/virtio_net.c |   50 +++--
 1 files changed, 30 insertions(+), 20 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index ea9890d..b0d241d 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1636,47 +1636,57 @@ static void virtnet_get_channels(struct net_device *dev,
 }
 
 /* Check if the user is trying to change anything besides speed/duplex */
-static bool virtnet_validate_ethtool_cmd(const struct ethtool_cmd *cmd)
+static bool
+virtnet_validate_ethtool_cmd(const struct ethtool_link_ksettings *cmd)
 {
-   struct ethtool_cmd diff1 = *cmd;
-   struct ethtool_cmd diff2 = {};
+   struct ethtool_link_ksettings diff1 = *cmd;
+   struct ethtool_link_ksettings diff2 = {};
 
/* cmd is always set so we need to clear it, validate the port type
 * and also without autonegotiation we can ignore advertising
 */
-   ethtool_cmd_speed_set(, 0);
-   diff2.port = PORT_OTHER;
-   diff1.advertising = 0;
-   diff1.duplex = 0;
-   diff1.cmd = 0;
+   diff1.base.speed = 0;
+   diff2.base.port = PORT_OTHER;
+   ethtool_link_ksettings_zero_link_mode(, advertising);
+   diff1.base.duplex = 0;
+   diff1.base.cmd = 0;
+   diff1.base.link_mode_masks_nwords = 0;
 
-   return !memcmp(, , sizeof(diff1));
+   return !memcmp(, , sizeof(diff1.base)) &&
+   bitmap_empty(diff1.link_modes.supported,
+__ETHTOOL_LINK_MODE_MASK_NBITS) &&
+   bitmap_empty(diff1.link_modes.advertising,
+__ETHTOOL_LINK_MODE_MASK_NBITS) &&
+   bitmap_empty(diff1.link_modes.lp_advertising,
+__ETHTOOL_LINK_MODE_MASK_NBITS);
 }
 
-static int virtnet_set_settings(struct net_device *dev, struct ethtool_cmd 
*cmd)
+static int virtnet_set_link_ksettings(struct net_device *dev,
+ const struct ethtool_link_ksettings *cmd)
 {
struct virtnet_info *vi = netdev_priv(dev);
u32 speed;
 
-   speed = ethtool_cmd_speed(cmd);
+   speed = cmd->base.speed;
/* don't allow custom speed and duplex */
if (!ethtool_validate_speed(speed) ||
-   !ethtool_validate_duplex(cmd->duplex) ||
+   !ethtool_validate_duplex(cmd->base.duplex) ||
!virtnet_validate_ethtool_cmd(cmd))
return -EINVAL;
vi->speed = speed;
-   vi->duplex = cmd->duplex;
+   vi->duplex = cmd->base.duplex;
 
return 0;
 }
 
-static int virtnet_get_settings(struct net_device *dev, struct ethtool_cmd 
*cmd)
+static int virtnet_get_link_ksettings(struct net_device *dev,
+ struct ethtool_link_ksettings *cmd)
 {
struct virtnet_info *vi = netdev_priv(dev);
 
-   ethtool_cmd_speed_set(cmd, vi->speed);
-   cmd->duplex = vi->duplex;
-   cmd->port = PORT_OTHER;
+   cmd->base.speed = vi->speed;
+   cmd->base.duplex = vi->duplex;
+   cmd->base.port = PORT_OTHER;
 
return 0;
 }
@@ -1696,8 +1706,8 @@ static void virtnet_init_settings(struct net_device *dev)
.set_channels = virtnet_set_channels,
.get_channels = virtnet_get_channels,
.get_ts_info = ethtool_op_get_ts_info,
-   .get_settings = virtnet_get_settings,
-   .set_settings = virtnet_set_settings,
+   .get_link_ksettings = virtnet_get_link_ksettings,
+   .set_link_ksettings = virtnet_set_link_ksettings,
 };
 
 static void virtnet_freeze_down(struct virtio_device *vdev)
-- 
1.7.4.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] net: virtio_net: use new api ethtool_{get|set}_link_ksettings

2017-04-24 Thread Philippe Reynes
The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

As I don't have the hardware, I'd be very pleased if
someone may test this patch.

Signed-off-by: Philippe Reynes 
---
 drivers/net/virtio_net.c |   50 +++--
 1 files changed, 30 insertions(+), 20 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index ea9890d..b0d241d 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1636,47 +1636,57 @@ static void virtnet_get_channels(struct net_device *dev,
 }
 
 /* Check if the user is trying to change anything besides speed/duplex */
-static bool virtnet_validate_ethtool_cmd(const struct ethtool_cmd *cmd)
+static bool
+virtnet_validate_ethtool_cmd(const struct ethtool_link_ksettings *cmd)
 {
-   struct ethtool_cmd diff1 = *cmd;
-   struct ethtool_cmd diff2 = {};
+   struct ethtool_link_ksettings diff1 = *cmd;
+   struct ethtool_link_ksettings diff2 = {};
 
/* cmd is always set so we need to clear it, validate the port type
 * and also without autonegotiation we can ignore advertising
 */
-   ethtool_cmd_speed_set(, 0);
-   diff2.port = PORT_OTHER;
-   diff1.advertising = 0;
-   diff1.duplex = 0;
-   diff1.cmd = 0;
+   diff1.base.speed = 0;
+   diff2.base.port = PORT_OTHER;
+   ethtool_link_ksettings_zero_link_mode(, advertising);
+   diff1.base.duplex = 0;
+   diff1.base.cmd = 0;
+   diff1.base.link_mode_masks_nwords = 0;
 
-   return !memcmp(, , sizeof(diff1));
+   return !memcmp(, , sizeof(diff1.base)) &&
+   bitmap_empty(diff1.link_modes.supported,
+__ETHTOOL_LINK_MODE_MASK_NBITS) &&
+   bitmap_empty(diff1.link_modes.advertising,
+__ETHTOOL_LINK_MODE_MASK_NBITS) &&
+   bitmap_empty(diff1.link_modes.lp_advertising,
+__ETHTOOL_LINK_MODE_MASK_NBITS);
 }
 
-static int virtnet_set_settings(struct net_device *dev, struct ethtool_cmd 
*cmd)
+static int virtnet_set_link_ksettings(struct net_device *dev,
+ const struct ethtool_link_ksettings *cmd)
 {
struct virtnet_info *vi = netdev_priv(dev);
u32 speed;
 
-   speed = ethtool_cmd_speed(cmd);
+   speed = cmd->base.speed;
/* don't allow custom speed and duplex */
if (!ethtool_validate_speed(speed) ||
-   !ethtool_validate_duplex(cmd->duplex) ||
+   !ethtool_validate_duplex(cmd->base.duplex) ||
!virtnet_validate_ethtool_cmd(cmd))
return -EINVAL;
vi->speed = speed;
-   vi->duplex = cmd->duplex;
+   vi->duplex = cmd->base.duplex;
 
return 0;
 }
 
-static int virtnet_get_settings(struct net_device *dev, struct ethtool_cmd 
*cmd)
+static int virtnet_get_link_ksettings(struct net_device *dev,
+ struct ethtool_link_ksettings *cmd)
 {
struct virtnet_info *vi = netdev_priv(dev);
 
-   ethtool_cmd_speed_set(cmd, vi->speed);
-   cmd->duplex = vi->duplex;
-   cmd->port = PORT_OTHER;
+   cmd->base.speed = vi->speed;
+   cmd->base.duplex = vi->duplex;
+   cmd->base.port = PORT_OTHER;
 
return 0;
 }
@@ -1696,8 +1706,8 @@ static void virtnet_init_settings(struct net_device *dev)
.set_channels = virtnet_set_channels,
.get_channels = virtnet_get_channels,
.get_ts_info = ethtool_op_get_ts_info,
-   .get_settings = virtnet_get_settings,
-   .set_settings = virtnet_set_settings,
+   .get_link_ksettings = virtnet_get_link_ksettings,
+   .set_link_ksettings = virtnet_set_link_ksettings,
 };
 
 static void virtnet_freeze_down(struct virtio_device *vdev)
-- 
1.7.4.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v2 2/5] virtio-net: transmit napi

2017-04-24 Thread Willem de Bruijn
>>   static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
>>   {
>> struct virtio_net_hdr_mrg_rxbuf *hdr;
>> @@ -1130,9 +1172,11 @@ static netdev_tx_t start_xmit(struct sk_buff *skb,
>> struct net_device *dev)
>> int err;
>> struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
>> bool kick = !skb->xmit_more;
>> +   bool use_napi = sq->napi.weight;
>> /* Free up any pending old buffers before queueing new ones. */
>> -   free_old_xmit_skbs(sq);
>> +   if (!use_napi)
>> +   free_old_xmit_skbs(sq);
>
>
> I'm not sure this is best or even correct. Consider we clean xmit packets
> speculatively in virtnet_poll_tx(), we need call free_old_xmit_skbs()
> unconditionally. This can also help to reduce the possible of napi
> rescheduling in virtnet_poll_tx().

Because of the use of trylock there. Absolutely, thanks! Perhaps I should
only use trylock in the opportunistic clean path from the rx softirq and
full locking in the tx softirq.

I previously observed that cleaning here would, counterintuitively,
reduce efficiency. It reverted the improvements of cleaning transmit
completions from the receive softirq. Going through my data, I did
not observe this regression anymore on the latest patchset.

Let me test again, with and without the new
virtqueue_enable_cb_delayed patch. Perhaps that made a
difference.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v2 5/5] virtio-net: keep tx interrupts disabled unless kick

2017-04-24 Thread Willem de Bruijn
>> -   if (!use_napi)
>> +   if (use_napi) {
>> +   if (kick)
>> +   virtqueue_enable_cb_delayed(sq->vq);
>> +   else
>> +   virtqueue_disable_cb(sq->vq);
>
>
> Since virtqueue_disable_cb() do nothing for event idx. I wonder whether or
> not just calling enable_cb_dealyed() is ok here.

Good point.

> Btw, it does not disable interrupt at all, I propose a patch in the past
> which can do more than this:
>
> https://patchwork.kernel.org/patch/6472601/

Interesting. Yes, let me evaluate that variant.

Thanks for reviewing,

  Willem
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v2 5/5] virtio-net: keep tx interrupts disabled unless kick

2017-04-24 Thread Willem de Bruijn
From: Willem de Bruijn 

Tx napi mode increases the rate of transmit interrupts. Suppress some
by masking interrupts while more packets are expected. The interrupts
will be reenabled before the last packet is sent.

This optimization reduces the througput drop with tx napi for
unidirectional flows such as UDP_STREAM that do not benefit from
cleaning tx completions in the the receive napi handler.

Signed-off-by: Willem de Bruijn 
---
 drivers/net/virtio_net.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index b14c82ce0032..b107ae011632 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1196,8 +1196,14 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, 
struct net_device *dev)
bool use_napi = sq->napi.weight;
 
/* Free up any pending old buffers before queueing new ones. */
-   if (!use_napi)
+   if (use_napi) {
+   if (kick)
+   virtqueue_enable_cb_delayed(sq->vq);
+   else
+   virtqueue_disable_cb(sq->vq);
+   } else {
free_old_xmit_skbs(sq);
+   }
 
/* timestamp packet in software */
skb_tx_timestamp(skb);
-- 
2.12.2.816.g281164-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v2 4/5] virtio-net: clean tx descriptors from rx napi

2017-04-24 Thread Willem de Bruijn
From: Willem de Bruijn 

Amortize the cost of virtual interrupts by doing both rx and tx work
on reception of a receive interrupt if tx napi is enabled. With
VIRTIO_F_EVENT_IDX, this suppresses most explicit tx completion
interrupts for bidirectional workloads.

Signed-off-by: Willem de Bruijn 
---
 drivers/net/virtio_net.c | 21 +
 1 file changed, 21 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 81a98de864e6..b14c82ce0032 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1072,12 +1072,33 @@ static void free_old_xmit_skbs(struct send_queue *sq)
u64_stats_update_end(>tx_syncp);
 }
 
+static void virtnet_poll_cleantx(struct receive_queue *rq)
+{
+   struct virtnet_info *vi = rq->vq->vdev->priv;
+   unsigned int index = vq2rxq(rq->vq);
+   struct send_queue *sq = >sq[index];
+   struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, index);
+
+   if (!sq->napi.weight)
+   return;
+
+   if (__netif_tx_trylock(txq)) {
+   free_old_xmit_skbs(sq);
+   __netif_tx_unlock(txq);
+   }
+
+   if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
+   netif_tx_wake_queue(txq);
+}
+
 static int virtnet_poll(struct napi_struct *napi, int budget)
 {
struct receive_queue *rq =
container_of(napi, struct receive_queue, napi);
unsigned int received;
 
+   virtnet_poll_cleantx(rq);
+
received = virtnet_receive(rq, budget);
 
/* Out of packets? */
-- 
2.12.2.816.g281164-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v2 2/5] virtio-net: transmit napi

2017-04-24 Thread Willem de Bruijn
On Thu, Apr 20, 2017 at 2:27 AM, Jason Wang  wrote:
>
>
> On 2017年04月19日 04:21, Willem de Bruijn wrote:
>>
>> +static void virtnet_napi_tx_enable(struct virtnet_info *vi,
>> +  struct virtqueue *vq,
>> +  struct napi_struct *napi)
>> +{
>> +   if (!napi->weight)
>> +   return;
>> +
>> +   if (!vi->affinity_hint_set) {
>> +   napi->weight = 0;
>> +   return;
>> +   }
>> +
>> +   return virtnet_napi_enable(vq, napi);
>> +}
>> +
>>   static void refill_work(struct work_struct *work)
>
>
> Maybe I was wrong, but according to Michael's comment it looks like he want
> check affinity_hint_set just for speculative tx polling on rx napi instead
> of disabling it at all.
>
> And I'm not convinced this is really needed, driver only provide affinity
> hint instead of affinity, so it's not guaranteed that tx and rx interrupt
> are in the same vcpus.

You're right. I made the restriction broader than the request, to really err
on the side of caution for the initial merge of napi tx. And enabling
the optimization is always a win over keeping it off, even without irq
affinity.

The cycle cost is significant without affinity regardless of whether the
optimization is used. Though this is not limited to napi-tx, it is more
pronounced in that mode than without napi.

1x TCP_RR for affinity configuration {process, rx_irq, tx_irq}:

upstream:

1,1,1: 28985 Mbps, 278 Gcyc
1,0,2: 30067 Mbps, 402 Gcyc

napi tx:

1,1,1: 34492 Mbps, 269 Gcyc
1,0,2: 36527 Mbps, 537 Gcyc (!)
1,0,1: 36269 Mbps, 394 Gcyc
1,0,0: 34674 Mbps, 402 Gcyc

This is a particularly strong example. It is also representative
of most RR tests. It is less pronounced in other streaming tests.
10x TCP_RR, for instance:

upstream:

1,1,1: 42267 Mbps, 301 Gcyc
1,0,2: 40663 Mbps, 445 Gcyc

napi tx:

1,1,1: 42420 Mbps, 303 Gcyc
1,0,2:  42267 Mbps, 431 Gcyc

These numbers were obtained with the virtqueue_enable_cb_delayed
optimization after xmit_skb, btw. It turns out that moving that before
increases 1x TCP_RR further to ~39 Gbps, at the cost of reducing
100x TCP_RR a bit.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH net-next v2 3/5] virtio-net: move free_old_xmit_skbs

2017-04-24 Thread Willem de Bruijn
From: Willem de Bruijn 

An upcoming patch will call free_old_xmit_skbs indirectly from
virtnet_poll. Move the function above this to avoid having to
introduce a forward declaration.

This is a pure move: no code changes.

Signed-off-by: Willem de Bruijn 
---
 drivers/net/virtio_net.c | 60 
 1 file changed, 30 insertions(+), 30 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c173e85dc7b8..81a98de864e6 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1042,6 +1042,36 @@ static int virtnet_receive(struct receive_queue *rq, int 
budget)
return received;
 }
 
+static void free_old_xmit_skbs(struct send_queue *sq)
+{
+   struct sk_buff *skb;
+   unsigned int len;
+   struct virtnet_info *vi = sq->vq->vdev->priv;
+   struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
+   unsigned int packets = 0;
+   unsigned int bytes = 0;
+
+   while ((skb = virtqueue_get_buf(sq->vq, )) != NULL) {
+   pr_debug("Sent skb %p\n", skb);
+
+   bytes += skb->len;
+   packets++;
+
+   dev_kfree_skb_any(skb);
+   }
+
+   /* Avoid overhead when no packets have been processed
+* happens when called speculatively from start_xmit.
+*/
+   if (!packets)
+   return;
+
+   u64_stats_update_begin(>tx_syncp);
+   stats->tx_bytes += bytes;
+   stats->tx_packets += packets;
+   u64_stats_update_end(>tx_syncp);
+}
+
 static int virtnet_poll(struct napi_struct *napi, int budget)
 {
struct receive_queue *rq =
@@ -1074,36 +1104,6 @@ static int virtnet_open(struct net_device *dev)
return 0;
 }
 
-static void free_old_xmit_skbs(struct send_queue *sq)
-{
-   struct sk_buff *skb;
-   unsigned int len;
-   struct virtnet_info *vi = sq->vq->vdev->priv;
-   struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
-   unsigned int packets = 0;
-   unsigned int bytes = 0;
-
-   while ((skb = virtqueue_get_buf(sq->vq, )) != NULL) {
-   pr_debug("Sent skb %p\n", skb);
-
-   bytes += skb->len;
-   packets++;
-
-   dev_kfree_skb_any(skb);
-   }
-
-   /* Avoid overhead when no packets have been processed
-* happens when called speculatively from start_xmit.
-*/
-   if (!packets)
-   return;
-
-   u64_stats_update_begin(>tx_syncp);
-   stats->tx_bytes += bytes;
-   stats->tx_packets += packets;
-   u64_stats_update_end(>tx_syncp);
-}
-
 static int virtnet_poll_tx(struct napi_struct *napi, int budget)
 {
struct send_queue *sq = container_of(napi, struct send_queue, napi);
-- 
2.12.2.816.g281164-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v2 1/5] virtio-net: napi helper functions

2017-04-24 Thread Willem de Bruijn
From: Willem de Bruijn 

Prepare virtio-net for tx napi by converting existing napi code to
use helper functions. This also deduplicates some logic.

Signed-off-by: Willem de Bruijn 
Signed-off-by: Jason Wang 
---
 drivers/net/virtio_net.c | 65 ++--
 1 file changed, 35 insertions(+), 30 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 666ada6130ab..b9c1df29892c 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -239,6 +239,26 @@ static struct page *get_a_page(struct receive_queue *rq, 
gfp_t gfp_mask)
return p;
 }
 
+static void virtqueue_napi_schedule(struct napi_struct *napi,
+   struct virtqueue *vq)
+{
+   if (napi_schedule_prep(napi)) {
+   virtqueue_disable_cb(vq);
+   __napi_schedule(napi);
+   }
+}
+
+static void virtqueue_napi_complete(struct napi_struct *napi,
+   struct virtqueue *vq, int processed)
+{
+   int opaque;
+
+   opaque = virtqueue_enable_cb_prepare(vq);
+   if (napi_complete_done(napi, processed) &&
+   unlikely(virtqueue_poll(vq, opaque)))
+   virtqueue_napi_schedule(napi, vq);
+}
+
 static void skb_xmit_done(struct virtqueue *vq)
 {
struct virtnet_info *vi = vq->vdev->priv;
@@ -936,27 +956,20 @@ static void skb_recv_done(struct virtqueue *rvq)
struct virtnet_info *vi = rvq->vdev->priv;
struct receive_queue *rq = >rq[vq2rxq(rvq)];
 
-   /* Schedule NAPI, Suppress further interrupts if successful. */
-   if (napi_schedule_prep(>napi)) {
-   virtqueue_disable_cb(rvq);
-   __napi_schedule(>napi);
-   }
+   virtqueue_napi_schedule(>napi, rvq);
 }
 
-static void virtnet_napi_enable(struct receive_queue *rq)
+static void virtnet_napi_enable(struct virtqueue *vq, struct napi_struct *napi)
 {
-   napi_enable(>napi);
+   napi_enable(napi);
 
/* If all buffers were filled by other side before we napi_enabled, we
-* won't get another interrupt, so process any outstanding packets
-* now.  virtnet_poll wants re-enable the queue, so we disable here.
-* We synchronize against interrupts via NAPI_STATE_SCHED */
-   if (napi_schedule_prep(>napi)) {
-   virtqueue_disable_cb(rq->vq);
-   local_bh_disable();
-   __napi_schedule(>napi);
-   local_bh_enable();
-   }
+* won't get another interrupt, so process any outstanding packets now.
+* Call local_bh_enable after to trigger softIRQ processing.
+*/
+   local_bh_disable();
+   virtqueue_napi_schedule(napi, vq);
+   local_bh_enable();
 }
 
 static void refill_work(struct work_struct *work)
@@ -971,7 +984,7 @@ static void refill_work(struct work_struct *work)
 
napi_disable(>napi);
still_empty = !try_fill_recv(vi, rq, GFP_KERNEL);
-   virtnet_napi_enable(rq);
+   virtnet_napi_enable(rq->vq, >napi);
 
/* In theory, this can happen: if we don't get any buffers in
 * we will *never* try to fill again.
@@ -1011,21 +1024,13 @@ static int virtnet_poll(struct napi_struct *napi, int 
budget)
 {
struct receive_queue *rq =
container_of(napi, struct receive_queue, napi);
-   unsigned int r, received;
+   unsigned int received;
 
received = virtnet_receive(rq, budget);
 
/* Out of packets? */
-   if (received < budget) {
-   r = virtqueue_enable_cb_prepare(rq->vq);
-   if (napi_complete_done(napi, received)) {
-   if (unlikely(virtqueue_poll(rq->vq, r)) &&
-   napi_schedule_prep(napi)) {
-   virtqueue_disable_cb(rq->vq);
-   __napi_schedule(napi);
-   }
-   }
-   }
+   if (received < budget)
+   virtqueue_napi_complete(napi, rq->vq, received);
 
return received;
 }
@@ -1040,7 +1045,7 @@ static int virtnet_open(struct net_device *dev)
/* Make sure we have some buffers: if oom use wq. */
if (!try_fill_recv(vi, >rq[i], GFP_KERNEL))
schedule_delayed_work(>refill, 0);
-   virtnet_napi_enable(>rq[i]);
+   virtnet_napi_enable(vi->rq[i].vq, >rq[i].napi);
}
 
return 0;
@@ -1747,7 +1752,7 @@ static int virtnet_restore_up(struct virtio_device *vdev)
schedule_delayed_work(>refill, 0);
 
for (i = 0; i < vi->max_queue_pairs; i++)
-   virtnet_napi_enable(>rq[i]);
+   virtnet_napi_enable(vi->rq[i].vq, >rq[i].napi);
}
 
netif_device_attach(vi->dev);
-- 

[PATCH net-next v2 0/5] virtio-net tx napi

2017-04-24 Thread Willem de Bruijn
From: Willem de Bruijn 

Add napi for virtio-net transmit completion processing.

Changes:
  v1 -> v2:
- disable by default
- disable unless affinity_hint_set
  because cache misses add up to a third higher cycle cost,
  e.g., in TCP_RR tests. This is not limited to the patch
  that enables tx completion cleaning in rx napi.
- use trylock to avoid contention between tx and rx napi
- keep interrupts masked during xmit_more (new patch 5/5)
  this improves cycles especially for multi UDP_STREAM, which
  does not benefit from cleaning tx completions on rx napi.
- move free_old_xmit_skbs (new patch 3/5)
  to avoid forward declaration

not changed:
- deduplicate virnet_poll_tx and virtnet_poll_txclean
  they look similar, but have differ too much to make it
  worthwhile.
- delay netif_wake_subqueue for more than 2 + MAX_SKB_FRAGS
  evaluated, but made no difference
- patch 1/5

  RFC -> v1:
- dropped vhost interrupt moderation patch:
  not needed and likely expensive at light load
- remove tx napi weight
- always clean all tx completions
- use boolean to toggle tx-napi, instead
- only clean tx in rx if tx-napi is enabled
- then clean tx before rx
- fix: add missing braces in virtnet_freeze_down
- testing: add 4KB TCP_RR + UDP test results

Based on previous patchsets by Jason Wang:

  [RFC V7 PATCH 0/7] enable tx interrupts for virtio-net
  http://lkml.iu.edu/hypermail/linux/kernel/1505.3/00245.html


Before commit b0c39dbdc204 ("virtio_net: don't free buffers in xmit
ring") the virtio-net driver would free transmitted packets on
transmission of new packets in ndo_start_xmit and, to catch the edge
case when no new packet is sent, also in a timer at 10HZ.

A timer can cause long stalls. VIRTIO_F_NOTIFY_ON_EMPTY avoids stalls
due to low free descriptor count. It does not address a stalls due to
low socket SO_SNDBUF. Increasing timer frequency decreases that stall
time, but increases interrupt rate and, thus, cycle count.

Currently, with no timer, packets are freed only at ndo_start_xmit.
Latency of consume_skb is now unbounded. To avoid a deadlock if a sock
reaches SO_SNDBUF, packets are orphaned on tx. This breaks TCP small
queues.

Reenable TCP small queues by removing the orphan. Instead of using a
timer, convert the driver to regular tx napi. This does not have the
unresolved stall issue and does not have any frequency to tune.

By keeping interrupts enabled by default, napi increases tx
interrupt rate. VIRTIO_F_EVENT_IDX avoids sending an interrupt if
one is already unacknowledged, so makes this more feasible today.
Combine that with an optimization that brings interrupt rate
back in line with the existing version for most workloads:

Tx completion cleaning on rx interrupts elides most explicit tx
interrupts by relying on the fact that many rx interrupts fire.

Tested by running {1, 10, 100} {TCP, UDP} STREAM, RR, 4K_RR benchmarks
from a guest to a server on the host, on an x86_64 Haswell. The guest
runs 4 vCPUs pinned to 4 cores. vhost and the test server are
pinned to a core each.

All results are the median of 5 runs, with variance well < 10%.
Used neper (github.com/google/neper) as test process.

Napi increases single stream throughput, but increases cycle cost.
The optimizations bring this down. The previous patchset saw a
regression with UDP_STREAM, which does not benefit from cleaning tx
interrupts in rx napi. This regression is now gone for 10x, 100x.
Remaining difference is higher 1x TCP_STREAM, lower 1x UDP_STREAM.


The latest results are with process, rx napi and tx napi affine to
the same core. All numbers are lower than the previous patchset.


 upstream napi
TCP_STREAM:
1x:
  Mbps  2898534492
  Gcycles 278  269

10x:
  Mbps  4226742420
  Gcycles 301  303

100x:
  Mbps  2971031131
  Gcycles 286  282

TCP_RR Latency (us):
1x:
  p50  22   21
  p99  27   26
  Gcycles 176  171

10x:
  p50  40   40
  p99  51   51
  Gcycles 215  211

100x:
  p50 272  235
  p99 414  360
  Gcycles 228  233

TCP_RR 4K:
1x:
  p50  27   29
  p99  34   36
  Gcycles 184  170

10x:
  p50  67   66
  p99  82   82
  Gcycles 217  218

100x:
  p50 435  485
  p99 813  626
  Gcycles 242  226

UDP_STREAM:
1x:
  Mbps  2980225600
  Gcycles 305  295

10x:
  Mbps  2990129684
  Gcycles 287  295

100x:
  Mbps  2995229863
  Gcycles 336  325

UDP_RR:
1x:
  p50  18   18
  p99  23   22
  Gcycles 183  181


[PATCH net-next v2 2/5] virtio-net: transmit napi

2017-04-24 Thread Willem de Bruijn
From: Willem de Bruijn 

Convert virtio-net to a standard napi tx completion path. This enables
better TCP pacing using TCP small queues and increases single stream
throughput.

The virtio-net driver currently cleans tx descriptors on transmission
of new packets in ndo_start_xmit. Latency depends on new traffic, so
is unbounded. To avoid deadlock when a socket reaches its snd limit,
packets are orphaned on tranmission. This breaks socket backpressure,
including TSQ.

Napi increases the number of interrupts generated compared to the
current model, which keeps interrupts disabled as long as the ring
has enough free descriptors. Keep tx napi optional and disabled for
now. Follow-on patches will reduce the interrupt cost.

Signed-off-by: Willem de Bruijn 
Signed-off-by: Jason Wang 
---
 drivers/net/virtio_net.c | 77 +---
 1 file changed, 67 insertions(+), 10 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index b9c1df29892c..c173e85dc7b8 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -33,9 +33,10 @@
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
 
-static bool csum = true, gso = true;
+static bool csum = true, gso = true, napi_tx;
 module_param(csum, bool, 0444);
 module_param(gso, bool, 0444);
+module_param(napi_tx, bool, 0644);
 
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
@@ -86,6 +87,8 @@ struct send_queue {
 
/* Name of the send queue: output.$index */
char name[40];
+
+   struct napi_struct napi;
 };
 
 /* Internal representation of a receive virtqueue */
@@ -262,12 +265,16 @@ static void virtqueue_napi_complete(struct napi_struct 
*napi,
 static void skb_xmit_done(struct virtqueue *vq)
 {
struct virtnet_info *vi = vq->vdev->priv;
+   struct napi_struct *napi = >sq[vq2txq(vq)].napi;
 
/* Suppress further interrupts. */
virtqueue_disable_cb(vq);
 
-   /* We were probably waiting for more output buffers. */
-   netif_wake_subqueue(vi->dev, vq2txq(vq));
+   if (napi->weight)
+   virtqueue_napi_schedule(napi, vq);
+   else
+   /* We were probably waiting for more output buffers. */
+   netif_wake_subqueue(vi->dev, vq2txq(vq));
 }
 
 static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx)
@@ -972,6 +979,21 @@ static void virtnet_napi_enable(struct virtqueue *vq, 
struct napi_struct *napi)
local_bh_enable();
 }
 
+static void virtnet_napi_tx_enable(struct virtnet_info *vi,
+  struct virtqueue *vq,
+  struct napi_struct *napi)
+{
+   if (!napi->weight)
+   return;
+
+   if (!vi->affinity_hint_set) {
+   napi->weight = 0;
+   return;
+   }
+
+   return virtnet_napi_enable(vq, napi);
+}
+
 static void refill_work(struct work_struct *work)
 {
struct virtnet_info *vi =
@@ -1046,6 +1068,7 @@ static int virtnet_open(struct net_device *dev)
if (!try_fill_recv(vi, >rq[i], GFP_KERNEL))
schedule_delayed_work(>refill, 0);
virtnet_napi_enable(vi->rq[i].vq, >rq[i].napi);
+   virtnet_napi_tx_enable(vi, vi->sq[i].vq, >sq[i].napi);
}
 
return 0;
@@ -1081,6 +1104,25 @@ static void free_old_xmit_skbs(struct send_queue *sq)
u64_stats_update_end(>tx_syncp);
 }
 
+static int virtnet_poll_tx(struct napi_struct *napi, int budget)
+{
+   struct send_queue *sq = container_of(napi, struct send_queue, napi);
+   struct virtnet_info *vi = sq->vq->vdev->priv;
+   struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq));
+
+   if (__netif_tx_trylock(txq)) {
+   free_old_xmit_skbs(sq);
+   __netif_tx_unlock(txq);
+   }
+
+   virtqueue_napi_complete(napi, sq->vq, 0);
+
+   if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
+   netif_tx_wake_queue(txq);
+
+   return 0;
+}
+
 static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
 {
struct virtio_net_hdr_mrg_rxbuf *hdr;
@@ -1130,9 +1172,11 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, 
struct net_device *dev)
int err;
struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
bool kick = !skb->xmit_more;
+   bool use_napi = sq->napi.weight;
 
/* Free up any pending old buffers before queueing new ones. */
-   free_old_xmit_skbs(sq);
+   if (!use_napi)
+   free_old_xmit_skbs(sq);
 
/* timestamp packet in software */
skb_tx_timestamp(skb);
@@ -1152,8 +1196,10 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, 
struct net_device *dev)
}
 
/* Don't wait up for transmitted skbs to be freed. */
-   skb_orphan(skb);
-  

[PATCH] tools/virtio: fix build breakage

2017-04-24 Thread Sekhar Nori
Commit db932ced55cf ("virtio: add context flag to find vqs")
added a new 'context' flag to vring_new_virtqueue(), but the
corresponding API in tools/virtio/ is not updated causing
build errors due to conflicting declarations.

Bring code in tools/virtio in sync with that in kernel.

I have used 'false' for the value of the new boolean 'context'
flag as that seems to be the best way to preserve existing
behavior.

Tested with:

$ make -C tools/virtio clean all ARCH=x86

Fixes: db932ced55cf ("virtio: add context flag to find vqs")
Signed-off-by: Sekhar Nori 
---
I am not really a virtio user so I have not tested
these apart from the build test. Applies on linux-next
of April 11 2017

 tools/virtio/linux/virtio.h | 1 +
 tools/virtio/virtio_test.c  | 2 +-
 tools/virtio/vringh_test.c  | 7 ---
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/tools/virtio/linux/virtio.h b/tools/virtio/linux/virtio.h
index 9377c8b4ac16..d8f534025b7f 100644
--- a/tools/virtio/linux/virtio.h
+++ b/tools/virtio/linux/virtio.h
@@ -57,6 +57,7 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
  unsigned int vring_align,
  struct virtio_device *vdev,
  bool weak_barriers,
+ bool ctx,
  void *pages,
  bool (*notify)(struct virtqueue *vq),
  void (*callback)(struct virtqueue *vq),
diff --git a/tools/virtio/virtio_test.c b/tools/virtio/virtio_test.c
index e0445898f08f..76d6f583c249 100644
--- a/tools/virtio/virtio_test.c
+++ b/tools/virtio/virtio_test.c
@@ -100,7 +100,7 @@ static void vq_info_add(struct vdev_info *dev, int num)
vring_init(>vring, num, info->ring, 4096);
info->vq = vring_new_virtqueue(info->idx,
   info->vring.num, 4096, >vdev,
-  true, info->ring,
+  true, false, info->ring,
   vq_notify, vq_callback, "test");
assert(info->vq);
info->vq->priv = info;
diff --git a/tools/virtio/vringh_test.c b/tools/virtio/vringh_test.c
index 5f94f5105678..9476c616d064 100644
--- a/tools/virtio/vringh_test.c
+++ b/tools/virtio/vringh_test.c
@@ -314,7 +314,8 @@ static int parallel_test(u64 features,
err(1, "Could not set affinity to cpu %u", first_cpu);
 
vq = vring_new_virtqueue(0, RINGSIZE, ALIGN, , true,
-guest_map, fast_vringh ? no_notify_host
+false, guest_map,
+fast_vringh ? no_notify_host
 : parallel_notify_host,
 never_callback_guest, "guest vq");
 
@@ -479,7 +480,7 @@ int main(int argc, char *argv[])
memset(__user_addr_min, 0, vring_size(RINGSIZE, ALIGN));
 
/* Set up guest side. */
-   vq = vring_new_virtqueue(0, RINGSIZE, ALIGN, , true,
+   vq = vring_new_virtqueue(0, RINGSIZE, ALIGN, , true, false,
 __user_addr_min,
 never_notify_host, never_callback_guest,
 "guest vq");
@@ -663,7 +664,7 @@ int main(int argc, char *argv[])
/* Force creation of direct, which we modify. */
__virtio_clear_bit(, VIRTIO_RING_F_INDIRECT_DESC);
vq = vring_new_virtqueue(0, RINGSIZE, ALIGN, , true,
-__user_addr_min,
+false, __user_addr_min,
 never_notify_host,
 never_callback_guest,
 "guest vq");
-- 
2.9.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Need information on type 2 IOMMU

2017-04-24 Thread valmiki

Hi All,

We have drivers/vfio/vfio_iommu_type1.c. what is type1 iommu? Is it 
w.r.t vfio layer it is being referred?


Is there type 2 IOMMU w.r.t vfio? If so what is it?

Regards,
Valmiki
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next 3/3] virtio-net: clean tx descriptors from rx napi

2017-04-24 Thread Willem de Bruijn
On Fri, Apr 7, 2017 at 5:10 PM, Michael S. Tsirkin  wrote:
> On Fri, Apr 07, 2017 at 04:59:58PM -0400, Willem de Bruijn wrote:
>> On Fri, Apr 7, 2017 at 3:28 PM, Michael S. Tsirkin  wrote:
>> > On Mon, Apr 03, 2017 at 01:02:13AM -0400, Willem de Bruijn wrote:
>> >> On Sun, Apr 2, 2017 at 10:47 PM, Michael S. Tsirkin  
>> >> wrote:
>> >> > On Sun, Apr 02, 2017 at 04:10:12PM -0400, Willem de Bruijn wrote:
>> >> >> From: Willem de Bruijn 
>> >> >>
>> >> >> Amortize the cost of virtual interrupts by doing both rx and tx work
>> >> >> on reception of a receive interrupt if tx napi is enabled. With
>> >> >> VIRTIO_F_EVENT_IDX, this suppresses most explicit tx completion
>> >> >> interrupts for bidirectional workloads.
>> >> >>
>> >> >> Signed-off-by: Willem de Bruijn 
>> >
>> > This is a popular approach, but I think this will only work well if tx
>> > and rx interrupts are processed on the same CPU and if tx queue is per
>> > cpu.  If they target different CPUs or if tx queue is used from multiple
>> > CPUs they will conflict on the shared locks.
>>
>> Yes. As a result of this discussion I started running a few vcpu affinity 
>> tests.
>>
>> > This can even change dynamically as CPUs/queues are reconfigured.
>> > How about adding a flag and skipping the tx poll if there's no match?
>>
>> I suspect that even with the cache invalidations this optimization
>> will be an improvement over handling all tx interrupts in the tx napi
>> handler. I will get the datapoint for that.
>>
>> That said, we can make this conditional. What flag exactly do you
>> propose? Compare raw_smp_processor_id() in the rx softint with one
>> previously stored in the napi tx callback?
>
> I'm not sure. Another idea is to check vi->affinity_hint_set.
> If set we know rq and sq are on the same CPU.

I was not aware of that flag, thanks. Yes, that looks like it should work.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next 3/3] virtio-net: clean tx descriptors from rx napi

2017-04-24 Thread Willem de Bruijn
On Fri, Apr 7, 2017 at 3:28 PM, Michael S. Tsirkin  wrote:
> On Mon, Apr 03, 2017 at 01:02:13AM -0400, Willem de Bruijn wrote:
>> On Sun, Apr 2, 2017 at 10:47 PM, Michael S. Tsirkin  wrote:
>> > On Sun, Apr 02, 2017 at 04:10:12PM -0400, Willem de Bruijn wrote:
>> >> From: Willem de Bruijn 
>> >>
>> >> Amortize the cost of virtual interrupts by doing both rx and tx work
>> >> on reception of a receive interrupt if tx napi is enabled. With
>> >> VIRTIO_F_EVENT_IDX, this suppresses most explicit tx completion
>> >> interrupts for bidirectional workloads.
>> >>
>> >> Signed-off-by: Willem de Bruijn 
>
> This is a popular approach, but I think this will only work well if tx
> and rx interrupts are processed on the same CPU and if tx queue is per
> cpu.  If they target different CPUs or if tx queue is used from multiple
> CPUs they will conflict on the shared locks.

Yes. As a result of this discussion I started running a few vcpu affinity tests.

The data is not complete. In particular, I don't have the data yet to
compare having tx and rx irq on the same cpu (0,0) vs on different
(0,2) for this patchset. Which is the relevant data to your point.

Initial results for unmodified upstream driver at {1, 10, 100}x
TCP_STREAM, for irq cpu affinity (rx,tx). Process is always pinned to cpu
1. This is a 4 vcpu system pinned by the host to 4 cores on the same
socket. The previously reported results were obtained with txq, rtx
and process on different vcpus (0,2). Running all on the same vcpu
lower cycle count considerably:

irq 0,0
1throughput_Mbps=29767.14  391,488,924,526  cycles
10  throughput_Mbps=40808.64  424,530,251,896  cycles
100 throughput_Mbps=33475.13  414,622,071,167  cycles

irq 0,2
1   throughput_Mbps=30176.05  395,673,200,747  cycles
10 throughput_Mbps=40729.26  433,948,374,991  cycles
100 throughput_Mbps=33758.68 436,291,949,393  cycles

irq 1,1
1throughput_Mbps=26635.20 269,071,002,844  cycles
10  throughput_Mbps=42385.05 299,945,944,516  cycles
100 throughput_Mbps=33580.98 283,272,895,507  cycles

With this patch set applied, cpu (1,1)

1 throughput_Mbps=34980.76  276,504,805,414  cycles
10   throughput_Mbps=42519.92 298,105,889,785  cycles
100 throughput_Mbps=35268.86 296,670,598,712  cycles

I will need to get data for (0,2) vs (0,0).

> This can even change dynamically as CPUs/queues are reconfigured.
> How about adding a flag and skipping the tx poll if there's no match?

I suspect that even with the cache invalidations this optimization
will be an improvement over handling all tx interrupts in the tx napi
handler. I will get the datapoint for that.

That said, we can make this conditional. What flag exactly do you
propose? Compare raw_smp_processor_id() in the rx softint with one
previously stored in the napi tx callback?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next 3/3] virtio-net: clean tx descriptors from rx napi

2017-04-24 Thread Willem de Bruijn
On Sun, Apr 2, 2017 at 10:47 PM, Michael S. Tsirkin  wrote:
> On Sun, Apr 02, 2017 at 04:10:12PM -0400, Willem de Bruijn wrote:
>> From: Willem de Bruijn 
>>
>> Amortize the cost of virtual interrupts by doing both rx and tx work
>> on reception of a receive interrupt if tx napi is enabled. With
>> VIRTIO_F_EVENT_IDX, this suppresses most explicit tx completion
>> interrupts for bidirectional workloads.
>>
>> Signed-off-by: Willem de Bruijn 
>> ---
>>  drivers/net/virtio_net.c | 22 ++
>>  1 file changed, 22 insertions(+)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 95d938e82080..af830eb212bf 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -1030,12 +1030,34 @@ static int virtnet_receive(struct receive_queue *rq, 
>> int budget)
>>   return received;
>>  }
>>
>> +static void free_old_xmit_skbs(struct send_queue *sq);
>> +
>
> Could you pls re-arrange code to avoid forward declarations?

Okay. I'll do the move in a separate patch to simplify review.

>> +static void virtnet_poll_cleantx(struct receive_queue *rq)
>> +{
>> + struct virtnet_info *vi = rq->vq->vdev->priv;
>> + unsigned int index = vq2rxq(rq->vq);
>> + struct send_queue *sq = >sq[index];
>> + struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, index);
>> +
>> + if (!sq->napi.weight)
>> + return;
>> +
>> + __netif_tx_lock(txq, smp_processor_id());
>> + free_old_xmit_skbs(sq);
>> + __netif_tx_unlock(txq);
>> +
>> + if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
>> + netif_wake_subqueue(vi->dev, vq2txq(sq->vq));
>> +}
>> +
>
> Looks very similar to virtnet_poll_tx.
>
> I think this might be waking the tx queue too early, so
> it will tend to stay almost full for long periods of time.
> Why not defer wakeup until queue is at least half empty?

I'll test that. Delaying wake-up longer than necessary can cause
queue build up at the qdisc and higher tail latency, I imagine. But
it may reduce the number of __netif_schedule calls.

> I wonder whether it's worth it to handle very short queues
> correctly - they previously made very slow progress,
> not they are never woken up.
>
> I'm a bit concerned about the cost of these wakeups
> and locking. I note that this wake is called basically
> every time queue is not full.
>
> Would it make sense to limit the amount of tx polling?
> Maybe use trylock to reduce the conflict with xmit?

Yes, that sounds good. I did test that previously and saw no
difference then. But when multiple cpus contend for a single
txq it should help.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next 3/3] virtio-net: clean tx descriptors from rx napi

2017-04-24 Thread Willem de Bruijn
From: Willem de Bruijn 

Amortize the cost of virtual interrupts by doing both rx and tx work
on reception of a receive interrupt if tx napi is enabled. With
VIRTIO_F_EVENT_IDX, this suppresses most explicit tx completion
interrupts for bidirectional workloads.

Signed-off-by: Willem de Bruijn 
---
 drivers/net/virtio_net.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 95d938e82080..af830eb212bf 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1030,12 +1030,34 @@ static int virtnet_receive(struct receive_queue *rq, 
int budget)
return received;
 }
 
+static void free_old_xmit_skbs(struct send_queue *sq);
+
+static void virtnet_poll_cleantx(struct receive_queue *rq)
+{
+   struct virtnet_info *vi = rq->vq->vdev->priv;
+   unsigned int index = vq2rxq(rq->vq);
+   struct send_queue *sq = >sq[index];
+   struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, index);
+
+   if (!sq->napi.weight)
+   return;
+
+   __netif_tx_lock(txq, smp_processor_id());
+   free_old_xmit_skbs(sq);
+   __netif_tx_unlock(txq);
+
+   if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
+   netif_wake_subqueue(vi->dev, vq2txq(sq->vq));
+}
+
 static int virtnet_poll(struct napi_struct *napi, int budget)
 {
struct receive_queue *rq =
container_of(napi, struct receive_queue, napi);
unsigned int received;
 
+   virtnet_poll_cleantx(rq);
+
received = virtnet_receive(rq, budget);
 
/* Out of packets? */
-- 
2.12.2.564.g063fe858b8-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next 2/3] virtio-net: transmit napi

2017-04-24 Thread Willem de Bruijn
On Sun, Apr 2, 2017 at 10:30 PM, Michael S. Tsirkin  wrote:
> On Sun, Apr 02, 2017 at 04:10:11PM -0400, Willem de Bruijn wrote:
>> From: Willem de Bruijn 
>>
>> Convert virtio-net to a standard napi tx completion path. This enables
>> better TCP pacing using TCP small queues and increases single stream
>> throughput.
>>
>> The virtio-net driver currently cleans tx descriptors on transmission
>> of new packets in ndo_start_xmit. Latency depends on new traffic, so
>> is unbounded. To avoid deadlock when a socket reaches its snd limit,
>> packets are orphaned on tranmission. This breaks socket backpressure,
>> including TSQ.
>>
>> Napi increases the number of interrupts generated compared to the
>> current model, which keeps interrupts disabled as long as the ring
>> has enough free descriptors. Keep tx napi optional for now. Follow-on
>> patches will reduce the interrupt cost.
>>
>> Signed-off-by: Willem de Bruijn 
>> Signed-off-by: Jason Wang 
>> ---
>>  drivers/net/virtio_net.c | 63 
>> 
>>  1 file changed, 53 insertions(+), 10 deletions(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 6aac0ad0d9b2..95d938e82080 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -33,9 +33,10 @@
>>  static int napi_weight = NAPI_POLL_WEIGHT;
>>  module_param(napi_weight, int, 0444);
>>
>> -static bool csum = true, gso = true;
>> +static bool csum = true, gso = true, napi_tx = true;
>>  module_param(csum, bool, 0444);
>>  module_param(gso, bool, 0444);
>> +module_param(napi_tx, bool, 0644);
>>
>>  /* FIXME: MTU in config. */
>>  #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
>
> Off by default seems safer until we can find better ways
> to reduce the overhead, esp for UDP.

Okay, I'll change that. I don't have an immediate idea for that
unidirectional case.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next 0/3] virtio-net tx napi

2017-04-24 Thread Willem de Bruijn
From: Willem de Bruijn 

Add napi for virtio-net transmit completion processing.

Based on previous patchsets by Jason Wang:

  [RFC V7 PATCH 0/7] enable tx interrupts for virtio-net
  http://lkml.iu.edu/hypermail/linux/kernel/1505.3/00245.html

Changes:
  RFC -> v1:
- dropped vhost interrupt moderation patch:
not needed and likely expensive at light load
- remove tx napi weight
  - always clean all tx completions
  - use boolean to toggle tx-napi, instead
- only clean tx in rx if tx-napi is enabled
  - then clean tx before rx
- fix: add missing braces in virtnet_freeze_down
- testing: add 4KB TCP_RR + UDP test results


Before commit b0c39dbdc204 ("virtio_net: don't free buffers in xmit
ring") the virtio-net driver would free transmitted packets on
transmission of new packets in ndo_start_xmit and, to catch the edge
case when no new packet is sent, also in a timer at 10HZ.

A timer can cause long stalls. VIRTIO_F_NOTIFY_ON_EMPTY avoids stalls
due to low free descriptor count. It does not address a stalls due to
low socket SO_SNDBUF. Increasing timer frequency decreases that stall
time, but increases interrupt rate and, thus, cycle count.

Currently, with no timer, packets are freed only at ndo_start_xmit.
Latency of consume_skb is now unbounded. To avoid a deadlock if a sock
reaches SO_SNDBUF, packets are orphaned on tx. This breaks TCP small
queues.

Reenable TCP small queues by removing the orphan. Instead of using a
timer, convert the driver to regular tx napi. This does not have the
unresolved stall issue and does not have any frequency to tune.

By keeping interrupts enabled by default, napi increases tx
interrupt rate. VIRTIO_F_EVENT_IDX avoids sending an interrupt if
one is already unacknowledged, so makes this more feasible today.
Combine that with an optimization that brings interrupt rate
back in line with the existing version for most workloads:

Tx completion cleaning on rx interrupts elides most explicit tx
interrupts by relying on the fact that many rx interrupts fire.

Tested by running {1, 10, 100} {TCP, UDP} STREAM, RR, 4K_RR benchmarks
from a guest to a server on the host, on an x86_64 Haswell. The guest
runs 4 vCPUs pinned to 4 cores. vhost and the test server are
pinned to a core each.

All results are the median of 5 runs, with variance well < 10%.
Used neper (github.com/google/neper) as test process.

Napi increases single stream throughput, but increases cycle cost.
Processing completions on rx interrupts optimization brings this down,
especially for bi-directional workloads. UDP_STREAM is unidirectional
and continues to see a ~10% lower throughput.

Not showing number for only the optimization patch. That showed no
significant difference with upstream.

 upstream napi   +at-rx
TCP_STREAM:
1x:
  Mbps  305373766637910
  Gcycles 400  540  405

10x:
  Mbps  410123995440245
  Gcycles 434  546  421

100x:
  Mbps  340883417234245
  Gcycles 435  546  418

TCP_RR Latency (us):
1x:
  p50  24   24   21
  p99  27   27   27
  Gcycles 299  432  308

10x:
  p50  31   31   41
  p99  40   46   52
  Gcycles 346  428  322

100x:
  p50 155  151  310
  p99 334  329  362
  Gcycles 336  421  308

TCP_RR 4K:
1x:
  p50  30   30   27
  p99  34   33   34
  Gcycles 307  437  305

10x:
  p50  63   67   65
  p99  76   77   87
  Gcycles 334  425  315

100x:
  p50 421  497  511
  p99 510  571  773
  Gcycles 350  430  321

UDP_STREAM:
1x:
  Mbps  298022636026608
  Gcycles 305  363  362

10x:
  Mbps  299012680127078
  Gcycles 287  363  360

100x:
  Mbps  299522682227054
  Gcycles 336  351  354

UDP_RR:
1x:
  p50  24   21   19
  p99  27   24   23
  Gcycles 299  431  309

10x:
  p50  31   27   35
  p99  40   35   54
  Gcycles 346  421  325

100x:
  p50 155  153  240
  p99 334  323  462
  Gcycles 336  421  311

UDP_RR 4K:
1x:
  p50  24   25   23
  p99  27   28   30
  Gcycles 299  435  321

10x:
  p50  31   35   48
  p99  40   54   66
  Gcycles 346  451  308

100x:
  p50 155  210  307
  p99 334  451  519
  Gcycles 336  440  297

Note that GSO is enabled, so 4K RR still translates to one packet

Re: [PATCH] virtio_console: fix uninitialized variable use

2017-04-24 Thread Mike Galbraith
On Wed, 2017-03-29 at 23:27 +0300, Michael S. Tsirkin wrote:

> Hi Mike
>   if you like, pls send me your Signed-off-by and I'll
>   change the patch to make you an author.

Nah, it's perfect as it is.  While I was pretty darn sure it was
generic, I intentionally posted it as diagnostic info.

>   More importantly, pls let me know whether this helps
>   resolve the issues about hybernation for you!

Well, it prevents kaboom with VIRTIO_CONSOLE_F_MULTIPORT turned off,
but no, it does nada wrt the warnings.

-Mike
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next 1/3] virtio-net: napi helper functions

2017-04-24 Thread Willem de Bruijn
From: Willem de Bruijn 

Prepare virtio-net for tx napi by converting existing napi code to
use helper functions. This also deduplicates some logic.

Signed-off-by: Willem de Bruijn 
Signed-off-by: Jason Wang 
---
 drivers/net/virtio_net.c | 65 ++--
 1 file changed, 35 insertions(+), 30 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index b0d241d110ec..6aac0ad0d9b2 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -239,6 +239,26 @@ static struct page *get_a_page(struct receive_queue *rq, 
gfp_t gfp_mask)
return p;
 }
 
+static void virtqueue_napi_schedule(struct napi_struct *napi,
+   struct virtqueue *vq)
+{
+   if (napi_schedule_prep(napi)) {
+   virtqueue_disable_cb(vq);
+   __napi_schedule(napi);
+   }
+}
+
+static void virtqueue_napi_complete(struct napi_struct *napi,
+   struct virtqueue *vq, int processed)
+{
+   int opaque;
+
+   opaque = virtqueue_enable_cb_prepare(vq);
+   if (napi_complete_done(napi, processed) &&
+   unlikely(virtqueue_poll(vq, opaque)))
+   virtqueue_napi_schedule(napi, vq);
+}
+
 static void skb_xmit_done(struct virtqueue *vq)
 {
struct virtnet_info *vi = vq->vdev->priv;
@@ -936,27 +956,20 @@ static void skb_recv_done(struct virtqueue *rvq)
struct virtnet_info *vi = rvq->vdev->priv;
struct receive_queue *rq = >rq[vq2rxq(rvq)];
 
-   /* Schedule NAPI, Suppress further interrupts if successful. */
-   if (napi_schedule_prep(>napi)) {
-   virtqueue_disable_cb(rvq);
-   __napi_schedule(>napi);
-   }
+   virtqueue_napi_schedule(>napi, rvq);
 }
 
-static void virtnet_napi_enable(struct receive_queue *rq)
+static void virtnet_napi_enable(struct virtqueue *vq, struct napi_struct *napi)
 {
-   napi_enable(>napi);
+   napi_enable(napi);
 
/* If all buffers were filled by other side before we napi_enabled, we
-* won't get another interrupt, so process any outstanding packets
-* now.  virtnet_poll wants re-enable the queue, so we disable here.
-* We synchronize against interrupts via NAPI_STATE_SCHED */
-   if (napi_schedule_prep(>napi)) {
-   virtqueue_disable_cb(rq->vq);
-   local_bh_disable();
-   __napi_schedule(>napi);
-   local_bh_enable();
-   }
+* won't get another interrupt, so process any outstanding packets now.
+* Call local_bh_enable after to trigger softIRQ processing.
+*/
+   local_bh_disable();
+   virtqueue_napi_schedule(napi, vq);
+   local_bh_enable();
 }
 
 static void refill_work(struct work_struct *work)
@@ -971,7 +984,7 @@ static void refill_work(struct work_struct *work)
 
napi_disable(>napi);
still_empty = !try_fill_recv(vi, rq, GFP_KERNEL);
-   virtnet_napi_enable(rq);
+   virtnet_napi_enable(rq->vq, >napi);
 
/* In theory, this can happen: if we don't get any buffers in
 * we will *never* try to fill again.
@@ -1011,21 +1024,13 @@ static int virtnet_poll(struct napi_struct *napi, int 
budget)
 {
struct receive_queue *rq =
container_of(napi, struct receive_queue, napi);
-   unsigned int r, received;
+   unsigned int received;
 
received = virtnet_receive(rq, budget);
 
/* Out of packets? */
-   if (received < budget) {
-   r = virtqueue_enable_cb_prepare(rq->vq);
-   if (napi_complete_done(napi, received)) {
-   if (unlikely(virtqueue_poll(rq->vq, r)) &&
-   napi_schedule_prep(napi)) {
-   virtqueue_disable_cb(rq->vq);
-   __napi_schedule(napi);
-   }
-   }
-   }
+   if (received < budget)
+   virtqueue_napi_complete(napi, rq->vq, received);
 
return received;
 }
@@ -1040,7 +1045,7 @@ static int virtnet_open(struct net_device *dev)
/* Make sure we have some buffers: if oom use wq. */
if (!try_fill_recv(vi, >rq[i], GFP_KERNEL))
schedule_delayed_work(>refill, 0);
-   virtnet_napi_enable(>rq[i]);
+   virtnet_napi_enable(vi->rq[i].vq, >rq[i].napi);
}
 
return 0;
@@ -1747,7 +1752,7 @@ static int virtnet_restore_up(struct virtio_device *vdev)
schedule_delayed_work(>refill, 0);
 
for (i = 0; i < vi->max_queue_pairs; i++)
-   virtnet_napi_enable(>rq[i]);
+   virtnet_napi_enable(vi->rq[i].vq, >rq[i].napi);
}
 
netif_device_attach(vi->dev);
-- 

Re: [REGRESSION] 07ec51480b5e ("virtio_pci: use shared interrupts for virtqueues") causes crashes in guest

2017-04-24 Thread Thorsten Leemhuis
On 23.03.2017 17:41, Michael S. Tsirkin wrote:
> On Thu, Mar 23, 2017 at 03:19:07PM +, Richard W.M. Jones wrote:
>> On Thu, Mar 23, 2017 at 01:13:50PM +0800, Jason Wang wrote:
>>> >From 312859b596e83a2164a8430343d31fce2a5ad808 Mon Sep 17 00:00:00 2001
>>> From: Jason Wang 
>>> Date: Thu, 23 Mar 2017 13:07:16 +0800
>>> Subject: [PATCH] virtio_pci: fix out of bound access for msix_names
>>> Signed-off-by: Jason Wang 
>> I tested this, and it does appear to fix the crashes in
>> vp_modern_find_vqs.  Therefore:
>> Tested-by: Richard W.M. Jones 
> I've queued the fix, thanks everyone!

Thx. Feel free to add

Tested-by: Thorsten Leemhuis 

as a kernel with that patch works well in my reboot tests so far.
 Ciao, Thorsten
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next 2/3] virtio-net: transmit napi

2017-04-24 Thread Willem de Bruijn
From: Willem de Bruijn 

Convert virtio-net to a standard napi tx completion path. This enables
better TCP pacing using TCP small queues and increases single stream
throughput.

The virtio-net driver currently cleans tx descriptors on transmission
of new packets in ndo_start_xmit. Latency depends on new traffic, so
is unbounded. To avoid deadlock when a socket reaches its snd limit,
packets are orphaned on tranmission. This breaks socket backpressure,
including TSQ.

Napi increases the number of interrupts generated compared to the
current model, which keeps interrupts disabled as long as the ring
has enough free descriptors. Keep tx napi optional for now. Follow-on
patches will reduce the interrupt cost.

Signed-off-by: Willem de Bruijn 
Signed-off-by: Jason Wang 
---
 drivers/net/virtio_net.c | 63 
 1 file changed, 53 insertions(+), 10 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 6aac0ad0d9b2..95d938e82080 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -33,9 +33,10 @@
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
 
-static bool csum = true, gso = true;
+static bool csum = true, gso = true, napi_tx = true;
 module_param(csum, bool, 0444);
 module_param(gso, bool, 0444);
+module_param(napi_tx, bool, 0644);
 
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
@@ -86,6 +87,8 @@ struct send_queue {
 
/* Name of the send queue: output.$index */
char name[40];
+
+   struct napi_struct napi;
 };
 
 /* Internal representation of a receive virtqueue */
@@ -262,12 +265,16 @@ static void virtqueue_napi_complete(struct napi_struct 
*napi,
 static void skb_xmit_done(struct virtqueue *vq)
 {
struct virtnet_info *vi = vq->vdev->priv;
+   struct napi_struct *napi = >sq[vq2txq(vq)].napi;
 
/* Suppress further interrupts. */
virtqueue_disable_cb(vq);
 
-   /* We were probably waiting for more output buffers. */
-   netif_wake_subqueue(vi->dev, vq2txq(vq));
+   if (napi->weight)
+   virtqueue_napi_schedule(napi, vq);
+   else
+   /* We were probably waiting for more output buffers. */
+   netif_wake_subqueue(vi->dev, vq2txq(vq));
 }
 
 static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx)
@@ -961,6 +968,9 @@ static void skb_recv_done(struct virtqueue *rvq)
 
 static void virtnet_napi_enable(struct virtqueue *vq, struct napi_struct *napi)
 {
+   if (!napi->weight)
+   return;
+
napi_enable(napi);
 
/* If all buffers were filled by other side before we napi_enabled, we
@@ -1046,6 +1056,7 @@ static int virtnet_open(struct net_device *dev)
if (!try_fill_recv(vi, >rq[i], GFP_KERNEL))
schedule_delayed_work(>refill, 0);
virtnet_napi_enable(vi->rq[i].vq, >rq[i].napi);
+   virtnet_napi_enable(vi->sq[i].vq, >sq[i].napi);
}
 
return 0;
@@ -1081,6 +1092,24 @@ static void free_old_xmit_skbs(struct send_queue *sq)
u64_stats_update_end(>tx_syncp);
 }
 
+static int virtnet_poll_tx(struct napi_struct *napi, int budget)
+{
+   struct send_queue *sq = container_of(napi, struct send_queue, napi);
+   struct virtnet_info *vi = sq->vq->vdev->priv;
+   struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq));
+
+   __netif_tx_lock(txq, smp_processor_id());
+   free_old_xmit_skbs(sq);
+   __netif_tx_unlock(txq);
+
+   virtqueue_napi_complete(napi, sq->vq, 0);
+
+   if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
+   netif_wake_subqueue(vi->dev, vq2txq(sq->vq));
+
+   return 0;
+}
+
 static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
 {
struct virtio_net_hdr_mrg_rxbuf *hdr;
@@ -1130,9 +1159,11 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, 
struct net_device *dev)
int err;
struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
bool kick = !skb->xmit_more;
+   bool use_napi = sq->napi.weight;
 
/* Free up any pending old buffers before queueing new ones. */
-   free_old_xmit_skbs(sq);
+   if (!use_napi)
+   free_old_xmit_skbs(sq);
 
/* timestamp packet in software */
skb_tx_timestamp(skb);
@@ -1152,8 +1183,10 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, 
struct net_device *dev)
}
 
/* Don't wait up for transmitted skbs to be freed. */
-   skb_orphan(skb);
-   nf_reset(skb);
+   if (!use_napi) {
+   skb_orphan(skb);
+   nf_reset(skb);
+   }
 
/* If running out of space, stop queue to avoid getting packets that we
 * are then unable to transmit.
@@ -1167,7 +1200,8 @@ static netdev_tx_t start_xmit(struct sk_buff 

Re: [PATCH net-next v2 2/5] virtio-net: transmit napi

2017-04-24 Thread Michael S. Tsirkin
On Mon, Apr 24, 2017 at 01:05:45PM -0400, Willem de Bruijn wrote:
> On Mon, Apr 24, 2017 at 12:40 PM, Michael S. Tsirkin  wrote:
> > On Fri, Apr 21, 2017 at 10:50:12AM -0400, Willem de Bruijn wrote:
> >> >>> Maybe I was wrong, but according to Michael's comment it looks like he
> >> >>> want
> >> >>> check affinity_hint_set just for speculative tx polling on rx napi
> >> >>> instead
> >> >>> of disabling it at all.
> >> >>>
> >> >>> And I'm not convinced this is really needed, driver only provide 
> >> >>> affinity
> >> >>> hint instead of affinity, so it's not guaranteed that tx and rx 
> >> >>> interrupt
> >> >>> are in the same vcpus.
> >> >>
> >> >> You're right. I made the restriction broader than the request, to really
> >> >> err
> >> >> on the side of caution for the initial merge of napi tx. And enabling
> >> >> the optimization is always a win over keeping it off, even without irq
> >> >> affinity.
> >> >>
> >> >> The cycle cost is significant without affinity regardless of whether the
> >> >> optimization is used.
> >> >
> >> >
> >> > Yes, I noticed this in the past too.
> >> >
> >> >> Though this is not limited to napi-tx, it is more
> >> >> pronounced in that mode than without napi.
> >> >>
> >> >> 1x TCP_RR for affinity configuration {process, rx_irq, tx_irq}:
> >> >>
> >> >> upstream:
> >> >>
> >> >> 1,1,1: 28985 Mbps, 278 Gcyc
> >> >> 1,0,2: 30067 Mbps, 402 Gcyc
> >> >>
> >> >> napi tx:
> >> >>
> >> >> 1,1,1: 34492 Mbps, 269 Gcyc
> >> >> 1,0,2: 36527 Mbps, 537 Gcyc (!)
> >> >> 1,0,1: 36269 Mbps, 394 Gcyc
> >> >> 1,0,0: 34674 Mbps, 402 Gcyc
> >> >>
> >> >> This is a particularly strong example. It is also representative
> >> >> of most RR tests. It is less pronounced in other streaming tests.
> >> >> 10x TCP_RR, for instance:
> >> >>
> >> >> upstream:
> >> >>
> >> >> 1,1,1: 42267 Mbps, 301 Gcyc
> >> >> 1,0,2: 40663 Mbps, 445 Gcyc
> >> >>
> >> >> napi tx:
> >> >>
> >> >> 1,1,1: 42420 Mbps, 303 Gcyc
> >> >> 1,0,2:  42267 Mbps, 431 Gcyc
> >> >>
> >> >> These numbers were obtained with the virtqueue_enable_cb_delayed
> >> >> optimization after xmit_skb, btw. It turns out that moving that before
> >> >> increases 1x TCP_RR further to ~39 Gbps, at the cost of reducing
> >> >> 100x TCP_RR a bit.
> >> >
> >> >
> >> > I see, so I think we can leave the affinity hint optimization/check for
> >> > future investigation:
> >> >
> >> > - to avoid endless optimization (e.g we may want to share a single
> >> > vector/napi for tx/rx queue pairs in the future) for this series.
> >> > - tx napi is disabled by default which means we can do optimization on 
> >> > top.
> >>
> >> Okay. I'll drop the vi->affinity_hint_set from the patch set for now.
> >
> > I kind of like it, let's be conservative. But I'd prefer a comment
> > near it explaining why it's there.
> 
> I don't feel strongly. Was minutes away from sending a v3 with this
> code reverted, but I'll reinstate it and add a comment. Other planned
> changes based on Jason's feedback to v2:
> 
>   v2 -> v3:
> - convert __netif_tx_trylock to __netif_tx_lock on tx napi poll
>   ensure that the handler always cleans, to avoid deadlock
> - unconditionally clean in start_xmit
>   avoid adding an unnecessary "if (use_napi)" branch
> - remove virtqueue_disable_cb in patch 5/5
>   a noop in the common event_idx based loop

Makes sense, thanks!

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC (resend) net-next 0/6] virtio-net: Add support for virtio-net header extensions

2017-04-24 Thread Michael S. Tsirkin
On Thu, Apr 20, 2017 at 11:34:57AM -0400, Vlad Yasevich wrote:
> > - For 1.1, do we really want something like vnet header? AFAIK, it was not 
> > used by modern
> > NICs, is this better to pack all meta-data into descriptor itself? This may 
> > need a some
> > changes in tun/macvtap, but looks more PCIE friendly.
> 
> That would really be ideal and I've looked at this.

We already have at least 16 unused bits in the used ring
(head is 16 bit we are using 32 for it).

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v2 2/5] virtio-net: transmit napi

2017-04-24 Thread Michael S. Tsirkin
On Fri, Apr 21, 2017 at 10:50:12AM -0400, Willem de Bruijn wrote:
> >>> Maybe I was wrong, but according to Michael's comment it looks like he
> >>> want
> >>> check affinity_hint_set just for speculative tx polling on rx napi
> >>> instead
> >>> of disabling it at all.
> >>>
> >>> And I'm not convinced this is really needed, driver only provide affinity
> >>> hint instead of affinity, so it's not guaranteed that tx and rx interrupt
> >>> are in the same vcpus.
> >>
> >> You're right. I made the restriction broader than the request, to really
> >> err
> >> on the side of caution for the initial merge of napi tx. And enabling
> >> the optimization is always a win over keeping it off, even without irq
> >> affinity.
> >>
> >> The cycle cost is significant without affinity regardless of whether the
> >> optimization is used.
> >
> >
> > Yes, I noticed this in the past too.
> >
> >> Though this is not limited to napi-tx, it is more
> >> pronounced in that mode than without napi.
> >>
> >> 1x TCP_RR for affinity configuration {process, rx_irq, tx_irq}:
> >>
> >> upstream:
> >>
> >> 1,1,1: 28985 Mbps, 278 Gcyc
> >> 1,0,2: 30067 Mbps, 402 Gcyc
> >>
> >> napi tx:
> >>
> >> 1,1,1: 34492 Mbps, 269 Gcyc
> >> 1,0,2: 36527 Mbps, 537 Gcyc (!)
> >> 1,0,1: 36269 Mbps, 394 Gcyc
> >> 1,0,0: 34674 Mbps, 402 Gcyc
> >>
> >> This is a particularly strong example. It is also representative
> >> of most RR tests. It is less pronounced in other streaming tests.
> >> 10x TCP_RR, for instance:
> >>
> >> upstream:
> >>
> >> 1,1,1: 42267 Mbps, 301 Gcyc
> >> 1,0,2: 40663 Mbps, 445 Gcyc
> >>
> >> napi tx:
> >>
> >> 1,1,1: 42420 Mbps, 303 Gcyc
> >> 1,0,2:  42267 Mbps, 431 Gcyc
> >>
> >> These numbers were obtained with the virtqueue_enable_cb_delayed
> >> optimization after xmit_skb, btw. It turns out that moving that before
> >> increases 1x TCP_RR further to ~39 Gbps, at the cost of reducing
> >> 100x TCP_RR a bit.
> >
> >
> > I see, so I think we can leave the affinity hint optimization/check for
> > future investigation:
> >
> > - to avoid endless optimization (e.g we may want to share a single
> > vector/napi for tx/rx queue pairs in the future) for this series.
> > - tx napi is disabled by default which means we can do optimization on top.
> 
> Okay. I'll drop the vi->affinity_hint_set from the patch set for now.

I kind of like it, let's be conservative. But I'd prefer a comment
near it explaining why it's there.

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 1/3] virtio-iommu: firmware description of the virtual topology

2017-04-24 Thread Jean-Philippe Brucker
On 21/04/17 09:43, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
>> Sent: Wednesday, April 19, 2017 2:41 AM
>>
>> On 18/04/17 10:51, Tian, Kevin wrote:
 From: Jean-Philippe Brucker
 Sent: Saturday, April 8, 2017 3:18 AM

 Unlike other virtio devices, the virtio-iommu doesn't work independently,
 it is linked to other virtual or assigned devices. So before jumping into
 device operations, we need to define a way for the guest to discover the
 virtual IOMMU and the devices it translates.

 The host must describe the relation between IOMMU and devices to the
 guest
 using either device-tree or ACPI. The virtual IOMMU identifies each
>>>
>>> Do you plan to support both device tree and ACPI?
>>
>> Yes, with ACPI the topology would be described using IORT nodes. I didn't
>> include an example in my driver because DT is sufficient for a prototype
>> and is readily available (both in Linux and kvmtool), whereas IORT would
>> be quite easy to reuse in Linux, but isn't present in kvmtool at the
>> moment. However, both interfaces have to be supported for the virtio-
>> iommu
>> to be portable.
> 
> 'portable' means whether guest enables ACPI?

Sorry, "supported" isn't the right term for what I meant. It is for
firmware interface to accommodate devices, not the other way around, so
firmware consideration is outside the scope of the virtio-iommu
specification and virtio-iommu itself doesn't need to "support" any interface.

For the purpose of this particular document however, both popular firmware
interfaces (ACPI and DT) must be taken into account. Those are the two
interfaces I know about, there might be others. But I figure that a VMM
implementing a virtual IOMMU is complex enough to be able to also
implement one of these two interfaces, so talking about DT and ACPI should
fit all use cases. It also provides two examples for other firmware
interfaces that wish to describe the IOMMU topology.

 virtual device with a 32-bit ID, that we will call "Device ID" in this
 document. Device IDs are not necessarily unique system-wide, but they
>> may
 not overlap within a single virtual IOMMU. Device ID of passed-through
 devices do not need to match IDs seen by the physical IOMMU.

 The virtual IOMMU uses virtio-mmio transport exclusively, not virtio-pci,
 because with PCI the IOMMU interface would itself be an endpoint, and
 existing firmware interfaces don't allow to describe IOMMU<->master
 relations between PCI endpoints.
>>>
>>> I'm not familiar with virtio-mmio mechanism. Curious how devices in
>>> virtio-mmio are enumerated today? Could we use that mechanism to
>>> identify vIOMMUs and then invent a purely para-virtualized method to
>>> enumerate devices behind each vIOMMU?
>>
>> Using DT, virtio-mmio devices are described with "virtio-mmio" compatible
>> node, and with ACPI they use _HID LNRO0005. Since the host already
>> describes available devices to a guest using a firmware interface, I think
>> we should reuse the tools provided by that interface for describing
>> relations between DMA masters and IOMMU.
> 
> OK, I didn't realize virtio-mmio is defined to rely on DT for enumeration.

Not necessarily DT, you can have virtio-mmio devices in ACPI namespace as
well. Qemu has a an example of LNRO0005 with ACPI.

>>> Asking this is because each vendor has its own enumeration methods.
>>> ARM has device tree and ACPI IORT. AMR has ACPI IVRS and device
>>> tree (same format as ARM?). Intel has APCI DMAR and sub-tables. Your
>>> current proposal looks following ARM definitions which I'm not sure
>>> extensible enough to cover features defined only in other vendors'
>>> structures.
>>
>> ACPI IORT can be extended to incorporate para-virtualized IOMMUs,
>> regardless of the underlying architecture. It isn't defined solely for the
>> ARM SMMU, but serves a more general purpose of describing a map of
>> device
>> identifiers communicated from one components to another. Both DMAR and
>> IVRS have such description (respectively DRHD and IVHD), but they are
>> designed for a specific IOMMU, whereas IORT could host other kinds.
> 
> I'll take a look at IORT definition. DRHD includes information more
> than device mapping.

I guess that most information provided by DMAR and others are
IOMMU-specific and the equivalent for virtio-iommu would fit in virtio
config space. But describing device mapping relative to IOMMUs is the same
problem for all systems. Doing it with a virtio-iommu probing mechanism
would require to reinvent a way to identify devices every time a host
wants to add support for a new bus (RID for PCI, base address for MMIO,
others in the future), when firmwares would have to provide this
information anyway for bare metal.

>> It seems that all we really need is an interface that says "there is a
>> virtio-iommu at address X, here are the devices it translates and their
>> corresponding IDs", 

Re: [RFC 3/3] virtio-iommu: future work

2017-04-24 Thread Jean-Philippe Brucker
On 21/04/17 09:31, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker
>> Sent: Saturday, April 8, 2017 3:18 AM
>>
>> Here I propose a few ideas for extensions and optimizations. This is all
>> very exploratory, feel free to correct mistakes and suggest more things.
> 
> [...]
>>
>>   II. Page table sharing
>>   ==
>>
>>   1. Sharing IOMMU page tables
>>   
>>
>> VIRTIO_IOMMU_F_PT_SHARING
>>
>> This is independent of the nested mode described in I.2, but relies on a
>> similar feature in the physical IOMMU: having two stages of page tables,
>> one for the host and one for the guest.
>>
>> When this is supported, the guest can manage its own s1 page directory, to
>> avoid sending MAP/UNMAP requests. Feature
>> VIRTIO_IOMMU_F_PT_SHARING allows
>> a driver to give a page directory pointer (pgd) to the host and send
>> invalidations when removing or changing a mapping. In this mode, three
>> requests are used: probe, attach and invalidate. An address space cannot
>> be using the MAP/UNMAP interface and PT_SHARING at the same time.
>>
>> Device and driver first need to negotiate which page table format they
>> will be using. This depends on the physical IOMMU, so the request contains
>> a negotiation part to probe the device capabilities.
>>
>> (1) Driver attaches devices to address spaces as usual, but a flag
>> VIRTIO_IOMMU_ATTACH_F_PRIVATE (working title) tells the device not to
>> create page tables for use with the MAP/UNMAP API. The driver intends
>> to manage the address space itself.
>>
>> (2) Driver sends a PROBE_TABLE request. It sets len > 0 with the size of
>> pg_format array.
>>
>>  VIRTIO_IOMMU_T_PROBE_TABLE
>>
>>  struct virtio_iommu_req_probe_table {
>>  le32address_space;
>>  le32flags;
>>  le32len;
>>
>>  le32nr_contexts;
>>  struct {
>>  le32model;
>>  u8  format[64];
>>  } pg_format[len];
>>  };
>>
>> Introducing a probe request is more flexible than advertising those
>> features in virtio config, because capabilities are dynamic, and depend on
>> which devices are attached to an address space. Within a single address
>> space, devices may support different numbers of contexts (PASIDs), and
>> some may not support recoverable faults.
>>
>> (3) Device responds success with all page table formats implemented by the
>> physical IOMMU in pg_format. 'model' 0 is invalid, so driver can
>> initialize the array to 0 and deduce from there which entries have
>> been filled by the device.
>>
>> Using a probe method seems preferable over trying to attach every possible
>> format until one sticks. For instance, with an ARM guest running on an x86
>> host, PROBE_TABLE would return the Intel IOMMU page table format, and
>> the
>> guest could use that page table code to handle its mappings, hidden behind
>> the IOMMU API. This requires that the page-table code is reasonably
>> abstracted from the architecture, as is done with drivers/iommu/io-pgtable
>> (an x86 guest could use any format implement by io-pgtable for example.)
> 
> So essentially you need modify all existing IOMMU drivers to support page 
> table sharing in pvIOMMU. After abstraction is done the core pvIOMMU files 
> can be kept vendor agnostic. But if we talk about the whole pvIOMMU 
> module, it actually includes vendor specific logic thus unlike typical 
> para-virtualized virtio drivers being completely vendor agnostic. Is this 
> understanding accurate?

Yes, although kernel modules would be separate. For Linux on ARM we
already have the page-table logic abstracted in iommu/io-pgtable module,
because multiple IOMMUs share the same PT formats (SMMUv2, SMMUv3, Renesas
IPMMU, Qcom MSM, Mediatek). It offers a simple interface:

* When attaching devices to an IOMMU domain, the IOMMU driver registers
its page table format and provides invalidation callbacks.

* On iommu_map/unmap, the IOMMU driver calls into io_pgtable_ops, which
provide map, unmap and iova_to_phys functions.

* Page table operations call back into the driver via iommu_gather_ops
when they need to invalidate TLB entries.

Currently only the few flavors of ARM PT formats are implemented, but
other page table formats could be added if they fit this model.

> It also means in the host-side pIOMMU driver needs to propagate all
> supported formats through VFIO to Qemu vIOMMU, meaning
> such format definitions need be consistently agreed across all those 
> components.

Yes, that's the icky part. We need to define a format that every OS and
hypervisor implementing virtio-iommu can understand (similarly to the
PASID table sharing interface that Yi L is working on for VFIO, although
that one is contained in Linux UAPI and doesn't require other OSes to know
about it).

>>   2. Sharing MMU page tables
>>   --
>>
>> The guest can share process 

Re: [RFC 2/3] virtio-iommu: device probing and operations

2017-04-24 Thread Jean-Philippe Brucker
On 21/04/17 10:02, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
>> Sent: Wednesday, April 19, 2017 2:46 AM
>>
>> On 18/04/17 11:26, Tian, Kevin wrote:
 From: Jean-Philippe Brucker
 Sent: Saturday, April 8, 2017 3:18 AM

>>> [...]
   II. Feature bits
   

 VIRTIO_IOMMU_F_INPUT_RANGE (0)
  Available range of virtual addresses is described in input_range
>>>
>>> Usually only the maximum supported address bits are important.
>>> Curious do you see such situation where low end of the address
>>> space is not usable (since you have both start/end defined later)?
>>
>> A start address would allow to provide something resembling a GART to the
>> guest: an IOMMU with one address space (ioasid_bits=0) and a small IOVA
>> aperture. I'm not sure how useful that would be in practice.
> 
> Intel VT-d has no such limitation, which I can tell. :-)
> 
>>
>> On a related note, the virtio-iommu itself doesn't provide a
>> per-address-space aperture as it stands. For example, attaching a device
>> to an address space might restrict the available IOVA range for the whole
>> AS if that device cannot write to high memory (above 32-bit). If the guest
>> attempts to map an IOVA outside this window into the device's address
>> space, it should expect the MAP request to fail. And when attaching, if
>> the address space already has mappings outside this window, then ATTACH
>> should fail.
>>
>> This too seems to be something that ought to be communicated by firmware,
>> but bits are missing (I can't find anything equivalent to DT's dma-ranges
>> for PCI root bridges in ACPI tables, for example). In addition VFIO
>> doesn't communicate any DMA mask for devices, and doesn't check them
>> itself. I guess that the host could find out the DMA mask of devices one
>> way or another, but it is tricky to enforce, so I didn't make this a hard
>> requirement. Although I should probably add a few words about it.
> 
> If there is no such communication on bare metal, then same for pvIOMMU.
> 
>>
>>> [...]
   1. Attach device
   

 struct virtio_iommu_req_attach {
le32address_space;
le32device;
le32flags/reserved;
 };

 Attach a device to an address space. 'address_space' is an identifier
 unique to the guest. If the address space doesn't exist in the IOMMU
>>>
>>> Based on your description this address space ID is per operation right?
>>> MAP/UNMAP and page-table sharing should have different ID spaces...
>>
>> I think it's simpler if we keep a single IOASID space per virtio-iommu
>> device, because the maximum number of address spaces (described by
>> ioasid_bits) might be a restriction of the pIOMMU. For page-table sharing
>> you still need to define which devices will share a page directory using
>> ATTACH requests, though that interface is not set in stone.
> 
> got you. yes VM is supposed to consume less IOASIDs than physically
> available. It doesn’t hurt to have one IOASID space for both IOVA
> map/unmap usages (one IOASID per device) and SVM usages (multiple
> IOASIDs per device). The former is digested by software and the latter
> will be bound to hardware.
> 

Hmm, I'm using address space indexed by IOASID for "classic" IOMMU, and
then contexts indexed by PASID when talking about SVM. So in my mind an
address space can have multiple sub-address-spaces (contexts). Number of
IOASIDs is a limitation of the pIOMMU, and number of PASIDs is a
limitation of the device. Therefore attaching devices to address spaces
would update the number of available contexts in that address space. The
terminology is not ideal, and I'd be happy to change it for something more
clear.

Thanks,
Jean-Philippe
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH 6/6] drm: fourcc byteorder: virtio restrict to XRGB8888

2017-04-24 Thread Gerd Hoffmann
While wading through the code I've noticed we have a little issue in
virtio:  We attach a format to the bo when it is created
(DRM_IOCTL_MODE_CREATE_DUMB), not when we map it as framebuffer
(DRM_IOCTL_MODE_ADDFB).

Easy way out:  support a single format only.

Signed-off-by: Gerd Hoffmann 
---
 drivers/gpu/drm/virtio/virtgpu_gem.c   | 5 -
 drivers/gpu/drm/virtio/virtgpu_plane.c | 9 +
 2 files changed, 5 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/virtio/virtgpu_gem.c 
b/drivers/gpu/drm/virtio/virtgpu_gem.c
index 4f2c2dc731..b09e5e5ae4 100644
--- a/drivers/gpu/drm/virtio/virtgpu_gem.c
+++ b/drivers/gpu/drm/virtio/virtgpu_gem.c
@@ -90,7 +90,10 @@ int virtio_gpu_mode_dumb_create(struct drm_file *file_priv,
uint32_t resid;
uint32_t format;
 
-   pitch = args->width * ((args->bpp + 1) / 8);
+   if (args->bpp != 32)
+   return -EINVAL;
+
+   pitch = args->width * 4;
args->size = pitch * args->height;
args->size = ALIGN(args->size, PAGE_SIZE);
 
diff --git a/drivers/gpu/drm/virtio/virtgpu_plane.c 
b/drivers/gpu/drm/virtio/virtgpu_plane.c
index f40ffc9a70..3a4498a223 100644
--- a/drivers/gpu/drm/virtio/virtgpu_plane.c
+++ b/drivers/gpu/drm/virtio/virtgpu_plane.c
@@ -28,14 +28,7 @@
 #include 
 
 static const uint32_t virtio_gpu_formats[] = {
-   DRM_FORMAT_XRGB,
-   DRM_FORMAT_ARGB,
-   DRM_FORMAT_BGRX,
-   DRM_FORMAT_BGRA,
-   DRM_FORMAT_RGBX,
-   DRM_FORMAT_RGBA,
-   DRM_FORMAT_XBGR,
-   DRM_FORMAT_ABGR,
+   DRM_FORMAT_CPU_XRGB,
 };
 
 static const uint32_t virtio_gpu_cursor_formats[] = {
-- 
2.9.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 5/6] drm: fourcc byteorder: adapt virtio to drm_mode_legacy_fb_format update

2017-04-24 Thread Gerd Hoffmann
Signed-off-by: Gerd Hoffmann 
---
 drivers/gpu/drm/virtio/virtgpu_gem.c   |  2 +-
 drivers/gpu/drm/virtio/virtgpu_plane.c | 31 ---
 2 files changed, 1 insertion(+), 32 deletions(-)

diff --git a/drivers/gpu/drm/virtio/virtgpu_gem.c 
b/drivers/gpu/drm/virtio/virtgpu_gem.c
index cc025d8fbe..4f2c2dc731 100644
--- a/drivers/gpu/drm/virtio/virtgpu_gem.c
+++ b/drivers/gpu/drm/virtio/virtgpu_gem.c
@@ -99,7 +99,7 @@ int virtio_gpu_mode_dumb_create(struct drm_file *file_priv,
if (ret)
goto fail;
 
-   format = virtio_gpu_translate_format(DRM_FORMAT_XRGB);
+   format = virtio_gpu_translate_format(DRM_FORMAT_CPU_XRGB);
virtio_gpu_resource_id_get(vgdev, );
virtio_gpu_cmd_create_resource(vgdev, resid, format,
   args->width, args->height);
diff --git a/drivers/gpu/drm/virtio/virtgpu_plane.c 
b/drivers/gpu/drm/virtio/virtgpu_plane.c
index adcdbd0abe..f40ffc9a70 100644
--- a/drivers/gpu/drm/virtio/virtgpu_plane.c
+++ b/drivers/gpu/drm/virtio/virtgpu_plane.c
@@ -39,11 +39,7 @@ static const uint32_t virtio_gpu_formats[] = {
 };
 
 static const uint32_t virtio_gpu_cursor_formats[] = {
-#ifdef __BIG_ENDIAN
-   DRM_FORMAT_BGRA,
-#else
DRM_FORMAT_ARGB,
-#endif
 };
 
 uint32_t virtio_gpu_translate_format(uint32_t drm_fourcc)
@@ -51,32 +47,6 @@ uint32_t virtio_gpu_translate_format(uint32_t drm_fourcc)
uint32_t format;
 
switch (drm_fourcc) {
-#ifdef __BIG_ENDIAN
-   case DRM_FORMAT_XRGB:
-   format = VIRTIO_GPU_FORMAT_X8R8G8B8_UNORM;
-   break;
-   case DRM_FORMAT_ARGB:
-   format = VIRTIO_GPU_FORMAT_A8R8G8B8_UNORM;
-   break;
-   case DRM_FORMAT_BGRX:
-   format = VIRTIO_GPU_FORMAT_B8G8R8X8_UNORM;
-   break;
-   case DRM_FORMAT_BGRA:
-   format = VIRTIO_GPU_FORMAT_B8G8R8A8_UNORM;
-   break;
-   case DRM_FORMAT_RGBX:
-   format = VIRTIO_GPU_FORMAT_R8G8B8X8_UNORM;
-   break;
-   case DRM_FORMAT_RGBA:
-   format = VIRTIO_GPU_FORMAT_R8G8B8A8_UNORM;
-   break;
-   case DRM_FORMAT_XBGR:
-   format = VIRTIO_GPU_FORMAT_X8B8G8R8_UNORM;
-   break;
-   case DRM_FORMAT_ABGR:
-   format = VIRTIO_GPU_FORMAT_A8B8G8R8_UNORM;
-   break;
-#else
case DRM_FORMAT_XRGB:
format = VIRTIO_GPU_FORMAT_B8G8R8X8_UNORM;
break;
@@ -101,7 +71,6 @@ uint32_t virtio_gpu_translate_format(uint32_t drm_fourcc)
case DRM_FORMAT_ABGR:
format = VIRTIO_GPU_FORMAT_R8G8B8A8_UNORM;
break;
-#endif
default:
/*
 * This should not happen, we handle everything listed
-- 
2.9.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 4/6] drm: fourcc byteorder: adapt bochs-drm to drm_mode_legacy_fb_format update

2017-04-24 Thread Gerd Hoffmann
Signed-off-by: Gerd Hoffmann 
---
 drivers/gpu/drm/bochs/bochs_mm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/bochs/bochs_mm.c b/drivers/gpu/drm/bochs/bochs_mm.c
index 857755ac2d..781d35bdff 100644
--- a/drivers/gpu/drm/bochs/bochs_mm.c
+++ b/drivers/gpu/drm/bochs/bochs_mm.c
@@ -508,7 +508,7 @@ bochs_user_framebuffer_create(struct drm_device *dev,
   (mode_cmd->pixel_format >> 16) & 0xff,
   (mode_cmd->pixel_format >> 24) & 0xff);
 
-   if (mode_cmd->pixel_format != DRM_FORMAT_XRGB)
+   if (mode_cmd->pixel_format != DRM_FORMAT_CPU_XRGB)
return ERR_PTR(-ENOENT);
 
obj = drm_gem_object_lookup(filp, mode_cmd->handles[0]);
-- 
2.9.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization