Re: [PATCH][net-next] tun: remove unnecessary check in tun_flow_update

2018-12-06 Thread Jason Wang



On 2018/12/6 下午4:28, Li RongQing wrote:

caller has guaranted that rxhash is not zero

Signed-off-by: Li RongQing 
---
  drivers/net/tun.c | 5 +
  1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index d0745dc81976..6760b86547df 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -529,10 +529,7 @@ static void tun_flow_update(struct tun_struct *tun, u32 
rxhash,
unsigned long delay = tun->ageing_time;
u16 queue_index = tfile->queue_index;
  
-	if (!rxhash)

-   return;
-   else
-   head = >flows[tun_hashfn(rxhash)];
+   head = >flows[tun_hashfn(rxhash)];
  
  	rcu_read_lock();
  



Acked-by: Jason Wang 




Re: [PATCH][net-next] tun: align write-heavy flow entry members to a cache line

2018-12-06 Thread Jason Wang



On 2018/12/6 下午4:08, Li RongQing wrote:

tun flow entry 'updated' fields are written when receive
every packet. Thus if a flow is receiving packets from a
particular flow entry, it'll cause false-sharing with
all the other who has looked it up, so move it in its own
cache line

and update 'queue_index' and 'update' field only when
they are changed to reduce the cache false-sharing.

Signed-off-by: Zhang Yu 
Signed-off-by: Wang Li 
Signed-off-by: Li RongQing 
---
  drivers/net/tun.c | 8 +---
  1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 835c73f42ae7..d0745dc81976 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -201,7 +201,7 @@ struct tun_flow_entry {
u32 rxhash;
u32 rps_rxhash;
int queue_index;
-   unsigned long updated;
+   unsigned long updated cacheline_aligned_in_smp;
  };
  
  #define TUN_NUM_FLOW_ENTRIES 1024

@@ -539,8 +539,10 @@ static void tun_flow_update(struct tun_struct *tun, u32 
rxhash,
e = tun_flow_find(head, rxhash);
if (likely(e)) {
/* TODO: keep queueing to old queue until it's empty? */
-   e->queue_index = queue_index;
-   e->updated = jiffies;
+   if (e->queue_index != queue_index)
+   e->queue_index = queue_index;
+   if (e->updated != jiffies)
+   e->updated = jiffies;



Acked-by: Jason Wang 

Btw, do you have perf numbers for this?

Thanks



sock_rps_record_flow_hash(e->rps_rxhash);
} else {
spin_lock_bh(>lock);


Re: [PATCH net] tun: remove skb access after netif_receive_skb

2018-12-03 Thread Jason Wang



On 2018/11/30 下午1:29, Toshiaki Makita wrote:

On 2018/11/30 11:40, Jason Wang wrote:

On 2018/11/30 上午10:30, Prashant Bhole wrote:

In tun.c skb->len was accessed while doing stats accounting after a
call to netif_receive_skb. We can not access skb after this call
because buffers may be dropped.

The fix for this bug would be to store skb->len in local variable and
then use it after netif_receive_skb(). IMO using xdp data size for
accounting bytes will be better because input for tun_xdp_one() is
xdp_buff.

Hence this patch:
- fixes a bug by removing skb access after netif_receive_skb()
- uses xdp data size for accounting bytes

...

Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through
sendmsg()")
Reviewed-by: Toshiaki Makita 
Signed-off-by: Prashant Bhole 
---
   drivers/net/tun.c | 3 ++-
   1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index e244f5d7512a..6e388792c0a8 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2385,6 +2385,7 @@ static int tun_xdp_one(struct tun_struct *tun,
  struct tun_file *tfile,
  struct xdp_buff *xdp, int *flush)
   {
+    unsigned int datasize = xdp->data_end - xdp->data;
   struct tun_xdp_hdr *hdr = xdp->data_hard_start;
   struct virtio_net_hdr *gso = >gso;
   struct tun_pcpu_stats *stats;
@@ -2461,7 +2462,7 @@ static int tun_xdp_one(struct tun_struct *tun,
   stats = get_cpu_ptr(tun->pcpu_stats);
   u64_stats_update_begin(>syncp);
   stats->rx_packets++;
-    stats->rx_bytes += skb->len;
+    stats->rx_bytes += datasize;
   u64_stats_update_end(>syncp);
   put_cpu_ptr(stats);
   


Good catch, but you probably need to calculate the datasize after XDP
processing since it may modify the packet length.

(+CC David Ahern who may be interested in this area.)

I'd rather think we should calculate it before XDP.
I checked several drivers behavior. mlx5, bnxt and qede use hardware
counters for rx bytes, which means the size is calculated before XDP.
nfp calculate it by software but before XDP. On the other hand, intel
drivers use skb->len. So currently bytes counters do not look
consistent, but I think calculation before XDP is more common.



Ok, I'm fine with either. Consider this fixes a real issue we need let 
it to be in -net as soon as possible. We can change the behavior of 
counter in the future.


So

Acked-by: Jason Wang 

According to patchwork status, you probably need to repost.

Thanks






Re: [PATCH net] tun: remove skb access after netif_receive_skb

2018-11-29 Thread Jason Wang



On 2018/11/30 上午10:30, Prashant Bhole wrote:

In tun.c skb->len was accessed while doing stats accounting after a
call to netif_receive_skb. We can not access skb after this call
because buffers may be dropped.

The fix for this bug would be to store skb->len in local variable and
then use it after netif_receive_skb(). IMO using xdp data size for
accounting bytes will be better because input for tun_xdp_one() is
xdp_buff.

Hence this patch:
- fixes a bug by removing skb access after netif_receive_skb()
- uses xdp data size for accounting bytes

[613.019057] BUG: KASAN: use-after-free in tun_sendmsg+0x77c/0xc50 [tun]
[613.021062] Read of size 4 at addr 8881da9ab7c0 by task vhost-1115/1155
[613.023073]
[613.024003] CPU: 0 PID: 1155 Comm: vhost-1115 Not tainted 4.20.0-rc3-vm+ #232
[613.026029] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1ubuntu1 04/01/2014
[613.029116] Call Trace:
[613.031145]  dump_stack+0x5b/0x90
[613.032219]  print_address_description+0x6c/0x23c
[613.034156]  ? tun_sendmsg+0x77c/0xc50 [tun]
[613.036141]  kasan_report.cold.5+0x241/0x308
[613.038125]  tun_sendmsg+0x77c/0xc50 [tun]
[613.040109]  ? tun_get_user+0x1960/0x1960 [tun]
[613.042094]  ? __isolate_free_page+0x270/0x270
[613.045173]  vhost_tx_batch.isra.14+0xeb/0x1f0 [vhost_net]
[613.047127]  ? peek_head_len.part.13+0x90/0x90 [vhost_net]
[613.049096]  ? get_tx_bufs+0x5a/0x2c0 [vhost_net]
[613.051106]  ? vhost_enable_notify+0x2d8/0x420 [vhost]
[613.053139]  handle_tx_copy+0x2d0/0x8f0 [vhost_net]
[613.053139]  ? vhost_net_buf_peek+0x340/0x340 [vhost_net]
[613.053139]  ? __mutex_lock+0x8d9/0xb30
[613.053139]  ? finish_task_switch+0x8f/0x3f0
[613.053139]  ? handle_tx+0x32/0x120 [vhost_net]
[613.053139]  ? mutex_trylock+0x110/0x110
[613.053139]  ? finish_task_switch+0xcf/0x3f0
[613.053139]  ? finish_task_switch+0x240/0x3f0
[613.053139]  ? __switch_to_asm+0x34/0x70
[613.053139]  ? __switch_to_asm+0x40/0x70
[613.053139]  ? __schedule+0x506/0xf10
[613.053139]  handle_tx+0xc7/0x120 [vhost_net]
[613.053139]  vhost_worker+0x166/0x200 [vhost]
[613.053139]  ? vhost_dev_init+0x580/0x580 [vhost]
[613.053139]  ? __kthread_parkme+0x77/0x90
[613.053139]  ? vhost_dev_init+0x580/0x580 [vhost]
[613.053139]  kthread+0x1b1/0x1d0
[613.053139]  ? kthread_park+0xb0/0xb0
[613.053139]  ret_from_fork+0x35/0x40
[613.088705]
[613.088705] Allocated by task 1155:
[613.088705]  kasan_kmalloc+0xbf/0xe0
[613.088705]  kmem_cache_alloc+0xdc/0x220
[613.088705]  __build_skb+0x2a/0x160
[613.088705]  build_skb+0x14/0xc0
[613.088705]  tun_sendmsg+0x4f0/0xc50 [tun]
[613.088705]  vhost_tx_batch.isra.14+0xeb/0x1f0 [vhost_net]
[613.088705]  handle_tx_copy+0x2d0/0x8f0 [vhost_net]
[613.088705]  handle_tx+0xc7/0x120 [vhost_net]
[613.088705]  vhost_worker+0x166/0x200 [vhost]
[613.088705]  kthread+0x1b1/0x1d0
[613.088705]  ret_from_fork+0x35/0x40
[613.088705]
[613.088705] Freed by task 1155:
[613.088705]  __kasan_slab_free+0x12e/0x180
[613.088705]  kmem_cache_free+0xa0/0x230
[613.088705]  ip6_mc_input+0x40f/0x5a0
[613.088705]  ipv6_rcv+0xc9/0x1e0
[613.088705]  __netif_receive_skb_one_core+0xc1/0x100
[613.088705]  netif_receive_skb_internal+0xc4/0x270
[613.088705]  br_pass_frame_up+0x2b9/0x2e0
[613.088705]  br_handle_frame_finish+0x2fb/0x7a0
[613.088705]  br_handle_frame+0x30f/0x6c0
[613.088705]  __netif_receive_skb_core+0x61a/0x15b0
[613.088705]  __netif_receive_skb_one_core+0x8e/0x100
[613.088705]  netif_receive_skb_internal+0xc4/0x270
[613.088705]  tun_sendmsg+0x738/0xc50 [tun]
[613.088705]  vhost_tx_batch.isra.14+0xeb/0x1f0 [vhost_net]
[613.088705]  handle_tx_copy+0x2d0/0x8f0 [vhost_net]
[613.088705]  handle_tx+0xc7/0x120 [vhost_net]
[613.088705]  vhost_worker+0x166/0x200 [vhost]
[613.088705]  kthread+0x1b1/0x1d0
[613.088705]  ret_from_fork+0x35/0x40
[613.088705]
[613.088705] The buggy address belongs to the object at 8881da9ab740
[613.088705]  which belongs to the cache skbuff_head_cache of size 232

Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through sendmsg()")
Reviewed-by: Toshiaki Makita 
Signed-off-by: Prashant Bhole 
---
  drivers/net/tun.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index e244f5d7512a..6e388792c0a8 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2385,6 +2385,7 @@ static int tun_xdp_one(struct tun_struct *tun,
   struct tun_file *tfile,
   struct xdp_buff *xdp, int *flush)
  {
+   unsigned int datasize = xdp->data_end - xdp->data;
struct tun_xdp_hdr *hdr = xdp->data_hard_start;
struct virtio_net_hdr *gso = >gso;
struct tun_pcpu_stats *stats;
@@ -2461,7 +2462,7 @@ static int tun_xdp_one(struct tun_struct *tun,
stats = get_cpu_ptr(tun->pcpu_stats);
u64_stats_update_begin(>syncp);
stats->rx_packets++;
-   stats->rx_bytes += skb->len;
+   stats->rx_bytes += datasize;
u64_stats_update_end(>syncp);
put_cpu_ptr(stats);
  




Re: consistency for statistics with XDP mode

2018-11-27 Thread Jason Wang



On 2018/11/27 下午3:04, Toshiaki Makita wrote:

On 2018/11/26 10:37, Toshiaki Makita wrote:

On 2018/11/23 1:43, David Ahern wrote:

On 11/21/18 5:53 PM, Toshiaki Makita wrote:

We really need consistency in the counters and at a minimum, users
should be able to track packet and byte counters for both Rx and Tx
including XDP.

It seems to me the Rx and Tx packet, byte and dropped counters returned
for the standard device stats (/proc/net/dev, ip -s li show, ...) should
include all packets managed by the driver regardless of whether they are
forwarded / dropped in XDP or go up the Linux stack. This also aligns

Agreed. When I introduced virtio_net XDP counters, I just forgot to
update tx packets/bytes counters on ndo_xdp_xmit. Probably I thought it
is handled by free_old_xmit_skbs.

Do you have some time to look at adding the Tx counters to virtio_net?

hoping I can make some time within a couple of days.

Hmm... It looks like free_old_xmit_skbs() calls dev_consume_skb_any()
for xdp_frame when napi_tx is enabled. I will fix this beforehand.



Good catch. But the fix may require some thought. E.g one idea is to not 
call free_old_xmit_skbs() for XDP TX ring?


Thanks




Re: [PATCH][net-next] vhost:net: allocate 32KB memory instead of 32K pages when page frag refill

2018-11-22 Thread Jason Wang



On 2018/11/23 上午10:04, Li RongQing wrote:

when page frag refills, 32K pages, 128MB memory is asked, it
hardly successes when system has memory stress



Looking at get_order(), it seems we get 3 after get_order(32768) since 
it accepts the size of block.


/**
 * get_order - Determine the allocation order of a memory size
 * @size: The size for which to get the order
...

define get_order(n)    \
(   \
    __builtin_constant_p(n) ? ( \
    ((n) == 0UL) ? BITS_PER_LONG - PAGE_SHIFT : \
    (((n) < (1UL << PAGE_SHIFT)) ? 0 :  \
 ilog2((n) - 1) - PAGE_SHIFT + 1)   \

  ^^^

    ) : \
    __get_order(n)  \
)




And such large memory size will cause the underflow of reference
bias, and make refcount of page chaos, since reference bias will
be decreased to negative before the allocated memory is used up



Do you have reproducer for this issue?

Thanks




so 32KB memory is safe choice, meanwhile, remove a unnecessary
check

Fixes: e4dab1e6ea64 ("vhost_net: mitigate page reference counting during page frag 
refill")
Signed-off-by: Zhang Yu 
Signed-off-by: Li RongQing 
---
  drivers/vhost/net.c | 22 +++---
  1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index d919284f103b..b933a4a8e4ba 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -641,7 +641,7 @@ static bool tx_can_batch(struct vhost_virtqueue *vq, size_t 
total_len)
   !vhost_vq_avail_empty(vq->dev, vq);
  }
  
-#define SKB_FRAG_PAGE_ORDER get_order(32768)

+#define SKB_FRAG_PAGE_ORDER3
  
  static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int sz,

   struct page_frag *pfrag, gfp_t gfp)
@@ -654,17 +654,17 @@ static bool vhost_net_page_frag_refill(struct vhost_net 
*net, unsigned int sz,
  
  	pfrag->offset = 0;

net->refcnt_bias = 0;
-   if (SKB_FRAG_PAGE_ORDER) {
-   /* Avoid direct reclaim but allow kswapd to wake */
-   pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
- __GFP_COMP | __GFP_NOWARN |
- __GFP_NORETRY,
- SKB_FRAG_PAGE_ORDER);
-   if (likely(pfrag->page)) {
-   pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
-   goto done;
-   }
+
+   /* Avoid direct reclaim but allow kswapd to wake */
+   pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
+ __GFP_COMP | __GFP_NOWARN |
+ __GFP_NORETRY,
+ SKB_FRAG_PAGE_ORDER);
+   if (likely(pfrag->page)) {
+   pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
+   goto done;
}
+
pfrag->page = alloc_page(gfp);
if (likely(pfrag->page)) {
pfrag->size = PAGE_SIZE;


Re: [PATCH] [PATCH net-next] tun: fix multiqueue rx

2018-11-15 Thread Jason Wang



On 2018/11/16 下午3:00, Matthew Cover wrote:

When writing packets to a descriptor associated with a combined queue, the
packets should end up on that queue.

Before this change all packets written to any descriptor associated with a
tap interface end up on rx-0, even when the descriptor is associated with a
different queue.

The rx traffic can be generated by either of the following.
   1. a simple tap program which spins up multiple queues and writes packets
  to each of the file descriptors
   2. tx from a qemu vm with a tap multiqueue netdev

The queue for rx traffic can be observed by either of the following (done
on the hypervisor in the qemu case).
   1. a simple netmap program which opens and reads from per-queue
  descriptors
   2. configuring RPS and doing per-cpu captures with rxtxcpu

Alternatively, if you printk() the return value of skb_get_rx_queue() just
before each instance of netif_receive_skb() in tun.c, you will get 65535
for every skb.

Calling skb_record_rx_queue() to set the rx queue to the queue_index fixes
the association between descriptor and rx queue.

Signed-off-by: Matthew Cover 
---
  drivers/net/tun.c | 7 ++-
  1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index a65779c6d72f..ce8620f3ea5e 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1536,6 +1536,7 @@ static void tun_rx_batched(struct tun_struct *tun, struct 
tun_file *tfile,
  
  	if (!rx_batched || (!more && skb_queue_empty(queue))) {

local_bh_disable();
+   skb_record_rx_queue(skb, tfile->queue_index);
netif_receive_skb(skb);
local_bh_enable();
return;
@@ -1555,8 +1556,11 @@ static void tun_rx_batched(struct tun_struct *tun, 
struct tun_file *tfile,
struct sk_buff *nskb;
  
  		local_bh_disable();

-   while ((nskb = __skb_dequeue(_queue)))
+   while ((nskb = __skb_dequeue(_queue))) {
+   skb_record_rx_queue(nskb, tfile->queue_index);
netif_receive_skb(nskb);
+   }
+   skb_record_rx_queue(skb, tfile->queue_index);
netif_receive_skb(skb);
local_bh_enable();
}
@@ -2452,6 +2456,7 @@ static int tun_xdp_one(struct tun_struct *tun,
!tfile->detached)
rxhash = __skb_get_hash_symmetric(skb);
  
+	skb_record_rx_queue(skb, tfile->queue_index);

netif_receive_skb(skb);
  
  	stats = get_cpu_ptr(tun->pcpu_stats);



Acked-by: Jason Wang 




Re: [PATCH] [PATCH net-next] tun: fix multiqueue rx

2018-11-15 Thread Jason Wang



On 2018/11/16 下午12:10, Matthew Cover wrote:

When writing packets to a descriptor associated with a combined queue, the
packets should end up on that queue.

Before this change all packets written to any descriptor associated with a
tap interface end up on rx-0, even when the descriptor is associated with a
different queue.

The rx traffic can be generated by either of the following.
   1. a simple tap program which spins up multiple queues and writes packets
  to each of the file descriptors
   2. tx from a qemu vm with a tap multiqueue netdev

The queue for rx traffic can be observed by either of the following (done
on the hypervisor in the qemu case).
   1. a simple netmap program which opens and reads from per-queue
  descriptors
   2. configuring RPS and doing per-cpu captures with rxtxcpu

Alternatively, if you printk() the return value of skb_get_rx_queue() just
before each instance of netif_receive_skb() in tun.c, you will get 65535
for every skb.

Calling skb_record_rx_queue() to set the rx queue to the queue_index fixes
the association between descriptor and rx queue.

Signed-off-by: Matthew Cover 
---
  drivers/net/tun.c | 6 +-
  1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index a65779c6d72f..4e306ff3501c 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1536,6 +1536,7 @@ static void tun_rx_batched(struct tun_struct *tun, struct 
tun_file *tfile,
  
  	if (!rx_batched || (!more && skb_queue_empty(queue))) {

local_bh_disable();
+   skb_record_rx_queue(skb, tfile->queue_index);
netif_receive_skb(skb);
local_bh_enable();
return;
@@ -1555,8 +1556,11 @@ static void tun_rx_batched(struct tun_struct *tun, 
struct tun_file *tfile,
struct sk_buff *nskb;
  
  		local_bh_disable();

-   while ((nskb = __skb_dequeue(_queue)))
+   while ((nskb = __skb_dequeue(_queue))) {
+   skb_record_rx_queue(nskb, tfile->queue_index);
netif_receive_skb(nskb);
+   }
+   skb_record_rx_queue(skb, tfile->queue_index);
netif_receive_skb(skb);
local_bh_enable();
}



Thanks for the fix. Actually, there's another path which needs to be 
fixed as well in tun_xdp_one(). This path is used for vhost to pass a 
batched of packets.




Re: [PATCH net-next] tun: compute the RFS hash only if needed.

2018-11-07 Thread Jason Wang



On 2018/11/7 下午5:34, Paolo Abeni wrote:

The tun XDP sendmsg code path, unconditionally computes the symmetric
hash of each packet for RFS's sake, even when we could skip it. e.g.
when the device has a single queue.

This change adds the check already in-place for the skb sendmsg path
to avoid unneeded hashing.

The above gives small, but measurable, performance gain for VM xmit
path when zerocopy is not enabled.

Signed-off-by: Paolo Abeni 
---
  drivers/net/tun.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 060135ceaf0e..a65779c6d72f 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2448,7 +2448,8 @@ static int tun_xdp_one(struct tun_struct *tun,
goto out;
}
  
-	if (!rcu_dereference(tun->steering_prog))

+   if (!rcu_dereference(tun->steering_prog) && tun->numqueues > 1 &&
+   !tfile->detached)
rxhash = __skb_get_hash_symmetric(skb);
  
  	netif_receive_skb(skb);



Acked-by: Jason Wang 

Thanks



Re: [PATCH] net/packet: support vhost mrg_rxbuf

2018-10-28 Thread Jason Wang



On 2018/10/27 下午8:04, Jianfeng Tan wrote:

Previouly, virtio net header size is hardcoded to be 10, which makes
the feature mrg_rxbuf not available.

We redefine PACKET_VNET_HDR ioctl which treats user input as boolean,
but now as int, 0, 10, 12, or everything else be treated as 10.

There will be one case which is treated differently: if user input is
12, previously, the header size will be 10; but now it's 12.

Signed-off-by: Jianfeng Tan 



This should go for net-next which is closed. You may consider to 
re-submit when it was open.




---
  net/packet/af_packet.c | 97 ++
  net/packet/diag.c  |  2 +-
  net/packet/internal.h  |  2 +-
  3 files changed, 63 insertions(+), 38 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index ec3095f13aae..1bd7f4cdcc80 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1999,18 +1999,24 @@ static unsigned int run_filter(struct sk_buff *skb,
  }
  
  static int packet_rcv_vnet(struct msghdr *msg, const struct sk_buff *skb,

-  size_t *len)
+  size_t *len, int vnet_hdr_len)
  {
+   int res;
struct virtio_net_hdr vnet_hdr;
  
-	if (*len < sizeof(vnet_hdr))

+   if (*len < vnet_hdr_len)
return -EINVAL;
-   *len -= sizeof(vnet_hdr);
+   *len -= vnet_hdr_len;
  
  	if (virtio_net_hdr_from_skb(skb, _hdr, vio_le(), true, 0))

return -EINVAL;
  
-	return memcpy_to_msg(msg, (void *)_hdr, sizeof(vnet_hdr));

+   res = memcpy_to_msg(msg, (void *)_hdr, sizeof(vnet_hdr));
+   if (res == 0)
+   iov_iter_advance(>msg_iter,
+vnet_hdr_len - sizeof(vnet_hdr));
+
+   return res;
  }
  
  /*

@@ -2206,11 +2212,13 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
net_device *dev,
  po->tp_reserve;
} else {
unsigned int maclen = skb_network_offset(skb);
+   int vnet_hdr_sz = READ_ONCE(po->vnet_hdr_sz);
+
netoff = TPACKET_ALIGN(po->tp_hdrlen +
   (maclen < 16 ? 16 : maclen)) +
   po->tp_reserve;
-   if (po->has_vnet_hdr) {
-   netoff += sizeof(struct virtio_net_hdr);
+   if (vnet_hdr_sz) {
+   netoff += vnet_hdr_sz;
do_vnet = true;
}
macoff = netoff - maclen;
@@ -2429,19 +2437,6 @@ static int __packet_snd_vnet_parse(struct virtio_net_hdr 
*vnet_hdr, size_t len)
return 0;
  }
  
-static int packet_snd_vnet_parse(struct msghdr *msg, size_t *len,

-struct virtio_net_hdr *vnet_hdr)
-{
-   if (*len < sizeof(*vnet_hdr))
-   return -EINVAL;
-   *len -= sizeof(*vnet_hdr);
-
-   if (!copy_from_iter_full(vnet_hdr, sizeof(*vnet_hdr), >msg_iter))
-   return -EFAULT;
-
-   return __packet_snd_vnet_parse(vnet_hdr, *len);
-}
-
  static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
void *frame, struct net_device *dev, void *data, int tp_len,
__be16 proto, unsigned char *addr, int hlen, int copylen,
@@ -2609,6 +2604,7 @@ static int tpacket_snd(struct packet_sock *po, struct 
msghdr *msg)
int len_sum = 0;
int status = TP_STATUS_AVAILABLE;
int hlen, tlen, copylen = 0;
+   int vnet_hdr_sz;
  
  	mutex_lock(>pg_vec_lock);
  
@@ -2648,7 +2644,8 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)

size_max = po->tx_ring.frame_size
- (po->tp_hdrlen - sizeof(struct sockaddr_ll));
  
-	if ((size_max > dev->mtu + reserve + VLAN_HLEN) && !po->has_vnet_hdr)

+   vnet_hdr_sz = READ_ONCE(po->vnet_hdr_sz);
+   if ((size_max > dev->mtu + reserve + VLAN_HLEN) && !vnet_hdr_sz)
size_max = dev->mtu + reserve + VLAN_HLEN;
  
  	do {

@@ -2668,10 +2665,10 @@ static int tpacket_snd(struct packet_sock *po, struct 
msghdr *msg)
status = TP_STATUS_SEND_REQUEST;
hlen = LL_RESERVED_SPACE(dev);
tlen = dev->needed_tailroom;
-   if (po->has_vnet_hdr) {
+   if (vnet_hdr_sz) {
vnet_hdr = data;
-   data += sizeof(*vnet_hdr);
-   tp_len -= sizeof(*vnet_hdr);
+   data += vnet_hdr_sz;
+   tp_len -= vnet_hdr_sz;
if (tp_len < 0 ||
__packet_snd_vnet_parse(vnet_hdr, tp_len)) {
tp_len = -EINVAL;
@@ -2696,7 +2693,7 @@ static int tpacket_snd(struct packet_sock *po, struct 
msghdr *msg)
  addr, hlen, copylen, );
if (likely(tp_len >= 0) &&
tp_len > dev->mtu + reserve &&
-  

Re: [PATCH] net/packet: fix packet drop as of virtio gso

2018-10-28 Thread Jason Wang



On 2018/10/28 上午7:42, Jianfeng Tan wrote:


On 10/8/2018 11:14 AM, Jason Wang wrote:



On 2018年09月29日 23:41, Jianfeng Tan wrote:

When we use raw socket as the vhost backend, a packet from virito with
gso offloading information, cannot be sent out in later validaton at
xmit path, as we did not set correct skb->protocol which is further 
used

for looking up the gso function.


Hi:

May I ask the reason for using raw socket for vhost? It was not a 
common setup with little care in the past few years. And it was slow 
since it lacks some recent improvements. Can it be replaced with e.g 
macvtap?


Hi Jason,

Apologize for late response. We are in container environment, in which 
case veth is used mostly. Either tap or macvtap cannot be put into an 
isolated netns. 



I think it can? See 17af2bce88d31e65ed73d638bb752d2e13c66ced.


Another thing could be macvlan as the backend of vhost, which is not 
supported either. So unfortunately, improving raw socket is the only 
choice I suppose.



Btw, you can setup macvtap on top of veth. Does this help?

Thanks




Thanks,
Jianfeng


Re: [PATCH] virtio_net: enable tx after resuming from suspend

2018-10-11 Thread Jason Wang




On 2018年10月11日 18:22, ake wrote:


On 2018年10月11日 18:44, Jason Wang wrote:


On 2018年10月11日 15:51, Ake Koomsin wrote:

commit 713a98d90c5e ("virtio-net: serialize tx routine during reset")
disabled the virtio tx before going to suspend to avoid a use after free.
However, after resuming, it causes the virtio_net device to lose its
network connectivity.

To solve the issue, we need to enable tx after resuming.

Fixes commit 713a98d90c5e ("virtio-net: serialize tx routine during
reset")
Signed-off-by: Ake Koomsin 
---
   drivers/net/virtio_net.c | 1 +
   1 file changed, 1 insertion(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index dab504ec5e50..3453d80f5f81 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2256,6 +2256,7 @@ static int virtnet_restore_up(struct
virtio_device *vdev)
   }
     netif_device_attach(vi->dev);
+    netif_start_queue(vi->dev);

I believe this is duplicated with netif_tx_wake_all_queues() in
netif_device_attach() above?

Thank you for your review.

If both netif_tx_wake_all_queues() and netif_start_queue() result in
clearing __QUEUE_STATE_DRV_XOFF, then is it possible that some
conditions in netif_device_attach() is not satisfied?


Yes, maybe. One case I can see now is when the device is down, in this 
case netif_device_attach() won't try to wakeup the queue.



  Without
netif_start_queue(), the virtio_net device does not resume properly
after waking up.


How do you trigger the issue? Just do suspend/resume?



Is it better to report this as a bug first?


Nope, you're very welcome to post patch directly.


If I am to do more
investigation, what areas should I look into?


As you've figured out, you can start with why netif_tx_wake_all_queues() 
were not executed?


(Btw, does the issue disappear if you move netif_tx_disable() under the 
check of netif_running() in virtnet_freeze_down()?)


Thanks



Best Regards
Ake Koomsin





Re: [PATCH net-next V3] virtio_net: ethtool tx napi configuration

2018-10-11 Thread Jason Wang




On 2018年10月11日 13:34, David Miller wrote:

From: Jason Wang 
Date: Tue,  9 Oct 2018 10:06:26 +0800


Implement ethtool .set_coalesce (-C) and .get_coalesce (-c) handlers.
Interrupt moderation is currently not supported, so these accept and
display the default settings of 0 usec and 1 frame.

Toggle tx napi through setting tx-frames. So as to not interfere
with possible future interrupt moderation, value 1 means tx napi while
value 0 means not.

Only allow the switching when device is down for simplicity.

Link: https://patchwork.ozlabs.org/patch/948149/
Suggested-by: Jason Wang 
Signed-off-by: Willem de Bruijn 
Signed-off-by: Jason Wang 
---
Changes from V2:
- only allow the switching when device is done
- remove unnecessary global variable and initialization
Changes from V1:
- try to synchronize with datapath to allow changing mode when
   interface is up.
- use tx-frames 0 as to disable tx napi while tx-frames 1 to enable tx napi

Applied, with...


+   bool running = netif_running(dev);

this unused variable removed.


My bad, thanks for the fixup.


Re: [PATCH] virtio_net: enable tx after resuming from suspend

2018-10-11 Thread Jason Wang




On 2018年10月11日 15:51, Ake Koomsin wrote:

commit 713a98d90c5e ("virtio-net: serialize tx routine during reset")
disabled the virtio tx before going to suspend to avoid a use after free.
However, after resuming, it causes the virtio_net device to lose its
network connectivity.

To solve the issue, we need to enable tx after resuming.

Fixes commit 713a98d90c5e ("virtio-net: serialize tx routine during reset")
Signed-off-by: Ake Koomsin 
---
  drivers/net/virtio_net.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index dab504ec5e50..3453d80f5f81 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2256,6 +2256,7 @@ static int virtnet_restore_up(struct virtio_device *vdev)
}
  
  	netif_device_attach(vi->dev);

+   netif_start_queue(vi->dev);


I believe this is duplicated with netif_tx_wake_all_queues() in 
netif_device_attach() above?


Thanks


return err;
  }
  




Re: [PATCH] net/packet: fix packet drop as of virtio gso

2018-10-07 Thread Jason Wang




On 2018年09月29日 23:41, Jianfeng Tan wrote:

When we use raw socket as the vhost backend, a packet from virito with
gso offloading information, cannot be sent out in later validaton at
xmit path, as we did not set correct skb->protocol which is further used
for looking up the gso function.


Hi:

May I ask the reason for using raw socket for vhost? It was not a common 
setup with little care in the past few years. And it was slow since it 
lacks some recent improvements. Can it be replaced with e.g macvtap?


Thanks



To fix this, we set this field according to virito hdr information.

Fixes: e858fae2b0b8f4 ("virtio_net: use common code for virtio_net_hdr and skb GSO 
conversion")

Cc: sta...@vger.kernel.org
Signed-off-by: Jianfeng Tan 
---
  include/linux/virtio_net.h | 18 ++
  net/packet/af_packet.c | 11 +++
  2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 9397628a1967..cb462f9ab7dd 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -5,6 +5,24 @@
  #include 
  #include 
  
+static inline int virtio_net_hdr_set_proto(struct sk_buff *skb,

+  const struct virtio_net_hdr *hdr)
+{
+   switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
+   case VIRTIO_NET_HDR_GSO_TCPV4:
+   case VIRTIO_NET_HDR_GSO_UDP:
+   skb->protocol = cpu_to_be16(ETH_P_IP);
+   break;
+   case VIRTIO_NET_HDR_GSO_TCPV6:
+   skb->protocol = cpu_to_be16(ETH_P_IPV6);
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
  static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
const struct virtio_net_hdr *hdr,
bool little_endian)
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 75c92a87e7b2..d6e94dc7e290 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2715,10 +2715,12 @@ static int tpacket_snd(struct packet_sock *po, struct 
msghdr *msg)
}
}
  
-		if (po->has_vnet_hdr && virtio_net_hdr_to_skb(skb, vnet_hdr,

- vio_le())) {
-   tp_len = -EINVAL;
-   goto tpacket_error;
+   if (po->has_vnet_hdr) {
+   if (virtio_net_hdr_to_skb(skb, vnet_hdr, vio_le())) {
+   tp_len = -EINVAL;
+   goto tpacket_error;
+   }
+   virtio_net_hdr_set_proto(skb, vnet_hdr);
}
  
  		skb->destructor = tpacket_destruct_skb;

@@ -2915,6 +2917,7 @@ static int packet_snd(struct socket *sock, struct msghdr 
*msg, size_t len)
if (err)
goto out_free;
len += sizeof(vnet_hdr);
+   virtio_net_hdr_set_proto(skb, _hdr);
}
  
  	skb_probe_transport_header(skb, reserve);




Re: [PATCH net-next] virtio_net: ethtool tx napi configuration

2018-09-27 Thread Jason Wang




On 2018年09月27日 21:53, Willem de Bruijn wrote:

On Thu, Sep 27, 2018 at 4:51 AM Jason Wang  wrote:



On 2018年09月14日 12:46, Willem de Bruijn wrote:

I'm not sure I get this. If we don't enable tx napi, we tend to delay TX
interrupt if we found the ring is about to full to avoid interrupt
storm, so we're probably ok in this case.

I'm only concerned about the transition state when converting from
napi to no-napi when the queue is stopped and tx interrupt disabled.

With napi mode the interrupt is only disabled if napi is scheduled,
in which case it will eventually reenable the interrupt. But when
switching to no-napi mode in this state no progress will be made.

But it seems this cannot happen. When converting to no-napi
mode, set_coalesce waits for napi to complete in napi_disable.
So the interrupt should always start enabled when transitioning
into no-napi mode.

An update, I meet a hang in napi_disalbe(). But it's hard to be
reproduced. I tend to choose a easy way like V1 that only allow the
switching when device is down.

I agree.


I will post the patch after a vacation. (or you can post if it was
urgent for you).

If you have time to review and add your signed-off-by, I can post it.
It's a pretty small diff at this point.

But no rush, we can also wait until after your vacation.


Then let me post it after the vacation.



I also need to look at a patch to toggle LRO using ethtool, btw.


Interesting, we've already did something similar during XDP. The 
GUEST_TSO_XXX part may need some private flags I believe.


Thanks


Re: [PATCH net V2] vhost-vsock: fix use after free

2018-09-27 Thread Jason Wang




On 2018年09月28日 01:04, Michael S. Tsirkin wrote:

On Thu, Sep 27, 2018 at 08:22:04PM +0800, Jason Wang wrote:

The access of vsock is not protected by vhost_vsock_lock. This may
lead to use after free since vhost_vsock_dev_release() may free the
pointer at the same time.

Fix this by holding the lock during the access.

Reported-by:syzbot+e3e074963495f92a8...@syzkaller.appspotmail.com
Fixes: 16320f363ae1 ("vhost-vsock: add pkt cancel capability")
Fixes: 433fc58e6bf2 ("VSOCK: Introduce vhost_vsock.ko")
Cc: Stefan Hajnoczi
Signed-off-by: Jason Wang

Wow is that really the best we can do?


For net/stable, probably yes.


  A global lock on a data path
operation?


It's already there, and the patch only increase the critical section.


  Granted use after free is nasty but Stefan said he sees
a way to fix it using a per socket refcount. He's on vacation
until Oct 4 though ...



Stefan has acked the pacth, so I think it's ok? We can do optimization 
for -next on top.


Thanks


Re: [PATCH net] vhost-vsock: fix use after free

2018-09-27 Thread Jason Wang




On 2018年09月27日 17:52, Sergei Shtylyov wrote:

Hello!

On 9/27/2018 11:43 AM, Jason Wang wrote:

   Just a couple of typos...


The access of vsock is not protected by vhost_vsock_lock. This may
lead use after free since vhost_vsock_dev_release() may free the


  Lead to use.


pointer at the same time.

Fix this by holding the lock during the acess.


   Access.


Reported-by: syzbot+e3e074963495f92a8...@syzkaller.appspotmail.com
Fixes: 16320f363ae1 ("vhost-vsock: add pkt cancel capability")
Fixes: 433fc58e6bf2 ("VSOCK: Introduce vhost_vsock.ko")
Cc: Stefan Hajnoczi 
Signed-off-by: Jason Wang 

[...]

MBR, Sergei


Let me post V2.

Thanks



[PATCH net V2] vhost-vsock: fix use after free

2018-09-27 Thread Jason Wang
The access of vsock is not protected by vhost_vsock_lock. This may
lead to use after free since vhost_vsock_dev_release() may free the
pointer at the same time.

Fix this by holding the lock during the access.

Reported-by: syzbot+e3e074963495f92a8...@syzkaller.appspotmail.com
Fixes: 16320f363ae1 ("vhost-vsock: add pkt cancel capability")
Fixes: 433fc58e6bf2 ("VSOCK: Introduce vhost_vsock.ko")
Cc: Stefan Hajnoczi 
Signed-off-by: Jason Wang 
---
- V2: fix typos
- The patch is needed for -stable.
---
 drivers/vhost/vsock.c | 26 +++---
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 34bc3ab40c6d..7d0b292867fd 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -210,21 +210,27 @@ vhost_transport_send_pkt(struct virtio_vsock_pkt *pkt)
struct vhost_vsock *vsock;
int len = pkt->len;
 
+   spin_lock_bh(_vsock_lock);
+
/* Find the vhost_vsock according to guest context id  */
-   vsock = vhost_vsock_get(le64_to_cpu(pkt->hdr.dst_cid));
+   vsock = __vhost_vsock_get(le64_to_cpu(pkt->hdr.dst_cid));
if (!vsock) {
virtio_transport_free_pkt(pkt);
+   spin_unlock_bh(_vsock_lock);
return -ENODEV;
}
 
if (pkt->reply)
atomic_inc(>queued_replies);
 
-   spin_lock_bh(>send_pkt_list_lock);
+   spin_lock(>send_pkt_list_lock);
list_add_tail(>list, >send_pkt_list);
-   spin_unlock_bh(>send_pkt_list_lock);
+   spin_unlock(>send_pkt_list_lock);
 
vhost_work_queue(>dev, >send_pkt_work);
+
+   spin_unlock_bh(_vsock_lock);
+
return len;
 }
 
@@ -236,18 +242,22 @@ vhost_transport_cancel_pkt(struct vsock_sock *vsk)
int cnt = 0;
LIST_HEAD(freeme);
 
+   spin_lock_bh(_vsock_lock);
+
/* Find the vhost_vsock according to guest context id  */
-   vsock = vhost_vsock_get(vsk->remote_addr.svm_cid);
-   if (!vsock)
+   vsock = __vhost_vsock_get(vsk->remote_addr.svm_cid);
+   if (!vsock) {
+   spin_unlock_bh(_vsock_lock);
return -ENODEV;
+   }
 
-   spin_lock_bh(>send_pkt_list_lock);
+   spin_lock(>send_pkt_list_lock);
list_for_each_entry_safe(pkt, n, >send_pkt_list, list) {
if (pkt->vsk != vsk)
continue;
list_move(>list, );
}
-   spin_unlock_bh(>send_pkt_list_lock);
+   spin_unlock(>send_pkt_list_lock);
 
list_for_each_entry_safe(pkt, n, , list) {
if (pkt->reply)
@@ -265,6 +275,8 @@ vhost_transport_cancel_pkt(struct vsock_sock *vsk)
vhost_poll_queue(_vq->poll);
}
 
+   spin_unlock_bh(_vsock_lock);
+
return 0;
 }
 
-- 
2.17.1



Re: [PATCH net-next] virtio_net: ethtool tx napi configuration

2018-09-27 Thread Jason Wang




On 2018年09月14日 12:46, Willem de Bruijn wrote:

I'm not sure I get this. If we don't enable tx napi, we tend to delay TX
interrupt if we found the ring is about to full to avoid interrupt
storm, so we're probably ok in this case.

I'm only concerned about the transition state when converting from
napi to no-napi when the queue is stopped and tx interrupt disabled.

With napi mode the interrupt is only disabled if napi is scheduled,
in which case it will eventually reenable the interrupt. But when
switching to no-napi mode in this state no progress will be made.

But it seems this cannot happen. When converting to no-napi
mode, set_coalesce waits for napi to complete in napi_disable.
So the interrupt should always start enabled when transitioning
into no-napi mode.


An update, I meet a hang in napi_disalbe(). But it's hard to be 
reproduced. I tend to choose a easy way like V1 that only allow the 
switching when device is down.


I will post the patch after a vacation. (or you can post if it was 
urgent for you).


Thanks


Re: [PATCH net-next] virtio_net: ethtool tx napi configuration

2018-09-14 Thread Jason Wang




On 2018年09月14日 12:46, Willem de Bruijn wrote:

On Thu, Sep 13, 2018 at 11:53 PM Jason Wang  wrote:



On 2018年09月14日 11:40, Willem de Bruijn wrote:

On Thu, Sep 13, 2018 at 11:27 PM Jason Wang  wrote:


On 2018年09月13日 22:58, Willem de Bruijn wrote:

On Thu, Sep 13, 2018 at 5:02 AM Jason Wang  wrote:

On 2018年09月13日 07:27, Willem de Bruijn wrote:

On Wed, Sep 12, 2018 at 3:11 PM Willem de Bruijn
 wrote:

On Wed, Sep 12, 2018 at 2:16 PM Florian Fainelli  wrote:

On 9/12/2018 11:07 AM, Willem de Bruijn wrote:

On Wed, Sep 12, 2018 at 1:42 PM Florian Fainelli  wrote:

On 9/9/2018 3:44 PM, Willem de Bruijn wrote:

From: Willem de Bruijn 

Implement ethtool .set_coalesce (-C) and .get_coalesce (-c) handlers.
Interrupt moderation is currently not supported, so these accept and
display the default settings of 0 usec and 1 frame.

Toggle tx napi through a bit in tx-frames. So as to not interfere
with possible future interrupt moderation, use bit 10, well outside
the reasonable range of real interrupt moderation values.

Changes are not atomic. The tx IRQ, napi BH and transmit path must
be quiesced when switching modes. Only allow changing this setting
when the device is down.

Humm, would not a private ethtool flag to switch TX NAPI on/off be more
appropriate rather than use the coalescing configuration API here?

What do you mean by private ethtool flag? A new field in ethtool
--features (-k)?

I meant using ethtool_drvinfo::n_priv_flags, ETH_SS_PRIV_FLAGS and then
ETHTOOL_GFPFLAGS and ETHTOOL_SPFLAGS to control the toggling of that
private flag. mlx5 has a number of privates flags for instance.

Interesting, thanks! I was not at all aware of those ethtool flags.
Am having a look. It definitely looks promising.

Okay, I made that change. That is indeed much cleaner, thanks.
Let me send the patch, initially as RFC.

I've observed one issue where if we toggle the flag before bringing
up the device, it hits a kernel BUG at include/linux/netdevice.h:515

BUG_ON(!test_bit(NAPI_STATE_SCHED, >state));

This reminds me that we need to check netif_running() before trying to
enable and disable tx napi in ethtool_set_coalesce().

The first iteration of my patch checked IFF_UP and effectively
only allowed the change when not running. What do you mean
by need to check?

I mean if device is not up, there's no need to toggle napi state and tx
lock.


And to respond to the other follow-up notes at once:


Consider we may have interrupt moderation in the future, I tend to use
set_coalesce. Otherwise we may need two steps to enable moderation:

- tx-napi on
- set_coalesce

FWIW, I don't care strongly whether we do this through coalesce or priv_flags.

Ok.

Since you prefer coalesce, let's go with that (and a revision of your
latest patch).

Good to know this.


+ if (!napi_weight)
+ virtqueue_enable_cb(vi->sq[i].vq);

I don't get why we need to disable enable cb here.

To avoid entering no-napi mode with too few descriptors to
make progress and no way to get out of that state. This is a
pretty crude attempt at handling that, admittedly.

But in this case, we will call enable_cb_delayed() and we will finally
get a interrupt?

Right. It's a bit of a roundabout way to ensure that
netif_tx_wake_queue and thus eventually free_old_xmit_skbs are called.
It might make more sense to just wake the device without going through
an interrupt.

I'm not sure I get this. If we don't enable tx napi, we tend to delay TX
interrupt if we found the ring is about to full to avoid interrupt
storm, so we're probably ok in this case.

I'm only concerned about the transition state when converting from
napi to no-napi when the queue is stopped and tx interrupt disabled.

With napi mode the interrupt is only disabled if napi is scheduled,
in which case it will eventually reenable the interrupt. But when
switching to no-napi mode in this state no progress will be made.

But it seems this cannot happen. When converting to no-napi
mode, set_coalesce waits for napi to complete in napi_disable.
So the interrupt should always start enabled when transitioning
into no-napi mode.


Yes, I see.

Thanks


Re: [PATCH net-next] virtio_net: ethtool tx napi configuration

2018-09-13 Thread Jason Wang




On 2018年09月14日 11:40, Willem de Bruijn wrote:

On Thu, Sep 13, 2018 at 11:27 PM Jason Wang  wrote:



On 2018年09月13日 22:58, Willem de Bruijn wrote:

On Thu, Sep 13, 2018 at 5:02 AM Jason Wang  wrote:


On 2018年09月13日 07:27, Willem de Bruijn wrote:

On Wed, Sep 12, 2018 at 3:11 PM Willem de Bruijn
 wrote:

On Wed, Sep 12, 2018 at 2:16 PM Florian Fainelli  wrote:

On 9/12/2018 11:07 AM, Willem de Bruijn wrote:

On Wed, Sep 12, 2018 at 1:42 PM Florian Fainelli  wrote:

On 9/9/2018 3:44 PM, Willem de Bruijn wrote:

From: Willem de Bruijn 

Implement ethtool .set_coalesce (-C) and .get_coalesce (-c) handlers.
Interrupt moderation is currently not supported, so these accept and
display the default settings of 0 usec and 1 frame.

Toggle tx napi through a bit in tx-frames. So as to not interfere
with possible future interrupt moderation, use bit 10, well outside
the reasonable range of real interrupt moderation values.

Changes are not atomic. The tx IRQ, napi BH and transmit path must
be quiesced when switching modes. Only allow changing this setting
when the device is down.

Humm, would not a private ethtool flag to switch TX NAPI on/off be more
appropriate rather than use the coalescing configuration API here?

What do you mean by private ethtool flag? A new field in ethtool
--features (-k)?

I meant using ethtool_drvinfo::n_priv_flags, ETH_SS_PRIV_FLAGS and then
ETHTOOL_GFPFLAGS and ETHTOOL_SPFLAGS to control the toggling of that
private flag. mlx5 has a number of privates flags for instance.

Interesting, thanks! I was not at all aware of those ethtool flags.
Am having a look. It definitely looks promising.

Okay, I made that change. That is indeed much cleaner, thanks.
Let me send the patch, initially as RFC.

I've observed one issue where if we toggle the flag before bringing
up the device, it hits a kernel BUG at include/linux/netdevice.h:515

   BUG_ON(!test_bit(NAPI_STATE_SCHED, >state));

This reminds me that we need to check netif_running() before trying to
enable and disable tx napi in ethtool_set_coalesce().

The first iteration of my patch checked IFF_UP and effectively
only allowed the change when not running. What do you mean
by need to check?

I mean if device is not up, there's no need to toggle napi state and tx
lock.


And to respond to the other follow-up notes at once:


Consider we may have interrupt moderation in the future, I tend to use
set_coalesce. Otherwise we may need two steps to enable moderation:

- tx-napi on
- set_coalesce

FWIW, I don't care strongly whether we do this through coalesce or priv_flags.

Ok.

Since you prefer coalesce, let's go with that (and a revision of your
latest patch).


Good to know this.


+ if (!napi_weight)
+ virtqueue_enable_cb(vi->sq[i].vq);

I don't get why we need to disable enable cb here.

To avoid entering no-napi mode with too few descriptors to
make progress and no way to get out of that state. This is a
pretty crude attempt at handling that, admittedly.

But in this case, we will call enable_cb_delayed() and we will finally
get a interrupt?

Right. It's a bit of a roundabout way to ensure that
netif_tx_wake_queue and thus eventually free_old_xmit_skbs are called.
It might make more sense to just wake the device without going through
an interrupt.


I'm not sure I get this. If we don't enable tx napi, we tend to delay TX 
interrupt if we found the ring is about to full to avoid interrupt 
storm, so we're probably ok in this case.


Thanks


Re: [PATCH net-next] virtio_net: ethtool tx napi configuration

2018-09-13 Thread Jason Wang




On 2018年09月13日 22:58, Willem de Bruijn wrote:

On Thu, Sep 13, 2018 at 5:02 AM Jason Wang  wrote:



On 2018年09月13日 07:27, Willem de Bruijn wrote:

On Wed, Sep 12, 2018 at 3:11 PM Willem de Bruijn
 wrote:

On Wed, Sep 12, 2018 at 2:16 PM Florian Fainelli  wrote:


On 9/12/2018 11:07 AM, Willem de Bruijn wrote:

On Wed, Sep 12, 2018 at 1:42 PM Florian Fainelli  wrote:


On 9/9/2018 3:44 PM, Willem de Bruijn wrote:

From: Willem de Bruijn 

Implement ethtool .set_coalesce (-C) and .get_coalesce (-c) handlers.
Interrupt moderation is currently not supported, so these accept and
display the default settings of 0 usec and 1 frame.

Toggle tx napi through a bit in tx-frames. So as to not interfere
with possible future interrupt moderation, use bit 10, well outside
the reasonable range of real interrupt moderation values.

Changes are not atomic. The tx IRQ, napi BH and transmit path must
be quiesced when switching modes. Only allow changing this setting
when the device is down.

Humm, would not a private ethtool flag to switch TX NAPI on/off be more
appropriate rather than use the coalescing configuration API here?

What do you mean by private ethtool flag? A new field in ethtool
--features (-k)?

I meant using ethtool_drvinfo::n_priv_flags, ETH_SS_PRIV_FLAGS and then
ETHTOOL_GFPFLAGS and ETHTOOL_SPFLAGS to control the toggling of that
private flag. mlx5 has a number of privates flags for instance.

Interesting, thanks! I was not at all aware of those ethtool flags.
Am having a look. It definitely looks promising.

Okay, I made that change. That is indeed much cleaner, thanks.
Let me send the patch, initially as RFC.

I've observed one issue where if we toggle the flag before bringing
up the device, it hits a kernel BUG at include/linux/netdevice.h:515

  BUG_ON(!test_bit(NAPI_STATE_SCHED, >state));

This reminds me that we need to check netif_running() before trying to
enable and disable tx napi in ethtool_set_coalesce().

The first iteration of my patch checked IFF_UP and effectively
only allowed the change when not running. What do you mean
by need to check?


I mean if device is not up, there's no need to toggle napi state and tx 
lock.




And to respond to the other follow-up notes at once:


Consider we may have interrupt moderation in the future, I tend to use
set_coalesce. Otherwise we may need two steps to enable moderation:

- tx-napi on
- set_coalesce

FWIW, I don't care strongly whether we do this through coalesce or priv_flags.


Ok.


+ if (!napi_weight)
+ virtqueue_enable_cb(vi->sq[i].vq);

I don't get why we need to disable enable cb here.

To avoid entering no-napi mode with too few descriptors to
make progress and no way to get out of that state. This is a
pretty crude attempt at handling that, admittedly.


But in this case, we will call enable_cb_delayed() and we will finally 
get a interrupt?


Thanks


Re: [PATCH net-next RFC] virtio_net: ethtool tx napi configuration

2018-09-13 Thread Jason Wang




On 2018年09月13日 07:29, Willem de Bruijn wrote:

From: Willem de Bruijn 

Implement ethtool .set_priv_flags and .get_priv_flags handlers
and use ethtool private flags to toggle transmit napi:

   ethtool --set-priv-flags eth0 tx-napi on
   ethtool --show-priv-flags eth0

Link: https://patchwork.ozlabs.org/patch/948149/
Suggested-by: Jason Wang 
Suggested-by: Florian Fainelli 
Signed-off-by: Willem de Bruijn 
---
  drivers/net/virtio_net.c | 49 
  1 file changed, 49 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 765920905226..9ca7e0a0f0d9 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -73,6 +73,10 @@ static const unsigned long guest_offloads[] = {
VIRTIO_NET_F_GUEST_UFO
  };
  
+static const char virtnet_ethtool_priv_flags[][ETH_GSTRING_LEN] = {

+   "tx-napi",
+};
+
  struct virtnet_stat_desc {
char desc[ETH_GSTRING_LEN];
size_t offset;
@@ -2059,6 +2063,9 @@ static void virtnet_get_strings(struct net_device *dev, 
u32 stringset, u8 *data)
}
}
break;
+   case ETH_SS_PRIV_FLAGS:
+   memcpy(data, virtnet_ethtool_priv_flags,
+  sizeof(virtnet_ethtool_priv_flags));
}
  }
  
@@ -2070,6 +2077,9 @@ static int virtnet_get_sset_count(struct net_device *dev, int sset)

case ETH_SS_STATS:
return vi->curr_queue_pairs * (VIRTNET_RQ_STATS_LEN +
   VIRTNET_SQ_STATS_LEN);
+   case ETH_SS_PRIV_FLAGS:
+   return ARRAY_SIZE(virtnet_ethtool_priv_flags);
+
default:
return -EOPNOTSUPP;
}
@@ -2181,6 +2191,43 @@ static int virtnet_get_link_ksettings(struct net_device 
*dev,
return 0;
  }
  
+static int virtnet_set_priv_flags(struct net_device *dev, u32 priv_flags)

+{
+   struct virtnet_info *vi = netdev_priv(dev);
+   int i, napi_weight;
+
+   napi_weight = priv_flags & 0x1 ? NAPI_POLL_WEIGHT : 0;
+
+   if (napi_weight ^ vi->sq[0].napi.weight) {
+   for (i = 0; i < vi->max_queue_pairs; i++) {
+   struct netdev_queue *txq =
+   netdev_get_tx_queue(vi->dev, i);
+
+   virtnet_napi_tx_disable(>sq[i].napi);
+   __netif_tx_lock_bh(txq);
+   vi->sq[i].napi.weight = napi_weight;
+   if (!napi_weight)
+   virtqueue_enable_cb(vi->sq[i].vq);


I don't get why we need to disable enable cb here.

Thanks


+   __netif_tx_unlock_bh(txq);
+   virtnet_napi_tx_enable(vi, vi->sq[i].vq,
+  >sq[i].napi);
+   }
+   }
+
+   return 0;
+}
+
+static u32 virtnet_get_priv_flags(struct net_device *dev)
+{
+   struct virtnet_info *vi = netdev_priv(dev);
+   int priv_flags = 0;
+
+   if (vi->sq[0].napi.weight)
+   priv_flags |= 0x1;
+
+   return priv_flags;
+}
+
  static void virtnet_init_settings(struct net_device *dev)
  {
struct virtnet_info *vi = netdev_priv(dev);
@@ -2219,6 +2266,8 @@ static const struct ethtool_ops virtnet_ethtool_ops = {
.get_ts_info = ethtool_op_get_ts_info,
.get_link_ksettings = virtnet_get_link_ksettings,
.set_link_ksettings = virtnet_set_link_ksettings,
+   .set_priv_flags = virtnet_set_priv_flags,
+   .get_priv_flags = virtnet_get_priv_flags,
  };
  
  static void virtnet_freeze_down(struct virtio_device *vdev)




Re: [PATCH net-next] virtio_net: ethtool tx napi configuration

2018-09-13 Thread Jason Wang




On 2018年09月12日 21:43, Willem de Bruijn wrote:

On Tue, Sep 11, 2018 at 11:35 PM Jason Wang  wrote:



On 2018年09月11日 09:14, Willem de Bruijn wrote:

I cook a fixup, and it looks works in my setup:

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index b320b6b14749..9181c3f2f832 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2204,10 +2204,17 @@ static int virtnet_set_coalesce(struct
net_device *dev,
return -EINVAL;

if (napi_weight ^ vi->sq[0].napi.weight) {
-   if (dev->flags & IFF_UP)
-   return -EBUSY;
-   for (i = 0; i < vi->max_queue_pairs; i++)
+   for (i = 0; i < vi->max_queue_pairs; i++) {
+   struct netdev_queue *txq =
+  netdev_get_tx_queue(vi->dev, i);
+
+ virtnet_napi_tx_disable(>sq[i].napi);
+   __netif_tx_lock_bh(txq);
vi->sq[i].napi.weight = napi_weight;
+   __netif_tx_unlock_bh(txq);
+   virtnet_napi_tx_enable(vi, vi->sq[i].vq,
+ >sq[i].napi);
+   }
}

return 0;

Thanks! It passes my simple stress test, too. Which consists of two
concurrent loops, one toggling the ethtool option, another running
TCP_RR.


The only left case is the speculative tx polling in RX NAPI. I think we
don't need to care in this case since it was not a must for correctness.

As long as the txq lock is held that will be a noop, anyway. The other
concurrent action is skb_xmit_done. It looks correct to me, but need
to think about it a bit. The tricky transition is coming out of napi without
having >= 2 + MAX_SKB_FRAGS clean descriptors. If the queue is
stopped it may deadlock transmission in no-napi mode.

Yes, maybe we can enable tx queue when napi weight is zero in
virtnet_poll_tx().

Yes, that precaution should resolve that edge case.


I've done a stress test and it passes. The test contains:

- vm with 2 queues
- a bash script to enable and disable tx napi
- two netperf UDP_STREAM sessions to send small packets

Great. That matches my results. Do you want to send the v2?


Some mails were blocked so I do not receive some replies in time. So I 
post a V2 (but as you've pointed out, it's buggy).


Thanks



Re: [PATCH net-next] virtio_net: ethtool tx napi configuration

2018-09-13 Thread Jason Wang




On 2018年09月13日 03:11, Willem de Bruijn wrote:

On Wed, Sep 12, 2018 at 2:16 PM Florian Fainelli  wrote:



On 9/12/2018 11:07 AM, Willem de Bruijn wrote:

On Wed, Sep 12, 2018 at 1:42 PM Florian Fainelli  wrote:



On 9/9/2018 3:44 PM, Willem de Bruijn wrote:

From: Willem de Bruijn 

Implement ethtool .set_coalesce (-C) and .get_coalesce (-c) handlers.
Interrupt moderation is currently not supported, so these accept and
display the default settings of 0 usec and 1 frame.

Toggle tx napi through a bit in tx-frames. So as to not interfere
with possible future interrupt moderation, use bit 10, well outside
the reasonable range of real interrupt moderation values.

Changes are not atomic. The tx IRQ, napi BH and transmit path must
be quiesced when switching modes. Only allow changing this setting
when the device is down.

Humm, would not a private ethtool flag to switch TX NAPI on/off be more
appropriate rather than use the coalescing configuration API here?

What do you mean by private ethtool flag? A new field in ethtool
--features (-k)?

I meant using ethtool_drvinfo::n_priv_flags, ETH_SS_PRIV_FLAGS and then
ETHTOOL_GFPFLAGS and ETHTOOL_SPFLAGS to control the toggling of that
private flag. mlx5 has a number of privates flags for instance.

Interesting, thanks! I was not at all aware of those ethtool flags.
Am having a look. It definitely looks promising.


Configurable napi-tx is not a common feature across devices. We really
want virtio-net to also just convert to napi-tx as default, but need a
way to gradually convert with application opt-out if some workloads
see regressions.

The rationale makes sense, no questions about it.


There is prior art in interpreting coalesce values as
more than a direct mapping to usec. The e1000 is one example.


Looked at both e1000 and e1000e and they both have a similar programming
of the HW's interrupt target rate register, which is relevant to
interrupt coalescing, what part of these drivers do you see as doing
something not quite coalescing related?

It's all coalescing related, for sure. e1000_set_coalesce just does not
translate the tx-usecs values into microsecond latency directly.

It modifies both the interrupt throttle rate adapter->itr and interrupt mode
adapter->itr_setting, which are initially set in e1000_check_options from
module param InterruptThrottleRate.

Value 0 disables interrupt moderation. 1 and 3 program a dynamic mode.
2 is an illegal value as is 5..9. 10..1 converts from usec to interrupt
rate/sec.

I took tx-napi to be a similar interrupt related option as, say, dynamic
conservative mode interrupt moderation.


Consider we may have interrupt moderation in the future, I tend to use 
set_coalesce. Otherwise we may need two steps to enable moderation:


- tx-napi on
- set_coalesce

Thanks



Re: [PATCH net-next] virtio_net: ethtool tx napi configuration

2018-09-13 Thread Jason Wang




On 2018年09月13日 07:27, Willem de Bruijn wrote:

On Wed, Sep 12, 2018 at 3:11 PM Willem de Bruijn
 wrote:

On Wed, Sep 12, 2018 at 2:16 PM Florian Fainelli  wrote:



On 9/12/2018 11:07 AM, Willem de Bruijn wrote:

On Wed, Sep 12, 2018 at 1:42 PM Florian Fainelli  wrote:



On 9/9/2018 3:44 PM, Willem de Bruijn wrote:

From: Willem de Bruijn 

Implement ethtool .set_coalesce (-C) and .get_coalesce (-c) handlers.
Interrupt moderation is currently not supported, so these accept and
display the default settings of 0 usec and 1 frame.

Toggle tx napi through a bit in tx-frames. So as to not interfere
with possible future interrupt moderation, use bit 10, well outside
the reasonable range of real interrupt moderation values.

Changes are not atomic. The tx IRQ, napi BH and transmit path must
be quiesced when switching modes. Only allow changing this setting
when the device is down.

Humm, would not a private ethtool flag to switch TX NAPI on/off be more
appropriate rather than use the coalescing configuration API here?

What do you mean by private ethtool flag? A new field in ethtool
--features (-k)?

I meant using ethtool_drvinfo::n_priv_flags, ETH_SS_PRIV_FLAGS and then
ETHTOOL_GFPFLAGS and ETHTOOL_SPFLAGS to control the toggling of that
private flag. mlx5 has a number of privates flags for instance.

Interesting, thanks! I was not at all aware of those ethtool flags.
Am having a look. It definitely looks promising.

Okay, I made that change. That is indeed much cleaner, thanks.
Let me send the patch, initially as RFC.

I've observed one issue where if we toggle the flag before bringing
up the device, it hits a kernel BUG at include/linux/netdevice.h:515

 BUG_ON(!test_bit(NAPI_STATE_SCHED, >state));


This reminds me that we need to check netif_running() before trying to 
enable and disable tx napi in ethtool_set_coalesce().


Thanks


Re: [PATCH net-next] virtio_net: ethtool tx napi configuration

2018-09-11 Thread Jason Wang




On 2018年09月11日 09:14, Willem de Bruijn wrote:

I cook a fixup, and it looks works in my setup:

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index b320b6b14749..9181c3f2f832 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2204,10 +2204,17 @@ static int virtnet_set_coalesce(struct
net_device *dev,
   return -EINVAL;

   if (napi_weight ^ vi->sq[0].napi.weight) {
-   if (dev->flags & IFF_UP)
-   return -EBUSY;
-   for (i = 0; i < vi->max_queue_pairs; i++)
+   for (i = 0; i < vi->max_queue_pairs; i++) {
+   struct netdev_queue *txq =
+  netdev_get_tx_queue(vi->dev, i);
+
+ virtnet_napi_tx_disable(>sq[i].napi);
+   __netif_tx_lock_bh(txq);
   vi->sq[i].napi.weight = napi_weight;
+   __netif_tx_unlock_bh(txq);
+   virtnet_napi_tx_enable(vi, vi->sq[i].vq,
+ >sq[i].napi);
+   }
   }

   return 0;

Thanks! It passes my simple stress test, too. Which consists of two
concurrent loops, one toggling the ethtool option, another running
TCP_RR.


The only left case is the speculative tx polling in RX NAPI. I think we
don't need to care in this case since it was not a must for correctness.

As long as the txq lock is held that will be a noop, anyway. The other
concurrent action is skb_xmit_done. It looks correct to me, but need
to think about it a bit. The tricky transition is coming out of napi without
having >= 2 + MAX_SKB_FRAGS clean descriptors. If the queue is
stopped it may deadlock transmission in no-napi mode.

Yes, maybe we can enable tx queue when napi weight is zero in
virtnet_poll_tx().

Yes, that precaution should resolve that edge case.



I've done a stress test and it passes. The test contains:

- vm with 2 queues
- a bash script to enable and disable tx napi
- two netperf UDP_STREAM sessions to send small packets

Thanks



Re: [PATCH net-next] virtio_net: ethtool tx napi configuration

2018-09-10 Thread Jason Wang




On 2018年09月10日 21:35, Willem de Bruijn wrote:

On Mon, Sep 10, 2018 at 2:01 AM Jason Wang  wrote:



On 2018年09月10日 06:44, Willem de Bruijn wrote:

From: Willem de Bruijn 

Implement ethtool .set_coalesce (-C) and .get_coalesce (-c) handlers.
Interrupt moderation is currently not supported, so these accept and
display the default settings of 0 usec and 1 frame.

Toggle tx napi through a bit in tx-frames. So as to not interfere
with possible future interrupt moderation, use bit 10, well outside
the reasonable range of real interrupt moderation values.

Changes are not atomic. The tx IRQ, napi BH and transmit path must
be quiesced when switching modes. Only allow changing this setting
when the device is down.

I cook a fixup, and it looks works in my setup:

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index b320b6b14749..9181c3f2f832 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2204,10 +2204,17 @@ static int virtnet_set_coalesce(struct
net_device *dev,
  return -EINVAL;

  if (napi_weight ^ vi->sq[0].napi.weight) {
-   if (dev->flags & IFF_UP)
-   return -EBUSY;
-   for (i = 0; i < vi->max_queue_pairs; i++)
+   for (i = 0; i < vi->max_queue_pairs; i++) {
+   struct netdev_queue *txq =
+  netdev_get_tx_queue(vi->dev, i);
+
+ virtnet_napi_tx_disable(>sq[i].napi);
+   __netif_tx_lock_bh(txq);
  vi->sq[i].napi.weight = napi_weight;
+   __netif_tx_unlock_bh(txq);
+   virtnet_napi_tx_enable(vi, vi->sq[i].vq,
+ >sq[i].napi);
+   }
  }

  return 0;

Thanks! It passes my simple stress test, too. Which consists of two
concurrent loops, one toggling the ethtool option, another running
TCP_RR.


The only left case is the speculative tx polling in RX NAPI. I think we
don't need to care in this case since it was not a must for correctness.

As long as the txq lock is held that will be a noop, anyway. The other
concurrent action is skb_xmit_done. It looks correct to me, but need
to think about it a bit. The tricky transition is coming out of napi without
having >= 2 + MAX_SKB_FRAGS clean descriptors. If the queue is
stopped it may deadlock transmission in no-napi mode.


Yes, maybe we can enable tx queue when napi weight is zero in 
virtnet_poll_tx(). Let me do more stress test on this.





Link: https://patchwork.ozlabs.org/patch/948149/
Suggested-by: Jason Wang 
Signed-off-by: Willem de Bruijn 
---
   drivers/net/virtio_net.c | 52 
   1 file changed, 52 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 765920905226..b320b6b14749 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -66,6 +66,8 @@ DECLARE_EWMA(pkt_len, 0, 64)

   #define VIRTNET_DRIVER_VERSION "1.0.0"

+static const u32 ethtool_coalesce_napi_mask = (1UL << 10);
+
   static const unsigned long guest_offloads[] = {
   VIRTIO_NET_F_GUEST_TSO4,
   VIRTIO_NET_F_GUEST_TSO6,
@@ -2181,6 +2183,54 @@ static int virtnet_get_link_ksettings(struct net_device 
*dev,
   return 0;
   }

+static int virtnet_set_coalesce(struct net_device *dev,
+ struct ethtool_coalesce *ec)
+{
+ const struct ethtool_coalesce ec_default = {
+ .cmd = ETHTOOL_SCOALESCE,
+ .rx_max_coalesced_frames = 1,

I think rx part is no necessary.

The definition of ethtool_coalesce has:

"* It is illegal to set both usecs and max_frames to zero as this
  * would cause interrupts to never be generated.  To disable
  * coalescing, set usecs = 0 and max_frames = 1."

I'd rather not diverge from this prescribed behavior unless there's
a strong reason.


I get this.



On the related point in the other thread:


Rethink about this, how about something like:

- UINT_MAX: no tx interrupt
- other value: tx interrupt with possible interrupt moderation

Okay, that will be simpler to configure.


I wonder maybe we can use ethtool_coalesce definition. E.g usecs = 0 && 
max_frame = 0 means no interrupt? Just a thought, I'm ok with either.


Thanks


Re: [PATCH net-next] virtio_net: ethtool tx napi configuration

2018-09-10 Thread Jason Wang




On 2018年09月10日 06:44, Willem de Bruijn wrote:

From: Willem de Bruijn 

Implement ethtool .set_coalesce (-C) and .get_coalesce (-c) handlers.
Interrupt moderation is currently not supported, so these accept and
display the default settings of 0 usec and 1 frame.

Toggle tx napi through a bit in tx-frames. So as to not interfere
with possible future interrupt moderation, use bit 10, well outside
the reasonable range of real interrupt moderation values.

Changes are not atomic. The tx IRQ, napi BH and transmit path must
be quiesced when switching modes. Only allow changing this setting
when the device is down.


I cook a fixup, and it looks works in my setup:

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index b320b6b14749..9181c3f2f832 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2204,10 +2204,17 @@ static int virtnet_set_coalesce(struct 
net_device *dev,

    return -EINVAL;

    if (napi_weight ^ vi->sq[0].napi.weight) {
-   if (dev->flags & IFF_UP)
-   return -EBUSY;
-   for (i = 0; i < vi->max_queue_pairs; i++)
+   for (i = 0; i < vi->max_queue_pairs; i++) {
+   struct netdev_queue *txq =
+  netdev_get_tx_queue(vi->dev, i);
+
+ virtnet_napi_tx_disable(>sq[i].napi);
+   __netif_tx_lock_bh(txq);
    vi->sq[i].napi.weight = napi_weight;
+   __netif_tx_unlock_bh(txq);
+   virtnet_napi_tx_enable(vi, vi->sq[i].vq,
+ >sq[i].napi);
+   }
    }

    return 0;

The only left case is the speculative tx polling in RX NAPI. I think we 
don't need to care in this case since it was not a must for correctness.




Link: https://patchwork.ozlabs.org/patch/948149/
Suggested-by: Jason Wang 
Signed-off-by: Willem de Bruijn 
---
  drivers/net/virtio_net.c | 52 
  1 file changed, 52 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 765920905226..b320b6b14749 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -66,6 +66,8 @@ DECLARE_EWMA(pkt_len, 0, 64)
  
  #define VIRTNET_DRIVER_VERSION "1.0.0"
  
+static const u32 ethtool_coalesce_napi_mask = (1UL << 10);

+
  static const unsigned long guest_offloads[] = {
VIRTIO_NET_F_GUEST_TSO4,
VIRTIO_NET_F_GUEST_TSO6,
@@ -2181,6 +2183,54 @@ static int virtnet_get_link_ksettings(struct net_device 
*dev,
return 0;
  }
  
+static int virtnet_set_coalesce(struct net_device *dev,

+   struct ethtool_coalesce *ec)
+{
+   const struct ethtool_coalesce ec_default = {
+   .cmd = ETHTOOL_SCOALESCE,
+   .rx_max_coalesced_frames = 1,


I think rx part is no necessary.

Thanks


+   .tx_max_coalesced_frames = 1,
+   };
+   struct virtnet_info *vi = netdev_priv(dev);
+   int i, napi_weight = 0;
+
+   if (ec->tx_max_coalesced_frames & ethtool_coalesce_napi_mask) {
+   ec->tx_max_coalesced_frames &= ~ethtool_coalesce_napi_mask;
+   napi_weight = NAPI_POLL_WEIGHT;
+   }
+
+   /* disallow changes to fields not explicitly tested above */
+   if (memcmp(ec, _default, sizeof(ec_default)))
+   return -EINVAL;
+
+   if (napi_weight ^ vi->sq[0].napi.weight) {
+   if (dev->flags & IFF_UP)
+   return -EBUSY;
+   for (i = 0; i < vi->max_queue_pairs; i++)
+   vi->sq[i].napi.weight = napi_weight;
+   }
+
+   return 0;
+}
+
+static int virtnet_get_coalesce(struct net_device *dev,
+   struct ethtool_coalesce *ec)
+{
+   const struct ethtool_coalesce ec_default = {
+   .cmd = ETHTOOL_GCOALESCE,
+   .rx_max_coalesced_frames = 1,
+   .tx_max_coalesced_frames = 1,
+   };
+   struct virtnet_info *vi = netdev_priv(dev);
+
+   memcpy(ec, _default, sizeof(ec_default));
+
+   if (vi->sq[0].napi.weight)
+   ec->tx_max_coalesced_frames |= ethtool_coalesce_napi_mask;
+
+   return 0;
+}
+
  static void virtnet_init_settings(struct net_device *dev)
  {
struct virtnet_info *vi = netdev_priv(dev);
@@ -2219,6 +2269,8 @@ static const struct ethtool_ops virtnet_ethtool_ops = {
.get_ts_info = ethtool_op_get_ts_info,
.get_link_ksettings = virtnet_get_link_ksettings,
.set_link_ksettings = virtnet_set_link_ksettings,
+   .set_coalesce = virtnet_set_coalesce,
+   .get_coalesce = virtnet_get_coalesce,
  };
  
  static void virtnet_freeze_down(struct virtio_device *vdev)




Re: [PATCH net-next] virtio_net: force_napi_tx module param.

2018-09-09 Thread Jason Wang




On 2018年09月10日 07:07, Willem de Bruijn wrote:

On Wed, Aug 29, 2018 at 9:01 AM Willem de Bruijn
 wrote:

On Wed, Aug 29, 2018 at 3:56 AM Jason Wang  wrote:



On 2018年08月29日 03:57, Willem de Bruijn wrote:

On Mon, Jul 30, 2018 at 2:06 AM Jason Wang  wrote:


On 2018年07月25日 08:17, Jon Olson wrote:

On Tue, Jul 24, 2018 at 3:46 PM Michael S. Tsirkin  wrote:

On Tue, Jul 24, 2018 at 06:31:54PM -0400, Willem de Bruijn wrote:

On Tue, Jul 24, 2018 at 6:23 PM Michael S. Tsirkin  wrote:

On Tue, Jul 24, 2018 at 04:52:53PM -0400, Willem de Bruijn wrote:

>From the above linked patch, I understand that there are yet
other special cases in production, such as a hard cap on #tx queues to
32 regardless of number of vcpus.

I don't think upstream kernels have this limit - we can
now use vmalloc for higher number of queues.

Yes. that patch* mentioned it as a google compute engine imposed
limit. It is exactly such cloud provider imposed rules that I'm
concerned about working around in upstream drivers.

* for reference, I mean https://patchwork.ozlabs.org/patch/725249/

Yea. Why does GCE do it btw?

There are a few reasons for the limit, some historical, some current.

Historically we did this because of a kernel limit on the number of
TAP queues (in Montreal I thought this limit was 32). To my chagrin,
the limit upstream at the time we did it was actually eight. We had
increased the limit from eight to 32 internally, and it appears in
upstream it has subsequently increased upstream to 256. We no longer
use TAP for networking, so that constraint no longer applies for us,
but when looking at removing/raising the limit we discovered no
workloads that clearly benefited from lifting it, and it also placed
more pressure on our virtual networking stack particularly on the Tx
side. We left it as-is.

In terms of current reasons there are really two. One is memory usage.
As you know, virtio-net uses rx/tx pairs, so there's an expectation
that the guest will have an Rx queue for every Tx queue. We run our
individual virtqueues fairly deep (4096 entries) to give guests a wide
time window for re-posting Rx buffers and avoiding starvation on
packet delivery. Filling an Rx vring with max-sized mergeable buffers
(4096 bytes) is 16MB of GFP_ATOMIC allocations. At 32 queues this can
be up to 512MB of memory posted for network buffers. Scaling this to
the largest VM GCE offers today (160 VCPUs -- n1-ultramem-160) keeping
all of the Rx rings full would (in the large average Rx packet size
case) consume up to 2.5 GB(!) of guest RAM. Now, those VMs have 3.8T
of RAM available, but I don't believe we've observed a situation where
they would have benefited from having 2.5 gigs of buffers posted for
incoming network traffic :)

We can work to have async txq and rxq instead of paris if there's a
strong requirement.


The second reason is interrupt related -- as I mentioned above, we
have found no workloads that clearly benefit from so many queues, but
we have found workloads that degrade. In particular workloads that do
a lot of small packet processing but which aren't extremely latency
sensitive can achieve higher PPS by taking fewer interrupt across
fewer VCPUs due to better batching (this also incurs higher latency,
but at the limit the "busy" cores end up suppressing most interrupts
and spending most of their cycles farming out work). Memcache is a
good example here, particularly if the latency targets for request
completion are in the ~milliseconds range (rather than the
microseconds we typically strive for with TCP_RR-style workloads).

All of that said, we haven't been forthcoming with data (and
unfortunately I don't have it handy in a useful form, otherwise I'd
simply post it here), so I understand the hesitation to simply run
with napi_tx across the board. As Willem said, this patch seemed like
the least disruptive way to allow us to continue down the road of
"universal" NAPI Tx and to hopefully get data across enough workloads
(with VMs small, large, and absurdly large :) to present a compelling
argument in one direction or another. As far as I know there aren't
currently any NAPI related ethtool commands (based on a quick perusal
of ethtool.h)

As I suggest before, maybe we can (ab)use tx-frames-irq.

I forgot to respond to this originally, but I agree.

How about something like the snippet below. It would be simpler to
reason about if only allow switching while the device is down, but
napi does not strictly require that.

+static int virtnet_set_coalesce(struct net_device *dev,
+   struct ethtool_coalesce *ec)
+{
+   const u32 tx_coalesce_napi_mask = (1 << 16);
+   const struct ethtool_coalesce ec_default = {
+   .cmd = ETHTOOL_SCOALESCE,
+   .rx_max_coalesced_frames = 1,
+   .tx_max_coalesced_frames = 1,
+   };
+   struct virtnet_info *vi = netdev_priv(dev);
+   int napi_weight = 0;
+   bool running;
+ 

Re: [PATCH net-next] virtio_net: force_napi_tx module param.

2018-08-29 Thread Jason Wang




On 2018年08月29日 03:57, Willem de Bruijn wrote:

On Mon, Jul 30, 2018 at 2:06 AM Jason Wang  wrote:



On 2018年07月25日 08:17, Jon Olson wrote:

On Tue, Jul 24, 2018 at 3:46 PM Michael S. Tsirkin  wrote:

On Tue, Jul 24, 2018 at 06:31:54PM -0400, Willem de Bruijn wrote:

On Tue, Jul 24, 2018 at 6:23 PM Michael S. Tsirkin  wrote:

On Tue, Jul 24, 2018 at 04:52:53PM -0400, Willem de Bruijn wrote:

>From the above linked patch, I understand that there are yet
other special cases in production, such as a hard cap on #tx queues to
32 regardless of number of vcpus.

I don't think upstream kernels have this limit - we can
now use vmalloc for higher number of queues.

Yes. that patch* mentioned it as a google compute engine imposed
limit. It is exactly such cloud provider imposed rules that I'm
concerned about working around in upstream drivers.

* for reference, I mean https://patchwork.ozlabs.org/patch/725249/

Yea. Why does GCE do it btw?

There are a few reasons for the limit, some historical, some current.

Historically we did this because of a kernel limit on the number of
TAP queues (in Montreal I thought this limit was 32). To my chagrin,
the limit upstream at the time we did it was actually eight. We had
increased the limit from eight to 32 internally, and it appears in
upstream it has subsequently increased upstream to 256. We no longer
use TAP for networking, so that constraint no longer applies for us,
but when looking at removing/raising the limit we discovered no
workloads that clearly benefited from lifting it, and it also placed
more pressure on our virtual networking stack particularly on the Tx
side. We left it as-is.

In terms of current reasons there are really two. One is memory usage.
As you know, virtio-net uses rx/tx pairs, so there's an expectation
that the guest will have an Rx queue for every Tx queue. We run our
individual virtqueues fairly deep (4096 entries) to give guests a wide
time window for re-posting Rx buffers and avoiding starvation on
packet delivery. Filling an Rx vring with max-sized mergeable buffers
(4096 bytes) is 16MB of GFP_ATOMIC allocations. At 32 queues this can
be up to 512MB of memory posted for network buffers. Scaling this to
the largest VM GCE offers today (160 VCPUs -- n1-ultramem-160) keeping
all of the Rx rings full would (in the large average Rx packet size
case) consume up to 2.5 GB(!) of guest RAM. Now, those VMs have 3.8T
of RAM available, but I don't believe we've observed a situation where
they would have benefited from having 2.5 gigs of buffers posted for
incoming network traffic :)

We can work to have async txq and rxq instead of paris if there's a
strong requirement.


The second reason is interrupt related -- as I mentioned above, we
have found no workloads that clearly benefit from so many queues, but
we have found workloads that degrade. In particular workloads that do
a lot of small packet processing but which aren't extremely latency
sensitive can achieve higher PPS by taking fewer interrupt across
fewer VCPUs due to better batching (this also incurs higher latency,
but at the limit the "busy" cores end up suppressing most interrupts
and spending most of their cycles farming out work). Memcache is a
good example here, particularly if the latency targets for request
completion are in the ~milliseconds range (rather than the
microseconds we typically strive for with TCP_RR-style workloads).

All of that said, we haven't been forthcoming with data (and
unfortunately I don't have it handy in a useful form, otherwise I'd
simply post it here), so I understand the hesitation to simply run
with napi_tx across the board. As Willem said, this patch seemed like
the least disruptive way to allow us to continue down the road of
"universal" NAPI Tx and to hopefully get data across enough workloads
(with VMs small, large, and absurdly large :) to present a compelling
argument in one direction or another. As far as I know there aren't
currently any NAPI related ethtool commands (based on a quick perusal
of ethtool.h)

As I suggest before, maybe we can (ab)use tx-frames-irq.

I forgot to respond to this originally, but I agree.

How about something like the snippet below. It would be simpler to
reason about if only allow switching while the device is down, but
napi does not strictly require that.

+static int virtnet_set_coalesce(struct net_device *dev,
+   struct ethtool_coalesce *ec)
+{
+   const u32 tx_coalesce_napi_mask = (1 << 16);
+   const struct ethtool_coalesce ec_default = {
+   .cmd = ETHTOOL_SCOALESCE,
+   .rx_max_coalesced_frames = 1,
+   .tx_max_coalesced_frames = 1,
+   };
+   struct virtnet_info *vi = netdev_priv(dev);
+   int napi_weight = 0;
+   bool running;
+   int i;
+
+   if (ec->tx_max_coalesced_frames & tx_coalesce_napi_mask) {
+   ec->tx_max_coalesced_frames &= ~tx_coal

Re: [PATCH net-next v2] net: allow to call netif_reset_xps_queues() under cpus_read_lock

2018-08-09 Thread Jason Wang




On 2018年08月09日 11:07, Andrei Vagin wrote:

From: Andrei Vagin 

The definition of static_key_slow_inc() has cpus_read_lock in place. In the
virtio_net driver, XPS queues are initialized after setting the queue:cpu
affinity in virtnet_set_affinity() which is already protected within
cpus_read_lock. Lockdep prints a warning when we are trying to acquire
cpus_read_lock when it is already held.

This patch adds an ability to call __netif_set_xps_queue under
cpus_read_lock().


WARNING: possible recursive locking detected
4.18.0-rc3-next-20180703+ #1 Not tainted

swapper/0/1 is trying to acquire lock:
cf973d46 (cpu_hotplug_lock.rw_sem){}, at: 
static_key_slow_inc+0xe/0x20

but task is already holding lock:
cf973d46 (cpu_hotplug_lock.rw_sem){}, at: init_vqs+0x513/0x5a0

other info that might help us debug this:
  Possible unsafe locking scenario:

CPU0

   lock(cpu_hotplug_lock.rw_sem);
   lock(cpu_hotplug_lock.rw_sem);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

3 locks held by swapper/0/1:
  #0: 244bc7da (>mutex){}, at: __driver_attach+0x5a/0x110
  #1: cf973d46 (cpu_hotplug_lock.rw_sem){}, at: init_vqs+0x513/0x5a0
  #2: 5cd8463f (xps_map_mutex){+.+.}, at: 
__netif_set_xps_queue+0x8d/0xc60

v2: move cpus_read_lock() out of __netif_set_xps_queue()

Cc: "Nambiar, Amritha" 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Fixes: 8af2c06ff4b1 ("net-sysfs: Add interface for Rx queue(s) map per Tx 
queue")

Signed-off-by: Andrei Vagin 
---
  drivers/net/virtio_net.c |  4 +++-
  net/core/dev.c   | 20 +++-
  net/core/net-sysfs.c |  4 
  3 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 62311dde6e71..39a7f4452587 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1903,9 +1903,11 @@ static void virtnet_set_affinity(struct virtnet_info *vi)
  
  	i = 0;

for_each_online_cpu(cpu) {
+   const unsigned long *mask = cpumask_bits(cpumask_of(cpu));
+
virtqueue_set_affinity(vi->rq[i].vq, cpu);
virtqueue_set_affinity(vi->sq[i].vq, cpu);
-   netif_set_xps_queue(vi->dev, cpumask_of(cpu), i);
+   __netif_set_xps_queue(vi->dev, mask, i, false);
i++;
}
  
diff --git a/net/core/dev.c b/net/core/dev.c

index f68122f0ab02..325fc5088370 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2176,6 +2176,7 @@ static void netif_reset_xps_queues(struct net_device 
*dev, u16 offset,
if (!static_key_false(_needed))
return;
  
+	cpus_read_lock();

mutex_lock(_map_mutex);
  
  	if (static_key_false(_rxqs_needed)) {

@@ -2199,10 +2200,11 @@ static void netif_reset_xps_queues(struct net_device 
*dev, u16 offset,
  
  out_no_maps:

if (static_key_enabled(_rxqs_needed))
-   static_key_slow_dec(_rxqs_needed);
+   static_key_slow_dec_cpuslocked(_rxqs_needed);
  
-	static_key_slow_dec(_needed);

+   static_key_slow_dec_cpuslocked(_needed);
mutex_unlock(_map_mutex);
+   cpus_read_unlock();
  }
  
  static void netif_reset_xps_queues_gt(struct net_device *dev, u16 index)

@@ -2250,6 +2252,7 @@ static struct xps_map *expand_xps_map(struct xps_map 
*map, int attr_index,
return new_map;
  }
  
+/* Must be called under cpus_read_lock */

  int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
  u16 index, bool is_rxqs_map)
  {
@@ -2317,9 +2320,9 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
if (!new_dev_maps)
goto out_no_new_maps;
  
-	static_key_slow_inc(_needed);

+   static_key_slow_inc_cpuslocked(_needed);
if (is_rxqs_map)
-   static_key_slow_inc(_rxqs_needed);
+   static_key_slow_inc_cpuslocked(_rxqs_needed);
  
  	for (j = -1; j = netif_attrmask_next(j, possible_mask, nr_ids),

 j < nr_ids;) {
@@ -2448,11 +2451,18 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
kfree(new_dev_maps);
return -ENOMEM;
  }
+EXPORT_SYMBOL_GPL(__netif_set_xps_queue);
  
  int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,

u16 index)
  {
-   return __netif_set_xps_queue(dev, cpumask_bits(mask), index, false);
+   int ret;
+
+   cpus_read_lock();
+   ret =  __netif_set_xps_queue(dev, cpumask_bits(mask), index, false);
+   cpus_read_unlock();
+
+   return ret;
  }
  EXPORT_SYMBOL(netif_set_xps_queue);
  
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c

index 0a95bcf64cdc..bd67c4d0fcfd 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -26,6 +26,7 

Re: [PATCH net-next] virtio_net: force_napi_tx module param.

2018-07-30 Thread Jason Wang




On 2018年07月31日 03:51, Stephen Hemminger wrote:

On Sun, 29 Jul 2018 09:00:27 -0700 (PDT)
David Miller  wrote:


From: Caleb Raitto 
Date: Mon, 23 Jul 2018 16:11:19 -0700


From: Caleb Raitto 

The driver disables tx napi if it's not certain that completions will
be processed affine with tx service.

Its heuristic doesn't account for some scenarios where it is, such as
when the queue pair count matches the core but not hyperthread count.

Allow userspace to override the heuristic. This is an alternative
solution to that in the linked patch. That added more logic in the
kernel for these cases, but the agreement was that this was better left
to user control.

Do not expand the existing napi_tx variable to a ternary value,
because doing so can break user applications that expect
boolean ('Y'/'N') instead of integer output. Add a new param instead.

Link: https://patchwork.ozlabs.org/patch/725249/
Acked-by: Willem de Bruijn 
Acked-by: Jon Olson 
Signed-off-by: Caleb Raitto 

So I looked into the history surrounding these issues.

First of all, it's always ends up turning out crummy when drivers start
to set affinities themselves.  The worst possible case is to do it
_conditionally_, and that is exactly what virtio_net is doing.

 From the user's perspective, this provides a really bad experience.

So if I have a 32-queue device and there are 32 cpus, you'll do all
the affinity settings, stopping Irqbalanced from doing anything
right?

So if I add one more cpu, you'll say "oops, no idea what to do in
this situation" and not touch the affinities at all?

That makes no sense at all.

If the driver is going to set affinities at all, OWN that decision
and set it all the time to something reasonable.

Or accept that you shouldn't be touching this stuff in the first place
and leave the affinities alone.

Right now we're kinda in a situation where the driver has been setting
affinities in the ncpus==nqueues cases for some time, so we can't stop
doing it.

Which means we have to set them in all cases to make the user
experience sane again.

I looked at the linked to patch again:

https://patchwork.ozlabs.org/patch/725249/

And I think the strategy should be made more generic, to get rid of
the hyperthreading assumptions.  I also agree that the "assign
to first N cpus" logic doesn't make much sense either.

Just distribute across the available cpus evenly, and be done with it.
If you have 64 cpus and 32 queues, this assigns queues to every other
cpu.

Then we don't need this weird new module parameter.

I wonder if it would be possible to give irqbalanced hints
with irq_set_affinity_hint instead of doing direct affinity setting?


We do use hints instead of affinity directly.

See vp_set_vq_affinity(), which did:

    if (vp_dev->msix_enabled) {
        mask = vp_dev->msix_affinity_masks[info->msix_vector];
        irq = pci_irq_vector(vp_dev->pci_dev, info->msix_vector);
        if (cpu == -1)
            irq_set_affinity_hint(irq, NULL);
        else {
            cpumask_clear(mask);
            cpumask_set_cpu(cpu, mask);
            irq_set_affinity_hint(irq, mask);
        }
    }

Thanks.



Re: [PATCH net-next] virtio_net: force_napi_tx module param.

2018-07-30 Thread Jason Wang




On 2018年07月25日 08:17, Jon Olson wrote:

On Tue, Jul 24, 2018 at 3:46 PM Michael S. Tsirkin  wrote:

On Tue, Jul 24, 2018 at 06:31:54PM -0400, Willem de Bruijn wrote:

On Tue, Jul 24, 2018 at 6:23 PM Michael S. Tsirkin  wrote:

On Tue, Jul 24, 2018 at 04:52:53PM -0400, Willem de Bruijn wrote:

>From the above linked patch, I understand that there are yet
other special cases in production, such as a hard cap on #tx queues to
32 regardless of number of vcpus.

I don't think upstream kernels have this limit - we can
now use vmalloc for higher number of queues.

Yes. that patch* mentioned it as a google compute engine imposed
limit. It is exactly such cloud provider imposed rules that I'm
concerned about working around in upstream drivers.

* for reference, I mean https://patchwork.ozlabs.org/patch/725249/

Yea. Why does GCE do it btw?

There are a few reasons for the limit, some historical, some current.

Historically we did this because of a kernel limit on the number of
TAP queues (in Montreal I thought this limit was 32). To my chagrin,
the limit upstream at the time we did it was actually eight. We had
increased the limit from eight to 32 internally, and it appears in
upstream it has subsequently increased upstream to 256. We no longer
use TAP for networking, so that constraint no longer applies for us,
but when looking at removing/raising the limit we discovered no
workloads that clearly benefited from lifting it, and it also placed
more pressure on our virtual networking stack particularly on the Tx
side. We left it as-is.

In terms of current reasons there are really two. One is memory usage.
As you know, virtio-net uses rx/tx pairs, so there's an expectation
that the guest will have an Rx queue for every Tx queue. We run our
individual virtqueues fairly deep (4096 entries) to give guests a wide
time window for re-posting Rx buffers and avoiding starvation on
packet delivery. Filling an Rx vring with max-sized mergeable buffers
(4096 bytes) is 16MB of GFP_ATOMIC allocations. At 32 queues this can
be up to 512MB of memory posted for network buffers. Scaling this to
the largest VM GCE offers today (160 VCPUs -- n1-ultramem-160) keeping
all of the Rx rings full would (in the large average Rx packet size
case) consume up to 2.5 GB(!) of guest RAM. Now, those VMs have 3.8T
of RAM available, but I don't believe we've observed a situation where
they would have benefited from having 2.5 gigs of buffers posted for
incoming network traffic :)


We can work to have async txq and rxq instead of paris if there's a 
strong requirement.




The second reason is interrupt related -- as I mentioned above, we
have found no workloads that clearly benefit from so many queues, but
we have found workloads that degrade. In particular workloads that do
a lot of small packet processing but which aren't extremely latency
sensitive can achieve higher PPS by taking fewer interrupt across
fewer VCPUs due to better batching (this also incurs higher latency,
but at the limit the "busy" cores end up suppressing most interrupts
and spending most of their cycles farming out work). Memcache is a
good example here, particularly if the latency targets for request
completion are in the ~milliseconds range (rather than the
microseconds we typically strive for with TCP_RR-style workloads).

All of that said, we haven't been forthcoming with data (and
unfortunately I don't have it handy in a useful form, otherwise I'd
simply post it here), so I understand the hesitation to simply run
with napi_tx across the board. As Willem said, this patch seemed like
the least disruptive way to allow us to continue down the road of
"universal" NAPI Tx and to hopefully get data across enough workloads
(with VMs small, large, and absurdly large :) to present a compelling
argument in one direction or another. As far as I know there aren't
currently any NAPI related ethtool commands (based on a quick perusal
of ethtool.h)


As I suggest before, maybe we can (ab)use tx-frames-irq.

Thanks


-- it seems like it would be fairly involved/heavyweight
to plumb one solely for this unless NAPI Tx is something many users
will want to tune (and for which other drivers would support tuning).

--
Jon Olson




Re: [PATCH net] tun: Fix use-after-free on XDP_TX

2018-07-12 Thread Jason Wang




On 2018年07月13日 12:24, Toshiaki Makita wrote:

On XDP_TX we need to free up the frame only when tun_xdp_tx() returns a
negative value. A positive value indicates that the packet is
successfully enqueued to the ptr_ring, so freeing the page causes
use-after-free.

Fixes: 735fc4054b3a ("xdp: change ndo_xdp_xmit API to support bulking")
Signed-off-by: Toshiaki Makita 
---
  drivers/net/tun.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index a192a01..f5727ba 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1688,7 +1688,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
case XDP_TX:
get_page(alloc_frag->page);
alloc_frag->offset += buflen;
-   if (tun_xdp_tx(tun->dev, ))
+   if (tun_xdp_tx(tun->dev, ) < 0)
goto err_redirect;
rcu_read_unlock();
local_bh_enable();


Acked-by: Jason Wang 


Re: [PATCH vhost] vhost_net: Fix too many vring kick on busypoll

2018-07-01 Thread Jason Wang




On 2018年07月02日 10:52, Toshiaki Makita wrote:

On 2018/07/02 11:41, Jason Wang wrote:

On 2018年06月30日 00:38, Michael S. Tsirkin wrote:

On Fri, Jun 29, 2018 at 05:09:50PM +0900, Toshiaki Makita wrote:

Under heavy load vhost busypoll may run without suppressing
notification. For example tx zerocopy callback can push tx work while
handle_tx() is running, then busyloop exits due to vhost_has_work()
condition and enables notification but immediately reenters handle_tx()
because the pushed work was tx. In this case handle_tx() tries to
disable notification again, but when using event_idx it by design
cannot. Then busyloop will run without suppressing notification.
Another example is the case where handle_tx() tries to enable
notification but avail idx is advanced so disables it again. This case
also lead to the same situation with event_idx.

The problem is that once we enter this situation busyloop does not work
under heavy load for considerable amount of time, because notification
is likely to happen during busyloop and handle_tx() immediately enables
notification after notification happens. Specifically busyloop detects
notification by vhost_has_work() and then handle_tx() calls
vhost_enable_notify().

I'd like to understand the problem a bit better.
Why does this happen?
Doesn't this only happen if ring is empty?


My understanding is:

vhost_zerocopy_callback() try to poll vhost virtqueue. This will cause
the busy loop in vhost_net_tx_get_vq_desc() to exit because of
vhost_has_work() return true. Then handle_tx() tends to enable
notification. Then guest may kick us even if handle_tx() call
vhost_disable_notify() which in fact did nothing for even index.

Yes.


Maybe we can try to call vhost_zerocopy_signal_used() if we found
there's pending used from zerocopy instead.

Note that even when zerocopy is disabled the problem happens as I wrote.
When vhost_enable_notify() detects avail_idx advanced it tries to
disable notification again but it fails.



Yes, and the main reason is need_resched() and rx work. (polling RX will 
be addressed by Tonghao's patch I think).


Thanks


Re: [PATCH vhost] vhost_net: Fix too many vring kick on busypoll

2018-07-01 Thread Jason Wang




On 2018年07月02日 10:45, Toshiaki Makita wrote:

Hi Jason,

On 2018/06/29 18:30, Jason Wang wrote:

On 2018年06月29日 16:09, Toshiaki Makita wrote:

...

To fix this, poll the work instead of enabling notification when
busypoll is interrupted by something. IMHO signal_pending() and
vhost_has_work() are kind of interruptions rather than signals to
completely cancel the busypoll, so let's run busypoll after the
necessary work is done. In order to avoid too long busyloop due to
interruption, save the endtime in vq field and use it when reentering
the work function.

I think we don't care long busyloop unless e.g tx can starve rx?

I just want to keep it user-controllable. Unless memorizing it busypoll
can run unexpectedly long.


I think the total amount of time for busy polling is bounded. If I was 
wrong, it should be a bug somewhere.





I observed this problem makes tx performance poor but does not rx. I
guess it is because rx notification from socket is not that heavy so
did not impact the performance even when we failed to mask the
notification.

I think the reason is:

For tx, we may only process used ring under handle_tx(). Busy polling
code can not recognize this case.
For rx, we call peek_head_len() after exiting busy loop, so if the work
is for rx, it can be processed in handle_rx() immediately.

Sorry but I don't see the difference. Tx busypoll calls
vhost_get_vq_desc() immediately after busypoll exits in
vhost_net_tx_get_vq_desc().


Yes, so for the problem of zerocopy callback issue is we don't poll used 
(done_idx != upend_idx). Maybe we can try to add this and avoid the 
check of vhost_has_worker().




The scenario I described above (in 2nd paragraph) is a bit simplified.
To be clearer what I think is happening is:

1. handle_tx() runs busypoll with notification enabled (due to zerocopy
callback or something).


I think it was due to we enable notification when we exit handle_tx().


2. Guest produces a packet in vring.
3. handle_tx() busypoll detects the produced packet and get it.
4. While vhost is processing the packet, guest kicks vring because it
produced the packet. Vring notification is disabled automatically by
event_idx and tx work is queued.
5. After processing the packet vhost enters busyloop again, but detects
tx work and exits busyloop immediately. Then handle_tx() exits after
enabling the notification.
6. Because the queued work was tx, handle_tx() is called immediately
again. handle_tx() runs busypoll with notification enabled.
7. Repeat 2-6.


Exactly what I understand.




Anyway for consistency I fixed rx routine as well as tx.

Performance numbers:

- Bulk transfer from guest to external physical server.
  [Guest]->vhost_net->tap--(XDP_REDIRECT)-->i40e --(wire)--> [Server]

Just to confirm in this case since zerocopy is enabled, we are in fact
use the generic XDP datapath?

For some reason zerocopy was not applied for most packets, so in most
cases driver XDP was used. I was going to dig into it but not yet.


Right, just to confirm this. This is expected.

In tuntap, we do native XDP only for small and non zerocopy packets. See 
tun_can_build_skb(). The reason is XDP may adjust packet header which is 
not supported by zercopy. We can only use XDP generic for zerocopy in 
this case.





- Set 10us busypoll.
- Guest disables checksum and TSO because of host XDP.
- Measured single flow Mbps by netperf, and kicks by perf kvm stat
    (EPT_MISCONFIG event).

  Before  After
    Mbps  kicks/s  Mbps  kicks/s
UDP_STREAM 1472byte  247758 27
  Send   3645.37    6958.10
  Recv   3588.56    6958.10
    1byte    9865 37
  Send  4.34   5.43
  Recv  4.17   5.26
TCP_STREAM 8801.03    45794   9592.77 2884

Impressive numbers.


Signed-off-by: Toshiaki Makita 
---

...

+static bool vhost_busy_poll_interrupted(struct vhost_dev *dev)
+{
+    return unlikely(signal_pending(current)) || vhost_has_work(dev);
   }
     static void vhost_net_disable_vq(struct vhost_net *n,
@@ -437,10 +438,21 @@ static int vhost_net_tx_get_vq_desc(struct
vhost_net *net,
     if (r == vq->num && vq->busyloop_timeout) {
   preempt_disable();
-    endtime = busy_clock() + vq->busyloop_timeout;
-    while (vhost_can_busy_poll(vq->dev, endtime) &&
-   vhost_vq_avail_empty(vq->dev, vq))
+    if (vq->busyloop_endtime) {
+    endtime = vq->busyloop_endtime;
+    vq->busyloop_endtime = 0;

Looks like endtime may be before current time. So we skip busy loop in
this case.

Sure, I'll add a condition.


+    } else {
+    endtime = busy_clock() + vq->busyloop_timeout;
+    }
+    while (vhost_can_busy_poll(endt

Re: [PATCH 3/3] virtio_net: split XDP_TX kick and XDP_REDIRECT map flushing

2018-06-27 Thread Jason Wang
 send_queue *sq;
unsigned int received, qp;
-   bool xdp_xmit = false;
+   unsigned int xdp_xmit = 0;
  
  	virtnet_poll_cleantx(rq);
  
@@ -1331,12 +1337,14 @@ static int virtnet_poll(struct napi_struct *napi, int budget)

if (received < budget)
virtqueue_napi_complete(napi, rq->vq, received);
  
-	if (xdp_xmit) {

+   if (xdp_xmit & VIRTIO_XDP_REDIR)
+   xdp_do_flush_map();
+
+   if (xdp_xmit & VIRTIO_XDP_TX) {
qp = vi->curr_queue_pairs - vi->xdp_queue_pairs +
 smp_processor_id();
sq = >sq[qp];
virtqueue_kick(sq->vq);
-       xdp_do_flush_map();
}
  
  	return received;




Acked-by: Jason Wang 

Thanks


[PATCH net] vhost_net: flush batched heads before trying to busy polling

2018-05-29 Thread Jason Wang
After commit e2b3b35eb989 ("vhost_net: batch used ring update in rx"),
we tend to batch updating used heads. But it doesn't flush batched
heads before trying to do busy polling, this will cause vhost to wait
for guest TX which waits for the used RX. Fixing by flush batched
heads before busy loop.

1 byte TCP_RR performance recovers from 13107.83 to 50402.65.

Fixes: e2b3b35eb989 ("vhost_net: batch used ring update in rx")
Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c | 37 -
 1 file changed, 24 insertions(+), 13 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 986058a..eeaf673 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -105,7 +105,9 @@ struct vhost_net_virtqueue {
/* vhost zerocopy support fields below: */
/* last used idx for outstanding DMA zerocopy buffers */
int upend_idx;
-   /* first used idx for DMA done zerocopy buffers */
+   /* For TX, first used idx for DMA done zerocopy buffers
+* For RX, number of batched heads
+*/
int done_idx;
/* an array of userspace buffers info */
struct ubuf_info *ubuf_info;
@@ -626,6 +628,18 @@ static int sk_has_rx_data(struct sock *sk)
return skb_queue_empty(>sk_receive_queue);
 }
 
+static void vhost_rx_signal_used(struct vhost_net_virtqueue *nvq)
+{
+   struct vhost_virtqueue *vq = >vq;
+   struct vhost_dev *dev = vq->dev;
+
+   if (!nvq->done_idx)
+   return;
+
+   vhost_add_used_and_signal_n(dev, vq, vq->heads, nvq->done_idx);
+   nvq->done_idx = 0;
+}
+
 static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
 {
struct vhost_net_virtqueue *rvq = >vqs[VHOST_NET_VQ_RX];
@@ -635,6 +649,8 @@ static int vhost_net_rx_peek_head_len(struct vhost_net 
*net, struct sock *sk)
int len = peek_head_len(rvq, sk);
 
if (!len && vq->busyloop_timeout) {
+   /* Flush batched heads first */
+   vhost_rx_signal_used(rvq);
/* Both tx vq and rx socket were polled here */
mutex_lock_nested(>mutex, 1);
vhost_disable_notify(>dev, vq);
@@ -762,7 +778,7 @@ static void handle_rx(struct vhost_net *net)
};
size_t total_len = 0;
int err, mergeable;
-   s16 headcount, nheads = 0;
+   s16 headcount;
size_t vhost_hlen, sock_hlen;
size_t vhost_len, sock_len;
struct socket *sock;
@@ -790,8 +806,8 @@ static void handle_rx(struct vhost_net *net)
while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk))) {
sock_len += sock_hlen;
vhost_len = sock_len + vhost_hlen;
-   headcount = get_rx_bufs(vq, vq->heads + nheads, vhost_len,
-   , vq_log, ,
+   headcount = get_rx_bufs(vq, vq->heads + nvq->done_idx,
+   vhost_len, , vq_log, ,
likely(mergeable) ? UIO_MAXIOV : 1);
/* On error, stop handling until the next kick. */
if (unlikely(headcount < 0))
@@ -862,12 +878,9 @@ static void handle_rx(struct vhost_net *net)
vhost_discard_vq_desc(vq, headcount);
goto out;
}
-   nheads += headcount;
-   if (nheads > VHOST_RX_BATCH) {
-   vhost_add_used_and_signal_n(>dev, vq, vq->heads,
-   nheads);
-   nheads = 0;
-   }
+   nvq->done_idx += headcount;
+   if (nvq->done_idx > VHOST_RX_BATCH)
+   vhost_rx_signal_used(nvq);
if (unlikely(vq_log))
vhost_log_write(vq, vq_log, log, vhost_len);
total_len += vhost_len;
@@ -878,9 +891,7 @@ static void handle_rx(struct vhost_net *net)
}
vhost_net_enable_vq(net, vq);
 out:
-   if (nheads)
-   vhost_add_used_and_signal_n(>dev, vq, vq->heads,
-   nheads);
+   vhost_rx_signal_used(nvq);
mutex_unlock(>mutex);
 }
 
-- 
2.7.4



Re: [RFC v5 3/5] virtio_ring: add packed ring support

2018-05-29 Thread Jason Wang




On 2018年05月29日 13:11, Tiwei Bie wrote:

On Tue, May 29, 2018 at 11:18:57AM +0800, Jason Wang wrote:

On 2018年05月22日 16:16, Tiwei Bie wrote:

[...]

+static void detach_buf_packed(struct vring_virtqueue *vq,
+ unsigned int id, void **ctx)
+{
+   struct vring_desc_state_packed *state;
+   struct vring_packed_desc *desc;
+   unsigned int i;
+   int *next;
+
+   /* Clear data ptr. */
+   vq->desc_state_packed[id].data = NULL;
+
+   next = 
+   for (i = 0; i < vq->desc_state_packed[id].num; i++) {
+   state = >desc_state_packed[*next];
+   vring_unmap_state_packed(vq, state);
+   next = >desc_state_packed[*next].next;

Have a short discussion with Michael. It looks like the only thing we care
is DMA address and length here.

The length tracked by desc_state_packed[] is also necessary
when INDIRECT is used and the buffers are writable.


So a thought is to avoid using desc_state_packed() is vring_use_dma_api() is
false? Probably need another id allocator or just use desc_state_packed for
free id list.

I think it's a good suggestion. Thanks!

I won't track the addr/len/flags for non-indirect descs
when vring_use_dma_api() returns false.


+   }
+
+   vq->vq.num_free += vq->desc_state_packed[id].num;
+   *next = vq->free_head;

Using pointer seems not elegant and unnecessary here. You can just track
next in integer I think?

It's possible to use integer, but will need one more var
to track `prev` (desc_state_packed[prev].next == next in
this case).


Yes, please do this.




+   vq->free_head = id;
+
+   if (vq->indirect) {
+   u32 len;
+
+   /* Free the indirect table, if any, now that it's unmapped. */
+   desc = vq->desc_state_packed[id].indir_desc;
+   if (!desc)
+   return;
+
+   len = vq->desc_state_packed[id].len;
+   for (i = 0; i < len / sizeof(struct vring_packed_desc); i++)
+   vring_unmap_desc_packed(vq, [i]);
+
+   kfree(desc);
+   vq->desc_state_packed[id].indir_desc = NULL;
+   } else if (ctx) {
+   *ctx = vq->desc_state_packed[id].indir_desc;
+   }
   }
   static inline bool more_used_packed(const struct vring_virtqueue *vq)
   {
-   return false;
+   u16 last_used, flags;
+   u8 avail, used;
+
+   last_used = vq->last_used_idx;
+   flags = virtio16_to_cpu(vq->vq.vdev,
+   vq->vring_packed.desc[last_used].flags);
+   avail = !!(flags & VRING_DESC_F_AVAIL(1));
+   used = !!(flags & VRING_DESC_F_USED(1));
+
+   return avail == used && used == vq->used_wrap_counter;

Spec does not check avail == used? So there's probably a bug in either of
the two places.

In what condition that avail != used but used == vq_used_wrap_counter? A
buggy device or device set the two bits in two writes?

Currently, spec doesn't forbid devices to set the status
bits as: avail != used but used == vq_used_wrap_counter.
So I think driver cannot assume this won't happen.


Right, but I could not find a situation that device need to something 
like this. We should either forbid this in the spec or change the 
example code in the spec.


Let's see how Michael think about this.

Thanks



Re: [RFC v5 2/5] virtio_ring: support creating packed ring

2018-05-29 Thread Jason Wang




On 2018年05月29日 13:24, Tiwei Bie wrote:

On Tue, May 29, 2018 at 10:49:11AM +0800, Jason Wang wrote:

On 2018年05月22日 16:16, Tiwei Bie wrote:

This commit introduces the support for creating packed ring.
All split ring specific functions are added _split suffix.
Some necessary stubs for packed ring are also added.

Signed-off-by: Tiwei Bie 
---
   drivers/virtio/virtio_ring.c | 801 +++
   include/linux/virtio_ring.h  |   8 +-
   2 files changed, 546 insertions(+), 263 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 71458f493cf8..f5ef5f42a7cf 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -61,11 +61,15 @@ struct vring_desc_state {
struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
   };
+struct vring_desc_state_packed {
+   int next;   /* The next desc state. */
+};
+
   struct vring_virtqueue {
struct virtqueue vq;
-   /* Actual memory layout for this queue */
-   struct vring vring;
+   /* Is this a packed ring? */
+   bool packed;
/* Can we use weak barriers? */
bool weak_barriers;
@@ -87,11 +91,39 @@ struct vring_virtqueue {
/* Last used index we've seen. */
u16 last_used_idx;
-   /* Last written value to avail->flags */
-   u16 avail_flags_shadow;
+   union {
+   /* Available for split ring */
+   struct {
+   /* Actual memory layout for this queue. */
+   struct vring vring;
-   /* Last written value to avail->idx in guest byte order */
-   u16 avail_idx_shadow;
+   /* Last written value to avail->flags */
+   u16 avail_flags_shadow;
+
+   /* Last written value to avail->idx in
+* guest byte order. */
+   u16 avail_idx_shadow;
+   };
+
+   /* Available for packed ring */
+   struct {
+   /* Actual memory layout for this queue. */
+   struct vring_packed vring_packed;
+
+   /* Driver ring wrap counter. */
+   u8 avail_wrap_counter;
+
+   /* Device ring wrap counter. */
+   u8 used_wrap_counter;

How about just use boolean?

I can change it to bool if you want.


Yes, please.



[...]

-static int vring_mapping_error(const struct vring_virtqueue *vq,
-  dma_addr_t addr)
-{
-   if (!vring_use_dma_api(vq->vq.vdev))
-   return 0;
-
-   return dma_mapping_error(vring_dma_dev(vq), addr);
-}

It looks to me if you keep vring_mapping_error behind
vring_unmap_one_split(), lots of changes were unncessary.


[...]

+   }
+   /* That should have freed everything. */
+   BUG_ON(vq->vq.num_free != vq->vring.num);
+
+   END_USE(vq);
+   return NULL;
+}

I think the those copy-and-paste hunks could be avoided and the diff should
only contains renaming of the function. If yes, it would be very welcomed
since it requires to compare the changes verbatim otherwise.

Michael suggested to lay out the code as:

XXX_split

XXX_packed

XXX wrappers

https://lkml.org/lkml/2018/4/13/410

That's why I moved some code.


I see, then no need to change but it still looks unnecessary.




+
+/*
+ * The layout for the packed ring is a continuous chunk of memory
+ * which looks like this.
+ *
+ * struct vring_packed {
+ * // The actual descriptors (16 bytes each)
+ * struct vring_packed_desc desc[num];
+ *
+ * // Padding to the next align boundary.
+ * char pad[];
+ *
+ * // Driver Event Suppression
+ * struct vring_packed_desc_event driver;
+ *
+ * // Device Event Suppression
+ * struct vring_packed_desc_event device;
+ * };
+ */
+static inline void vring_init_packed(struct vring_packed *vr, unsigned int num,
+void *p, unsigned long align)
+{
+   vr->num = num;
+   vr->desc = p;
+   vr->driver = (void *)(((uintptr_t)p + sizeof(struct vring_packed_desc)
+   * num + align - 1) & ~(align - 1));

If we choose not to go uapi, maybe we can use ALIGN() macro here?

Okay.


+   vr->device = vr->driver + 1;
+}

[...]

+/* Only available for split ring */
   const struct vring *virtqueue_get_vring(struct virtqueue *vq)
   {

A possible issue with this is:

After commit d4674240f31f8c4289abba07d64291c6ddce51bc ("KVM: s390:
virtio-ccw revision 1 SET_VQ"). CCW tries to use
virtqueue_get_avail()/virtqueue_get_used(). Looks like a bug either here or
ccw code.

Do we still need to support:

include/linux/virtio.h
/*
  * Legacy accessors -- in almost all cases, these are the wrong functions
  * to use.
  */
static inline void *virtqueue_get_desc(struct virtqueue *vq)
{
 return v

Re: [RFC v5 3/5] virtio_ring: add packed ring support

2018-05-28 Thread Jason Wang




On 2018年05月22日 16:16, Tiwei Bie wrote:

This commit introduces the support (without EVENT_IDX) for
packed ring.

Signed-off-by: Tiwei Bie 
---
  drivers/virtio/virtio_ring.c | 474 ++-
  1 file changed, 468 insertions(+), 6 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index f5ef5f42a7cf..eb9fd5207a68 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -62,6 +62,12 @@ struct vring_desc_state {
  };
  
  struct vring_desc_state_packed {

+   void *data; /* Data for callback. */
+   struct vring_packed_desc *indir_desc; /* Indirect descriptor, if any. */
+   int num;/* Descriptor list length. */
+   dma_addr_t addr;/* Buffer DMA addr. */
+   u32 len;/* Buffer length. */
+   u16 flags;  /* Descriptor flags. */
int next;   /* The next desc state. */
  };
  
@@ -758,6 +764,72 @@ static inline unsigned vring_size_packed(unsigned int num, unsigned long align)

& ~(align - 1)) + sizeof(struct vring_packed_desc_event) * 2;
  }
  
+static void vring_unmap_state_packed(const struct vring_virtqueue *vq,

+struct vring_desc_state_packed *state)
+{
+   u16 flags;
+
+   if (!vring_use_dma_api(vq->vq.vdev))
+   return;
+
+   flags = state->flags;
+
+   if (flags & VRING_DESC_F_INDIRECT) {
+   dma_unmap_single(vring_dma_dev(vq),
+state->addr, state->len,
+(flags & VRING_DESC_F_WRITE) ?
+DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   } else {
+   dma_unmap_page(vring_dma_dev(vq),
+  state->addr, state->len,
+  (flags & VRING_DESC_F_WRITE) ?
+  DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   }
+}
+
+static void vring_unmap_desc_packed(const struct vring_virtqueue *vq,
+  struct vring_packed_desc *desc)
+{
+   u16 flags;
+
+   if (!vring_use_dma_api(vq->vq.vdev))
+   return;
+
+   flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
+
+   if (flags & VRING_DESC_F_INDIRECT) {
+   dma_unmap_single(vring_dma_dev(vq),
+virtio64_to_cpu(vq->vq.vdev, desc->addr),
+virtio32_to_cpu(vq->vq.vdev, desc->len),
+(flags & VRING_DESC_F_WRITE) ?
+DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   } else {
+   dma_unmap_page(vring_dma_dev(vq),
+  virtio64_to_cpu(vq->vq.vdev, desc->addr),
+  virtio32_to_cpu(vq->vq.vdev, desc->len),
+  (flags & VRING_DESC_F_WRITE) ?
+  DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   }
+}
+
+static struct vring_packed_desc *alloc_indirect_packed(struct virtqueue *_vq,
+  unsigned int total_sg,
+  gfp_t gfp)
+{
+   struct vring_packed_desc *desc;
+
+   /*
+* We require lowmem mappings for the descriptors because
+* otherwise virt_to_phys will give us bogus addresses in the
+* virtqueue.
+*/
+   gfp &= ~__GFP_HIGHMEM;
+
+   desc = kmalloc(total_sg * sizeof(struct vring_packed_desc), gfp);
+
+   return desc;
+}
+
  static inline int virtqueue_add_packed(struct virtqueue *_vq,
   struct scatterlist *sgs[],
   unsigned int total_sg,
@@ -767,47 +839,437 @@ static inline int virtqueue_add_packed(struct virtqueue 
*_vq,
   void *ctx,
   gfp_t gfp)
  {
+   struct vring_virtqueue *vq = to_vvq(_vq);
+   struct vring_packed_desc *desc;
+   struct scatterlist *sg;
+   unsigned int i, n, descs_used, uninitialized_var(prev), err_idx;
+   __virtio16 uninitialized_var(head_flags), flags;
+   u16 head, avail_wrap_counter, id, cur;
+   bool indirect;
+
+   START_USE(vq);
+
+   BUG_ON(data == NULL);
+   BUG_ON(ctx && vq->indirect);
+
+   if (unlikely(vq->broken)) {
+   END_USE(vq);
+   return -EIO;
+   }
+
+#ifdef DEBUG
+   {
+   ktime_t now = ktime_get();
+
+   /* No kick or get, with .1 second between?  Warn. */
+   if (vq->last_add_time_valid)
+   WARN_ON(ktime_to_ms(ktime_sub(now, vq->last_add_time))
+   > 100);
+   vq->last_add_time = now;
+   vq->last_add_time_valid = true;
+   }
+#endif
+
+   

Re: [RFC v5 2/5] virtio_ring: support creating packed ring

2018-05-28 Thread Jason Wang




On 2018年05月22日 16:16, Tiwei Bie wrote:

This commit introduces the support for creating packed ring.
All split ring specific functions are added _split suffix.
Some necessary stubs for packed ring are also added.

Signed-off-by: Tiwei Bie 
---
  drivers/virtio/virtio_ring.c | 801 +++
  include/linux/virtio_ring.h  |   8 +-
  2 files changed, 546 insertions(+), 263 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 71458f493cf8..f5ef5f42a7cf 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -61,11 +61,15 @@ struct vring_desc_state {
struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
  };
  
+struct vring_desc_state_packed {

+   int next;   /* The next desc state. */
+};
+
  struct vring_virtqueue {
struct virtqueue vq;
  
-	/* Actual memory layout for this queue */

-   struct vring vring;
+   /* Is this a packed ring? */
+   bool packed;
  
  	/* Can we use weak barriers? */

bool weak_barriers;
@@ -87,11 +91,39 @@ struct vring_virtqueue {
/* Last used index we've seen. */
u16 last_used_idx;
  
-	/* Last written value to avail->flags */

-   u16 avail_flags_shadow;
+   union {
+   /* Available for split ring */
+   struct {
+   /* Actual memory layout for this queue. */
+   struct vring vring;
  
-	/* Last written value to avail->idx in guest byte order */

-   u16 avail_idx_shadow;
+   /* Last written value to avail->flags */
+   u16 avail_flags_shadow;
+
+   /* Last written value to avail->idx in
+* guest byte order. */
+   u16 avail_idx_shadow;
+   };
+
+   /* Available for packed ring */
+   struct {
+   /* Actual memory layout for this queue. */
+   struct vring_packed vring_packed;
+
+   /* Driver ring wrap counter. */
+   u8 avail_wrap_counter;
+
+   /* Device ring wrap counter. */
+   u8 used_wrap_counter;


How about just use boolean?


+
+   /* Index of the next avail descriptor. */
+   u16 next_avail_idx;
+
+   /* Last written value to driver->flags in
+* guest byte order. */
+   u16 event_flags_shadow;
+   };
+   };
  
  	/* How to notify other side. FIXME: commonalize hcalls! */

bool (*notify)(struct virtqueue *vq);
@@ -111,11 +143,24 @@ struct vring_virtqueue {
  #endif
  
  	/* Per-descriptor state. */

-   struct vring_desc_state desc_state[];
+   union {
+   struct vring_desc_state desc_state[1];
+   struct vring_desc_state_packed desc_state_packed[1];
+   };
  };
  
  #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
  
+static inline bool virtqueue_use_indirect(struct virtqueue *_vq,

+ unsigned int total_sg)
+{
+   struct vring_virtqueue *vq = to_vvq(_vq);
+
+   /* If the host supports indirect descriptor tables, and we have multiple
+* buffers, then go indirect. FIXME: tune this threshold */
+   return (vq->indirect && total_sg > 1 && vq->vq.num_free);
+}
+
  /*
   * Modern virtio devices have feature bits to specify whether they need a
   * quirk and bypass the IOMMU. If not there, just use the DMA API.
@@ -201,8 +246,17 @@ static dma_addr_t vring_map_single(const struct 
vring_virtqueue *vq,
  cpu_addr, size, direction);
  }
  
-static void vring_unmap_one(const struct vring_virtqueue *vq,

-   struct vring_desc *desc)
+static int vring_mapping_error(const struct vring_virtqueue *vq,
+  dma_addr_t addr)
+{
+   if (!vring_use_dma_api(vq->vq.vdev))
+   return 0;
+
+   return dma_mapping_error(vring_dma_dev(vq), addr);
+}
+
+static void vring_unmap_one_split(const struct vring_virtqueue *vq,
+ struct vring_desc *desc)
  {
u16 flags;
  
@@ -226,17 +280,9 @@ static void vring_unmap_one(const struct vring_virtqueue *vq,

}
  }
  
-static int vring_mapping_error(const struct vring_virtqueue *vq,

-  dma_addr_t addr)
-{
-   if (!vring_use_dma_api(vq->vq.vdev))
-   return 0;
-
-   return dma_mapping_error(vring_dma_dev(vq), addr);
-}


It looks to me if you keep vring_mapping_error behind 
vring_unmap_one_split(), lots of changes were unncessary.



-
-static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
-unsigned int total_sg, gfp_t gfp)
+static struct vring_desc 

Re: [PATCH v2 net] tun: Fix NULL pointer dereference in XDP redirect

2018-05-28 Thread Jason Wang
55.255483] R10: 0040 R11: 8dbcf010 R12: 998b097a9000
[ 2055.262681] R13: 8dbd8c98c000 R14:  R15: 998b07607d78
[ 2055.269879] FS:  () GS:8dbd8ff0() 
knlGS:
[ 2055.278039] CS:  0010 DS:  ES:  CR0: 80050033
[ 2055.283834] CR2: 0018 CR3: 000c0c8cc005 CR4: 007626e0
[ 2055.291030] DR0:  DR1:  DR2: 
[ 2055.298227] DR3:  DR6: fffe0ff0 DR7: 0400
[ 2055.305424] PKRU: 5554
[ 2055.308153] Call Trace:
[ 2055.310624]  xdp_do_redirect+0x7b/0x380
[ 2055.314499]  tun_get_user+0x10fe/0x12a0 [tun]
[ 2055.318895]  tun_sendmsg+0x52/0x70 [tun]
[ 2055.322852]  handle_tx+0x2ad/0x5f0 [vhost_net]
[ 2055.327337]  vhost_worker+0xa5/0x100 [vhost]
[ 2055.331646]  kthread+0xf5/0x130
[ 2055.334813]  ? vhost_dev_ioctl+0x3b0/0x3b0 [vhost]
[ 2055.339646]  ? kthread_bind+0x10/0x10
[ 2055.343343]  ret_from_fork+0x35/0x40
[ 2055.346950] Code: 0e 74 15 83 f8 10 75 05 e9 e9 aa b3 ff f3 c3 0f 1f 80 00 00 00 
00 f3 c3 e9 c9 9d b3 ff 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <8b> 47 18 83 
f8 0e 74 0d 83 f8 10 75 05 e9 e9 a9 b3 ff 31 c0 c3
[ 2055.366004] RIP: __xdp_map_lookup_elem+0x5/0x30 RSP: 998b07607cc0
[ 2055.372500] CR2: 0018
[ 2055.375856] ---[ end trace 2a2dcc5e9e174268 ]---
[ 2055.523626] Kernel panic - not syncing: Fatal exception
[ 2055.529796] Kernel Offset: 0x2e00 from 0x8100 (relocation 
range: 0x8000-0xbfff)
[ 2055.677539] ---[ end Kernel panic - not syncing: Fatal exception ]---

v2:
  - Removed preempt_disable/enable since local_bh_disable will prevent
preemption as well, feedback from Jason Wang.

Fixes: 761876c857cb ("tap: XDP support")
Signed-off-by: Toshiaki Makita 
---
  drivers/net/tun.c | 15 +--
  1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 45d8077..23e9eb6 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1650,7 +1650,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
else
*skb_xdp = 0;
  
-	preempt_disable();

+   local_bh_disable();
rcu_read_lock();
xdp_prog = rcu_dereference(tun->xdp_prog);
if (xdp_prog && !*skb_xdp) {
@@ -1675,7 +1675,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
if (err)
goto err_redirect;
rcu_read_unlock();
-   preempt_enable();
+   local_bh_enable();
return NULL;
case XDP_TX:
get_page(alloc_frag->page);
@@ -1684,7 +1684,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
goto err_redirect;
tun_xdp_flush(tun->dev);
rcu_read_unlock();
-   preempt_enable();
+   local_bh_enable();
return NULL;
case XDP_PASS:
delta = orig_data - xdp.data;
@@ -1703,7 +1703,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
skb = build_skb(buf, buflen);
if (!skb) {
rcu_read_unlock();
-   preempt_enable();
+   local_bh_enable();
return ERR_PTR(-ENOMEM);
}
  
@@ -1713,7 +1713,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,

alloc_frag->offset += buflen;
  
  	rcu_read_unlock();

-   preempt_enable();
+   local_bh_enable();
  
  	return skb;
  
@@ -1721,7 +1721,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,

put_page(alloc_frag->page);
  err_xdp:
rcu_read_unlock();
-   preempt_enable();
+   local_bh_enable();
this_cpu_inc(tun->pcpu_stats->rx_dropped);
return NULL;
  }
@@ -1917,16 +1917,19 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
struct bpf_prog *xdp_prog;
int ret;
  
+		local_bh_disable();

rcu_read_lock();
xdp_prog = rcu_dereference(tun->xdp_prog);
if (xdp_prog) {
ret = do_xdp_generic(xdp_prog, skb);
if (ret != XDP_PASS) {
rcu_read_unlock();
+   local_bh_enable();
return total_len;
}
}
rcu_read_unlock();
+   local_bh_enable();
}
  
  	rcu_read_lock();


Acked-by: Jason Wang 

Thanks.



[RFC V5 PATCH 4/8] vhost_net: do not explicitly manipulate vhost_used_elem

2018-05-28 Thread Jason Wang
Two helpers of setting/getting used len were introduced to avoid
explicitly manipulating vhost_used_elem in zerocopy code. This will be
used to hide used_elem internals and simplify packed ring
implementation.

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c   | 11 +--
 drivers/vhost/vhost.c | 12 ++--
 drivers/vhost/vhost.h |  5 +
 3 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 3826f1f..30273ad 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -341,9 +341,10 @@ static void vhost_zerocopy_signal_used(struct vhost_net 
*net,
int j = 0;
 
for (i = nvq->done_idx; i != nvq->upend_idx; i = (i + 1) % UIO_MAXIOV) {
-   if (vq->heads[i].elem.len == VHOST_DMA_FAILED_LEN)
+   if (vhost_get_used_len(vq, >heads[i]) ==
+   VHOST_DMA_FAILED_LEN)
vhost_net_tx_err(net);
-   if (VHOST_DMA_IS_DONE(vq->heads[i].elem.len)) {
+   if (VHOST_DMA_IS_DONE(vhost_get_used_len(vq, >heads[i]))) {
vq->heads[i].elem.len = VHOST_DMA_CLEAR_LEN;
++j;
} else
@@ -542,10 +543,8 @@ static void handle_tx(struct vhost_net *net)
struct ubuf_info *ubuf;
ubuf = nvq->ubuf_info + nvq->upend_idx;
 
-   vq->heads[nvq->upend_idx].elem.id =
-   cpu_to_vhost32(vq, used.elem.id);
-   vq->heads[nvq->upend_idx].elem.len =
-   VHOST_DMA_IN_PROGRESS;
+   vhost_set_used_len(vq, , VHOST_DMA_IN_PROGRESS);
+   vq->heads[nvq->upend_idx] = used;
ubuf->callback = vhost_zerocopy_callback;
ubuf->ctx = nvq->ubufs;
ubuf->desc = nvq->upend_idx;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2b2a776..e0fcfec 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2067,11 +2067,19 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 }
 EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
 
-static void vhost_set_used_len(struct vhost_virtqueue *vq,
-  struct vhost_used_elem *used, int len)
+void vhost_set_used_len(struct vhost_virtqueue *vq,
+   struct vhost_used_elem *used, int len)
 {
used->elem.len = cpu_to_vhost32(vq, len);
 }
+EXPORT_SYMBOL_GPL(vhost_set_used_len);
+
+int vhost_get_used_len(struct vhost_virtqueue *vq,
+  struct vhost_used_elem *used)
+{
+   return vhost32_to_cpu(vq, used->elem.len);
+}
+EXPORT_SYMBOL_GPL(vhost_get_used_len);
 
 /* This is a multi-buffer version of vhost_get_desc, that works if
  * vq has read descriptors only.
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 8dea44b..604821b 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -198,6 +198,11 @@ int vhost_get_bufs(struct vhost_virtqueue *vq,
   unsigned *log_num,
   unsigned int quota,
   s16 *count);
+void vhost_set_used_len(struct vhost_virtqueue *vq,
+   struct vhost_used_elem *used,
+   int len);
+int vhost_get_used_len(struct vhost_virtqueue *vq,
+  struct vhost_used_elem *used);
 void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
 
 int vhost_vq_init_access(struct vhost_virtqueue *);
-- 
2.7.4



[RFC V5 PATCH 0/8] Packed ring layout for vhost

2018-05-28 Thread Jason Wang
Hi all:

This RFC implement packed ring layout. The code were tested with
Tiwei's RFC V5 at https://lkml.org/lkml/2018/5/22/138. Some fixups and
tweaks were needed on top of Tiwei's code to make it run for event
index.

Pktgen reports about 20% improvement on TX PPS when doing pktgen from
guest to host. No ovbious improvement on RX PPS. We can do lots of
optimizations on top but for simple and for correceness first, this
version does not do much.

This version were tested with:

- Zerocopy (Out of Order) support
- vIOMMU support
- mergeable buffer on/off
- busy polling on/off

Notes for tester:

- Start from this version, vhost need qemu co-operation to work
  correctly. Or you can comment out the packed specific code for
  GET/SET_VRING_BASE.

- Changes from V4:
- fix signalled_used index recording
- track avail index correctly
- various minor fixes

Changes from V3:
- Fix math on event idx checking
- Sync last avail wrap counter through GET/SET_VRING_BASE
- remove desc_event prefix in the driver/device structure

Changes from V2:
- do not use & in checking desc_event_flags
- off should be most significant bit
- remove the workaround of mergeable buffer for dpdk prototype
- id should be in the last descriptor in the chain
- keep _F_WRITE for write descriptor when adding used
- device flags updating should use ADDR_USED type
- return error on unexpected unavail descriptor in a chain
- return false in vhost_ve_avail_empty is descriptor is available
- track last seen avail_wrap_counter
- correctly examine available descriptor in get_indirect_packed()
- vhost_idx_diff should return u16 instead of bool

Changes from V1:

- Refactor vhost used elem code to avoid open coding on used elem
- Event suppression support (compile test only).
- Indirect descriptor support (compile test only).
- Zerocopy support.
- vIOMMU support.
- SCSI/VSOCK support (compile test only).
- Fix several bugs

Jason Wang (8):
  vhost: move get_rx_bufs to vhost.c
  vhost: hide used ring layout from device
  vhost: do not use vring_used_elem
  vhost_net: do not explicitly manipulate vhost_used_elem
  vhost: vhost_put_user() can accept metadata type
  virtio: introduce packed ring defines
  vhost: packed ring support
  vhost: event suppression for packed ring

 drivers/vhost/net.c| 144 ++
 drivers/vhost/scsi.c   |  62 +--
 drivers/vhost/vhost.c  | 926 +
 drivers/vhost/vhost.h  |  52 ++-
 drivers/vhost/vsock.c  |  42 +-
 include/uapi/linux/virtio_config.h |   9 +
 include/uapi/linux/virtio_ring.h   |  32 ++
 7 files changed, 1000 insertions(+), 267 deletions(-)

-- 
2.7.4



[RFC V5 PATCH 3/8] vhost: do not use vring_used_elem

2018-05-28 Thread Jason Wang
Instead of depending on the exported vring_used_elem, this patch
switches to use a new internal structure vhost_used_elem which embed
vring_used_elem in itself. This could be used to let vhost to record
extra metadata for the incoming packed ring layout.

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c   | 19 +++---
 drivers/vhost/scsi.c  | 10 
 drivers/vhost/vhost.c | 68 ---
 drivers/vhost/vhost.h | 18 --
 drivers/vhost/vsock.c |  6 ++---
 5 files changed, 45 insertions(+), 76 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 826489c..3826f1f 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -341,10 +341,10 @@ static void vhost_zerocopy_signal_used(struct vhost_net 
*net,
int j = 0;
 
for (i = nvq->done_idx; i != nvq->upend_idx; i = (i + 1) % UIO_MAXIOV) {
-   if (vq->heads[i].len == VHOST_DMA_FAILED_LEN)
+   if (vq->heads[i].elem.len == VHOST_DMA_FAILED_LEN)
vhost_net_tx_err(net);
-   if (VHOST_DMA_IS_DONE(vq->heads[i].len)) {
-   vq->heads[i].len = VHOST_DMA_CLEAR_LEN;
+   if (VHOST_DMA_IS_DONE(vq->heads[i].elem.len)) {
+   vq->heads[i].elem.len = VHOST_DMA_CLEAR_LEN;
++j;
} else
break;
@@ -367,7 +367,7 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, 
bool success)
rcu_read_lock_bh();
 
/* set len to mark this desc buffers done DMA */
-   vq->heads[ubuf->desc].len = success ?
+   vq->heads[ubuf->desc].elem.len = success ?
VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN;
cnt = vhost_net_ubuf_put(ubufs);
 
@@ -426,7 +426,7 @@ static int vhost_net_enable_vq(struct vhost_net *n,
 
 static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
struct vhost_virtqueue *vq,
-   struct vring_used_elem *used_elem,
+   struct vhost_used_elem *used_elem,
struct iovec iov[], unsigned int iov_size,
unsigned int *out_num, unsigned int *in_num)
 {
@@ -477,7 +477,7 @@ static void handle_tx(struct vhost_net *net)
size_t hdr_size;
struct socket *sock;
struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
-   struct vring_used_elem used;
+   struct vhost_used_elem used;
bool zcopy, zcopy_used;
int sent_pkts = 0;
 
@@ -542,9 +542,10 @@ static void handle_tx(struct vhost_net *net)
struct ubuf_info *ubuf;
ubuf = nvq->ubuf_info + nvq->upend_idx;
 
-   vq->heads[nvq->upend_idx].id =
-   cpu_to_vhost32(vq, used.id);
-   vq->heads[nvq->upend_idx].len = VHOST_DMA_IN_PROGRESS;
+   vq->heads[nvq->upend_idx].elem.id =
+   cpu_to_vhost32(vq, used.elem.id);
+   vq->heads[nvq->upend_idx].elem.len =
+   VHOST_DMA_IN_PROGRESS;
ubuf->callback = vhost_zerocopy_callback;
ubuf->ctx = nvq->ubufs;
ubuf->desc = nvq->upend_idx;
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 654c71f..ac11412 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -67,7 +67,7 @@ struct vhost_scsi_inflight {
 
 struct vhost_scsi_cmd {
/* Descriptor from vhost_get_vq_desc() for virt_queue segment */
-   struct vring_used_elem tvc_vq_used;
+   struct vhost_used_elem tvc_vq_used;
/* virtio-scsi initiator task attribute */
int tvc_task_attr;
/* virtio-scsi response incoming iovecs */
@@ -441,7 +441,7 @@ vhost_scsi_do_evt_work(struct vhost_scsi *vs, struct 
vhost_scsi_evt *evt)
struct vhost_virtqueue *vq = >vqs[VHOST_SCSI_VQ_EVT].vq;
struct virtio_scsi_event *event = >event;
struct virtio_scsi_event __user *eventp;
-   struct vring_used_elem used;
+   struct vhost_used_elem used;
unsigned out, in;
int ret;
 
@@ -785,7 +785,7 @@ static void vhost_scsi_submission_work(struct work_struct 
*work)
 static void
 vhost_scsi_send_bad_target(struct vhost_scsi *vs,
   struct vhost_virtqueue *vq,
-  struct vring_used_elem *used, unsigned out)
+  struct vhost_used_elem *used, unsigned out)
 {
struct virtio_scsi_cmd_resp __user *resp;
struct virtio_scsi_cmd_resp rsp;
@@ -808,7 +808,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct 
vhost_virtqueue *vq)
struct virtio_scsi_cmd_req v_req;
struct virtio_scsi_cmd_req_pi v_req_pi;

[RFC V5 PATCH 1/8] vhost: move get_rx_bufs to vhost.c

2018-05-28 Thread Jason Wang
Move get_rx_bufs() to vhost.c and rename it to
vhost_get_bufs(). This helps to hide vring internal layout from
specific device implementation. Packed ring implementation will
benefit from this.

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c   | 83 ++-
 drivers/vhost/vhost.c | 78 +++
 drivers/vhost/vhost.h |  7 +
 3 files changed, 88 insertions(+), 80 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 986058a..762aa81 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -664,83 +664,6 @@ static int vhost_net_rx_peek_head_len(struct vhost_net 
*net, struct sock *sk)
return len;
 }
 
-/* This is a multi-buffer version of vhost_get_desc, that works if
- * vq has read descriptors only.
- * @vq - the relevant virtqueue
- * @datalen- data length we'll be reading
- * @iovcount   - returned count of io vectors we fill
- * @log- vhost log
- * @log_num- log offset
- * @quota   - headcount quota, 1 for big buffer
- * returns number of buffer heads allocated, negative on error
- */
-static int get_rx_bufs(struct vhost_virtqueue *vq,
-  struct vring_used_elem *heads,
-  int datalen,
-  unsigned *iovcount,
-  struct vhost_log *log,
-  unsigned *log_num,
-  unsigned int quota)
-{
-   unsigned int out, in;
-   int seg = 0;
-   int headcount = 0;
-   unsigned d;
-   int r, nlogs = 0;
-   /* len is always initialized before use since we are always called with
-* datalen > 0.
-*/
-   u32 uninitialized_var(len);
-
-   while (datalen > 0 && headcount < quota) {
-   if (unlikely(seg >= UIO_MAXIOV)) {
-   r = -ENOBUFS;
-   goto err;
-   }
-   r = vhost_get_vq_desc(vq, vq->iov + seg,
- ARRAY_SIZE(vq->iov) - seg, ,
- , log, log_num);
-   if (unlikely(r < 0))
-   goto err;
-
-   d = r;
-   if (d == vq->num) {
-   r = 0;
-   goto err;
-   }
-   if (unlikely(out || in <= 0)) {
-   vq_err(vq, "unexpected descriptor format for RX: "
-   "out %d, in %d\n", out, in);
-   r = -EINVAL;
-   goto err;
-   }
-   if (unlikely(log)) {
-   nlogs += *log_num;
-   log += *log_num;
-   }
-   heads[headcount].id = cpu_to_vhost32(vq, d);
-   len = iov_length(vq->iov + seg, in);
-   heads[headcount].len = cpu_to_vhost32(vq, len);
-   datalen -= len;
-   ++headcount;
-   seg += in;
-   }
-   heads[headcount - 1].len = cpu_to_vhost32(vq, len + datalen);
-   *iovcount = seg;
-   if (unlikely(log))
-   *log_num = nlogs;
-
-   /* Detect overrun */
-   if (unlikely(datalen > 0)) {
-   r = UIO_MAXIOV + 1;
-   goto err;
-   }
-   return headcount;
-err:
-   vhost_discard_vq_desc(vq, headcount);
-   return r;
-}
-
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_rx(struct vhost_net *net)
@@ -790,9 +713,9 @@ static void handle_rx(struct vhost_net *net)
while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk))) {
sock_len += sock_hlen;
vhost_len = sock_len + vhost_hlen;
-   headcount = get_rx_bufs(vq, vq->heads + nheads, vhost_len,
-   , vq_log, ,
-   likely(mergeable) ? UIO_MAXIOV : 1);
+   headcount = vhost_get_bufs(vq, vq->heads + nheads, vhost_len,
+  , vq_log, ,
+  likely(mergeable) ? UIO_MAXIOV : 1);
/* On error, stop handling until the next kick. */
if (unlikely(headcount < 0))
goto out;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index f0be5f3..096a688 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2100,6 +2100,84 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 }
 EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
 
+/* This is a multi-buffer version of vhost_get_desc, that works if
+ * vq has read descriptors only.
+ * @vq - the relevant virtqueue
+ * @datalen- data length we'll be reading
+ * @iovcount   - returned count of io vectors we fill
+ * @log

[RFC V5 PATCH 2/8] vhost: hide used ring layout from device

2018-05-28 Thread Jason Wang
We used to return descriptor head by vhost_get_vq_desc() to device and
pass it back to vhost_add_used() and its friends. This exposes the
internal used ring layout to device which makes it hard to be extended for
e.g packed ring layout.

So this patch tries to hide the used ring layout by

- letting vhost_get_vq_desc() return pointer to struct vring_used_elem
- accepting pointer to struct vring_used_elem in vhost_add_used() and
  vhost_add_used_and_signal()

This could help to hide used ring layout and make it easier to
implement packed ring on top.

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c   | 46 +-
 drivers/vhost/scsi.c  | 62 +++
 drivers/vhost/vhost.c | 52 +-
 drivers/vhost/vhost.h |  9 +---
 drivers/vhost/vsock.c | 42 +-
 5 files changed, 112 insertions(+), 99 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 762aa81..826489c 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -426,22 +426,24 @@ static int vhost_net_enable_vq(struct vhost_net *n,
 
 static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
struct vhost_virtqueue *vq,
+   struct vring_used_elem *used_elem,
struct iovec iov[], unsigned int iov_size,
unsigned int *out_num, unsigned int *in_num)
 {
unsigned long uninitialized_var(endtime);
-   int r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
+   int r = vhost_get_vq_desc(vq, used_elem, vq->iov, ARRAY_SIZE(vq->iov),
  out_num, in_num, NULL, NULL);
 
-   if (r == vq->num && vq->busyloop_timeout) {
+   if (r == -ENOSPC && vq->busyloop_timeout) {
preempt_disable();
endtime = busy_clock() + vq->busyloop_timeout;
while (vhost_can_busy_poll(vq->dev, endtime) &&
   vhost_vq_avail_empty(vq->dev, vq))
cpu_relax();
preempt_enable();
-   r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
- out_num, in_num, NULL, NULL);
+   r = vhost_get_vq_desc(vq, used_elem, vq->iov,
+ ARRAY_SIZE(vq->iov), out_num, in_num,
+ NULL, NULL);
}
 
return r;
@@ -463,7 +465,6 @@ static void handle_tx(struct vhost_net *net)
struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
struct vhost_virtqueue *vq = >vq;
unsigned out, in;
-   int head;
struct msghdr msg = {
.msg_name = NULL,
.msg_namelen = 0,
@@ -476,6 +477,7 @@ static void handle_tx(struct vhost_net *net)
size_t hdr_size;
struct socket *sock;
struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
+   struct vring_used_elem used;
bool zcopy, zcopy_used;
int sent_pkts = 0;
 
@@ -499,20 +501,20 @@ static void handle_tx(struct vhost_net *net)
vhost_zerocopy_signal_used(net, vq);
 
 
-   head = vhost_net_tx_get_vq_desc(net, vq, vq->iov,
-   ARRAY_SIZE(vq->iov),
-   , );
-   /* On error, stop handling until the next kick. */
-   if (unlikely(head < 0))
-   break;
+   err = vhost_net_tx_get_vq_desc(net, vq, , vq->iov,
+  ARRAY_SIZE(vq->iov),
+  , );
/* Nothing new?  Wait for eventfd to tell us they refilled. */
-   if (head == vq->num) {
+   if (err == -ENOSPC) {
if (unlikely(vhost_enable_notify(>dev, vq))) {
vhost_disable_notify(>dev, vq);
continue;
}
break;
}
+   /* On error, stop handling until the next kick. */
+   if (unlikely(err < 0))
+   break;
if (in) {
vq_err(vq, "Unexpected descriptor format for TX: "
   "out %d, int %d\n", out, in);
@@ -540,7 +542,8 @@ static void handle_tx(struct vhost_net *net)
struct ubuf_info *ubuf;
ubuf = nvq->ubuf_info + nvq->upend_idx;
 
-   vq->heads[nvq->upend_idx].id = cpu_to_vhost32(vq, head);
+   vq->heads[nvq->upend_idx].id =
+   cpu_to_vh

[RFC V5 PATCH 6/8] virtio: introduce packed ring defines

2018-05-28 Thread Jason Wang
Signed-off-by: Jason Wang 
---
 include/uapi/linux/virtio_config.h |  9 +
 include/uapi/linux/virtio_ring.h   | 13 +
 2 files changed, 22 insertions(+)

diff --git a/include/uapi/linux/virtio_config.h 
b/include/uapi/linux/virtio_config.h
index 308e209..5903d51 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -71,4 +71,13 @@
  * this is for compatibility with legacy systems.
  */
 #define VIRTIO_F_IOMMU_PLATFORM33
+
+#define VIRTIO_F_RING_PACKED   34
+
+/*
+ * This feature indicates that all buffers are used by the device in
+ * the same order in which they have been made available.
+ */
+#define VIRTIO_F_IN_ORDER  35
+
 #endif /* _UAPI_LINUX_VIRTIO_CONFIG_H */
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index 6d5d5fa..e297580 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -43,6 +43,8 @@
 #define VRING_DESC_F_WRITE 2
 /* This means the buffer contains a list of buffer descriptors. */
 #define VRING_DESC_F_INDIRECT  4
+#define VRING_DESC_F_AVAIL  7
+#define VRING_DESC_F_USED  15
 
 /* The Host uses this in used->flags to advise the Guest: don't kick me when
  * you add a buffer.  It's unreliable, so it's simply an optimization.  Guest
@@ -62,6 +64,17 @@
  * at the end of the used ring. Guest should ignore the used->flags field. */
 #define VIRTIO_RING_F_EVENT_IDX29
 
+struct vring_desc_packed {
+   /* Buffer Address. */
+   __virtio64 addr;
+   /* Buffer Length. */
+   __virtio32 len;
+   /* Buffer ID. */
+   __virtio16 id;
+   /* The flags depending on descriptor type. */
+   __virtio16 flags;
+};
+
 /* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
 struct vring_desc {
/* Address (guest-physical). */
-- 
2.7.4



[RFC V5 PATCH 8/8] vhost: event suppression for packed ring

2018-05-28 Thread Jason Wang
This patch introduces basic support for event suppression aka driver
and device area.

Signed-off-by: Jason Wang 
---
 drivers/vhost/vhost.c| 191 ---
 drivers/vhost/vhost.h|  10 +-
 include/uapi/linux/virtio_ring.h |  19 
 3 files changed, 204 insertions(+), 16 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a36e5ad2..112f680 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1112,10 +1112,15 @@ static int vq_access_ok_packed(struct vhost_virtqueue 
*vq, unsigned int num,
   struct vring_used __user *used)
 {
struct vring_desc_packed *packed = (struct vring_desc_packed *)desc;
+   struct vring_packed_desc_event *driver_event =
+   (struct vring_packed_desc_event *)avail;
+   struct vring_packed_desc_event *device_event =
+   (struct vring_packed_desc_event *)used;
 
-   /* FIXME: check device area and driver area */
return access_ok(VERIFY_READ, packed, num * sizeof(*packed)) &&
-  access_ok(VERIFY_WRITE, packed, num * sizeof(*packed));
+  access_ok(VERIFY_WRITE, packed, num * sizeof(*packed)) &&
+  access_ok(VERIFY_READ, driver_event, sizeof(*driver_event)) &&
+  access_ok(VERIFY_WRITE, device_event, sizeof(*device_event));
 }
 
 static int vq_access_ok_split(struct vhost_virtqueue *vq, unsigned int num,
@@ -1190,14 +1195,27 @@ static bool iotlb_access_ok(struct vhost_virtqueue *vq,
return true;
 }
 
-int vq_iotlb_prefetch(struct vhost_virtqueue *vq)
+int vq_iotlb_prefetch_packed(struct vhost_virtqueue *vq)
+{
+   int num = vq->num;
+
+   return iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->desc,
+  num * sizeof(*vq->desc), VHOST_ADDR_DESC) &&
+  iotlb_access_ok(vq, VHOST_ACCESS_WO, (u64)(uintptr_t)vq->desc,
+  num * sizeof(*vq->desc), VHOST_ADDR_DESC) &&
+  iotlb_access_ok(vq, VHOST_ACCESS_RO,
+  (u64)(uintptr_t)vq->driver_event,
+  sizeof(*vq->driver_event), VHOST_ADDR_AVAIL) &&
+  iotlb_access_ok(vq, VHOST_ACCESS_WO,
+  (u64)(uintptr_t)vq->device_event,
+  sizeof(*vq->device_event), VHOST_ADDR_USED);
+}
+
+int vq_iotlb_prefetch_split(struct vhost_virtqueue *vq)
 {
size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
unsigned int num = vq->num;
 
-   if (!vq->iotlb)
-   return 1;
-
return iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->desc,
   num * sizeof(*vq->desc), VHOST_ADDR_DESC) &&
   iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->avail,
@@ -1209,6 +1227,17 @@ int vq_iotlb_prefetch(struct vhost_virtqueue *vq)
   num * sizeof(*vq->used->ring) + s,
   VHOST_ADDR_USED);
 }
+
+int vq_iotlb_prefetch(struct vhost_virtqueue *vq)
+{
+   if (!vq->iotlb)
+   return 1;
+
+   if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
+   return vq_iotlb_prefetch_packed(vq);
+   else
+   return vq_iotlb_prefetch_split(vq);
+}
 EXPORT_SYMBOL_GPL(vq_iotlb_prefetch);
 
 /* Can we log writes? */
@@ -1730,6 +1759,50 @@ static int vhost_update_used_flags(struct 
vhost_virtqueue *vq)
return 0;
 }
 
+static int vhost_update_device_flags(struct vhost_virtqueue *vq,
+__virtio16 device_flags)
+{
+   void __user *flags;
+
+   if (vhost_put_user(vq, device_flags, >device_event->flags,
+  VHOST_ADDR_USED) < 0)
+   return -EFAULT;
+   if (unlikely(vq->log_used)) {
+   /* Make sure the flag is seen before log. */
+   smp_wmb();
+   /* Log used flag write. */
+   flags = >device_event->flags;
+   log_write(vq->log_base, vq->log_addr +
+ (flags - (void __user *)vq->device_event),
+ sizeof(vq->device_event->flags));
+   if (vq->log_ctx)
+   eventfd_signal(vq->log_ctx, 1);
+   }
+   return 0;
+}
+
+static int vhost_update_device_off_wrap(struct vhost_virtqueue *vq,
+   __virtio16 device_off_wrap)
+{
+   void __user *off_wrap;
+
+   if (vhost_put_user(vq, device_off_wrap, >device_event->off_wrap,
+  VHOST_ADDR_USED) < 0)
+   return -EFAULT;
+   if (unlikely(vq->log_used)) {
+   /* Make sure the flag is seen before log. */
+   smp_wmb();
+   /* Log used

[RFC V5 PATCH 7/8] vhost: packed ring support

2018-05-28 Thread Jason Wang
Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c   |  13 +-
 drivers/vhost/vhost.c | 585 ++
 drivers/vhost/vhost.h |  13 +-
 3 files changed, 566 insertions(+), 45 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 30273ad..4991aa4 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -71,7 +71,8 @@ enum {
VHOST_NET_FEATURES = VHOST_FEATURES |
 (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
 (1ULL << VIRTIO_NET_F_MRG_RXBUF) |
-(1ULL << VIRTIO_F_IOMMU_PLATFORM)
+(1ULL << VIRTIO_F_IOMMU_PLATFORM) |
+(1ULL << VIRTIO_F_RING_PACKED)
 };
 
 enum {
@@ -576,7 +577,7 @@ static void handle_tx(struct vhost_net *net)
nvq->upend_idx = ((unsigned)nvq->upend_idx - 1)
% UIO_MAXIOV;
}
-   vhost_discard_vq_desc(vq, 1);
+   vhost_discard_vq_desc(vq, , 1);
vhost_net_enable_vq(net, vq);
break;
}
@@ -714,9 +715,11 @@ static void handle_rx(struct vhost_net *net)
mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);
 
while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk))) {
+   struct vhost_used_elem *used = vq->heads + nheads;
+
sock_len += sock_hlen;
vhost_len = sock_len + vhost_hlen;
-   err = vhost_get_bufs(vq, vq->heads + nheads, vhost_len,
+   err = vhost_get_bufs(vq, used, vhost_len,
 , vq_log, ,
 likely(mergeable) ? UIO_MAXIOV : 1,
 );
@@ -762,7 +765,7 @@ static void handle_rx(struct vhost_net *net)
if (unlikely(err != sock_len)) {
pr_debug("Discarded rx packet: "
 " len %d, expected %zd\n", err, sock_len);
-   vhost_discard_vq_desc(vq, headcount);
+   vhost_discard_vq_desc(vq, used, 1);
continue;
}
/* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
@@ -786,7 +789,7 @@ static void handle_rx(struct vhost_net *net)
copy_to_iter(_buffers, sizeof num_buffers,
 ) != sizeof num_buffers) {
vq_err(vq, "Failed num_buffers write");
-   vhost_discard_vq_desc(vq, headcount);
+   vhost_discard_vq_desc(vq, used, 1);
goto out;
}
nheads += headcount;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 4031a8f..a36e5ad2 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -323,6 +323,9 @@ static void vhost_vq_reset(struct vhost_dev *dev,
vhost_reset_is_le(vq);
vhost_disable_cross_endian(vq);
vq->busyloop_timeout = 0;
+   vq->used_wrap_counter = true;
+   vq->last_avail_wrap_counter = true;
+   vq->avail_wrap_counter = true;
vq->umem = NULL;
vq->iotlb = NULL;
__vhost_vq_meta_reset(vq);
@@ -1103,11 +1106,22 @@ static int vhost_iotlb_miss(struct vhost_virtqueue *vq, 
u64 iova, int access)
return 0;
 }
 
-static bool vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
-struct vring_desc __user *desc,
-struct vring_avail __user *avail,
-struct vring_used __user *used)
+static int vq_access_ok_packed(struct vhost_virtqueue *vq, unsigned int num,
+  struct vring_desc __user *desc,
+  struct vring_avail __user *avail,
+  struct vring_used __user *used)
+{
+   struct vring_desc_packed *packed = (struct vring_desc_packed *)desc;
+
+   /* FIXME: check device area and driver area */
+   return access_ok(VERIFY_READ, packed, num * sizeof(*packed)) &&
+  access_ok(VERIFY_WRITE, packed, num * sizeof(*packed));
+}
 
+static int vq_access_ok_split(struct vhost_virtqueue *vq, unsigned int num,
+ struct vring_desc __user *desc,
+ struct vring_avail __user *avail,
+ struct vring_used __user *used)
 {
size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
 
@@ -1118,6 +1132,17 @@ static bool vq_access_ok(struct vhost_virtqueue *vq, 
unsigned int num,
sizeof *used + num * sizeof *used->ring + s);
 }
 
+static int vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
+   struct vring_des

[RFC V5 PATCH 5/8] vhost: vhost_put_user() can accept metadata type

2018-05-28 Thread Jason Wang
We assumes used ring update is the only user for vhost_put_user() in
the past. This may not be the case for the incoming packed ring which
may update the descriptor ring for used. So introduce a new type
parameter.

Signed-off-by: Jason Wang 
---
 drivers/vhost/vhost.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index e0fcfec..4031a8f 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -811,7 +811,7 @@ static inline void __user *__vhost_get_user(struct 
vhost_virtqueue *vq,
return __vhost_get_user_slow(vq, addr, size, type);
 }
 
-#define vhost_put_user(vq, x, ptr) \
+#define vhost_put_user(vq, x, ptr, type)   \
 ({ \
int ret = -EFAULT; \
if (!vq->iotlb) { \
@@ -819,7 +819,7 @@ static inline void __user *__vhost_get_user(struct 
vhost_virtqueue *vq,
} else { \
__typeof__(ptr) to = \
(__typeof__(ptr)) __vhost_get_user(vq, ptr, \
- sizeof(*ptr), VHOST_ADDR_USED); \
+ sizeof(*ptr), type); \
if (to != NULL) \
ret = __put_user(x, to); \
else \
@@ -1683,7 +1683,7 @@ static int vhost_update_used_flags(struct vhost_virtqueue 
*vq)
 {
void __user *used;
if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
-  >used->flags) < 0)
+  >used->flags, VHOST_ADDR_USED) < 0)
return -EFAULT;
if (unlikely(vq->log_used)) {
/* Make sure the flag is seen before log. */
@@ -1702,7 +1702,7 @@ static int vhost_update_used_flags(struct vhost_virtqueue 
*vq)
 static int vhost_update_avail_event(struct vhost_virtqueue *vq, u16 
avail_event)
 {
if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
-  vhost_avail_event(vq)))
+  vhost_avail_event(vq), VHOST_ADDR_USED))
return -EFAULT;
if (unlikely(vq->log_used)) {
void __user *used;
@@ -2185,12 +2185,12 @@ static int __vhost_add_used_n(struct vhost_virtqueue 
*vq,
used = vq->used->ring + start;
for (i = 0; i < count; i++) {
if (unlikely(vhost_put_user(vq, heads[i].elem.id,
-   [i].id))) {
+   [i].id, VHOST_ADDR_USED))) {
vq_err(vq, "Failed to write used id");
return -EFAULT;
}
if (unlikely(vhost_put_user(vq, heads[i].elem.len,
-   [i].len))) {
+   [i].len, VHOST_ADDR_USED))) {
vq_err(vq, "Failed to write used len");
return -EFAULT;
}
@@ -2236,7 +2236,7 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct 
vhost_used_elem *heads,
/* Make sure buffer is written before we update index. */
smp_wmb();
if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
-  >used->idx)) {
+  >used->idx, VHOST_ADDR_USED)) {
vq_err(vq, "Failed to increment used idx");
return -EFAULT;
}
-- 
2.7.4



Re: [PATCH net] tun: Fix NULL pointer dereference in XDP redirect

2018-05-27 Thread Jason Wang



On 2018年05月25日 21:43, Toshiaki Makita wrote:

[...]

@@ -1917,16 +1923,22 @@ static ssize_t tun_get_user(struct 
tun_struct *tun, struct tun_file *tfile,

  struct bpf_prog *xdp_prog;
  int ret;
+    local_bh_disable();
+    preempt_disable();
  rcu_read_lock();
  xdp_prog = rcu_dereference(tun->xdp_prog);
  if (xdp_prog) {
  ret = do_xdp_generic(xdp_prog, skb);
  if (ret != XDP_PASS) {
  rcu_read_unlock();
+    preempt_enable();
+    local_bh_enable();
  return total_len;
  }
  }
  rcu_read_unlock();
+    preempt_enable();
+    local_bh_enable();
  }
  rcu_read_lock();


Good catch, thanks.

But I think we can just replace preempt_disable()/enable() with 
local_bh_disable()/local_bh_enable() ?


I actually thought the same, but noticed this patch.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9ea4c380066fbe 



It looks like they do not think local_bh_disable() implies 
preempt_disable(). But I'm not sure why..


Toshiaki Makita 


I see, there're probably have some subtle differences and implications 
for e.g scheduler or others.


What we what here is to make sure the process is not moved to another 
CPU and bh is enabled. By checking preemptible() function, preemption 
should be disabled after local_bh_disable(). So I think we're safe here.


Thanks


Re: [PATCH net] tun: Fix NULL pointer dereference in XDP redirect

2018-05-25 Thread Jason Wang



On 2018年05月25日 12:32, Toshiaki Makita wrote:

Calling XDP redirection requires preempt/bh disabled. Especially softirq
can call another XDP function and redirection functions, then percpu
value ri->map can be overwritten to NULL.

This is a generic XDP case called from tun.

[ 3535.736058] BUG: unable to handle kernel NULL pointer dereference at 
0018
[ 3535.743974] PGD 0 P4D 0
[ 3535.746530] Oops:  [#1] SMP PTI
[ 3535.750049] Modules linked in: vhost_net vhost tap tun bridge stp llc 
ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc vfat 
fat ext4 mbcache jbd2 intel_rapl skx_edac nfit libnvdimm x86_pkg_temp_thermal 
intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass crct10dif_pclmul 
crc32_pclmul ghash_clmulni_intel pcbc ses aesni_intel crypto_simd cryptd 
enclosure hpwdt hpilo glue_helper ipmi_si pcspkr wmi mei_me ioatdma mei 
ipmi_devintf shpchp dca ipmi_msghandler lpc_ich acpi_power_meter sch_fq_codel 
ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea 
sysfillrect sysimgblt fb_sys_fops ttm drm smartpqi i40e crc32c_intel 
scsi_transport_sas tg3 i2c_core ptp pps_core
[ 3535.813456] CPU: 5 PID: 1630 Comm: vhost-1614 Not tainted 4.17.0-rc4 #2
[ 3535.820127] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, 
BIOS U32 11/14/2017
[ 3535.828732] RIP: 0010:__xdp_map_lookup_elem+0x5/0x30
[ 3535.833740] RSP: 0018:b4bc47bf7c58 EFLAGS: 00010246
[ 3535.839009] RAX: 9fdfcfea1c40 RBX:  RCX: 9fdf27fe3100
[ 3535.846205] RDX: 9fdfca769200 RSI:  RDI: 
[ 3535.853402] RBP: b4bc491d9000 R08: 45ad R09: 0ec0
[ 3535.860597] R10: 0001 R11: 9fdf26c3ce4e R12: 9fdf9e72c000
[ 3535.867794] R13:  R14: fff2 R15: 9fdfc82cdd00
[ 3535.874990] FS:  () GS:9fdfcfe8() 
knlGS:
[ 3535.883152] CS:  0010 DS:  ES:  CR0: 80050033
[ 3535.888948] CR2: 0018 CR3: 000bde724004 CR4: 007626e0
[ 3535.896145] DR0:  DR1:  DR2: 
[ 3535.903342] DR3:  DR6: fffe0ff0 DR7: 0400
[ 3535.910538] PKRU: 5554
[ 3535.913267] Call Trace:
[ 3535.915736]  xdp_do_generic_redirect+0x7a/0x310
[ 3535.920310]  do_xdp_generic.part.117+0x285/0x370
[ 3535.924970]  tun_get_user+0x5b9/0x1260 [tun]
[ 3535.929279]  tun_sendmsg+0x52/0x70 [tun]
[ 3535.933237]  handle_tx+0x2ad/0x5f0 [vhost_net]
[ 3535.937721]  vhost_worker+0xa5/0x100 [vhost]
[ 3535.942030]  kthread+0xf5/0x130
[ 3535.945198]  ? vhost_dev_ioctl+0x3b0/0x3b0 [vhost]
[ 3535.950031]  ? kthread_bind+0x10/0x10
[ 3535.953727]  ret_from_fork+0x35/0x40
[ 3535.957334] Code: 0e 74 15 83 f8 10 75 05 e9 49 aa b3 ff f3 c3 0f 1f 80 00 00 00 
00 f3 c3 e9 29 9d b3 ff 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <8b> 47 18 83 
f8 0e 74 0d 83 f8 10 75 05 e9 49 a9 b3 ff 31 c0 c3
[ 3535.976387] RIP: __xdp_map_lookup_elem+0x5/0x30 RSP: b4bc47bf7c58
[ 3535.982883] CR2: 0018
[ 3535.987096] ---[ end trace 383b299dd1430240 ]---
[ 3536.131325] Kernel panic - not syncing: Fatal exception
[ 3536.137484] Kernel Offset: 0x26a0 from 0x8100 (relocation 
range: 0x8000-0xbfff)
[ 3536.281406] ---[ end Kernel panic - not syncing: Fatal exception ]---

And a kernel with generic case fixed still panics in tun driver XDP
redirect, because it did not disable bh.

[ 2055.128746] BUG: unable to handle kernel NULL pointer dereference at 
0018
[ 2055.136662] PGD 0 P4D 0
[ 2055.139219] Oops:  [#1] SMP PTI
[ 2055.142736] Modules linked in: vhost_net vhost tap tun bridge stp llc 
ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc vfat 
fat ext4 mbcache jbd2 intel_rapl skx_edac nfit libnvdimm x86_pkg_temp_thermal 
intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel pcbc ses aesni_intel ipmi_ssif crypto_simd enclosure cryptd 
hpwdt glue_helper ioatdma hpilo wmi dca pcspkr ipmi_si acpi_power_meter 
ipmi_devintf shpchp mei_me ipmi_msghandler mei lpc_ich sch_fq_codel ip_tables 
xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea 
sysfillrect sysimgblt fb_sys_fops ttm drm i40e smartpqi tg3 scsi_transport_sas 
crc32c_intel i2c_core ptp pps_core
[ 2055.206142] CPU: 6 PID: 1693 Comm: vhost-1683 Tainted: GW 
4.17.0-rc5-fix-tun+ #1
[ 2055.215011] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, 
BIOS U32 11/14/2017
[ 2055.223617] RIP: 0010:__xdp_map_lookup_elem+0x5/0x30
[ 2055.228624] RSP: 0018:998b07607cc0 EFLAGS: 00010246
[ 2055.233892] RAX: 8dbd8e235700 RBX: 8dbd8ff21c40 RCX: 0004
[ 2055.241089] RDX: 998b097a9000 RSI:  RDI: 
[ 2055.248286] RBP:  R08: 65a8 R09: 5d80
[ 2055.255483] R10: 

Re: [RFC v5 0/5] virtio: support packed ring

2018-05-24 Thread Jason Wang



On 2018年05月22日 16:16, Tiwei Bie wrote:

Hello everyone,

This RFC implements packed ring support in virtio driver.

Some simple functional tests have been done with Jason's
packed ring implementation in vhost (RFC v4):

https://lkml.org/lkml/2018/5/16/501

Both of ping and netperf worked as expected w/ EVENT_IDX
disabled. Ping worked as expected w/ EVENT_IDX enabled,
but netperf didn't (A hack has been added in the driver
to make netperf test pass in this case. The hack can be
found by `grep -rw XXX` in the code).


Looks like this is because a bug in vhost which wrongly track 
signalled_used and may miss an interrupt. After fixing that, both side 
works like a charm.


I'm testing vIOMMU and zerocopy, and will post a new version shortly. 
(Hope it would be the last RFC version).


Thanks


Re: [RFC V4 PATCH 8/8] vhost: event suppression for packed ring

2018-05-24 Thread Jason Wang



On 2018年05月16日 20:32, Jason Wang wrote:

+static bool vhost_notify_packed(struct vhost_dev *dev,
+   struct vhost_virtqueue *vq)
+{
+   __virtio16 event_off_wrap, event_flags;
+   __u16 old, new, off_wrap;
+   bool v;
+
+   /* Flush out used descriptors updates. This is paired
+* with the barrier that the Guest executes when enabling
+* interrupts.
+*/
+   smp_mb();
+
+   if (vhost_get_avail(vq, event_flags,
+  >driver_event->flags) < 0) {
+   vq_err(vq, "Failed to get driver desc_event_flags");
+   return true;
+   }
+
+   if (event_flags == cpu_to_vhost16(vq, RING_EVENT_FLAGS_DISABLE))
+   return false;
+   else if (event_flags == cpu_to_vhost16(vq, RING_EVENT_FLAGS_ENABLE))
+   return true;
+
+   /* Read desc event flags before event_off and event_wrap */
+   smp_rmb();
+
+   if (vhost_get_avail(vq, event_off_wrap,
+   >driver_event->off_warp) < 0) {
+   vq_err(vq, "Failed to get driver desc_event_off/wrap");
+   return true;
+   }
+
+   off_wrap = vhost16_to_cpu(vq, event_off_wrap);
+
+   old = vq->signalled_used;
+   v = vq->signalled_used_valid;
+   new = vq->signalled_used = vq->last_used_idx;
+   vq->signalled_used_valid = true;


We should move those idx tracking before checking event_flags. Otherwise 
we may lose interrupts because of a wrong signalled_used value.


Thanks


+
+   if (unlikely(!v))
+   return true;
+
+   return vhost_vring_packed_need_event(vq, off_wrap, new, old);
+}




Re: [RFC V4 PATCH 7/8] vhost: packed ring support

2018-05-23 Thread Jason Wang



On 2018年05月23日 15:17, Wei Xu wrote:

On Wed, May 23, 2018 at 09:39:28AM +0800, Jason Wang wrote:


On 2018年05月23日 00:54, Wei Xu wrote:

On Wed, May 16, 2018 at 08:32:20PM +0800, Jason Wang wrote:

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
  drivers/vhost/net.c   |   3 +-
  drivers/vhost/vhost.c | 539 ++
  drivers/vhost/vhost.h |   8 +-
  3 files changed, 513 insertions(+), 37 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 8304c30..f2a0f5b 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1358,6 +1382,8 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int 
ioctl, void __user *arg
break;
}
vq->last_avail_idx = s.num;
+   if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
+   vq->avail_wrap_counter = s.num >> 31;
/* Forget the cached index value. */
vq->avail_idx = vq->last_avail_idx;
break;
@@ -1366,6 +1392,8 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int 
ioctl, void __user *arg
s.num = vq->last_avail_idx;
if (copy_to_user(argp, , sizeof s))
r = -EFAULT;
+   if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
+   s.num |= vq->avail_wrap_counter << 31;
break;
case VHOST_SET_VRING_ADDR:
if (copy_from_user(, argp, sizeof a)) {

'last_used_idx' also needs to be saved/restored here.

I have figured out the root cause of broken device after reloading
'virtio-net' module, all indices have been reset for a reloading but
'last_used_idx' is not properly reset in this case. This confuses
handle_rx()/tx().

Wei


Good catch, so we probably need a new ioctl to sync between qemu and vhost.

Something like VHOST_SET/GET_USED_BASE.

Sure, or can we expand 'vhost_vring_state' to keep them done in a bunch?


It's port of uapi, so we can't.

Thanks




Thanks





Re: [RFC V4 PATCH 7/8] vhost: packed ring support

2018-05-22 Thread Jason Wang



On 2018年05月23日 00:54, Wei Xu wrote:

On Wed, May 16, 2018 at 08:32:20PM +0800, Jason Wang wrote:

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
  drivers/vhost/net.c   |   3 +-
  drivers/vhost/vhost.c | 539 ++
  drivers/vhost/vhost.h |   8 +-
  3 files changed, 513 insertions(+), 37 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 8304c30..f2a0f5b 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1358,6 +1382,8 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int 
ioctl, void __user *arg
break;
}
vq->last_avail_idx = s.num;
+   if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
+   vq->avail_wrap_counter = s.num >> 31;
/* Forget the cached index value. */
vq->avail_idx = vq->last_avail_idx;
break;
@@ -1366,6 +1392,8 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int 
ioctl, void __user *arg
s.num = vq->last_avail_idx;
if (copy_to_user(argp, , sizeof s))
r = -EFAULT;
+   if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
+   s.num |= vq->avail_wrap_counter << 31;
break;
case VHOST_SET_VRING_ADDR:
if (copy_from_user(, argp, sizeof a)) {

'last_used_idx' also needs to be saved/restored here.

I have figured out the root cause of broken device after reloading
'virtio-net' module, all indices have been reset for a reloading but
'last_used_idx' is not properly reset in this case. This confuses
handle_rx()/tx().

Wei



Good catch, so we probably need a new ioctl to sync between qemu and vhost.

Something like VHOST_SET/GET_USED_BASE.

Thanks



Re: [RFC PATCH net-next 10/12] vhost_net: build xdp buff

2018-05-22 Thread Jason Wang



On 2018年05月22日 00:56, Jesse Brandeburg wrote:

On Mon, 21 May 2018 17:04:31 +0800 Jason wrote:

This patch implement build XDP buffers in vhost_net. The idea is do
userspace copy in vhost_net and build XDP buff based on the
page. Vhost_net can then submit one or an array of XDP buffs to
underlayer socket (e.g TUN). TUN can choose to do XDP or call
build_skb() to build skb. To support build skb, vnet header were also
stored into the header of the XDP buff.

This userspace copy and XDP buffs building is key to achieve XDP
batching in TUN, since TUN does not need to care about userspace copy
and then can disable premmption for several XDP buffs to achieve
batching from XDP.

TODO: reserve headroom based on the TUN XDP.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
  drivers/vhost/net.c | 74 +
  1 file changed, 74 insertions(+)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f0639d7..1209e84 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -492,6 +492,80 @@ static bool vhost_has_more_pkts(struct vhost_net *net,
   likely(!vhost_exceeds_maxpend(net));
  }
  
+#define VHOST_NET_HEADROOM 256

+#define VHOST_NET_RX_PAD (NET_IP_ALIGN + NET_SKB_PAD)
+
+static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
+  struct iov_iter *from,
+  struct xdp_buff *xdp)
+{
+   struct vhost_virtqueue *vq = >vq;
+   struct page_frag *alloc_frag = >task_frag;
+   struct virtio_net_hdr *gso;
+   size_t len = iov_iter_count(from);
+   int buflen = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+   int pad = SKB_DATA_ALIGN(VHOST_NET_RX_PAD + VHOST_NET_HEADROOM
++ nvq->sock_hlen);
+   int sock_hlen = nvq->sock_hlen;
+   void *buf;
+   int copied;
+
+   if (len < nvq->sock_hlen)
+   return -EFAULT;
+
+   if (SKB_DATA_ALIGN(len + pad) +
+   SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) > PAGE_SIZE)
+   return -ENOSPC;
+
+   buflen += SKB_DATA_ALIGN(len + pad);

maybe store the result of SKB_DATA_ALIGN in a local instead of doing
the work twice?


Ok.




+   alloc_frag->offset = ALIGN((u64)alloc_frag->offset, SMP_CACHE_BYTES);
+   if (unlikely(!skb_page_frag_refill(buflen, alloc_frag, GFP_KERNEL)))
+   return -ENOMEM;
+
+   buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
+
+   /* We store two kinds of metadata in the header which will be
+* used for XDP_PASS to do build_skb():
+* offset 0: buflen
+* offset sizeof(int): vnet header
+*/
+   copied = copy_page_from_iter(alloc_frag->page,
+alloc_frag->offset + sizeof(int), 
sock_hlen, from);
+   if (copied != sock_hlen)
+   return -EFAULT;
+
+   gso = (struct virtio_net_hdr *)(buf + sizeof(int));
+
+   if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
+   vhost16_to_cpu(vq, gso->csum_start) +
+   vhost16_to_cpu(vq, gso->csum_offset) + 2 >
+   vhost16_to_cpu(vq, gso->hdr_len)) {
+   gso->hdr_len = cpu_to_vhost16(vq,
+  vhost16_to_cpu(vq, gso->csum_start) +
+  vhost16_to_cpu(vq, gso->csum_offset) + 2);
+
+   if (vhost16_to_cpu(vq, gso->hdr_len) > len)
+   return -EINVAL;
+   }
+
+   len -= sock_hlen;
+   copied = copy_page_from_iter(alloc_frag->page,
+alloc_frag->offset + pad,
+len, from);
+   if (copied != len)
+   return -EFAULT;
+
+   xdp->data_hard_start = buf;
+   xdp->data = buf + pad;
+   xdp->data_end = xdp->data + len;
+   *(int *)(xdp->data_hard_start)= buflen;

space before =


Yes.

Thanks




+
+   get_page(alloc_frag->page);
+   alloc_frag->offset += buflen;
+
+   return 0;
+}
+
  static void handle_tx_copy(struct vhost_net *net)
  {
struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];




Re: [RFC PATCH net-next 04/12] vhost_net: split out datacopy logic

2018-05-22 Thread Jason Wang



On 2018年05月22日 00:46, Jesse Brandeburg wrote:

On Mon, 21 May 2018 17:04:25 +0800 Jason wrote:

Instead of mixing zerocopy and datacopy logics, this patch tries to
split datacopy logic out. This results for a more compact code and
specific optimization could be done on top more easily.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
  drivers/vhost/net.c | 111 +++-
  1 file changed, 102 insertions(+), 9 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 4ebac76..4682fcc 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -492,9 +492,95 @@ static bool vhost_has_more_pkts(struct vhost_net *net,
   likely(!vhost_exceeds_maxpend(net));
  }
  
+static void handle_tx_copy(struct vhost_net *net)

+{
+   struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
+   struct vhost_virtqueue *vq = >vq;
+   unsigned out, in;

move "out" and "in" down to inside the for loop as well.


+   int head;

move this "head" declaration inside for-loop below.


Will do these in next version, basically this function were just copied 
from handle_tx_zerocopy and remove unnecessary parts. So I'm thinking an 
independent patch from the beginning to avoid changes in two places.



+   struct msghdr msg = {
+   .msg_name = NULL,
+   .msg_namelen = 0,
+   .msg_control = NULL,
+   .msg_controllen = 0,
+   .msg_flags = MSG_DONTWAIT,
+   };
+   size_t len, total_len = 0;
+   int err;
+   size_t hdr_size;
+   struct socket *sock;
+   struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
+   int sent_pkts = 0;

why do we need so many locals?


So it looks to me ubufs is unnecessary but all other is really needed.




+
+   mutex_lock(>mutex);
+   sock = vq->private_data;
+   if (!sock)
+   goto out;
+
+   if (!vq_iotlb_prefetch(vq))
+   goto out;
+
+   vhost_disable_notify(>dev, vq);
+   vhost_net_disable_vq(net, vq);
+
+   hdr_size = nvq->vhost_hlen;
+
+   for (;;) {
+   head = vhost_net_tx_get_vq_desc(net, vq, vq->iov,
+   ARRAY_SIZE(vq->iov),
+   , );
+   /* On error, stop handling until the next kick. */
+   if (unlikely(head < 0))
+   break;
+   /* Nothing new?  Wait for eventfd to tell us they refilled. */
+   if (head == vq->num) {
+   if (unlikely(vhost_enable_notify(>dev, vq))) {
+   vhost_disable_notify(>dev, vq);
+   continue;
+   }
+   break;
+   }
+   if (in) {
+   vq_err(vq, "Unexpected descriptor format for TX: "
+  "out %d, int %d\n", out, in);

don't break strings, keep all the bits between " " together, even if it
is longer than 80 chars.


Ok.




+   break;
+   }
+
+   len = init_iov_iter(vq, _iter, hdr_size, out);
+   if (len < 0)
+   break;

same comment as previous patch, len is a size_t which is unsigned.


Yes.




+
+   total_len += len;
+   if (total_len < VHOST_NET_WEIGHT &&
+   vhost_has_more_pkts(net, vq)) {
+   msg.msg_flags |= MSG_MORE;
+   } else {
+   msg.msg_flags &= ~MSG_MORE;
+   }

don't need { } here.


Right.


+
+   /* TODO: Check specific error and bomb out unless ENOBUFS? */
+   err = sock->ops->sendmsg(sock, , len);
+   if (unlikely(err < 0)) {
+   vhost_discard_vq_desc(vq, 1);
+   vhost_net_enable_vq(net, vq);
+   break;
+   }
+   if (err != len)
+   pr_debug("Truncated TX packet: "
+" len %d != %zd\n", err, len);

single line " " string please



Will fix.

Thanks


Re: [RFC PATCH net-next 03/12] vhost_net: introduce vhost_has_more_pkts()

2018-05-22 Thread Jason Wang



On 2018年05月22日 00:39, Jesse Brandeburg wrote:

On Mon, 21 May 2018 17:04:24 +0800 Jason wrote:

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
  drivers/vhost/net.c | 12 +---
  1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index de544ee..4ebac76 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -485,6 +485,13 @@ static bool vhost_exceeds_weight(int pkts, int total_len)
   unlikely(pkts >= VHOST_NET_PKT_WEIGHT);
  }
  
+static bool vhost_has_more_pkts(struct vhost_net *net,

+   struct vhost_virtqueue *vq)
+{
+   return !vhost_vq_avail_empty(>dev, vq) &&
+  likely(!vhost_exceeds_maxpend(net));

This really seems like mis-use of likely/unlikely, in the middle of a
sequence of operations that will always be run when this function is
called.  I think you should remove the likely from this helper,
especially, and control the branch from the branch point.


Yes, so I'm consider to make it a macro in next version.





+}
+
  /* Expects to be always run from workqueue - which acts as
   * read-size critical section for our kind of RCU. */
  static void handle_tx(struct vhost_net *net)
@@ -578,8 +585,7 @@ static void handle_tx(struct vhost_net *net)
}
total_len += len;
if (total_len < VHOST_NET_WEIGHT &&
-   !vhost_vq_avail_empty(>dev, vq) &&
-   likely(!vhost_exceeds_maxpend(net))) {
+   vhost_has_more_pkts(net, vq)) {

Yes, I know it came from here, but likely/unlikely are for branch
control, so they should encapsulate everything inside the if, unless
I'm mistaken.


Ok.




msg.msg_flags |= MSG_MORE;
} else {
msg.msg_flags &= ~MSG_MORE;
@@ -605,7 +611,7 @@ static void handle_tx(struct vhost_net *net)
else
vhost_zerocopy_signal_used(net, vq);
vhost_net_tx_packet(net);
-   if (unlikely(vhost_exceeds_weight(++sent_pkts, total_len))) {
+   if (vhost_exceeds_weight(++sent_pkts, total_len)) {

You should have kept the unlikely here, and not had it inside the
helper (as per the previous patch.  Also, why wasn't this change part
of the previous patch?


Yes, will squash the above into previous one.

Thanks




vhost_poll_queue(>poll);
break;
}




Re: [RFC PATCH net-next 02/12] vhost_net: introduce vhost_exceeds_weight()

2018-05-22 Thread Jason Wang



On 2018年05月22日 00:29, Jesse Brandeburg wrote:

On Mon, 21 May 2018 17:04:23 +0800 Jason wrote:

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
  drivers/vhost/net.c | 13 -
  1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 15d191a..de544ee 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -479,6 +479,12 @@ static size_t init_iov_iter(struct vhost_virtqueue *vq, 
struct iov_iter *iter,
return len;
  }
  
+static bool vhost_exceeds_weight(int pkts, int total_len)

+{
+   return unlikely(total_len >= VHOST_NET_WEIGHT) ||
+  unlikely(pkts >= VHOST_NET_PKT_WEIGHT);

I was going to say just one unlikely, but then the caller of this
function also says unlikely(vhost_exceeds...), so I think you should
just drop the unlikely statements here (both of them)


Ok.




+}
+
  /* Expects to be always run from workqueue - which acts as
   * read-size critical section for our kind of RCU. */
  static void handle_tx(struct vhost_net *net)
@@ -570,7 +576,6 @@ static void handle_tx(struct vhost_net *net)
msg.msg_control = NULL;
ubufs = NULL;
}
-

unrelated whitespace changes?


Yes.

Thanks




total_len += len;
if (total_len < VHOST_NET_WEIGHT &&
!vhost_vq_avail_empty(>dev, vq) &&
@@ -600,8 +605,7 @@ static void handle_tx(struct vhost_net *net)
else
vhost_zerocopy_signal_used(net, vq);
vhost_net_tx_packet(net);
-   if (unlikely(total_len >= VHOST_NET_WEIGHT) ||
-   unlikely(++sent_pkts >= VHOST_NET_PKT_WEIGHT)) {
+   if (unlikely(vhost_exceeds_weight(++sent_pkts, total_len))) {
vhost_poll_queue(>poll);
break;
}
@@ -887,8 +891,7 @@ static void handle_rx(struct vhost_net *net)
if (unlikely(vq_log))
vhost_log_write(vq, vq_log, log, vhost_len);
total_len += vhost_len;
-   if (unlikely(total_len >= VHOST_NET_WEIGHT) ||
-   unlikely(++recv_pkts >= VHOST_NET_PKT_WEIGHT)) {
+   if (unlikely(vhost_exceeds_weight(++recv_pkts, total_len))) {
vhost_poll_queue(>poll);
goto out;
}




Re: [RFC PATCH net-next 01/12] vhost_net: introduce helper to initialize tx iov iter

2018-05-22 Thread Jason Wang



On 2018年05月22日 00:24, Jesse Brandeburg wrote:

Hi Jason, a few nits.

On Mon, 21 May 2018 17:04:22 +0800 Jason wrote:

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
  drivers/vhost/net.c | 34 +++---
  1 file changed, 23 insertions(+), 11 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index c4b49fc..15d191a 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -459,6 +459,26 @@ static bool vhost_exceeds_maxpend(struct vhost_net *net)
   min_t(unsigned int, VHOST_MAX_PEND, vq->num >> 2);
  }
  
+static size_t init_iov_iter(struct vhost_virtqueue *vq, struct iov_iter *iter,

+   size_t hdr_size, int out)
+{
+   /* Skip header. TODO: support TSO. */
+   size_t len = iov_length(vq->iov, out);
+
+   iov_iter_init(iter, WRITE, vq->iov, out, len);
+   iov_iter_advance(iter, hdr_size);
+   /* Sanity check */
+   if (!iov_iter_count(iter)) {
+   vq_err(vq, "Unexpected header len for TX: "
+   "%zd expected %zd\n",
+   len, hdr_size);

ok, it was like this before, but please unwrap the string in " ", there
should be no line breaks in string declarations and they are allowed to
go over 80 characters.


Ok.




+   return -EFAULT;
+   }
+   len = iov_iter_count(iter);
+
+   return len;
+}
+
  /* Expects to be always run from workqueue - which acts as
   * read-size critical section for our kind of RCU. */
  static void handle_tx(struct vhost_net *net)
@@ -521,18 +541,10 @@ static void handle_tx(struct vhost_net *net)
   "out %d, int %d\n", out, in);
break;
}
-   /* Skip header. TODO: support TSO. */
-   len = iov_length(vq->iov, out);
-   iov_iter_init(_iter, WRITE, vq->iov, out, len);
-   iov_iter_advance(_iter, hdr_size);
-   /* Sanity check */
-   if (!msg_data_left()) {
-   vq_err(vq, "Unexpected header len for TX: "
-  "%zd expected %zd\n",
-  len, hdr_size);
+
+   len = init_iov_iter(vq, _iter, hdr_size, out);
+   if (len < 0)

len is declared as size_t, which is unsigned, and can never be
negative.  I'm pretty sure this is a bug.


Yes, let me fix it in next version.

Thanks





break;
-   }
-   len = msg_data_left();
  
  		zcopy_used = zcopy && len >= VHOST_GOODCOPY_LEN

   && !vhost_exceeds_maxpend(net)




[PATCH net] vhost: synchronize IOTLB message with dev cleanup

2018-05-22 Thread Jason Wang
DaeRyong Jeong reports a race between vhost_dev_cleanup() and
vhost_process_iotlb_msg():

Thread interleaving:
CPU0 (vhost_process_iotlb_msg)  CPU1 (vhost_dev_cleanup)
(In the case of both VHOST_IOTLB_UPDATE and
VHOST_IOTLB_INVALIDATE)
=   =
vhost_umem_clean(dev->iotlb);
if (!dev->iotlb) {
ret = -EFAULT;
break;
}
dev->iotlb = NULL;

The reason is we don't synchronize between them, fixing by protecting
vhost_process_iotlb_msg() with dev mutex.

Reported-by: DaeRyong Jeong <threeear...@gmail.com>
Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API")
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/vhost/vhost.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index f3bd8e9..f0be5f3 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -981,6 +981,7 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
 {
int ret = 0;
 
+   mutex_lock(>mutex);
vhost_dev_lock_vqs(dev);
switch (msg->type) {
case VHOST_IOTLB_UPDATE:
@@ -1016,6 +1017,8 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
}
 
vhost_dev_unlock_vqs(dev);
+   mutex_unlock(>mutex);
+
return ret;
 }
 ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
-- 
2.7.4



Re: KASAN: use-after-free Read in vhost_chr_write_iter

2018-05-22 Thread Jason Wang



On 2018年05月22日 16:38, DaeRyong Jeong wrote:

On Mon, May 21, 2018 at 10:38:10AM +0800, Jason Wang wrote:

On 2018年05月18日 17:24, Jason Wang wrote:

On 2018年05月17日 21:45, DaeRyong Jeong wrote:

We report the crash: KASAN: use-after-free Read in vhost_chr_write_iter

This crash has been found in v4.17-rc1 using RaceFuzzer (a modified
version of Syzkaller), which we describe more at the end of this
report. Our analysis shows that the race occurs when invoking two
syscalls concurrently, write$vnet and ioctl$VHOST_RESET_OWNER.


Analysis:
We think the concurrent execution of vhost_process_iotlb_msg() and
vhost_dev_cleanup() causes the crash.
Both of functions can run concurrently (please see call sequence below),
and possibly, there is a race on dev->iotlb.
If the switch occurs right after vhost_dev_cleanup() frees
dev->iotlb, vhost_process_iotlb_msg() still sees the non-null value
and it
keep executing without returning -EFAULT. Consequently, use-after-free
occures


Thread interleaving:
CPU0 (vhost_process_iotlb_msg)    CPU1 (vhost_dev_cleanup)
(In the case of both VHOST_IOTLB_UPDATE and
VHOST_IOTLB_INVALIDATE)
=    =
     vhost_umem_clean(dev->iotlb);
if (!dev->iotlb) {
     ret = -EFAULT;
     break;
}
     dev->iotlb = NULL;


Call Sequence:
CPU0
=
vhost_net_chr_write_iter
 vhost_chr_write_iter
     vhost_process_iotlb_msg

CPU1
=
vhost_net_ioctl
 vhost_net_reset_owner
     vhost_dev_reset_owner
     vhost_dev_cleanup

Thanks a lot for the analysis.

This could be addressed by simply protect it with dev mutex.

Will post a patch.


Could you please help to test the attached patch? I've done some smoking
test.

Thanks

Sorry to say this, but we don't have a reproducer for this bug since our
reproducer is being implemented.

This crash had occrued a few times in our fuzzer, so I inspected the code
manually.

It seems the patch is good for me, but we can't test the patch for now.
Sorry.



No problem.

I'm trying to craft a reproducer, looks not hard.

Thanks


[PATCH net] tuntap: correctly set SOCKWQ_ASYNC_NOSPACE

2018-05-22 Thread Jason Wang
When link is down, writes to the device might fail with
-EIO. Userspace needs an indication when the status is resolved.  As a
fix, tun_net_open() attempts to wake up writers - but that is only
effective if SOCKWQ_ASYNC_NOSPACE has been set in the past. This is
not the case of vhost_net which only poll for EPOLLOUT after it meets
errors during sendmsg().

This patch fixes this by making sure SOCKWQ_ASYNC_NOSPACE is set when
socket is not writable or device is down to guarantee EPOLLOUT will be
raised in either tun_chr_poll() or tun_sock_write_space() after device
is up.

Cc: Hannes Frederic Sowa <han...@stressinduktion.org>
Cc: Eric Dumazet <eduma...@google.com>
Fixes: 1bd4978a88ac2 ("tun: honor IFF_UP in tun_get_user()")
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/tun.c | 19 +++
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index d45ac37..45d8077 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1423,6 +1423,13 @@ static void tun_net_init(struct net_device *dev)
dev->max_mtu = MAX_MTU - dev->hard_header_len;
 }
 
+static bool tun_sock_writeable(struct tun_struct *tun, struct tun_file *tfile)
+{
+   struct sock *sk = tfile->socket.sk;
+
+   return (tun->dev->flags & IFF_UP) && sock_writeable(sk);
+}
+
 /* Character device part */
 
 /* Poll */
@@ -1445,10 +1452,14 @@ static __poll_t tun_chr_poll(struct file *file, 
poll_table *wait)
if (!ptr_ring_empty(>tx_ring))
mask |= EPOLLIN | EPOLLRDNORM;
 
-   if (tun->dev->flags & IFF_UP &&
-   (sock_writeable(sk) ||
-(!test_and_set_bit(SOCKWQ_ASYNC_NOSPACE, >sk_socket->flags) &&
- sock_writeable(sk
+   /* Make sure SOCKWQ_ASYNC_NOSPACE is set if not writable to
+* guarantee EPOLLOUT to be raised by either here or
+* tun_sock_write_space(). Then process could get notification
+* after it writes to a down device and meets -EIO.
+*/
+   if (tun_sock_writeable(tun, tfile) ||
+   (!test_and_set_bit(SOCKWQ_ASYNC_NOSPACE, >sk_socket->flags) &&
+tun_sock_writeable(tun, tfile)))
mask |= EPOLLOUT | EPOLLWRNORM;
 
if (tun->dev->reg_state != NETREG_REGISTERED)
-- 
2.7.4



Re: [PATCH net] tuntap: raise EPOLLOUT on device up

2018-05-21 Thread Jason Wang



On 2018年05月22日 11:46, Michael S. Tsirkin wrote:

On Tue, May 22, 2018 at 11:22:11AM +0800, Jason Wang wrote:


On 2018年05月22日 06:08, Michael S. Tsirkin wrote:

On Mon, May 21, 2018 at 11:47:42AM -0400, David Miller wrote:

From: Jason Wang <jasow...@redhat.com>
Date: Fri, 18 May 2018 21:00:43 +0800


We return -EIO on device down but can not raise EPOLLOUT after it was
up. This may confuse user like vhost which expects tuntap to raise
EPOLLOUT to re-enable its TX routine after tuntap is down. This could
be easily reproduced by transmitting packets from VM while down and up
the tap device. Fixing this by set SOCKWQ_ASYNC_NOSPACE on -EIO.

Cc: Hannes Frederic Sowa <han...@stressinduktion.org>
Cc: Eric Dumazet <eduma...@google.com>
Fixes: 1bd4978a88ac2 ("tun: honor IFF_UP in tun_get_user()")
Signed-off-by: Jason Wang <jasow...@redhat.com>

I'm no so sure what to do with this patch.

Like Michael says, this flag bit is only checks upon transmit which
may or may not happen after this point.  It doesn't seem to be
guaranteed.

The flag is checked in tun_chr_poll() as well.


Jason, can't we detect a link up transition and respond accordingly?
What do you think?


I think we've already tried to do this, in tun_net_open() we call
write_space(). But the problem is the bit may not be set at that time.

A second thought is to set the bit in tun_chr_poll() instead of -EIO like:

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index d45ac37..46a1573 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1423,6 +1423,13 @@ static void tun_net_init(struct net_device *dev)
     dev->max_mtu = MAX_MTU - dev->hard_header_len;
  }

+static bool tun_sock_writeable(struct tun_struct *tun, struct tun_file
*tfile)
+{
+   struct sock *sk = tfile->socket.sk;
+
+   return (tun->dev->flags & IFF_UP) && sock_writeable(sk);
+}
+
  /* Character device part */

  /* Poll */
@@ -1445,10 +1452,9 @@ static __poll_t tun_chr_poll(struct file *file,
poll_table *wait)
     if (!ptr_ring_empty(>tx_ring))
     mask |= EPOLLIN | EPOLLRDNORM;

-   if (tun->dev->flags & IFF_UP &&
-   (sock_writeable(sk) ||
-    (!test_and_set_bit(SOCKWQ_ASYNC_NOSPACE, >sk_socket->flags)
&&
- sock_writeable(sk
+   if (tun_sock_writeable(tun, tfile) ||
+   (!test_and_set_bit(SOCKWQ_ASYNC_NOSPACE, >sk_socket->flags)
&&
+    tun_sock_writeable(tun, tfile)));
     mask |= EPOLLOUT | EPOLLWRNORM;

     if (tun->dev->reg_state != NETREG_REGISTERED)

Does this make more sense?

Thanks

I just understood the motivation for doing it on EIO.
Maybe there's a reason it makes sense here as well,
but it's far from obvious. I suggest you repost adding
an explanation in the comment. The original patch will
be fine with an explanation as well.



Ok, let me add explanation on both code and commit log.

Thanks



Re: KASAN: use-after-free Read in vhost_chr_write_iter

2018-05-21 Thread Jason Wang



On 2018年05月21日 22:42, Michael S. Tsirkin wrote:

On Mon, May 21, 2018 at 10:38:10AM +0800, Jason Wang wrote:

On 2018年05月18日 17:24, Jason Wang wrote:

On 2018年05月17日 21:45, DaeRyong Jeong wrote:

We report the crash: KASAN: use-after-free Read in vhost_chr_write_iter

This crash has been found in v4.17-rc1 using RaceFuzzer (a modified
version of Syzkaller), which we describe more at the end of this
report. Our analysis shows that the race occurs when invoking two
syscalls concurrently, write$vnet and ioctl$VHOST_RESET_OWNER.


Analysis:
We think the concurrent execution of vhost_process_iotlb_msg() and
vhost_dev_cleanup() causes the crash.
Both of functions can run concurrently (please see call sequence below),
and possibly, there is a race on dev->iotlb.
If the switch occurs right after vhost_dev_cleanup() frees
dev->iotlb, vhost_process_iotlb_msg() still sees the non-null value
and it
keep executing without returning -EFAULT. Consequently, use-after-free
occures


Thread interleaving:
CPU0 (vhost_process_iotlb_msg)    CPU1 (vhost_dev_cleanup)
(In the case of both VHOST_IOTLB_UPDATE and
VHOST_IOTLB_INVALIDATE)
=    =
     vhost_umem_clean(dev->iotlb);
if (!dev->iotlb) {
     ret = -EFAULT;
     break;
}
     dev->iotlb = NULL;


Call Sequence:
CPU0
=
vhost_net_chr_write_iter
 vhost_chr_write_iter
     vhost_process_iotlb_msg

CPU1
=
vhost_net_ioctl
 vhost_net_reset_owner
     vhost_dev_reset_owner
     vhost_dev_cleanup

Thanks a lot for the analysis.

This could be addressed by simply protect it with dev mutex.

Will post a patch.


Could you please help to test the attached patch? I've done some smoking
test.

Thanks
>From 88328386f3f652e684ee33dc4cf63dcaed871aea Mon Sep 17 00:00:00 2001
From: Jason Wang<jasow...@redhat.com>
Date: Fri, 18 May 2018 17:33:27 +0800
Subject: [PATCH] vhost: synchronize IOTLB message with dev cleanup

DaeRyong Jeong reports a race between vhost_dev_cleanup() and
vhost_process_iotlb_msg():

Thread interleaving:
CPU0 (vhost_process_iotlb_msg)  CPU1 (vhost_dev_cleanup)
(In the case of both VHOST_IOTLB_UPDATE and
VHOST_IOTLB_INVALIDATE)
=   =
vhost_umem_clean(dev->iotlb);
if (!dev->iotlb) {
ret = -EFAULT;
break;
}
dev->iotlb = NULL;

The reason is we don't synchronize between them, fixing by protecting
vhost_process_iotlb_msg() with dev mutex.

Reported-by: DaeRyong Jeong<threeear...@gmail.com>
Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API")
Reported-by: DaeRyong Jeong<threeear...@gmail.com>

Long terms we might want to move iotlb into vqs
so that messages can be processed in parallel.
Not sure how to do it yet.



Then we probably need to extend IOTLB msg to have a queue idx. But I 
thinkit was probably only help if we split tx/rx into separate processes.


Thanks




[PATCH net V2 2/4] virtio-net: correctly transmit XDP buff after linearizing

2018-05-21 Thread Jason Wang
We should not go for the error path after successfully transmitting a
XDP buffer after linearizing. Since the error path may try to pop and
drop next packet and increase the drop counters. Fixing this by simply
drop the refcnt of original page and go for xmit path.

Fixes: 72979a6c3590 ("virtio_net: xdp, add slowpath case for non contiguous 
buffers")
Cc: John Fastabend <john.fastab...@gmail.com>
Acked-by: Michael S. Tsirkin <m...@redhat.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/virtio_net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c15d240..6260d65 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -775,7 +775,7 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
}
*xdp_xmit = true;
if (unlikely(xdp_page != page))
-   goto err_xdp;
+   put_page(page);
rcu_read_unlock();
goto xdp_xmit;
case XDP_REDIRECT:
-- 
2.7.4



[PATCH net V2 1/4] virtio-net: correctly redirect linearized packet

2018-05-21 Thread Jason Wang
After a linearized packet was redirected by XDP, we should not go for
the err path which will try to pop buffers for the next packet and
increase the drop counter. Fixing this by just drop the page refcnt
for the original page.

Fixes: 186b3c998c50 ("virtio-net: support XDP_REDIRECT")
Reported-by: David Ahern <dsah...@gmail.com>
Tested-by: David Ahern <dsah...@gmail.com>
Acked-by: Michael S. Tsirkin <m...@redhat.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/virtio_net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 770422e..c15d240 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -787,7 +787,7 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
}
*xdp_xmit = true;
if (unlikely(xdp_page != page))
-   goto err_xdp;
+   put_page(page);
rcu_read_unlock();
goto xdp_xmit;
default:
-- 
2.7.4



[PATCH net V2 0/4] Fix several issues of virtio-net mergeable XDP

2018-05-21 Thread Jason Wang
Hi:

Please review the patches that tries to fix sevreal issues of
virtio-net mergeable XDP.

Changes from V1:
- check against 1 before decreasing instead of resetting to 1
- typoe fixes

Jason Wang (4):
  virtio-net: correctly redirect linearized packet
  virtio-net: correctly transmit XDP buff after linearizing
  virtio-net: correctly check num_buf during err path
  virtio-net: fix leaking page for gso packet during mergeable XDP

 drivers/net/virtio_net.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

-- 
2.7.4



[PATCH net V2 3/4] virtio-net: correctly check num_buf during err path

2018-05-21 Thread Jason Wang
If we successfully linearize the packet, num_buf will be set to zero
which may confuse error handling path which assumes num_buf is at
least 1 and this can lead the code tries to pop the descriptor of next
buffer. Fixing this by checking num_buf against 1 before decreasing.

Fixes: 4941d472bf95 ("virtio-net: do not reset during XDP set")
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/virtio_net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 6260d65..326e247 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -875,7 +875,7 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
rcu_read_unlock();
 err_skb:
put_page(page);
-   while (--num_buf) {
+   while (num_buf-- > 1) {
buf = virtqueue_get_buf(rq->vq, );
if (unlikely(!buf)) {
pr_debug("%s: rx error: %d buffers missing\n",
-- 
2.7.4



[PATCH net V2 4/4] virtio-net: fix leaking page for gso packet during mergeable XDP

2018-05-21 Thread Jason Wang
We need to drop refcnt to xdp_page if we see a gso packet. Otherwise
it will be leaked. Fixing this by moving the check of gso packet above
the linearizing logic. While at it, remove useless comment as well.

Cc: John Fastabend <john.fastab...@gmail.com>
Fixes: 72979a6c3590 ("virtio_net: xdp, add slowpath case for non contiguous 
buffers")
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/virtio_net.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 326e247..032e1ac 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -707,6 +707,13 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
void *data;
u32 act;
 
+   /* Transient failure which in theory could occur if
+* in-flight packets from before XDP was enabled reach
+* the receive path after XDP is loaded.
+*/
+   if (unlikely(hdr->hdr.gso_type))
+   goto err_xdp;
+
/* This happens when rx buffer size is underestimated
 * or headroom is not enough because of the buffer
 * was refilled before XDP is set. This should only
@@ -727,14 +734,6 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
xdp_page = page;
}
 
-   /* Transient failure which in theory could occur if
-* in-flight packets from before XDP was enabled reach
-* the receive path after XDP is loaded. In practice I
-* was not able to create this condition.
-*/
-   if (unlikely(hdr->hdr.gso_type))
-   goto err_xdp;
-
/* Allow consuming headroom but reserve enough space to push
 * the descriptor on if we get an XDP_TX return code.
 */
-- 
2.7.4



Re: [PATCH net 4/4] virito-net: fix leaking page for gso packet during mergeable XDP

2018-05-21 Thread Jason Wang



On 2018年05月21日 23:01, Michael S. Tsirkin wrote:

On Mon, May 21, 2018 at 04:35:06PM +0800, Jason Wang wrote:

We need to drop refcnt to xdp_page if we see a gso packet. Otherwise
it will be leaked. Fixing this by moving the check of gso packet above
the linearizing logic.

Cc: John Fastabend <john.fastab...@gmail.com>
Fixes: 72979a6c3590 ("virtio_net: xdp, add slowpath case for non contiguous 
buffers")
Signed-off-by: Jason Wang <jasow...@redhat.com>

typo in subject


Let me fix it in V2.


---
  drivers/net/virtio_net.c | 16 
  1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 165a922..f8db809 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -707,6 +707,14 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
void *data;
u32 act;
  
+		/* Transient failure which in theory could occur if

+* in-flight packets from before XDP was enabled reach
+* the receive path after XDP is loaded. In practice I
+* was not able to create this condition.

BTW we should probably drop the last sentence. It says in theory, should be 
enough.


Ok.

Thanks


+*/
+   if (unlikely(hdr->hdr.gso_type))
+   goto err_xdp;
+
/* This happens when rx buffer size is underestimated
 * or headroom is not enough because of the buffer
 * was refilled before XDP is set. This should only
@@ -728,14 +736,6 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
xdp_page = page;
}
  
-		/* Transient failure which in theory could occur if

-* in-flight packets from before XDP was enabled reach
-* the receive path after XDP is loaded. In practice I
-* was not able to create this condition.
-*/
-   if (unlikely(hdr->hdr.gso_type))
-   goto err_xdp;
-
/* Allow consuming headroom but reserve enough space to push
 * the descriptor on if we get an XDP_TX return code.
 */
--
2.7.4




Re: [PATCH net 3/4] virtio-net: reset num_buf to 1 after linearizing packet

2018-05-21 Thread Jason Wang



On 2018年05月21日 22:59, Michael S. Tsirkin wrote:

On Mon, May 21, 2018 at 04:35:05PM +0800, Jason Wang wrote:

If we successfully linearize the packets, num_buf were set to zero
which was wrong since we now have only 1 buffer to be used for e.g in
the error path of receive_mergeable(). Zero num_buf will lead the code
try to pop the buffers of next packet and drop it. Fixing this by set
num_buf to 1 if we successfully linearize the packet.

Fixes: 4941d472bf95 ("virtio-net: do not reset during XDP set")
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
  drivers/net/virtio_net.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 6260d65..165a922 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -722,6 +722,7 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
  );
if (!xdp_page)
goto err_xdp;
+   num_buf = 1;

So this is tweaked here for the benefit of err_skb below.
That's confusing and we won't remember to change it
if we change the error handling.

How about fixing the error path?


-while (--num_buf) {
+while (num_buf-- > 1) {

Seems more robust to me.


Looks good, will do this in V2.

Thanks




offset = VIRTIO_XDP_HEADROOM;
} else {
xdp_page = page;
--
2.7.4




Re: [PATCH net] tuntap: raise EPOLLOUT on device up

2018-05-21 Thread Jason Wang



On 2018年05月22日 06:08, Michael S. Tsirkin wrote:

On Mon, May 21, 2018 at 11:47:42AM -0400, David Miller wrote:

From: Jason Wang <jasow...@redhat.com>
Date: Fri, 18 May 2018 21:00:43 +0800


We return -EIO on device down but can not raise EPOLLOUT after it was
up. This may confuse user like vhost which expects tuntap to raise
EPOLLOUT to re-enable its TX routine after tuntap is down. This could
be easily reproduced by transmitting packets from VM while down and up
the tap device. Fixing this by set SOCKWQ_ASYNC_NOSPACE on -EIO.

Cc: Hannes Frederic Sowa <han...@stressinduktion.org>
Cc: Eric Dumazet <eduma...@google.com>
Fixes: 1bd4978a88ac2 ("tun: honor IFF_UP in tun_get_user()")
Signed-off-by: Jason Wang <jasow...@redhat.com>

I'm no so sure what to do with this patch.

Like Michael says, this flag bit is only checks upon transmit which
may or may not happen after this point.  It doesn't seem to be
guaranteed.


The flag is checked in tun_chr_poll() as well.


Jason, can't we detect a link up transition and respond accordingly?
What do you think?



I think we've already tried to do this, in tun_net_open() we call 
write_space(). But the problem is the bit may not be set at that time.


A second thought is to set the bit in tun_chr_poll() instead of -EIO like:

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index d45ac37..46a1573 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1423,6 +1423,13 @@ static void tun_net_init(struct net_device *dev)
    dev->max_mtu = MAX_MTU - dev->hard_header_len;
 }

+static bool tun_sock_writeable(struct tun_struct *tun, struct tun_file 
*tfile)

+{
+   struct sock *sk = tfile->socket.sk;
+
+   return (tun->dev->flags & IFF_UP) && sock_writeable(sk);
+}
+
 /* Character device part */

 /* Poll */
@@ -1445,10 +1452,9 @@ static __poll_t tun_chr_poll(struct file *file, 
poll_table *wait)

    if (!ptr_ring_empty(>tx_ring))
    mask |= EPOLLIN | EPOLLRDNORM;

-   if (tun->dev->flags & IFF_UP &&
-   (sock_writeable(sk) ||
-    (!test_and_set_bit(SOCKWQ_ASYNC_NOSPACE, 
>sk_socket->flags) &&

- sock_writeable(sk
+   if (tun_sock_writeable(tun, tfile) ||
+   (!test_and_set_bit(SOCKWQ_ASYNC_NOSPACE, 
>sk_socket->flags) &&

+    tun_sock_writeable(tun, tfile)));
    mask |= EPOLLOUT | EPOLLWRNORM;

    if (tun->dev->reg_state != NETREG_REGISTERED)

Does this make more sense?

Thanks


[RFC PATCH net-next 00/12] XDP batching for TUN/vhost_net

2018-05-21 Thread Jason Wang
Hi all:

We do not support XDP batching for TUN since it can only receive one
packet a time from vhost_net. This series tries to remove this
limitation by:

- introduce a TUN specific msg_control that can hold a pointer to an
  array of XDP buffs
- try copy and build XDP buff in vhost_net
- store XDP buffs in an array and submit them once for every N packets
  from vhost_net
- since TUN can only do native XDP for datacopy packet, to simplify
  the logic, split datacopy out logic and only do batching for
  datacopy.

With this series, TX PPS can improve about 34% from 2.9Mpps to
3.9Mpps when doing xdp_redirect_map between TAP and ixgbe.

Thanks

Jason Wang (12):
  vhost_net: introduce helper to initialize tx iov iter
  vhost_net: introduce vhost_exceeds_weight()
  vhost_net: introduce vhost_has_more_pkts()
  vhost_net: split out datacopy logic
  vhost_net: batch update used ring for datacopy TX
  tuntap: enable premmption early
  tuntap: simplify error handling in tun_build_skb()
  tuntap: tweak on the path of non-xdp case in tun_build_skb()
  tuntap: split out XDP logic
  vhost_net: build xdp buff
  vhost_net: passing raw xdp buff to tun
  vhost_net: batch submitting XDP buffers to underlayer sockets

 drivers/net/tun.c  | 226 +++--
 drivers/vhost/net.c| 297 -
 include/linux/if_tun.h |   7 ++
 3 files changed, 444 insertions(+), 86 deletions(-)

-- 
2.7.4



[RFC PATCH net-next 03/12] vhost_net: introduce vhost_has_more_pkts()

2018-05-21 Thread Jason Wang
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/vhost/net.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index de544ee..4ebac76 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -485,6 +485,13 @@ static bool vhost_exceeds_weight(int pkts, int total_len)
   unlikely(pkts >= VHOST_NET_PKT_WEIGHT);
 }
 
+static bool vhost_has_more_pkts(struct vhost_net *net,
+   struct vhost_virtqueue *vq)
+{
+   return !vhost_vq_avail_empty(>dev, vq) &&
+  likely(!vhost_exceeds_maxpend(net));
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
@@ -578,8 +585,7 @@ static void handle_tx(struct vhost_net *net)
}
total_len += len;
if (total_len < VHOST_NET_WEIGHT &&
-   !vhost_vq_avail_empty(>dev, vq) &&
-   likely(!vhost_exceeds_maxpend(net))) {
+   vhost_has_more_pkts(net, vq)) {
msg.msg_flags |= MSG_MORE;
} else {
msg.msg_flags &= ~MSG_MORE;
@@ -605,7 +611,7 @@ static void handle_tx(struct vhost_net *net)
else
vhost_zerocopy_signal_used(net, vq);
vhost_net_tx_packet(net);
-   if (unlikely(vhost_exceeds_weight(++sent_pkts, total_len))) {
+   if (vhost_exceeds_weight(++sent_pkts, total_len)) {
vhost_poll_queue(>poll);
break;
}
-- 
2.7.4



[RFC PATCH net-next 01/12] vhost_net: introduce helper to initialize tx iov iter

2018-05-21 Thread Jason Wang
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/vhost/net.c | 34 +++---
 1 file changed, 23 insertions(+), 11 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index c4b49fc..15d191a 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -459,6 +459,26 @@ static bool vhost_exceeds_maxpend(struct vhost_net *net)
   min_t(unsigned int, VHOST_MAX_PEND, vq->num >> 2);
 }
 
+static size_t init_iov_iter(struct vhost_virtqueue *vq, struct iov_iter *iter,
+   size_t hdr_size, int out)
+{
+   /* Skip header. TODO: support TSO. */
+   size_t len = iov_length(vq->iov, out);
+
+   iov_iter_init(iter, WRITE, vq->iov, out, len);
+   iov_iter_advance(iter, hdr_size);
+   /* Sanity check */
+   if (!iov_iter_count(iter)) {
+   vq_err(vq, "Unexpected header len for TX: "
+   "%zd expected %zd\n",
+   len, hdr_size);
+   return -EFAULT;
+   }
+   len = iov_iter_count(iter);
+
+   return len;
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
@@ -521,18 +541,10 @@ static void handle_tx(struct vhost_net *net)
   "out %d, int %d\n", out, in);
break;
}
-   /* Skip header. TODO: support TSO. */
-   len = iov_length(vq->iov, out);
-   iov_iter_init(_iter, WRITE, vq->iov, out, len);
-   iov_iter_advance(_iter, hdr_size);
-   /* Sanity check */
-   if (!msg_data_left()) {
-   vq_err(vq, "Unexpected header len for TX: "
-  "%zd expected %zd\n",
-  len, hdr_size);
+
+   len = init_iov_iter(vq, _iter, hdr_size, out);
+   if (len < 0)
break;
-   }
-   len = msg_data_left();
 
zcopy_used = zcopy && len >= VHOST_GOODCOPY_LEN
   && !vhost_exceeds_maxpend(net)
-- 
2.7.4



[RFC PATCH net-next 02/12] vhost_net: introduce vhost_exceeds_weight()

2018-05-21 Thread Jason Wang
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/vhost/net.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 15d191a..de544ee 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -479,6 +479,12 @@ static size_t init_iov_iter(struct vhost_virtqueue *vq, 
struct iov_iter *iter,
return len;
 }
 
+static bool vhost_exceeds_weight(int pkts, int total_len)
+{
+   return unlikely(total_len >= VHOST_NET_WEIGHT) ||
+  unlikely(pkts >= VHOST_NET_PKT_WEIGHT);
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
@@ -570,7 +576,6 @@ static void handle_tx(struct vhost_net *net)
msg.msg_control = NULL;
ubufs = NULL;
}
-
total_len += len;
if (total_len < VHOST_NET_WEIGHT &&
!vhost_vq_avail_empty(>dev, vq) &&
@@ -600,8 +605,7 @@ static void handle_tx(struct vhost_net *net)
else
vhost_zerocopy_signal_used(net, vq);
vhost_net_tx_packet(net);
-   if (unlikely(total_len >= VHOST_NET_WEIGHT) ||
-   unlikely(++sent_pkts >= VHOST_NET_PKT_WEIGHT)) {
+   if (unlikely(vhost_exceeds_weight(++sent_pkts, total_len))) {
vhost_poll_queue(>poll);
break;
}
@@ -887,8 +891,7 @@ static void handle_rx(struct vhost_net *net)
if (unlikely(vq_log))
vhost_log_write(vq, vq_log, log, vhost_len);
total_len += vhost_len;
-   if (unlikely(total_len >= VHOST_NET_WEIGHT) ||
-   unlikely(++recv_pkts >= VHOST_NET_PKT_WEIGHT)) {
+   if (unlikely(vhost_exceeds_weight(++recv_pkts, total_len))) {
vhost_poll_queue(>poll);
goto out;
}
-- 
2.7.4



[RFC PATCH net-next 04/12] vhost_net: split out datacopy logic

2018-05-21 Thread Jason Wang
Instead of mixing zerocopy and datacopy logics, this patch tries to
split datacopy logic out. This results for a more compact code and
specific optimization could be done on top more easily.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/vhost/net.c | 111 +++-
 1 file changed, 102 insertions(+), 9 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 4ebac76..4682fcc 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -492,9 +492,95 @@ static bool vhost_has_more_pkts(struct vhost_net *net,
   likely(!vhost_exceeds_maxpend(net));
 }
 
+static void handle_tx_copy(struct vhost_net *net)
+{
+   struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
+   struct vhost_virtqueue *vq = >vq;
+   unsigned out, in;
+   int head;
+   struct msghdr msg = {
+   .msg_name = NULL,
+   .msg_namelen = 0,
+   .msg_control = NULL,
+   .msg_controllen = 0,
+   .msg_flags = MSG_DONTWAIT,
+   };
+   size_t len, total_len = 0;
+   int err;
+   size_t hdr_size;
+   struct socket *sock;
+   struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
+   int sent_pkts = 0;
+
+   mutex_lock(>mutex);
+   sock = vq->private_data;
+   if (!sock)
+   goto out;
+
+   if (!vq_iotlb_prefetch(vq))
+   goto out;
+
+   vhost_disable_notify(>dev, vq);
+   vhost_net_disable_vq(net, vq);
+
+   hdr_size = nvq->vhost_hlen;
+
+   for (;;) {
+   head = vhost_net_tx_get_vq_desc(net, vq, vq->iov,
+   ARRAY_SIZE(vq->iov),
+   , );
+   /* On error, stop handling until the next kick. */
+   if (unlikely(head < 0))
+   break;
+   /* Nothing new?  Wait for eventfd to tell us they refilled. */
+   if (head == vq->num) {
+   if (unlikely(vhost_enable_notify(>dev, vq))) {
+   vhost_disable_notify(>dev, vq);
+   continue;
+   }
+   break;
+   }
+   if (in) {
+   vq_err(vq, "Unexpected descriptor format for TX: "
+  "out %d, int %d\n", out, in);
+   break;
+   }
+
+   len = init_iov_iter(vq, _iter, hdr_size, out);
+   if (len < 0)
+   break;
+
+   total_len += len;
+   if (total_len < VHOST_NET_WEIGHT &&
+   vhost_has_more_pkts(net, vq)) {
+   msg.msg_flags |= MSG_MORE;
+   } else {
+   msg.msg_flags &= ~MSG_MORE;
+   }
+
+   /* TODO: Check specific error and bomb out unless ENOBUFS? */
+   err = sock->ops->sendmsg(sock, , len);
+   if (unlikely(err < 0)) {
+   vhost_discard_vq_desc(vq, 1);
+   vhost_net_enable_vq(net, vq);
+   break;
+   }
+   if (err != len)
+   pr_debug("Truncated TX packet: "
+" len %d != %zd\n", err, len);
+   vhost_add_used_and_signal(>dev, vq, head, 0);
+   if (vhost_exceeds_weight(++sent_pkts, total_len)) {
+   vhost_poll_queue(>poll);
+   break;
+   }
+   }
+out:
+   mutex_unlock(>mutex);
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
-static void handle_tx(struct vhost_net *net)
+static void handle_tx_zerocopy(struct vhost_net *net)
 {
struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
struct vhost_virtqueue *vq = >vq;
@@ -512,7 +598,7 @@ static void handle_tx(struct vhost_net *net)
size_t hdr_size;
struct socket *sock;
struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
-   bool zcopy, zcopy_used;
+   bool zcopy_used;
int sent_pkts = 0;
 
mutex_lock(>mutex);
@@ -527,13 +613,10 @@ static void handle_tx(struct vhost_net *net)
vhost_net_disable_vq(net, vq);
 
hdr_size = nvq->vhost_hlen;
-   zcopy = nvq->ubufs;
 
for (;;) {
/* Release DMAs done buffers first */
-   if (zcopy)
-   vhost_zerocopy_signal_used(net, vq);
-
+   vhost_zerocopy_signal_used(net, vq);
 
head = vhost_net_tx_get_vq_desc(net, vq, vq->iov,
ARRAY_SIZE(vq->iov),
@@ -559,9 +642,9 @@ static vo

[RFC PATCH net-next 06/12] tuntap: enable premmption early

2018-05-21 Thread Jason Wang
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/tun.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 44d4f3d..24ecd82 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1697,6 +1697,8 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
goto err_xdp;
}
}
+   rcu_read_unlock();
+   preempt_enable();
 
skb = build_skb(buf, buflen);
if (!skb) {
@@ -1710,9 +1712,6 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
get_page(alloc_frag->page);
alloc_frag->offset += buflen;
 
-   rcu_read_unlock();
-   preempt_enable();
-
return skb;
 
 err_redirect:
-- 
2.7.4



[RFC PATCH net-next 05/12] vhost_net: batch update used ring for datacopy TX

2018-05-21 Thread Jason Wang
Like commit e2b3b35eb989 ("vhost_net: batch used ring update in rx"),
this patches implements batch used ring update for datacopy TX
(zerocopy has already done some kind of batching).

Testpmd transmission from guest to ixgbe via XDP_REDIRECT shows about
15% improvement from 2.8Mpps to 3.2Mpps.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/vhost/net.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 4682fcc..f0639d7 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -511,6 +511,7 @@ static void handle_tx_copy(struct vhost_net *net)
struct socket *sock;
struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
int sent_pkts = 0;
+   s16 nheads = 0;
 
mutex_lock(>mutex);
sock = vq->private_data;
@@ -550,6 +551,9 @@ static void handle_tx_copy(struct vhost_net *net)
if (len < 0)
break;
 
+   vq->heads[nheads].id = cpu_to_vhost32(vq, head);
+   vq->heads[nheads].len = 0;
+
total_len += len;
if (total_len < VHOST_NET_WEIGHT &&
vhost_has_more_pkts(net, vq)) {
@@ -568,13 +572,20 @@ static void handle_tx_copy(struct vhost_net *net)
if (err != len)
pr_debug("Truncated TX packet: "
 " len %d != %zd\n", err, len);
-   vhost_add_used_and_signal(>dev, vq, head, 0);
+   if (++nheads == VHOST_RX_BATCH) {
+   vhost_add_used_and_signal_n(>dev, vq, vq->heads,
+   nheads);
+   nheads = 0;
+   }
if (vhost_exceeds_weight(++sent_pkts, total_len)) {
vhost_poll_queue(>poll);
break;
}
}
 out:
+   if (nheads)
+   vhost_add_used_and_signal_n(>dev, vq, vq->heads,
+   nheads);
mutex_unlock(>mutex);
 }
 
-- 
2.7.4



[RFC PATCH net-next 07/12] tuntap: simplify error handling in tun_build_skb()

2018-05-21 Thread Jason Wang
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/tun.c | 36 ++--
 1 file changed, 14 insertions(+), 22 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 24ecd82..f6e0f96 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1612,7 +1612,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
 int len, int *skb_xdp)
 {
struct page_frag *alloc_frag = >task_frag;
-   struct sk_buff *skb;
+   struct sk_buff *skb = NULL;
struct bpf_prog *xdp_prog;
int buflen = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
unsigned int delta = 0;
@@ -1638,6 +1638,9 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
if (copied != len)
return ERR_PTR(-EFAULT);
 
+   get_page(alloc_frag->page);
+   alloc_frag->offset += buflen;
+
/* There's a small window that XDP may be set after the check
 * of xdp_prog above, this should be rare and for simplicity
 * we do XDP on skb in case the headroom is not enough.
@@ -1665,24 +1668,16 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
 
switch (act) {
case XDP_REDIRECT:
-   get_page(alloc_frag->page);
-   alloc_frag->offset += buflen;
err = xdp_do_redirect(tun->dev, , xdp_prog);
xdp_do_flush_map();
if (err)
-   goto err_redirect;
-   rcu_read_unlock();
-   preempt_enable();
-   return NULL;
+   goto err_xdp;
+   goto out;
case XDP_TX:
-   get_page(alloc_frag->page);
-   alloc_frag->offset += buflen;
if (tun_xdp_tx(tun->dev, ))
-   goto err_redirect;
+   goto err_xdp;
tun_xdp_flush(tun->dev);
-   rcu_read_unlock();
-   preempt_enable();
-   return NULL;
+   goto out;
case XDP_PASS:
delta = orig_data - xdp.data;
len = xdp.data_end - xdp.data;
@@ -1702,25 +1697,22 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
 
skb = build_skb(buf, buflen);
if (!skb) {
-   rcu_read_unlock();
-   preempt_enable();
-   return ERR_PTR(-ENOMEM);
+   skb = ERR_PTR(-ENOMEM);
+   goto out;
}
 
skb_reserve(skb, pad - delta);
skb_put(skb, len);
-   get_page(alloc_frag->page);
-   alloc_frag->offset += buflen;
 
return skb;
 
-err_redirect:
-   put_page(alloc_frag->page);
 err_xdp:
+   alloc_frag->offset -= buflen;
+   put_page(alloc_frag->page);
+out:
rcu_read_unlock();
preempt_enable();
-   this_cpu_inc(tun->pcpu_stats->rx_dropped);
-   return NULL;
+   return skb;
 }
 
 /* Get packet from user space buffer */
-- 
2.7.4



[RFC PATCH net-next 08/12] tuntap: tweak on the path of non-xdp case in tun_build_skb()

2018-05-21 Thread Jason Wang
If we're sure not to go native XDP, there's no need for several things
like touching preemption counter and rcu stuffs.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/tun.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index f6e0f96..ed04291 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1645,10 +1645,12 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
 * of xdp_prog above, this should be rare and for simplicity
 * we do XDP on skb in case the headroom is not enough.
 */
-   if (hdr->gso_type || !xdp_prog)
+   if (hdr->gso_type || !xdp_prog) {
*skb_xdp = 1;
-   else
-   *skb_xdp = 0;
+   goto build;
+   }
+
+   *skb_xdp = 0;
 
preempt_disable();
rcu_read_lock();
@@ -1695,6 +1697,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
rcu_read_unlock();
preempt_enable();
 
+build:
skb = build_skb(buf, buflen);
if (!skb) {
skb = ERR_PTR(-ENOMEM);
-- 
2.7.4



[RFC PATCH net-next 09/12] tuntap: split out XDP logic

2018-05-21 Thread Jason Wang
This patch split out XDP logic into a single function. This make it to
be reused by XDP batching path in the following patch.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/tun.c | 85 +--
 1 file changed, 51 insertions(+), 34 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index ed04291..2560378 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1605,6 +1605,44 @@ static bool tun_can_build_skb(struct tun_struct *tun, 
struct tun_file *tfile,
return true;
 }
 
+static u32 tun_do_xdp(struct tun_struct *tun,
+ struct tun_file *tfile,
+ struct bpf_prog *xdp_prog,
+ struct xdp_buff *xdp,
+ int *err)
+{
+   u32 act = bpf_prog_run_xdp(xdp_prog, xdp);
+
+   switch (act) {
+   case XDP_REDIRECT:
+   *err = xdp_do_redirect(tun->dev, xdp, xdp_prog);
+   xdp_do_flush_map();
+   if (*err)
+   break;
+   goto out;
+   case XDP_TX:
+   *err = tun_xdp_tx(tun->dev, xdp);
+   if (*err)
+   break;
+   tun_xdp_flush(tun->dev);
+   goto out;
+   case XDP_PASS:
+   goto out;
+   default:
+   bpf_warn_invalid_xdp_action(act);
+   /* fall through */
+   case XDP_ABORTED:
+   trace_xdp_exception(tun->dev, xdp_prog, act);
+   /* fall through */
+   case XDP_DROP:
+   break;
+   }
+
+   put_page(virt_to_head_page(xdp->data_hard_start));
+out:
+   return act;
+}
+
 static struct sk_buff *tun_build_skb(struct tun_struct *tun,
 struct tun_file *tfile,
 struct iov_iter *from,
@@ -1615,10 +1653,10 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
struct sk_buff *skb = NULL;
struct bpf_prog *xdp_prog;
int buflen = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
-   unsigned int delta = 0;
char *buf;
size_t copied;
-   int err, pad = TUN_RX_PAD;
+   int pad = TUN_RX_PAD;
+   int err = 0;
 
rcu_read_lock();
xdp_prog = rcu_dereference(tun->xdp_prog);
@@ -1655,9 +1693,8 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
preempt_disable();
rcu_read_lock();
xdp_prog = rcu_dereference(tun->xdp_prog);
-   if (xdp_prog && !*skb_xdp) {
+   if (xdp_prog) {
struct xdp_buff xdp;
-   void *orig_data;
u32 act;
 
xdp.data_hard_start = buf;
@@ -1665,34 +1702,14 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
xdp_set_data_meta_invalid();
xdp.data_end = xdp.data + len;
xdp.rxq = >xdp_rxq;
-   orig_data = xdp.data;
-   act = bpf_prog_run_xdp(xdp_prog, );
-
-   switch (act) {
-   case XDP_REDIRECT:
-   err = xdp_do_redirect(tun->dev, , xdp_prog);
-   xdp_do_flush_map();
-   if (err)
-   goto err_xdp;
-   goto out;
-   case XDP_TX:
-   if (tun_xdp_tx(tun->dev, ))
-   goto err_xdp;
-   tun_xdp_flush(tun->dev);
-   goto out;
-   case XDP_PASS:
-   delta = orig_data - xdp.data;
-   len = xdp.data_end - xdp.data;
-   break;
-   default:
-   bpf_warn_invalid_xdp_action(act);
-   /* fall through */
-   case XDP_ABORTED:
-   trace_xdp_exception(tun->dev, xdp_prog, act);
-   /* fall through */
-   case XDP_DROP:
+   act = tun_do_xdp(tun, tfile, xdp_prog, , );
+   if (err)
goto err_xdp;
-   }
+   if (act != XDP_PASS)
+   goto out;
+
+   pad = xdp.data - xdp.data_hard_start;
+   len = xdp.data_end - xdp.data;
}
rcu_read_unlock();
preempt_enable();
@@ -1700,18 +1717,18 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
 build:
skb = build_skb(buf, buflen);
if (!skb) {
+   put_page(alloc_frag->page);
skb = ERR_PTR(-ENOMEM);
goto out;
}
 
-   skb_reserve(skb, pad - delta);
+   skb_reserve(skb, pad);
skb_put(skb, len);
 
return skb;
 
 err_xdp:
-   alloc_frag->offset -= buflen;
-   put_page(alloc_frag->page);
+   this_cpu_inc(tun->pcpu_stats->rx_dropped);
 out:
rcu_read_unlock();
preempt_enable();
-- 
2.7.4



[RFC PATCH net-next 12/12] vhost_net: batch submitting XDP buffers to underlayer sockets

2018-05-21 Thread Jason Wang
This patch implements XDP batching for vhost_net with tun. This is
done by batching XDP buffs in vhost and submit them when:

- vhost_net can not build XDP buff (mostly because of the size of packet)
- #batched exceeds the limitation (VHOST_NET_RX_BATCH).
- tun accept a batch of XDP buff through msg_control and process them
  in a batch

With this tun XDP can benefit from e.g batch transmission during
XDP_REDIRECT or XDP_TX.

Tests shows 21% improvement on TX pps (from ~3.2Mpps to ~3.9Mpps)
while transmitting through testpmd from guest to host by
xdp_redirect_map between tap0 and ixgbe.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/tun.c   | 36 +--
 drivers/vhost/net.c | 71 -
 2 files changed, 71 insertions(+), 36 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index b586b3f..5d16d18 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1616,7 +1616,6 @@ static u32 tun_do_xdp(struct tun_struct *tun,
switch (act) {
case XDP_REDIRECT:
*err = xdp_do_redirect(tun->dev, xdp, xdp_prog);
-   xdp_do_flush_map();
if (*err)
break;
goto out;
@@ -1624,7 +1623,6 @@ static u32 tun_do_xdp(struct tun_struct *tun,
*err = tun_xdp_tx(tun->dev, xdp);
if (*err)
break;
-   tun_xdp_flush(tun->dev);
goto out;
case XDP_PASS:
goto out;
@@ -2400,9 +2398,6 @@ static int tun_xdp_one(struct tun_struct *tun,
int err = 0;
bool skb_xdp = false;
 
-   preempt_disable();
-   rcu_read_lock();
-
xdp_prog = rcu_dereference(tun->xdp_prog);
if (xdp_prog) {
if (gso->gso_type) {
@@ -2461,15 +2456,12 @@ static int tun_xdp_one(struct tun_struct *tun,
tun_flow_update(tun, rxhash, tfile);
 
 out:
-   rcu_read_unlock();
-   preempt_enable();
-
return err;
 }
 
 static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 {
-   int ret;
+   int ret, i;
struct tun_file *tfile = container_of(sock, struct tun_file, socket);
struct tun_struct *tun = tun_get(tfile);
struct tun_msg_ctl *ctl = m->msg_control;
@@ -2477,10 +2469,28 @@ static int tun_sendmsg(struct socket *sock, struct 
msghdr *m, size_t total_len)
if (!tun)
return -EBADFD;
 
-   if (ctl && ctl->type == TUN_MSG_PTR) {
-   ret = tun_xdp_one(tun, tfile, ctl->ptr);
-   if (!ret)
-   ret = total_len;
+   if (ctl && ((ctl->type & 0xF) == TUN_MSG_PTR)) {
+   int n = ctl->type >> 16;
+
+   preempt_disable();
+   rcu_read_lock();
+
+   for (i = 0; i < n; i++) {
+   struct xdp_buff *x = (struct xdp_buff *)ctl->ptr;
+   struct xdp_buff *xdp = [i];
+
+   xdp_set_data_meta_invalid(xdp);
+   xdp->rxq = >xdp_rxq;
+   tun_xdp_one(tun, tfile, xdp);
+   }
+
+   xdp_do_flush_map();
+   tun_xdp_flush(tun->dev);
+
+   rcu_read_unlock();
+   preempt_enable();
+
+   ret = total_len;
goto out;
}
 
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 0d84de6..bec4109 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -118,6 +118,7 @@ struct vhost_net_virtqueue {
struct ptr_ring *rx_ring;
struct vhost_net_buf rxq;
struct xdp_buff xdp[VHOST_RX_BATCH];
+   struct vring_used_elem heads[VHOST_RX_BATCH];
 };
 
 struct vhost_net {
@@ -511,7 +512,7 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue 
*nvq,
void *buf;
int copied;
 
-   if (len < nvq->sock_hlen)
+   if (unlikely(len < nvq->sock_hlen))
return -EFAULT;
 
if (SKB_DATA_ALIGN(len + pad) +
@@ -567,11 +568,37 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue 
*nvq,
return 0;
 }
 
+static void vhost_tx_batch(struct vhost_net *net,
+  struct vhost_net_virtqueue *nvq,
+  struct socket *sock,
+  struct msghdr *msghdr, int n)
+{
+   struct tun_msg_ctl ctl = {
+   .type = n << 16 | TUN_MSG_PTR,
+   .ptr = nvq->xdp,
+   };
+   int err;
+
+   if (n == 0)
+   return;
+
+   msghdr->msg_control = 
+   err = sock->ops->sendmsg(sock, msghdr, 0);
+
+   if (unlikely(err < 0)) {
+   /* FIXME vq_err() */
+   vq_err(>vq, "sendmsg err!\n");
+   return;
+   }
+   vhost_add_used_and_signal_n(&g

[RFC PATCH net-next 10/12] vhost_net: build xdp buff

2018-05-21 Thread Jason Wang
This patch implement build XDP buffers in vhost_net. The idea is do
userspace copy in vhost_net and build XDP buff based on the
page. Vhost_net can then submit one or an array of XDP buffs to
underlayer socket (e.g TUN). TUN can choose to do XDP or call
build_skb() to build skb. To support build skb, vnet header were also
stored into the header of the XDP buff.

This userspace copy and XDP buffs building is key to achieve XDP
batching in TUN, since TUN does not need to care about userspace copy
and then can disable premmption for several XDP buffs to achieve
batching from XDP.

TODO: reserve headroom based on the TUN XDP.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/vhost/net.c | 74 +
 1 file changed, 74 insertions(+)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f0639d7..1209e84 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -492,6 +492,80 @@ static bool vhost_has_more_pkts(struct vhost_net *net,
   likely(!vhost_exceeds_maxpend(net));
 }
 
+#define VHOST_NET_HEADROOM 256
+#define VHOST_NET_RX_PAD (NET_IP_ALIGN + NET_SKB_PAD)
+
+static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
+  struct iov_iter *from,
+  struct xdp_buff *xdp)
+{
+   struct vhost_virtqueue *vq = >vq;
+   struct page_frag *alloc_frag = >task_frag;
+   struct virtio_net_hdr *gso;
+   size_t len = iov_iter_count(from);
+   int buflen = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+   int pad = SKB_DATA_ALIGN(VHOST_NET_RX_PAD + VHOST_NET_HEADROOM
++ nvq->sock_hlen);
+   int sock_hlen = nvq->sock_hlen;
+   void *buf;
+   int copied;
+
+   if (len < nvq->sock_hlen)
+   return -EFAULT;
+
+   if (SKB_DATA_ALIGN(len + pad) +
+   SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) > PAGE_SIZE)
+   return -ENOSPC;
+
+   buflen += SKB_DATA_ALIGN(len + pad);
+   alloc_frag->offset = ALIGN((u64)alloc_frag->offset, SMP_CACHE_BYTES);
+   if (unlikely(!skb_page_frag_refill(buflen, alloc_frag, GFP_KERNEL)))
+   return -ENOMEM;
+
+   buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
+
+   /* We store two kinds of metadata in the header which will be
+* used for XDP_PASS to do build_skb():
+* offset 0: buflen
+* offset sizeof(int): vnet header
+*/
+   copied = copy_page_from_iter(alloc_frag->page,
+alloc_frag->offset + sizeof(int), 
sock_hlen, from);
+   if (copied != sock_hlen)
+   return -EFAULT;
+
+   gso = (struct virtio_net_hdr *)(buf + sizeof(int));
+
+   if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
+   vhost16_to_cpu(vq, gso->csum_start) +
+   vhost16_to_cpu(vq, gso->csum_offset) + 2 >
+   vhost16_to_cpu(vq, gso->hdr_len)) {
+   gso->hdr_len = cpu_to_vhost16(vq,
+  vhost16_to_cpu(vq, gso->csum_start) +
+  vhost16_to_cpu(vq, gso->csum_offset) + 2);
+
+   if (vhost16_to_cpu(vq, gso->hdr_len) > len)
+   return -EINVAL;
+   }
+
+   len -= sock_hlen;
+   copied = copy_page_from_iter(alloc_frag->page,
+alloc_frag->offset + pad,
+len, from);
+   if (copied != len)
+   return -EFAULT;
+
+   xdp->data_hard_start = buf;
+   xdp->data = buf + pad;
+   xdp->data_end = xdp->data + len;
+   *(int *)(xdp->data_hard_start)= buflen;
+
+   get_page(alloc_frag->page);
+   alloc_frag->offset += buflen;
+
+   return 0;
+}
+
 static void handle_tx_copy(struct vhost_net *net)
 {
struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
-- 
2.7.4



[RFC PATCH net-next 11/12] vhost_net: passing raw xdp buff to tun

2018-05-21 Thread Jason Wang
This patches implement a TUN specific msg_control:

#define TUN_MSG_UBUF 1
#define TUN_MSG_PTR  2
struct tun_msg_ctl {
   int type;
   void *ptr;
};

The first supported type is ubuf which is already used by vhost_net
zerocopy code. The second is XDP buff, which allows vhost_net to pass
XDP buff to TUN. This could be used to implement accepting an array of
XDP buffs from vhost_net in the following patches.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/tun.c  | 91 +-
 drivers/vhost/net.c| 21 ++--
 include/linux/if_tun.h |  7 
 3 files changed, 116 insertions(+), 3 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 2560378..b586b3f 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2387,18 +2387,107 @@ static void tun_sock_write_space(struct sock *sk)
kill_fasync(>fasync, SIGIO, POLL_OUT);
 }
 
+static int tun_xdp_one(struct tun_struct *tun,
+  struct tun_file *tfile,
+  struct xdp_buff *xdp)
+{
+   struct virtio_net_hdr *gso = xdp->data_hard_start + sizeof(int);
+   struct tun_pcpu_stats *stats;
+   struct bpf_prog *xdp_prog;
+   struct sk_buff *skb = NULL;
+   u32 rxhash = 0, act;
+   int buflen = *(int *)xdp->data_hard_start;
+   int err = 0;
+   bool skb_xdp = false;
+
+   preempt_disable();
+   rcu_read_lock();
+
+   xdp_prog = rcu_dereference(tun->xdp_prog);
+   if (xdp_prog) {
+   if (gso->gso_type) {
+   skb_xdp = true;
+   goto build;
+   }
+   xdp_set_data_meta_invalid(xdp);
+   xdp->rxq = >xdp_rxq;
+   act = tun_do_xdp(tun, tfile, xdp_prog, xdp, );
+   if (err)
+   goto out;
+   if (act != XDP_PASS)
+   goto out;
+   }
+
+build:
+   skb = build_skb(xdp->data_hard_start, buflen);
+   if (!skb) {
+   err = -ENOMEM;
+   goto out;
+   }
+
+   if (skb_xdp) {
+   err = do_xdp_generic(xdp_prog, skb);
+   if (err != XDP_PASS)
+   goto out;
+   }
+
+   skb_reserve(skb, xdp->data - xdp->data_hard_start);
+   skb_put(skb, xdp->data_end - xdp->data);
+
+   if (virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {
+   this_cpu_inc(tun->pcpu_stats->rx_frame_errors);
+   kfree_skb(skb);
+   err = -EINVAL;
+   goto out;
+   }
+
+   skb->protocol = eth_type_trans(skb, tun->dev);
+   skb_reset_network_header(skb);
+   skb_probe_transport_header(skb, 0);
+
+   if (!rcu_dereference(tun->steering_prog))
+   rxhash = __skb_get_hash_symmetric(skb);
+
+   netif_receive_skb(skb);
+
+   stats = get_cpu_ptr(tun->pcpu_stats);
+   u64_stats_update_begin(>syncp);
+   stats->rx_packets++;
+   stats->rx_bytes += skb->len;
+   u64_stats_update_end(>syncp);
+   put_cpu_ptr(stats);
+
+   if (rxhash)
+   tun_flow_update(tun, rxhash, tfile);
+
+out:
+   rcu_read_unlock();
+   preempt_enable();
+
+   return err;
+}
+
 static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 {
int ret;
struct tun_file *tfile = container_of(sock, struct tun_file, socket);
struct tun_struct *tun = tun_get(tfile);
+   struct tun_msg_ctl *ctl = m->msg_control;
 
if (!tun)
return -EBADFD;
 
-   ret = tun_get_user(tun, tfile, m->msg_control, >msg_iter,
+   if (ctl && ctl->type == TUN_MSG_PTR) {
+   ret = tun_xdp_one(tun, tfile, ctl->ptr);
+   if (!ret)
+   ret = total_len;
+   goto out;
+   }
+
+   ret = tun_get_user(tun, tfile, ctl ? ctl->ptr : NULL, >msg_iter,
   m->msg_flags & MSG_DONTWAIT,
   m->msg_flags & MSG_MORE);
+out:
tun_put(tun);
return ret;
 }
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 1209e84..0d84de6 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -117,6 +117,7 @@ struct vhost_net_virtqueue {
struct vhost_net_ubuf_ref *ubufs;
struct ptr_ring *rx_ring;
struct vhost_net_buf rxq;
+   struct xdp_buff xdp[VHOST_RX_BATCH];
 };
 
 struct vhost_net {
@@ -570,6 +571,7 @@ static void handle_tx_copy(struct vhost_net *net)
 {
struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
struct vhost_virtqueue *vq = >vq;
+   struct xdp_buff xdp;
unsigned out, in;
int head;
struct msghdr msg = {
@@ -584,6 +586,7 @@ static void handle_tx_copy(struct vhost_net *net)
size_t hdr_size;
struct sock

[PATCH net 0/4] Fix several issues of virtio-net mergeable XDP

2018-05-21 Thread Jason Wang
Hi:

Please review the patches that tries to fix sevreal issues of
virtio-net mergeable XDP.

Thanks

Jason Wang (4):
  virtio-net: correctly redirect linearized packet
  virtio-net: correctly transmit XDP buff after linearizing
  virtio-net: reset num_buf to 1 after linearizing packet
  virito-net: fix leaking page for gso packet during mergeable XDP

 drivers/net/virtio_net.c | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

-- 
2.7.4



[PATCH net 3/4] virtio-net: reset num_buf to 1 after linearizing packet

2018-05-21 Thread Jason Wang
If we successfully linearize the packets, num_buf were set to zero
which was wrong since we now have only 1 buffer to be used for e.g in
the error path of receive_mergeable(). Zero num_buf will lead the code
try to pop the buffers of next packet and drop it. Fixing this by set
num_buf to 1 if we successfully linearize the packet.

Fixes: 4941d472bf95 ("virtio-net: do not reset during XDP set")
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/virtio_net.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 6260d65..165a922 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -722,6 +722,7 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
  );
if (!xdp_page)
goto err_xdp;
+   num_buf = 1;
offset = VIRTIO_XDP_HEADROOM;
} else {
xdp_page = page;
-- 
2.7.4



[PATCH net 2/4] virtio-net: correctly transmit XDP buff after linearizing

2018-05-21 Thread Jason Wang
We should not go for the error path after successfully transmitting a
XDP buffer after linearizing. Since the error path may try to pop and
drop next packet and increase the drop counters. Fixing this by simply
drop the refcnt of original page and go for xmit path.

Fixes: 72979a6c3590 ("virtio_net: xdp, add slowpath case for non contiguous 
buffers")
Cc: John Fastabend <john.fastab...@gmail.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/virtio_net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c15d240..6260d65 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -775,7 +775,7 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
}
*xdp_xmit = true;
if (unlikely(xdp_page != page))
-   goto err_xdp;
+   put_page(page);
rcu_read_unlock();
goto xdp_xmit;
case XDP_REDIRECT:
-- 
2.7.4



[PATCH net 1/4] virtio-net: correctly redirect linearized packet

2018-05-21 Thread Jason Wang
After a linearized packet was redirected by XDP, we should not go for
the err path which will try to pop buffers for the next packet and
increase the drop counter. Fixing this by just drop the page refcnt
for the original page.

Fixes: 186b3c998c50 ("virtio-net: support XDP_REDIRECT")
Reported-by: David Ahern <dsah...@gmail.com>
Tested-by: David Ahern <dsah...@gmail.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/virtio_net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 770422e..c15d240 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -787,7 +787,7 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
}
*xdp_xmit = true;
if (unlikely(xdp_page != page))
-   goto err_xdp;
+   put_page(page);
rcu_read_unlock();
goto xdp_xmit;
default:
-- 
2.7.4



[PATCH net 4/4] virito-net: fix leaking page for gso packet during mergeable XDP

2018-05-21 Thread Jason Wang
We need to drop refcnt to xdp_page if we see a gso packet. Otherwise
it will be leaked. Fixing this by moving the check of gso packet above
the linearizing logic.

Cc: John Fastabend <john.fastab...@gmail.com>
Fixes: 72979a6c3590 ("virtio_net: xdp, add slowpath case for non contiguous 
buffers")
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/net/virtio_net.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 165a922..f8db809 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -707,6 +707,14 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
void *data;
u32 act;
 
+   /* Transient failure which in theory could occur if
+* in-flight packets from before XDP was enabled reach
+* the receive path after XDP is loaded. In practice I
+* was not able to create this condition.
+*/
+   if (unlikely(hdr->hdr.gso_type))
+   goto err_xdp;
+
/* This happens when rx buffer size is underestimated
 * or headroom is not enough because of the buffer
 * was refilled before XDP is set. This should only
@@ -728,14 +736,6 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
xdp_page = page;
}
 
-   /* Transient failure which in theory could occur if
-* in-flight packets from before XDP was enabled reach
-* the receive path after XDP is loaded. In practice I
-* was not able to create this condition.
-*/
-   if (unlikely(hdr->hdr.gso_type))
-   goto err_xdp;
-
/* Allow consuming headroom but reserve enough space to push
 * the descriptor on if we get an XDP_TX return code.
 */
-- 
2.7.4



Re: KASAN: use-after-free Read in vhost_chr_write_iter

2018-05-20 Thread Jason Wang



On 2018年05月18日 17:24, Jason Wang wrote:



On 2018年05月17日 21:45, DaeRyong Jeong wrote:

We report the crash: KASAN: use-after-free Read in vhost_chr_write_iter

This crash has been found in v4.17-rc1 using RaceFuzzer (a modified
version of Syzkaller), which we describe more at the end of this
report. Our analysis shows that the race occurs when invoking two
syscalls concurrently, write$vnet and ioctl$VHOST_RESET_OWNER.


Analysis:
We think the concurrent execution of vhost_process_iotlb_msg() and
vhost_dev_cleanup() causes the crash.
Both of functions can run concurrently (please see call sequence below),
and possibly, there is a race on dev->iotlb.
If the switch occurs right after vhost_dev_cleanup() frees
dev->iotlb, vhost_process_iotlb_msg() still sees the non-null value 
and it

keep executing without returning -EFAULT. Consequently, use-after-free
occures


Thread interleaving:
CPU0 (vhost_process_iotlb_msg)    CPU1 (vhost_dev_cleanup)
(In the case of both VHOST_IOTLB_UPDATE and
VHOST_IOTLB_INVALIDATE)
=    =
    vhost_umem_clean(dev->iotlb);
if (!dev->iotlb) {
    ret = -EFAULT;
    break;
}
    dev->iotlb = NULL;


Call Sequence:
CPU0
=
vhost_net_chr_write_iter
vhost_chr_write_iter
    vhost_process_iotlb_msg

CPU1
=
vhost_net_ioctl
vhost_net_reset_owner
    vhost_dev_reset_owner
    vhost_dev_cleanup


Thanks a lot for the analysis.

This could be addressed by simply protect it with dev mutex.

Will post a patch.



Could you please help to test the attached patch? I've done some smoking 
test.


Thanks
>From 88328386f3f652e684ee33dc4cf63dcaed871aea Mon Sep 17 00:00:00 2001
From: Jason Wang <jasow...@redhat.com>
Date: Fri, 18 May 2018 17:33:27 +0800
Subject: [PATCH] vhost: synchronize IOTLB message with dev cleanup

DaeRyong Jeong reports a race between vhost_dev_cleanup() and
vhost_process_iotlb_msg():

Thread interleaving:
CPU0 (vhost_process_iotlb_msg)			CPU1 (vhost_dev_cleanup)
(In the case of both VHOST_IOTLB_UPDATE and
VHOST_IOTLB_INVALIDATE)
=		=
		vhost_umem_clean(dev->iotlb);
if (!dev->iotlb) {
	ret = -EFAULT;
		break;
}
		dev->iotlb = NULL;

The reason is we don't synchronize between them, fixing by protecting
vhost_process_iotlb_msg() with dev mutex.

Reported-by: DaeRyong Jeong <threeear...@gmail.com>
Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API")
Reported-by: DaeRyong Jeong <threeear...@gmail.com>
---
 drivers/vhost/vhost.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index f3bd8e9..f0be5f3 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -981,6 +981,7 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
 {
 	int ret = 0;
 
+	mutex_lock(>mutex);
 	vhost_dev_lock_vqs(dev);
 	switch (msg->type) {
 	case VHOST_IOTLB_UPDATE:
@@ -1016,6 +1017,8 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
 	}
 
 	vhost_dev_unlock_vqs(dev);
+	mutex_unlock(>mutex);
+
 	return ret;
 }
 ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
-- 
2.7.4



Re: [RFC V4 PATCH 0/8] Packed ring layout for vhost

2018-05-20 Thread Jason Wang



On 2018年05月21日 00:25, Wei Xu wrote:

On Wed, May 16, 2018 at 08:32:13PM +0800, Jason Wang wrote:

Hi all:

This RFC implement packed ring layout. The code were tested with
Tiwei's RFC V3 ahttps://lkml.org/lkml/2018/4/25/34. Some fixups and
tweaks were needed on top of Tiwei's code to make it run for event
index.

Could you please show the change based on Tiwei's code to easy other's
test?


Please try Tiwei's V4 instead of just waiting for the fixup. It should 
work unless you don't try zerocopy and vIOMMU.




Pktgen reports about 20% improvement on PPS (event index is off). More
testing is ongoing.

Notes for tester:

- Start from this version, vhost need qemu co-operation to work
   correctly. Or you can comment out the packed specific code for
   GET/SET_VRING_BASE.

Do you mean the code in vhost_virtqueue_start/stop?


For qemu, probably.


Both Tiwei's and your v3
work fortunately correctly which should be avoided since the ring should be
definitely different.


I don't understand this, you mean reset work?

Thanks



Wei


Changes from V3:
- Fix math on event idx checking
- Sync last avail wrap counter through GET/SET_VRING_BASE
- remove desc_event prefix in the driver/device structure

Changes from V2:
- do not use & in checking desc_event_flags
- off should be most significant bit
- remove the workaround of mergeable buffer for dpdk prototype
- id should be in the last descriptor in the chain
- keep _F_WRITE for write descriptor when adding used
- device flags updating should use ADDR_USED type
- return error on unexpected unavail descriptor in a chain
- return false in vhost_ve_avail_empty is descriptor is available
- track last seen avail_wrap_counter
- correctly examine available descriptor in get_indirect_packed()
- vhost_idx_diff should return u16 instead of bool

Changes from V1:

- Refactor vhost used elem code to avoid open coding on used elem
- Event suppression support (compile test only).
- Indirect descriptor support (compile test only).
- Zerocopy support.
- vIOMMU support.
- SCSI/VSOCK support (compile test only).
- Fix several bugs

Jason Wang (8):
   vhost: move get_rx_bufs to vhost.c
   vhost: hide used ring layout from device
   vhost: do not use vring_used_elem
   vhost_net: do not explicitly manipulate vhost_used_elem
   vhost: vhost_put_user() can accept metadata type
   virtio: introduce packed ring defines
   vhost: packed ring support
   vhost: event suppression for packed ring

  drivers/vhost/net.c| 136 ++
  drivers/vhost/scsi.c   |  62 +--
  drivers/vhost/vhost.c  | 861 -
  drivers/vhost/vhost.h  |  47 +-
  drivers/vhost/vsock.c  |  42 +-
  include/uapi/linux/virtio_config.h |   9 +
  include/uapi/linux/virtio_ring.h   |  32 ++
  7 files changed, 928 insertions(+), 261 deletions(-)

--
2.7.4





Re: [RFC v4 3/5] virtio_ring: add packed ring support

2018-05-20 Thread Jason Wang



On 2018年05月19日 10:29, Tiwei Bie wrote:

I don't hope so.


I agreed driver should track the DMA addrs or some
other necessary things from the very beginning. And
I also repeated the spec to emphasize that it does
make sense. And I'd like to do that.

What I was saying is that, to support OOO, we may
need to manage these context (which saves DMA addrs
etc) via a list which is similar to the desc list
maintained via `next` in split ring instead of an
array whose elements always can be indexed directly.

My point is these context is a must (not only for OOO).

Yeah, and I have the exactly same point after you
pointed that I shouldn't get the addrs from descs.
I do think it makes sense. I'll do it in the next
version. I don't have any doubt about it. All my
questions are about the OOO, instead of whether we
should save context or not. It just seems that you
thought I don't want to do it, and were trying to
convince me that I should do it.


Right, but looks like I was wrong :)




The desc ring in split ring is an array, but its
free entries are managed as list via next. I was
just wondering, do we want to manage such a list
because of OOO. It's just a very simple question
that I want to hear your opinion... (It doesn't
means anything, e.g. It doesn't mean I don't want
to support OOO. It's just a simple question...)

So the question is yes. But I admit I don't have better idea other than what
you propose here (something like split ring which is a little bit sad).
Maybe Michael had.

Yeah, that's why I asked this question. It will
make the packed ring a bit similar to split ring
at least in the driver part. So I want to draw
your attention on this to make sure that we're
on the same page.


Yes. I think we are.

Thanks


Best regards,
Tiwei Bie





Re: [RFC v4 3/5] virtio_ring: add packed ring support

2018-05-18 Thread Jason Wang



On 2018年05月18日 22:33, Tiwei Bie wrote:

On Fri, May 18, 2018 at 09:17:05PM +0800, Jason Wang wrote:

On 2018年05月18日 19:29, Tiwei Bie wrote:

On Thu, May 17, 2018 at 08:01:52PM +0800, Jason Wang wrote:

On 2018年05月16日 22:33, Tiwei Bie wrote:

On Wed, May 16, 2018 at 10:05:44PM +0800, Jason Wang wrote:

On 2018年05月16日 21:45, Tiwei Bie wrote:

On Wed, May 16, 2018 at 08:51:43PM +0800, Jason Wang wrote:

On 2018年05月16日 20:39, Tiwei Bie wrote:

On Wed, May 16, 2018 at 07:50:16PM +0800, Jason Wang wrote:

On 2018年05月16日 16:37, Tiwei Bie wrote:

[...]

+static void detach_buf_packed(struct vring_virtqueue *vq, unsigned int head,
+ unsigned int id, void **ctx)
+{
+   struct vring_packed_desc *desc;
+   unsigned int i, j;
+
+   /* Clear data ptr. */
+   vq->desc_state[id].data = NULL;
+
+   i = head;
+
+   for (j = 0; j < vq->desc_state[id].num; j++) {
+   desc = >vring_packed.desc[i];
+   vring_unmap_one_packed(vq, desc);

As mentioned in previous discussion, this probably won't work for the case
of out of order completion since it depends on the information in the
descriptor ring. We probably need to extend ctx to record such information.

Above code doesn't depend on the information in the descriptor
ring. The vq->desc_state[] is the extended ctx.

Best regards,
Tiwei Bie

Yes, but desc is a pointer to descriptor ring I think so
vring_unmap_one_packed() still depends on the content of descriptor ring?


I got your point now. I think it makes sense to reserve
the bits of the addr field. Driver shouldn't try to get
addrs from the descriptors when cleanup the descriptors
no matter whether we support out-of-order or not.

Maybe I was wrong, but I remember spec mentioned something like this.

You're right. Spec mentioned this. I was just repeating
the spec to emphasize that it does make sense. :)


But combining it with the out-of-order support, it will
mean that the driver still needs to maintain a desc/ctx
list that is very similar to the desc ring in the split
ring. I'm not quite sure whether it's something we want.
If it is true, I'll do it. So do you think we also want
to maintain such a desc/ctx list for packed ring?

To make it work for OOO backends I think we need something like this
(hardware NIC drivers are usually have something like this).

Which hardware NIC drivers have this?

It's quite common I think, e.g driver track e.g dma addr and page frag
somewhere. e.g the ring->rx_info in mlx4 driver.

It seems that I had a misunderstanding on your
previous comments. I know it's quite common for
drivers to track e.g. DMA addrs somewhere (and
I think one reason behind this is that they want
to reuse the bits of addr field).

Yes, we may want this for virtio-net as well in the future.


   But tracking
addrs somewhere doesn't means supporting OOO.
I thought you were saying it's quite common for
hardware NIC drivers to support OOO (i.e. NICs
will return the descriptors OOO):

I'm not familiar with mlx4, maybe I'm wrong.
I just had a quick glance. And I found below
comments in mlx4_en_process_rx_cq():

```
/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
   * descriptor offset can be deduced from the CQE index instead of
   * reading 'cqe->index' */
index = cq->mcq.cons_index & ring->size_mask;
cqe = mlx4_en_get_cqe(cq->buf, index, priv->cqe_size) + factor;
```

It seems that although they have a completion
queue, they are still using the ring in order.

I guess so (at least from the above bits). Git grep -i "out of order" in
drivers/net gives some hints. Looks like there're few deivces do this.


I guess maybe storage device may want OOO.

Right, some iSCSI did.

But tracking them elsewhere is not only for OOO.

Spec said:

for element address

"
In a used descriptor, Element Address is unused.
"

for Next flag:

"
For example, if descriptors are used in the same order in which they are
made available, this will result in
the used descriptor overwriting the first available descriptor in the list,
the used descriptor for the next list
overwriting the first available descriptor in the next list, etc.
"

for in order completion:

"
This will result in the used descriptor overwriting the first available
descriptor in the batch, the used descriptor
for the next batch overwriting the first available descriptor in the next
batch, etc.
"

So:

- It's an alignment to the spec
- device may (or should) overwrite the descriptor make also make address
field useless.

You didn't get my point...


I don't hope so.


I agreed driver should track the DMA addrs or some
other necessary things from the very beginning. And
I also repeated the spec to emphasize that it does
make sense. And I'd like to do that.

What I was saying is that, to support OOO, we may
need to manage these context (which saves DMA addrs
etc) via a list which is similar to the desc list

  1   2   3   4   5   6   7   8   9   10   >