Re: Question about IRQs during the .remove() of virtio-vsock driver

2019-05-21 Thread Jason Wang



On 2019/5/21 下午9:56, Michael S. Tsirkin wrote:

On Tue, May 21, 2019 at 03:49:20PM +0200, Stefano Garzarella wrote:

On Tue, May 21, 2019 at 06:05:31AM -0400, Michael S. Tsirkin wrote:

On Tue, May 21, 2019 at 11:44:07AM +0200, Stefano Garzarella wrote:

Hi Micheal, Jason,
as suggested by Stefan, I'm checking if we have some races in the
virtio-vsock driver. We found some races in the .probe() and .remove()
with the upper layer (socket) and I'll fix it.

Now my attention is on the bottom layer (virtio device) and my question is:
during the .remove() of virtio-vsock driver (virtio_vsock_remove), could happen
that an IRQ comes and one of our callback (e.g. virtio_vsock_rx_done()) is
executed, queueing new works?

I tried to follow the code in both cases (device unplugged or module removed)
and maybe it couldn't happen because we remove it from bus's knowledge,
but I'm not sure and your advice would be very helpful.

Thanks in advance,
Stefano


Great question! This should be better documented: patches welcome!

When I'm clear, I'll be happy to document this.


Here's my understanding:


A typical removal flow works like this:

- prevent linux from sending new kick requests to device
   and flush such outstanding requests if any
   (device can still send notifications to linux)

- call
   vi->vdev->config->reset(vi->vdev);
   this will flush all device writes and interrupts.
   device will not use any more buffers.
   previously outstanding callbacks might still be active.

- Then call
   vdev->config->del_vqs(vdev);
   to flush outstanding callbacks if any.

Thanks for sharing these useful information.

So, IIUC between step 1 (e.g. in virtio-vsock we flush all work-queues) and
step 2, new IRQs could happen, and in the virtio-vsock driver new work
will be queued.

In order to handle this case, I'm thinking to add a new variable
'work_enabled' in the struct virtio_vsock, put it to false at the start
of the .remove(), then call synchronize_rcu() before to flush all work
queues
and use an helper function virtio_transport_queue_work() to queue
a new work, where the check of work_enabled and the queue_work are in the
RCU read critical section.

Here a pseudo code to explain better the idea:

virtio_vsock_remove() {
 vsock->work_enabled = false;

 /* Wait for other CPUs to finish to queue works */
 synchronize_rcu();

 flush_works();

 vdev->config->reset(vdev);

 ...

 vdev->config->del_vqs(vdev);
}

virtio_vsock_queue_work(vsock, work) {
 rcu_read_lock();

 if (!vsock->work_enabled) {
 goto out;
 }

 queue_work(virtio_vsock_workqueue, work);

out:
 rcu_read_unlock();
}


Do you think can work?
Please tell me if there is a better way to handle this case.

Thanks,
Stefano


instead of rcu tricks I would just have rx_run and tx_run and check it
within the queued work - presumably under tx or rx lock.

then queueing an extra work becomes harmless,
and you flush it after del vqs which flushes everything for you.




It looks to me that we need guarantee no work queued or scheduled before 
del_vqs. Otherwise it may lead use after free? (E.g net disable NAPI 
before del_vqs).


Thanks



Re: [PATCH bpf-next 3/3] veth: Support bulk XDP_TX

2019-05-23 Thread Jason Wang



On 2019/5/23 下午9:51, Toshiaki Makita wrote:

On 19/05/23 (木) 22:29:27, Jesper Dangaard Brouer wrote:

On Thu, 23 May 2019 20:35:50 +0900
Toshiaki Makita  wrote:


On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote:

Toshiaki Makita  writes:

This improves XDP_TX performance by about 8%.

Here are single core XDP_TX test results. CPU consumptions are taken
from "perf report --no-child".

- Before:

   7.26 Mpps

   _raw_spin_lock  7.83%
   veth_xdp_xmit  12.23%

- After:

   7.84 Mpps

   _raw_spin_lock  1.17%
   veth_xdp_xmit   6.45%

Signed-off-by: Toshiaki Makita 
---
  drivers/net/veth.c | 26 +-
  1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 52110e5..4edc75f 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device 
*dev, int n,

  return ret;
  }
  +static void veth_xdp_flush_bq(struct net_device *dev)
+{
+    struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
+    int sent, i, err = 0;
+
+    sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);


Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So
you're introducing an additional per-cpu bulk queue, only to avoid 
lock

contention around the existing pointer ring. But the pointer ring is
per-rq, so if you have lock contention, this means you must have
multiple CPUs servicing the same rq, no?


Yes, it's possible. Not recommended though.



I think the general per-cpu TX bulk queue is overkill.  There is a loop
over packets in veth_xdp_rcv(struct veth_rq *rq, budget, *status), and
the caller veth_poll() will call veth_xdp_flush(rq->dev).

Why can't you store this "temp" bulk array in struct veth_rq ?


Of course I can. But I thought tun has the same problem and we can 
decrease memory footprint by sharing the same storage between devices.



For TUN and for its fast path where vhost passes a bulk of XDP frames 
(through msg_control) to us, we probably just need a temporary bulk 
array in tun_xdp_one() instead of a global one. I can post patch or 
maybe you if you're interested in this.


Thanks


Or if other devices want to reduce queues so that we can use XDP on 
many-cpu servers and introduce locks, we can use this storage for that 
case as well.


Still do you prefer veth-specific solution?



You could even alloc/create it on the stack of veth_poll() and send it
along via a pointer to veth_xdp_rcv).



Toshiaki Makita


Re: [PATCH bpf-next 3/3] veth: Support bulk XDP_TX

2019-05-23 Thread Jason Wang



On 2019/5/24 上午11:28, Toshiaki Makita wrote:

On 2019/05/24 12:13, Jason Wang wrote:

On 2019/5/23 下午9:51, Toshiaki Makita wrote:

On 19/05/23 (木) 22:29:27, Jesper Dangaard Brouer wrote:

On Thu, 23 May 2019 20:35:50 +0900
Toshiaki Makita  wrote:


On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote:

Toshiaki Makita  writes:

This improves XDP_TX performance by about 8%.

Here are single core XDP_TX test results. CPU consumptions are taken
from "perf report --no-child".

- Before:

    7.26 Mpps

    _raw_spin_lock  7.83%
    veth_xdp_xmit  12.23%

- After:

    7.84 Mpps

    _raw_spin_lock  1.17%
    veth_xdp_xmit   6.45%

Signed-off-by: Toshiaki Makita 
---
   drivers/net/veth.c | 26 +-
   1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 52110e5..4edc75f 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device
*dev, int n,
   return ret;
   }
   +static void veth_xdp_flush_bq(struct net_device *dev)
+{
+    struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
+    int sent, i, err = 0;
+
+    sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);

Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So
you're introducing an additional per-cpu bulk queue, only to avoid
lock
contention around the existing pointer ring. But the pointer ring is
per-rq, so if you have lock contention, this means you must have
multiple CPUs servicing the same rq, no?

Yes, it's possible. Not recommended though.


I think the general per-cpu TX bulk queue is overkill.  There is a loop
over packets in veth_xdp_rcv(struct veth_rq *rq, budget, *status), and
the caller veth_poll() will call veth_xdp_flush(rq->dev).

Why can't you store this "temp" bulk array in struct veth_rq ?

Of course I can. But I thought tun has the same problem and we can
decrease memory footprint by sharing the same storage between devices.


For TUN and for its fast path where vhost passes a bulk of XDP frames
(through msg_control) to us, we probably just need a temporary bulk
array in tun_xdp_one() instead of a global one. I can post patch or
maybe you if you're interested in this.

Of course you/I can. What I'm concerned is that could be waste of cache
line when softirq runs veth napi handler and then tun napi handler.



Well, technically the bulk queue passed to TUN could be reused. I admit 
it may save cacheline in ideal case but I wonder how much we could gain 
on real workload. (Note TUN doesn't use napi handler to do XDP, it has a 
NAPI mode but it was mainly used for hardening and XDP was not 
implemented there, maybe we should fix this).


Thanks



Thanks



Or if other devices want to reduce queues so that we can use XDP on
many-cpu servers and introduce locks, we can use this storage for that
case as well.

Still do you prefer veth-specific solution?


You could even alloc/create it on the stack of veth_poll() and send it
along via a pointer to veth_xdp_rcv).


Toshiaki Makita




Re: [PATCH v3 2/2] net: core: support XDP generic on stacked devices.

2019-05-23 Thread Jason Wang



On 2019/5/24 上午3:19, Saeed Mahameed wrote:

On Thu, 2019-05-23 at 10:54 -0700, Stephen Hemminger wrote:

When a device is stacked like (team, bonding, failsafe or netvsc) the
XDP generic program for the parent device was not called.

Move the call to XDP generic inside __netif_receive_skb_core where
it can be done multiple times for stacked case.

Suggested-by: Jiri Pirko 
Fixes: d445516966dc ("net: xdp: support xdp generic on virtual
devices")
Signed-off-by: Stephen Hemminger 
---
v1 - call xdp_generic in netvsc handler
v2 - do xdp_generic in generic rx handler processing
v3 - move xdp_generic call inside the another pass loop

  net/core/dev.c | 56 ++

  1 file changed, 11 insertions(+), 45 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index b6b8505cfb3e..696776e14d00 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4502,23 +4502,6 @@ static int netif_rx_internal(struct sk_buff
*skb)
  
  	trace_netif_rx(skb);
  
-	if (static_branch_unlikely(&generic_xdp_needed_key)) {

-   int ret;
-
-   preempt_disable();
-   rcu_read_lock();
-   ret = do_xdp_generic(rcu_dereference(skb->dev-

xdp_prog), skb);

-   rcu_read_unlock();
-   preempt_enable();
-
-   /* Consider XDP consuming the packet a success from
-* the netdev point of view we do not want to count
-* this as an error.
-*/
-   if (ret != XDP_PASS)
-   return NET_RX_SUCCESS;
-   }
-

Adding Jesper,

There is a small behavioral change due to this patch,



At least not for current code. We don't support skb in cpumap (though 
the fix should not be hard), in dp_do_generic_redirect_map() we had:


    } else {
    /* TODO: Handle BPF_MAP_TYPE_CPUMAP */
    err = -EBADRQC;
    goto err;
    }



the XDP program after this patch will run on the RPS CPU, if
configured, which could cause some behavioral changes in
xdp_redirect_cpu: bpf_redirect_map(cpu_map).

Maybe this is acceptable, but it should be documented, as the current
assumption dictates: XDP program runs on the core where the XDP
frame/SKB was first seen.



At lest for TUN, this is not true. XDP frames were built by vhost_net 
and passed to TUN. There's no guarantee that vhost_net kthread won't 
move to another core.


Thanks





Re: [PATCH v3 2/2] net: core: support XDP generic on stacked devices.

2019-05-24 Thread Jason Wang



On 2019/5/24 下午6:07, Jesper Dangaard Brouer wrote:

On Fri, 24 May 2019 12:17:27 +0800
Jason Wang  wrote:


Maybe this is acceptable, but it should be documented, as the current
assumption dictates: XDP program runs on the core where the XDP
frame/SKB was first seen.


At lest for TUN, this is not true. XDP frames were built by vhost_net
and passed to TUN. There's no guarantee that vhost_net kthread won't
move to another core.

This sound a little scary, as we depend on per-CPU variables (e.g.
bpf_redirect_info).  Can the vhost_net kthread move between CPUs
within/during the NAPI-poll?



The RX of TUN will not go for NAPI usually. What we did is:

1) Vhost kthread prepares an array of XDP frames and pass them to TUN 
through sendmsg


2) TUN will disable bh and run XDP for each frames then enable bh

So kthread can move to another CPU before 2) but we guarantee that the 
per-CPU dependency of XDP work in 2).


TUN indeed has a NAPI mode which is mainly used for hardening, and XDP 
was not implemented on that path (this could be fixed in the future).


Thanks



Re: [PATCH v3 2/2] net: core: support XDP generic on stacked devices.

2019-05-24 Thread Jason Wang



On 2019/5/24 下午5:33, Jesper Dangaard Brouer wrote:

On Thu, 23 May 2019 13:15:44 -0700
Stephen Hemminger  wrote:


On Thu, 23 May 2019 19:19:40 +
Saeed Mahameed  wrote:


On Thu, 2019-05-23 at 10:54 -0700, Stephen Hemminger wrote:

When a device is stacked like (team, bonding, failsafe or netvsc) the
XDP generic program for the parent device was not called.

Move the call to XDP generic inside __netif_receive_skb_core where
it can be done multiple times for stacked case.

Suggested-by: Jiri Pirko 
Fixes: d445516966dc ("net: xdp: support xdp generic on virtual
devices")
Signed-off-by: Stephen Hemminger 
---
v1 - call xdp_generic in netvsc handler
v2 - do xdp_generic in generic rx handler processing
v3 - move xdp_generic call inside the another pass loop

  net/core/dev.c | 56 ++

  1 file changed, 11 insertions(+), 45 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index b6b8505cfb3e..696776e14d00 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4502,23 +4502,6 @@ static int netif_rx_internal(struct sk_buff
*skb)
  
  	trace_netif_rx(skb);
  
-	if (static_branch_unlikely(&generic_xdp_needed_key)) {

-   int ret;
-
-   preempt_disable();
-   rcu_read_lock();
-   ret = do_xdp_generic(rcu_dereference(skb->dev-

xdp_prog), skb);

-   rcu_read_unlock();
-   preempt_enable();
-
-   /* Consider XDP consuming the packet a success from
-* the netdev point of view we do not want to count
-* this as an error.
-*/
-   if (ret != XDP_PASS)
-   return NET_RX_SUCCESS;
-   }
-

Adding Jesper,

There is a small behavioral change due to this patch,
the XDP program after this patch will run on the RPS CPU, if
configured, which could cause some behavioral changes in
xdp_redirect_cpu: bpf_redirect_map(cpu_map).

Maybe this is acceptable, but it should be documented, as the current
assumption dictates: XDP program runs on the core where the XDP
frame/SKB was first seen.

This does break some assumptions, that I worry about.  I've not
optimized generic XDP much, as this is suppose to be slow-path, but as
you can see in my evaluation[1] generic-XDP do have a performance
potential (XDP drop: native=12Mpps and generic=8.4Mpps), but the
XDP-generic performance dies as soon as we e.g. do XDP_TX
(native=10Mpps and generic=4Mpps).  The reason is lack of bulking.

We could implement the same kind of RX-bulking tricks as we do for
XDP_REDIRECT, where bpf_redirect_map store frames in the map and send
them once NAPI-poll exit via a xdp_do_flush_map().  These tricks
depend on per-CPU data (bpf_redirect_info), thus I cannot see how this
could work if XDP-generic now happens after RPS on a remote CPU.



RPS uses backlog NAPI so per-CPU is probably not an issue.

Thanks




Notice, that TX bulking at XDP-generic level, is actually rather
simple, as netstack TX path-code support xmit_more via creating a list
of SKBs... Last time I hacked it up, I saw 20%-30% speedup... anyone
motivated to do this?


Or maybe XDP should just force off RPS (like it does gro)

I guess, we could force off RPS.  But I do see one valid use-case of
combining CPUMAP redirect with RFS (Receive Flow Steering) which is part
of RPS.  Yes, I know we/I *also* have to implement generic-XDP-cpumap
support. But for native-XDP CPUMAP redirect, from 1-2 RX-CPUs into
N-remote CPUs via CPUMAP, and then lets RFS send SKBs to where the
application runs, does make sense to me. (I do have an "assignment" to
implement this in eBPF here[2]).


[1] 
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/Documentation/blogposts/xdp25_eval_generic_xdp_tx.rst

[2] 
https://github.com/xdp-project/xdp-project/blob/master/areas/cpumap.org#cpumap-implement-dynamic-load-balancer-that-is-ooo-safe


Re: [RFC] virtio-net: share receive_*() and add_recvbuf_*() with virtio-vsock

2019-07-15 Thread Jason Wang




   struct sk_buff *virtskb_receive_small(struct virtskb *vs, ...);
   struct sk_buff *virtskb_receive_big(struct virtskb *vs, ...);
   struct sk_buff *virtskb_receive_mergeable(struct virtskb *vs, ...);

   int virtskb_add_recvbuf_small(struct virtskb*vs, ...);
   int virtskb_add_recvbuf_big(struct virtskb *vs, ...);
   int virtskb_add_recvbuf_mergeable(struct virtskb *vs, ...);

For the Guest->Host path it should be easier, so maybe I can add a
"virtskb_send(struct virtskb *vs, struct sk_buff *skb)" with a part of the code
of xmit_skb().

I may miss something, but I don't see any thing that prevents us from using
xmit_skb() directly.


Yes, but my initial idea was to make it more parametric and not related to the
virtio_net_hdr, so the 'hdr_len' could be a parameter and the
'num_buffers' should be handled by the caller.


Let me know if you have in mind better names or if I should put these function
in another place.

I would like to leave the control part completely separate, so, for example,
the two drivers will negotiate the features independently and they will call
the right virtskb_receive_*() function based on the negotiation.

If it's one the issue of negotiation, we can simply change the
virtnet_probe() to deal with different devices.



I already started to work on it, but before to do more steps and send an RFC
patch, I would like to hear your opinion.
Do you think that makes sense?
Do you see any issue or a better solution?

I still think we need to seek a way of adding some codes on virtio-net.c
directly if there's no huge different in the processing of TX/RX. That would
save us a lot time.

After the reading of the buffers from the virtqueue I think the process
is slightly different, because virtio-net will interface with the network
stack, while virtio-vsock will interface with the vsock-core (socket).
So the virtio-vsock implements the following:
- control flow mechanism to avoid to loose packets, informing the peer
about the amount of memory available in the receive queue using some
fields in the virtio_vsock_hdr
- de-multiplexing parsing the virtio_vsock_hdr and choosing the right
socket depending on the port
- socket state handling


I think it's just a branch, for ethernet, go for networking stack. otherwise
go for vsock core?


Yes, that should work.

So, I should refactor the functions that can be called also from the vsock
core, in order to remove "struct net_device *dev" parameter.
Maybe creating some wrappers for the network stack.

Otherwise I should create a fake net_device for vsock_core.

What do you suggest?



I'm not quite sure I get the question. Can you just use the one that 
created by virtio_net?



Thanks


[PATCH 6/6] vhost: don't do synchronize_rcu() in vhost_uninit_vq_maps()

2019-07-23 Thread Jason Wang
There's no need for RCU synchronization in vhost_uninit_vq_maps()
since we've already serialized with readers (memory accessors). This
also avoid the possible userspace DOS through ioctl() because of the
possible high latency caused by synchronize_rcu().

Reported-by: Michael S. Tsirkin 
Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
Signed-off-by: Jason Wang 
---
 drivers/vhost/vhost.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 5b8821d00fe4..a17df1f4069a 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -334,7 +334,9 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
}
spin_unlock(&vq->mmu_lock);
 
-   synchronize_rcu();
+   /* No need for synchronize_rcu() or kfree_rcu() since we are
+* serialized with memory accessors (e.g vq mutex held).
+*/
 
for (i = 0; i < VHOST_NUM_ADDRS; i++)
if (map[i])
-- 
2.18.1



Re: [PATCH v1] tun: mark small packets as owned by the tap sock

2019-07-23 Thread Jason Wang



On 2019/7/23 下午9:01, Alexis Bauvin wrote:

Small packets going out of a tap device go through an optimized code
path that uses build_skb() rather than sock_alloc_send_pskb(). The
latter calls skb_set_owner_w(), but the small packet code path does not.

The net effect is that small packets are not owned by the userland
application's socket (e.g. QEMU), while large packets are.
This can be seen with a TCP session, where packets are not owned when
the window size is small enough (around PAGE_SIZE), while they are once
the window grows (note that this requires the host to support virtio
tso for the guest to offload segmentation).
All this leads to inconsistent behaviour in the kernel, especially on
netfilter modules that uses sk->socket (e.g. xt_owner).

Signed-off-by: Alexis Bauvin 
Fixes: 66ccbc9c87c2 ("tap: use build_skb() for small packet")
---
  drivers/net/tun.c | 71 ---
  1 file changed, 37 insertions(+), 34 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 3d443597bd04..ac56b6a29eb2 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1656,6 +1656,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
  {
struct page_frag *alloc_frag = ¤t->task_frag;
struct bpf_prog *xdp_prog;
+   struct sk_buff *skb;
int buflen = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
char *buf;
size_t copied;
@@ -1686,44 +1687,46 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
 */
if (hdr->gso_type || !xdp_prog) {
*skb_xdp = 1;
-   return __tun_build_skb(alloc_frag, buf, buflen, len, pad);
-   }
-
-   *skb_xdp = 0;
+   } else {
+   *skb_xdp = 0;
  
-	local_bh_disable();

-   rcu_read_lock();
-   xdp_prog = rcu_dereference(tun->xdp_prog);
-   if (xdp_prog) {
-   struct xdp_buff xdp;
-   u32 act;
-
-   xdp.data_hard_start = buf;
-   xdp.data = buf + pad;
-   xdp_set_data_meta_invalid(&xdp);
-   xdp.data_end = xdp.data + len;
-   xdp.rxq = &tfile->xdp_rxq;
-
-   act = bpf_prog_run_xdp(xdp_prog, &xdp);
-   if (act == XDP_REDIRECT || act == XDP_TX) {
-   get_page(alloc_frag->page);
-   alloc_frag->offset += buflen;
+   local_bh_disable();
+   rcu_read_lock();
+   xdp_prog = rcu_dereference(tun->xdp_prog);
+   if (xdp_prog) {
+   struct xdp_buff xdp;
+   u32 act;
+
+   xdp.data_hard_start = buf;
+   xdp.data = buf + pad;
+   xdp_set_data_meta_invalid(&xdp);
+   xdp.data_end = xdp.data + len;
+   xdp.rxq = &tfile->xdp_rxq;
+
+   act = bpf_prog_run_xdp(xdp_prog, &xdp);
+   if (act == XDP_REDIRECT || act == XDP_TX) {
+   get_page(alloc_frag->page);
+   alloc_frag->offset += buflen;
+   }
+   err = tun_xdp_act(tun, xdp_prog, &xdp, act);
+   if (err < 0)
+   goto err_xdp;
+   if (err == XDP_REDIRECT)
+   xdp_do_flush_map();
+   if (err != XDP_PASS)
+   goto out;
+
+   pad = xdp.data - xdp.data_hard_start;
+   len = xdp.data_end - xdp.data;
}
-   err = tun_xdp_act(tun, xdp_prog, &xdp, act);
-   if (err < 0)
-   goto err_xdp;
-   if (err == XDP_REDIRECT)
-   xdp_do_flush_map();
-   if (err != XDP_PASS)
-   goto out;
-
-   pad = xdp.data - xdp.data_hard_start;
-   len = xdp.data_end - xdp.data;
+   rcu_read_unlock();
+   local_bh_enable();
}
-   rcu_read_unlock();
-   local_bh_enable();
  
-	return __tun_build_skb(alloc_frag, buf, buflen, len, pad);

+   skb = __tun_build_skb(alloc_frag, buf, buflen, len, pad);
+   if (skb)
+   skb_set_owner_w(skb, tfile->socket.sk);
+   return skb;
  
  err_xdp:

put_page(alloc_frag->page);



To reduce the change set, anyhow you can move the skb_set_owner_w() to 
__tun_build_skb()?


Thanks




Re: [PATCH v2] tun: mark small packets as owned by the tap sock

2019-07-23 Thread Jason Wang



On 2019/7/23 下午10:23, Alexis Bauvin wrote:

- v1 -> v2: Move skb_set_owner_w to __tun_build_skb to reduce patch size

Small packets going out of a tap device go through an optimized code
path that uses build_skb() rather than sock_alloc_send_pskb(). The
latter calls skb_set_owner_w(), but the small packet code path does not.

The net effect is that small packets are not owned by the userland
application's socket (e.g. QEMU), while large packets are.
This can be seen with a TCP session, where packets are not owned when
the window size is small enough (around PAGE_SIZE), while they are once
the window grows (note that this requires the host to support virtio
tso for the guest to offload segmentation).
All this leads to inconsistent behaviour in the kernel, especially on
netfilter modules that uses sk->socket (e.g. xt_owner).

Signed-off-by: Alexis Bauvin 
Fixes: 66ccbc9c87c2 ("tap: use build_skb() for small packet")



Acked-by: Jason Wang 



---
  drivers/net/tun.c | 9 ++---
  1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 3d443597bd04..db16d7a13e00 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1599,7 +1599,8 @@ static bool tun_can_build_skb(struct tun_struct *tun, 
struct tun_file *tfile,
return true;
  }
  
-static struct sk_buff *__tun_build_skb(struct page_frag *alloc_frag, char *buf,

+static struct sk_buff *__tun_build_skb(struct tun_file *tfile,
+  struct page_frag *alloc_frag, char *buf,
   int buflen, int len, int pad)
  {
struct sk_buff *skb = build_skb(buf, buflen);
@@ -1609,6 +1610,7 @@ static struct sk_buff *__tun_build_skb(struct page_frag 
*alloc_frag, char *buf,
  
  	skb_reserve(skb, pad);

skb_put(skb, len);
+   skb_set_owner_w(skb, tfile->socket.sk);
  
  	get_page(alloc_frag->page);

alloc_frag->offset += buflen;
@@ -1686,7 +1688,8 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
 */
if (hdr->gso_type || !xdp_prog) {
*skb_xdp = 1;
-   return __tun_build_skb(alloc_frag, buf, buflen, len, pad);
+   return __tun_build_skb(tfile, alloc_frag, buf, buflen, len,
+  pad);
}
  
  	*skb_xdp = 0;

@@ -1723,7 +1726,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
rcu_read_unlock();
local_bh_enable();
  
-	return __tun_build_skb(alloc_frag, buf, buflen, len, pad);

+   return __tun_build_skb(tfile, alloc_frag, buf, buflen, len, pad);
  
  err_xdp:

put_page(alloc_frag->page);


Re: tun: mark small packets as owned by the tap sock

2019-08-13 Thread Jason Wang



On 2019/8/13 上午6:19, Dave Jones wrote:

On Wed, Aug 07, 2019 at 12:30:07AM +, Linux Kernel wrote:
  > Commit: 4b663366246be1d1d4b1b8b01245b2e88ad9e706
  > Parent: 16b2084a8afa1432d14ba72b7c97d7908e178178
  > Web:
https://git.kernel.org/torvalds/c/4b663366246be1d1d4b1b8b01245b2e88ad9e706
  > Author: Alexis Bauvin 
  > AuthorDate: Tue Jul 23 16:23:01 2019 +0200
  >
  > tun: mark small packets as owned by the tap sock
  >
  > - v1 -> v2: Move skb_set_owner_w to __tun_build_skb to reduce patch size
  >
  > Small packets going out of a tap device go through an optimized code
  > path that uses build_skb() rather than sock_alloc_send_pskb(). The
  > latter calls skb_set_owner_w(), but the small packet code path does not.
  >
  > The net effect is that small packets are not owned by the userland
  > application's socket (e.g. QEMU), while large packets are.
  > This can be seen with a TCP session, where packets are not owned when
  > the window size is small enough (around PAGE_SIZE), while they are once
  > the window grows (note that this requires the host to support virtio
  > tso for the guest to offload segmentation).
  > All this leads to inconsistent behaviour in the kernel, especially on
  > netfilter modules that uses sk->socket (e.g. xt_owner).
  >
  > Fixes: 66ccbc9c87c2 ("tap: use build_skb() for small packet")
  > Signed-off-by: Alexis Bauvin 
  > Acked-by: Jason Wang 

This commit breaks ipv6 routing when I deployed on it a linode.
It seems to work briefly after boot, and then silently all packets get
dropped. (Presumably, it's dropping RA or ND packets)

With this reverted, everything works as it did in rc3.

Dave



Hi:

Two questions:

- Are you using XDP for TUN?

- Does it work before 66ccbc9c87c2? If yes, could you show us the result 
of net_dropmonitor?


Thanks






Re: [PATCH v3] vhost_vdpa: fix the problem in vhost_vdpa_set_config_call

2021-01-12 Thread Jason Wang



On 2021/1/12 下午1:36, Cindy Lu wrote:

In vhost_vdpa_set_config_call, the cb.private should be vhost_vdpa.
this cb.private will finally use in vhost_vdpa_config_cb as
vhost_vdpa. Fix this issue.

Fixes: 776f395004d82 ("vhost_vdpa: Support config interrupt in vdpa")
Acked-by: Jason Wang 
Signed-off-by: Cindy Lu 
---



Hi Cindy:

I think at least you forget to cc stable.

Thanks



  drivers/vhost/vdpa.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index ef688c8c0e0e..3fbb9c1f49da 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -319,7 +319,7 @@ static long vhost_vdpa_set_config_call(struct vhost_vdpa 
*v, u32 __user *argp)
struct eventfd_ctx *ctx;
  
  	cb.callback = vhost_vdpa_config_cb;

-   cb.private = v->vdpa;
+   cb.private = v;
if (copy_from_user(&fd, argp, sizeof(fd)))
return  -EFAULT;
  




Re: [RFC PATCH 0/7] Support for virtio-net hash reporting

2021-01-12 Thread Jason Wang



On 2021/1/13 上午7:47, Willem de Bruijn wrote:

On Tue, Jan 12, 2021 at 3:29 PM Yuri Benditovich
 wrote:

On Tue, Jan 12, 2021 at 9:49 PM Yuri Benditovich
 wrote:

On Tue, Jan 12, 2021 at 9:41 PM Yuri Benditovich
 wrote:

Existing TUN module is able to use provided "steering eBPF" to
calculate per-packet hash and derive the destination queue to
place the packet to. The eBPF uses mapped configuration data
containing a key for hash calculation and indirection table
with array of queues' indices.

This series of patches adds support for virtio-net hash reporting
feature as defined in virtio specification. It extends the TUN module
and the "steering eBPF" as follows:

Extended steering eBPF calculates the hash value and hash type, keeps
hash value in the skb->hash and returns index of destination virtqueue
and the type of the hash. TUN module keeps returned hash type in
(currently unused) field of the skb.
skb->__unused renamed to 'hash_report_type'.

When TUN module is called later to allocate and fill the virtio-net
header and push it to destination virtqueue it populates the hash
and the hash type into virtio-net header.

VHOST driver is made aware of respective virtio-net feature that
extends the virtio-net header to report the hash value and hash report
type.

Comment from Willem de Bruijn:

Skbuff fields are in short supply. I don't think we need to add one
just for this narrow path entirely internal to the tun device.


We understand that and try to minimize the impact by using an already
existing unused field of skb.

Not anymore. It was repurposed as a flags field very recently.

This use case is also very narrow in scope. And a very short path from
data producer to consumer. So I don't think it needs to claim scarce
bits in the skb.

tun_ebpf_select_queue stores the field, tun_put_user reads it and
converts it to the virtio_net_hdr in the descriptor.

tun_ebpf_select_queue is called from .ndo_select_queue.  Storing the
field in skb->cb is fragile, as in theory some code could overwrite
that between field between ndo_select_queue and
ndo_start_xmit/tun_net_xmit, from which point it is fully under tun
control again. But in practice, I don't believe anything does.

Alternatively an existing skb field that is used only on disjoint
datapaths, such as ingress-only, could be viable.



A question here. We had metadata support in XDP for cooperation between 
eBPF programs. Do we have something similar in the skb?


E.g in the RSS, if we want to pass some metadata information between 
eBPF program and the logic that generates the vnet header (either hard 
logic in the kernel or another eBPF program). Is there any way that can 
avoid the possible conflicts of qdiscs?






Instead, you could just run the flow_dissector in tun_put_user if the
feature is negotiated. Indeed, the flow dissector seems more apt to me
than BPF here. Note that the flow dissector internally can be
overridden by a BPF program if the admin so chooses.


When this set of patches is related to hash delivery in the virtio-net
packet in general,
it was prepared in context of RSS feature implementation as defined in
virtio spec [1]
In case of RSS it is not enough to run the flow_dissector in tun_put_user:
in tun_ebpf_select_queue the TUN calls eBPF to calculate the hash,
hash type and queue index
according to the (mapped) parameters (key, hash types, indirection
table) received from the guest.

TUNSETSTEERINGEBPF was added to support more diverse queue selection
than the default in case of multiqueue tun. Not sure what the exact
use cases are.

But RSS is exactly the purpose of the flow dissector. It is used for
that purpose in the software variant RPS. The flow dissector
implements a superset of the RSS spec, and certainly computes a
four-tuple for TCP/IPv6. In the case of RPS, it is skipped if the NIC
has already computed a 4-tuple hash.

What it does not give is a type indication, such as
VIRTIO_NET_HASH_TYPE_TCPv6. I don't understand how this would be used.
In datapaths where the NIC has already computed the four-tuple hash
and stored it in skb->hash --the common case for servers--, That type
field is the only reason to have to compute again.



The problem is there's no guarantee that the packet comes from the NIC, 
it could be a simple VM2VM or host2VM packet.


And even if the packet is coming from the NIC that calculates the hash 
there's no guarantee that it's the has that guest want (guest may use 
different RSS keys).


Thanks





Our intention is to keep the hash and hash type in the skb to populate them
into a virtio-net header later in tun_put_user.
Note that in this case the type of calculated hash is selected not
only from flow dissections
but also from limitations provided by the guest.

This is already implemented in qemu (for case of vhost=off), see [2]
(virtio_net_process_rss)
For case of vhost=on there are WIP for qemu to load eBPF and attach it to TUN.
Note that exact way of selecting rx virtqueue depends on the guest,

Re: [RFC PATCH 0/7] Support for virtio-net hash reporting

2021-01-13 Thread Jason Wang



On 2021/1/13 下午10:33, Willem de Bruijn wrote:

On Tue, Jan 12, 2021 at 11:11 PM Jason Wang  wrote:


On 2021/1/13 上午7:47, Willem de Bruijn wrote:

On Tue, Jan 12, 2021 at 3:29 PM Yuri Benditovich
 wrote:

On Tue, Jan 12, 2021 at 9:49 PM Yuri Benditovich
 wrote:

On Tue, Jan 12, 2021 at 9:41 PM Yuri Benditovich
 wrote:

Existing TUN module is able to use provided "steering eBPF" to
calculate per-packet hash and derive the destination queue to
place the packet to. The eBPF uses mapped configuration data
containing a key for hash calculation and indirection table
with array of queues' indices.

This series of patches adds support for virtio-net hash reporting
feature as defined in virtio specification. It extends the TUN module
and the "steering eBPF" as follows:

Extended steering eBPF calculates the hash value and hash type, keeps
hash value in the skb->hash and returns index of destination virtqueue
and the type of the hash. TUN module keeps returned hash type in
(currently unused) field of the skb.
skb->__unused renamed to 'hash_report_type'.

When TUN module is called later to allocate and fill the virtio-net
header and push it to destination virtqueue it populates the hash
and the hash type into virtio-net header.

VHOST driver is made aware of respective virtio-net feature that
extends the virtio-net header to report the hash value and hash report
type.

Comment from Willem de Bruijn:

Skbuff fields are in short supply. I don't think we need to add one
just for this narrow path entirely internal to the tun device.


We understand that and try to minimize the impact by using an already
existing unused field of skb.

Not anymore. It was repurposed as a flags field very recently.

This use case is also very narrow in scope. And a very short path from
data producer to consumer. So I don't think it needs to claim scarce
bits in the skb.

tun_ebpf_select_queue stores the field, tun_put_user reads it and
converts it to the virtio_net_hdr in the descriptor.

tun_ebpf_select_queue is called from .ndo_select_queue.  Storing the
field in skb->cb is fragile, as in theory some code could overwrite
that between field between ndo_select_queue and
ndo_start_xmit/tun_net_xmit, from which point it is fully under tun
control again. But in practice, I don't believe anything does.

Alternatively an existing skb field that is used only on disjoint
datapaths, such as ingress-only, could be viable.


A question here. We had metadata support in XDP for cooperation between
eBPF programs. Do we have something similar in the skb?

E.g in the RSS, if we want to pass some metadata information between
eBPF program and the logic that generates the vnet header (either hard
logic in the kernel or another eBPF program). Is there any way that can
avoid the possible conflicts of qdiscs?

Not that I am aware of. The closest thing is cb[].

It'll have to aliase a field like that, that is known unused for the given path.



Right, we need to make sure cb is not used by other ones. I'm not sure 
how hard to achieve that consider Qemu installs the eBPF program but it 
doesn't deal with networking configurations.





One other approach that has been used within linear call stacks is out
of band. Like percpu variables softnet_data.xmit.more and
mirred_rec_level. But that is perhaps a bit overwrought for this use
case.



Yes, and if we go that way then eBPF turns out to be a burden since we 
need to invent helpers to access those auxiliary data structure. It 
would be better then to hard-coded the RSS in the kernel.






Instead, you could just run the flow_dissector in tun_put_user if the
feature is negotiated. Indeed, the flow dissector seems more apt to me
than BPF here. Note that the flow dissector internally can be
overridden by a BPF program if the admin so chooses.


When this set of patches is related to hash delivery in the virtio-net
packet in general,
it was prepared in context of RSS feature implementation as defined in
virtio spec [1]
In case of RSS it is not enough to run the flow_dissector in tun_put_user:
in tun_ebpf_select_queue the TUN calls eBPF to calculate the hash,
hash type and queue index
according to the (mapped) parameters (key, hash types, indirection
table) received from the guest.

TUNSETSTEERINGEBPF was added to support more diverse queue selection
than the default in case of multiqueue tun. Not sure what the exact
use cases are.

But RSS is exactly the purpose of the flow dissector. It is used for
that purpose in the software variant RPS. The flow dissector
implements a superset of the RSS spec, and certainly computes a
four-tuple for TCP/IPv6. In the case of RPS, it is skipped if the NIC
has already computed a 4-tuple hash.

What it does not give is a type indication, such as
VIRTIO_NET_HASH_TYPE_TCPv6. I don't understand how this would be used.
In datapaths where the NIC has already computed the four-tuple hash
and stored it in s

Re: [PATCH linux-next v3 6/6] vdpa_sim_net: Add support for user supported devices

2021-01-13 Thread Jason Wang



On 2021/1/7 上午11:48, Parav Pandit wrote:



From: Michael S. Tsirkin 
Sent: Tuesday, January 5, 2021 6:53 PM

On Tue, Jan 05, 2021 at 12:30:15PM +, Parav Pandit wrote:



From: Michael S. Tsirkin 
Sent: Tuesday, January 5, 2021 5:45 PM

On Tue, Jan 05, 2021 at 12:02:33PM +, Parav Pandit wrote:



From: Michael S. Tsirkin 
Sent: Tuesday, January 5, 2021 5:19 PM

On Tue, Jan 05, 2021 at 12:32:03PM +0200, Parav Pandit wrote:

Enable user to create vdpasim net simulate devices.


$ vdpa dev add mgmtdev vdpasim_net name foo2

Show the newly created vdpa device by its name:
$ vdpa dev show foo2
foo2: type network mgmtdev vdpasim_net vendor_id 0 max_vqs 2
max_vq_size 256

$ vdpa dev show foo2 -jp
{
 "dev": {
 "foo2": {
 "type": "network",
 "mgmtdev": "vdpasim_net",
 "vendor_id": 0,
 "max_vqs": 2,
 "max_vq_size": 256
 }
 }
}


I'd like an example of how do device specific (e.g. net
specific) interfaces tie in to this.

Not sure I follow your question.
Do you mean how to set mac address or mtu of this vdpa device of
type

net?

If so, dev add command will be extended shortly in subsequent
series to

set this net specific attributes.

(I did mention in the next steps in cover letter).


+static int __init vdpasim_net_init(void) {
+   int ret;
+
+   if (macaddr) {
+   mac_pton(macaddr, macaddr_buf);
+   if (!is_valid_ether_addr(macaddr_buf))
+   return -EADDRNOTAVAIL;
+   } else {
+   eth_random_addr(macaddr_buf);
}

Hmm so all devices start out with the same MAC until changed?
And how is the change effected?

Post this patchset and post we have iproute2 vdpa in the tree,
will add the

mac address as the input attribute during "vdpa dev add" command.

So that each different vdpa device can have user specified
(different) mac

address.

For now maybe just avoid VIRTIO_NET_F_MAC then for new devices

then?

That would require book keeping existing net vdpa_sim devices created to

avoid setting VIRTIO_NET_F_MAC.

Such book keeping code will be short lived anyway.
Not sure if its worth it.
Until now only one device was created. So not sure two vdpa devices with

same mac address will be a real issue.

When we add mac address attribute in add command, at that point also

remove the module parameter macaddr.

Will that be mandatory? I'm not to happy with a UAPI we intend to break
straight away ...

No. Specifying mac address shouldn't be mandatory. UAPI wont' be broken.



If it's not mandatory. Does it mean the vDPA parent need to use its own 
logic to generate a validate mac? I'm not sure this is what management 
(libvirt want).


Thanks








Re: [PATCH linux-next v3 6/6] vdpa_sim_net: Add support for user supported devices

2021-01-14 Thread Jason Wang



On 2021/1/14 下午3:58, Parav Pandit wrote:



From: Jason Wang 
Sent: Thursday, January 14, 2021 9:48 AM

On 2021/1/7 上午11:48, Parav Pandit wrote:

From: Michael S. Tsirkin 
Sent: Tuesday, January 5, 2021 6:53 PM

On Tue, Jan 05, 2021 at 12:30:15PM +, Parav Pandit wrote:

From: Michael S. Tsirkin 
Sent: Tuesday, January 5, 2021 5:45 PM

On Tue, Jan 05, 2021 at 12:02:33PM +, Parav Pandit wrote:

From: Michael S. Tsirkin 
Sent: Tuesday, January 5, 2021 5:19 PM

On Tue, Jan 05, 2021 at 12:32:03PM +0200, Parav Pandit wrote:

Enable user to create vdpasim net simulate devices.


$ vdpa dev add mgmtdev vdpasim_net name foo2

Show the newly created vdpa device by its name:
$ vdpa dev show foo2
foo2: type network mgmtdev vdpasim_net vendor_id 0 max_vqs 2
max_vq_size 256

$ vdpa dev show foo2 -jp
{
  "dev": {
  "foo2": {
  "type": "network",
  "mgmtdev": "vdpasim_net",
  "vendor_id": 0,
  "max_vqs": 2,
  "max_vq_size": 256
  }
  }
}

I'd like an example of how do device specific (e.g. net
specific) interfaces tie in to this.

Not sure I follow your question.
Do you mean how to set mac address or mtu of this vdpa device of
type

net?

If so, dev add command will be extended shortly in subsequent
series to

set this net specific attributes.

(I did mention in the next steps in cover letter).


+static int __init vdpasim_net_init(void) {
+   int ret;
+
+   if (macaddr) {
+   mac_pton(macaddr, macaddr_buf);
+   if (!is_valid_ether_addr(macaddr_buf))
+   return -EADDRNOTAVAIL;
+   } else {
+   eth_random_addr(macaddr_buf);
}

Hmm so all devices start out with the same MAC until changed?
And how is the change effected?

Post this patchset and post we have iproute2 vdpa in the tree,
will add the

mac address as the input attribute during "vdpa dev add" command.

So that each different vdpa device can have user specified
(different) mac

address.

For now maybe just avoid VIRTIO_NET_F_MAC then for new devices

then?

That would require book keeping existing net vdpa_sim devices
created to

avoid setting VIRTIO_NET_F_MAC.

Such book keeping code will be short lived anyway.
Not sure if its worth it.
Until now only one device was created. So not sure two vdpa devices
with

same mac address will be a real issue.

When we add mac address attribute in add command, at that point also

remove the module parameter macaddr.

Will that be mandatory? I'm not to happy with a UAPI we intend to
break straight away ...

No. Specifying mac address shouldn't be mandatory. UAPI wont' be

broken.


If it's not mandatory. Does it mean the vDPA parent need to use its own logic
to generate a validate mac? I'm not sure this is what management (libvirt
want).


There are few use cases that I see with PFs, VFs and SFs supporting vdpa 
devices.

1. User wants to use the VF only for vdpa purpose. Here user got the VF which 
was pre-setup by the sysadmin.
In this case whatever MAC assigned to the VF can be used by its vdpa device.
Here, user doesn't need to pass the mac address during vdpa device creation 
time.
This is done as the same MAC has been setup in the ACL rules on the switch side.
Non VDPA users of a VF typically use the VF this way for Netdev and rdma 
functionality.
They might continue same way for vdpa application as well.
Here VF mac is either set using
(a) devlink port function set hw_addr command or using
(b) ip link set vf mac
So vdpa tool didn't pass the mac. (optional).
Though VIRTIO_NET_F_MAC is still valid.

2. User may want to create one or more vdpa device out of the mgmt. device.
Here user wants to more/full control of all features, overriding what sysadmin 
has setup as MAC of the VF/SF.
In this case user will specify the MAC via mgmt tool.
(a) This is also used by those vdpa devices which doesn't have eswitch offloads.
(b) This will work with eswitch offloads as well who does source learning.
(c) User chose to use the vdpa device of a VF while VF Netdev and rdma device 
are used by hypervisor for something else as well.
VIRTIO_NET_F_MAC remains valid in all 2.{a,b,c}.

3. A  vendor mgmt. device always expects it user to provide mac for its vdpa 
devices.
So when it is not provided, it can fail with error message string in extack or 
clear the VIRTIO_NET_F_MAC and let it work using virtio spec's 5.1.5 point 5 to 
proceed.

As common denominator of all above cases, if QEMU or user pass the MAC during 
creation, it will almost always work.
Advance user and QEMU with switchdev mode support who has done 1.a/1.b, will 
omit it.
I do not know how deep integration of QEMU exist with the switchdev mode 
support.

With that mac, mtu as optional input fields provide the necessary flexibility 
for different stacks t

Re: [PATCH net-next v7] vhost_net: avoid tx queue stuck when sendmsg fails

2021-01-14 Thread Jason Wang



On 2021/1/15 下午12:46, wangyunjian wrote:

From: Yunjian Wang 

Currently the driver doesn't drop a packet which can't be sent by tun
(e.g bad packet). In this case, the driver will always process the
same packet lead to the tx queue stuck.

To fix this issue:
1. in the case of persistent failure (e.g bad packet), the driver
can skip this descriptor by ignoring the error.
2. in the case of transient failure (e.g -ENOBUFS, -EAGAIN and -ENOMEM),
the driver schedules the worker to try again.

Signed-off-by: Yunjian Wang 



Acked-by: Jason Wang 



---
v7:
* code rebase
v6:
* update code styles and commit log
---
  drivers/vhost/net.c | 26 ++
  1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 3b744031ec8f..df82b124170e 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -828,14 +828,15 @@ static void handle_tx_copy(struct vhost_net *net, struct 
socket *sock)
msg.msg_flags &= ~MSG_MORE;
}
  
-		/* TODO: Check specific error and bomb out unless ENOBUFS? */

err = sock->ops->sendmsg(sock, &msg, len);
if (unlikely(err < 0)) {
-   vhost_discard_vq_desc(vq, 1);
-   vhost_net_enable_vq(net, vq);
-   break;
-   }
-   if (err != len)
+   if (err == -EAGAIN || err == -ENOMEM || err == 
-ENOBUFS) {
+   vhost_discard_vq_desc(vq, 1);
+   vhost_net_enable_vq(net, vq);
+   break;
+   }
+   pr_debug("Fail to send packet: err %d", err);
+   } else if (unlikely(err != len))
pr_debug("Truncated TX packet: len %d != %zd\n",
 err, len);
  done:
@@ -924,7 +925,6 @@ static void handle_tx_zerocopy(struct vhost_net *net, 
struct socket *sock)
msg.msg_flags &= ~MSG_MORE;
}
  
-		/* TODO: Check specific error and bomb out unless ENOBUFS? */

err = sock->ops->sendmsg(sock, &msg, len);
if (unlikely(err < 0)) {
if (zcopy_used) {
@@ -933,11 +933,13 @@ static void handle_tx_zerocopy(struct vhost_net *net, 
struct socket *sock)
nvq->upend_idx = ((unsigned)nvq->upend_idx - 1)
% UIO_MAXIOV;
}
-   vhost_discard_vq_desc(vq, 1);
-   vhost_net_enable_vq(net, vq);
-   break;
-   }
-   if (err != len)
+   if (err == -EAGAIN || err == -ENOMEM || err == 
-ENOBUFS) {
+   vhost_discard_vq_desc(vq, 1);
+   vhost_net_enable_vq(net, vq);
+   break;
+   }
+   pr_debug("Fail to send packet: err %d", err);
+   } else if (unlikely(err != len))
pr_debug("Truncated TX packet: "
 " len %d != %zd\n", err, len);
if (!zcopy_used)




Re: [RFC PATCH 0/7] Support for virtio-net hash reporting

2021-01-17 Thread Jason Wang



On 2021/1/17 下午3:57, Yuri Benditovich wrote:

On Thu, Jan 14, 2021 at 5:39 AM Jason Wang  wrote:


On 2021/1/13 下午10:33, Willem de Bruijn wrote:

On Tue, Jan 12, 2021 at 11:11 PM Jason Wang  wrote:

On 2021/1/13 上午7:47, Willem de Bruijn wrote:

On Tue, Jan 12, 2021 at 3:29 PM Yuri Benditovich
 wrote:

On Tue, Jan 12, 2021 at 9:49 PM Yuri Benditovich
 wrote:

On Tue, Jan 12, 2021 at 9:41 PM Yuri Benditovich
 wrote:

Existing TUN module is able to use provided "steering eBPF" to
calculate per-packet hash and derive the destination queue to
place the packet to. The eBPF uses mapped configuration data
containing a key for hash calculation and indirection table
with array of queues' indices.

This series of patches adds support for virtio-net hash reporting
feature as defined in virtio specification. It extends the TUN module
and the "steering eBPF" as follows:

Extended steering eBPF calculates the hash value and hash type, keeps
hash value in the skb->hash and returns index of destination virtqueue
and the type of the hash. TUN module keeps returned hash type in
(currently unused) field of the skb.
skb->__unused renamed to 'hash_report_type'.

When TUN module is called later to allocate and fill the virtio-net
header and push it to destination virtqueue it populates the hash
and the hash type into virtio-net header.

VHOST driver is made aware of respective virtio-net feature that
extends the virtio-net header to report the hash value and hash report
type.

Comment from Willem de Bruijn:

Skbuff fields are in short supply. I don't think we need to add one
just for this narrow path entirely internal to the tun device.


We understand that and try to minimize the impact by using an already
existing unused field of skb.

Not anymore. It was repurposed as a flags field very recently.

This use case is also very narrow in scope. And a very short path from
data producer to consumer. So I don't think it needs to claim scarce
bits in the skb.

tun_ebpf_select_queue stores the field, tun_put_user reads it and
converts it to the virtio_net_hdr in the descriptor.

tun_ebpf_select_queue is called from .ndo_select_queue.  Storing the
field in skb->cb is fragile, as in theory some code could overwrite
that between field between ndo_select_queue and
ndo_start_xmit/tun_net_xmit, from which point it is fully under tun
control again. But in practice, I don't believe anything does.

Alternatively an existing skb field that is used only on disjoint
datapaths, such as ingress-only, could be viable.

A question here. We had metadata support in XDP for cooperation between
eBPF programs. Do we have something similar in the skb?

E.g in the RSS, if we want to pass some metadata information between
eBPF program and the logic that generates the vnet header (either hard
logic in the kernel or another eBPF program). Is there any way that can
avoid the possible conflicts of qdiscs?

Not that I am aware of. The closest thing is cb[].

It'll have to aliase a field like that, that is known unused for the given path.


Right, we need to make sure cb is not used by other ones. I'm not sure
how hard to achieve that consider Qemu installs the eBPF program but it
doesn't deal with networking configurations.



One other approach that has been used within linear call stacks is out
of band. Like percpu variables softnet_data.xmit.more and
mirred_rec_level. But that is perhaps a bit overwrought for this use
case.


Yes, and if we go that way then eBPF turns out to be a burden since we
need to invent helpers to access those auxiliary data structure. It
would be better then to hard-coded the RSS in the kernel.



Instead, you could just run the flow_dissector in tun_put_user if the
feature is negotiated. Indeed, the flow dissector seems more apt to me
than BPF here. Note that the flow dissector internally can be
overridden by a BPF program if the admin so chooses.


When this set of patches is related to hash delivery in the virtio-net
packet in general,
it was prepared in context of RSS feature implementation as defined in
virtio spec [1]
In case of RSS it is not enough to run the flow_dissector in tun_put_user:
in tun_ebpf_select_queue the TUN calls eBPF to calculate the hash,
hash type and queue index
according to the (mapped) parameters (key, hash types, indirection
table) received from the guest.

TUNSETSTEERINGEBPF was added to support more diverse queue selection
than the default in case of multiqueue tun. Not sure what the exact
use cases are.

But RSS is exactly the purpose of the flow dissector. It is used for
that purpose in the software variant RPS. The flow dissector
implements a superset of the RSS spec, and certainly computes a
four-tuple for TCP/IPv6. In the case of RPS, it is skipped if the NIC
has already computed a 4-tuple hash.

What it does not give is a type indication, such as
VIRTIO_NET_HASH_TYPE_TCPv6. I don't understand how this would be used.

Re: [PATCH net-next v2 0/7] virtio-net support xdp socket zero copy xmit

2021-01-17 Thread Jason Wang



On 2021/1/16 上午10:59, Xuan Zhuo wrote:

XDP socket is an excellent by pass kernel network transmission framework. The
zero copy feature of xsk (XDP socket) needs to be supported by the driver. The
performance of zero copy is very good. mlx5 and intel ixgbe already support this
feature, This patch set allows virtio-net to support xsk's zerocopy xmit
feature.

And xsk's zerocopy rx has made major changes to virtio-net, and I hope to submit
it after this patch set are received.

Compared with other drivers, virtio-net does not directly obtain the dma
address, so I first obtain the xsk page, and then pass the page to virtio.

When recycling the sent packets, we have to distinguish between skb and xdp.
Now we have to distinguish between skb, xdp, xsk. So the second patch solves
this problem first.

The last four patches are used to support xsk zerocopy in virtio-net:

  1. support xsk enable/disable
  2. realize the function of xsk packet sending
  3. implement xsk wakeup callback
  4. set xsk completed when packet sent done


 Performance Testing 

The udp package tool implemented by the interface of xsk vs sockperf(kernel udp)
for performance testing:

xsk zero copy in virtio-net:
CPUPPS MSGSIZE
28.7%  3833857 64
38.5%  3689491 512
38.9%  2787096 1456



Some questions on the results:

1) What's the setup on the vhost?
2) What's the setup of the mitigation in both host and guest?
3) Any analyze on the possible bottleneck via perf or other tools?

Thanks




xsk without zero copy in virtio-net:
CPUPPS MSGSIZE
100%   1916747 64
100%   1775988 512
100%   1440054 1456

sockperf:
CPUPPS MSGSIZE
100%   713274  64
100%   701024  512
100%   695832  1456

Xuan Zhuo (7):
   xsk: support get page for drv
   virtio-net, xsk: distinguish XDP_TX and XSK XMIT ctx
   xsk, virtio-net: prepare for support xsk zerocopy xmit
   virtio-net, xsk: support xsk enable/disable
   virtio-net, xsk: realize the function of xsk packet sending
   virtio-net, xsk: implement xsk wakeup callback
   virtio-net, xsk: set xsk completed when packet sent done

  drivers/net/virtio_net.c| 559 +++-
  include/linux/netdevice.h   |   1 +
  include/net/xdp_sock_drv.h  |  10 +
  include/net/xsk_buff_pool.h |   1 +
  net/xdp/xsk_buff_pool.c |  10 +-
  5 files changed, 523 insertions(+), 58 deletions(-)

--
1.8.3.1





Re: [PATCH net-next v2 2/7] virtio-net, xsk: distinguish XDP_TX and XSK XMIT ctx

2021-01-17 Thread Jason Wang



On 2021/1/16 上午10:59, Xuan Zhuo wrote:

If support xsk, a new ptr will be recovered during the
process of freeing the old ptr. In order to distinguish between ctx sent
by XDP_TX and ctx sent by xsk, a struct is added here to distinguish
between these two situations. virtnet_xdp_type.type It is used to
distinguish different ctx, and virtnet_xdp_type.offset is used to record
the offset between "true ctx" and virtnet_xdp_type.

The newly added virtnet_xsk_hdr will be used for xsk.

Signed-off-by: Xuan Zhuo 



Any reason that you can't simply encode the type in the pointer itself 
as we used to do?


#define VIRTIO_XSK_FLAG    BIT(1)

?



---
  drivers/net/virtio_net.c | 75 ++--
  1 file changed, 60 insertions(+), 15 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index ba8e637..e707c31 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -94,6 +94,22 @@ struct virtnet_rq_stats {
u64 kicks;
  };
  
+enum {

+   XDP_TYPE_XSK,
+   XDP_TYPE_TX,
+};
+
+struct virtnet_xdp_type {
+   int offset:24;
+   unsigned type:8;
+};
+
+struct virtnet_xsk_hdr {
+   struct virtnet_xdp_type type;
+   struct virtio_net_hdr_mrg_rxbuf hdr;
+   u32 len;
+};
+
  #define VIRTNET_SQ_STAT(m)offsetof(struct virtnet_sq_stats, m)
  #define VIRTNET_RQ_STAT(m)offsetof(struct virtnet_rq_stats, m)
  
@@ -251,14 +267,19 @@ static bool is_xdp_frame(void *ptr)

return (unsigned long)ptr & VIRTIO_XDP_FLAG;
  }
  
-static void *xdp_to_ptr(struct xdp_frame *ptr)

+static void *xdp_to_ptr(struct virtnet_xdp_type *ptr)
  {
return (void *)((unsigned long)ptr | VIRTIO_XDP_FLAG);
  }
  
-static struct xdp_frame *ptr_to_xdp(void *ptr)

+static struct virtnet_xdp_type *ptr_to_xtype(void *ptr)
+{
+   return (struct virtnet_xdp_type *)((unsigned long)ptr & 
~VIRTIO_XDP_FLAG);
+}
+
+static void *xtype_get_ptr(struct virtnet_xdp_type *xdptype)
  {
-   return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
+   return (char *)xdptype + xdptype->offset;
  }
  
  /* Converting between virtqueue no. and kernel tx/rx queue no.

@@ -459,11 +480,16 @@ static int __virtnet_xdp_xmit_one(struct virtnet_info *vi,
   struct xdp_frame *xdpf)
  {
struct virtio_net_hdr_mrg_rxbuf *hdr;
+   struct virtnet_xdp_type *xdptype;
int err;
  
-	if (unlikely(xdpf->headroom < vi->hdr_len))

+   if (unlikely(xdpf->headroom < vi->hdr_len + sizeof(*xdptype)))
return -EOVERFLOW;
  
+	xdptype = (struct virtnet_xdp_type *)(xdpf + 1);

+   xdptype->offset = (char *)xdpf - (char *)xdptype;
+   xdptype->type = XDP_TYPE_TX;
+
/* Make room for virtqueue hdr (also change xdpf->headroom?) */
xdpf->data -= vi->hdr_len;
/* Zero header and leave csum up to XDP layers */
@@ -473,7 +499,7 @@ static int __virtnet_xdp_xmit_one(struct virtnet_info *vi,
  
  	sg_init_one(sq->sg, xdpf->data, xdpf->len);
  
-	err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdp_to_ptr(xdpf),

+   err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdp_to_ptr(xdptype),
   GFP_ATOMIC);
if (unlikely(err))
return -ENOSPC; /* Caller handle free/refcnt */
@@ -523,8 +549,11 @@ static int virtnet_xdp_xmit(struct net_device *dev,
/* Free up any pending old buffers before queueing new ones. */
while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
if (likely(is_xdp_frame(ptr))) {
-   struct xdp_frame *frame = ptr_to_xdp(ptr);
+   struct virtnet_xdp_type *xtype;
+   struct xdp_frame *frame;
  
+			xtype = ptr_to_xtype(ptr);

+   frame = xtype_get_ptr(xtype);
bytes += frame->len;
xdp_return_frame(frame);
} else {
@@ -1373,24 +1402,34 @@ static int virtnet_receive(struct receive_queue *rq, 
int budget,
  
  static void free_old_xmit_skbs(struct send_queue *sq, bool in_napi)

  {
-   unsigned int len;
unsigned int packets = 0;
unsigned int bytes = 0;
-   void *ptr;
+   unsigned int len;
+   struct virtnet_xdp_type *xtype;
+   struct xdp_frame*frame;
+   struct virtnet_xsk_hdr  *xskhdr;
+   struct sk_buff  *skb;
+   void*ptr;
  
  	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {

if (likely(!is_xdp_frame(ptr))) {
-   struct sk_buff *skb = ptr;
+   skb = ptr;
  
  			pr_debug("Sent skb %p\n", skb);
  
  			bytes += skb->len;

napi_consume_skb(skb, in_napi);
} else {
-   struct xdp_frame *frame = ptr_to_xdp(ptr);
+   xtype = ptr_to_xtype(ptr);
  
-			bytes += frame->len;

-   xdp_return_fra

Re: [PATCH net-next v2 5/7] virtio-net, xsk: realize the function of xsk packet sending

2021-01-18 Thread Jason Wang



On 2021/1/16 上午10:59, Xuan Zhuo wrote:

virtnet_xsk_run will be called in the tx interrupt handling function
virtnet_poll_tx.

The sending process gets desc from the xsk tx queue, and assembles it to
send the data.

Compared with other drivers, a special place is that the page of the
data in xsk is used here instead of the dma address. Because the virtio
interface does not use the dma address.

Signed-off-by: Xuan Zhuo 
---
  drivers/net/virtio_net.c | 200 ++-
  1 file changed, 197 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index a62d456..42aa9ad 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -119,6 +119,8 @@ struct virtnet_xsk_hdr {
u32 len;
  };
  
+#define VIRTNET_STATE_XSK_WAKEUP 1

+
  #define VIRTNET_SQ_STAT(m)offsetof(struct virtnet_sq_stats, m)
  #define VIRTNET_RQ_STAT(m)offsetof(struct virtnet_rq_stats, m)
  
@@ -163,9 +165,12 @@ struct send_queue {

struct xsk_buff_pool   __rcu *pool;
struct virtnet_xsk_hdr __rcu *hdr;
  
+		unsigned long  state;

u64hdr_con;
u64hdr_pro;
u64hdr_n;
+   struct xdp_desclast_desc;
+   bool   wait_slot;
} xsk;
  };
  
@@ -284,6 +289,8 @@ static void __free_old_xmit_ptr(struct send_queue *sq, bool in_napi,

bool xsk_wakeup,
unsigned int *_packets, unsigned int *_bytes);
  static void free_old_xmit_skbs(struct send_queue *sq, bool in_napi);
+static int virtnet_xsk_run(struct send_queue *sq,
+  struct xsk_buff_pool *pool, int budget);
  
  static bool is_xdp_frame(void *ptr)

  {
@@ -1590,6 +1597,8 @@ static int virtnet_poll_tx(struct napi_struct *napi, int 
budget)
struct virtnet_info *vi = sq->vq->vdev->priv;
unsigned int index = vq2txq(sq->vq);
struct netdev_queue *txq;
+   struct xsk_buff_pool *pool;
+   int work = 0;
  
  	if (unlikely(is_xdp_raw_buffer_queue(vi, index))) {

/* We don't need to enable cb for XDP */
@@ -1599,15 +1608,26 @@ static int virtnet_poll_tx(struct napi_struct *napi, 
int budget)
  
  	txq = netdev_get_tx_queue(vi->dev, index);

__netif_tx_lock(txq, raw_smp_processor_id());
-   free_old_xmit_skbs(sq, true);
+
+   rcu_read_lock();
+   pool = rcu_dereference(sq->xsk.pool);
+   if (pool) {
+   work = virtnet_xsk_run(sq, pool, budget);
+   rcu_read_unlock();
+   } else {
+   rcu_read_unlock();
+   free_old_xmit_skbs(sq, true);
+   }
+
__netif_tx_unlock(txq);
  
-	virtqueue_napi_complete(napi, sq->vq, 0);

+   if (work < budget)
+   virtqueue_napi_complete(napi, sq->vq, 0);
  
  	if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)

netif_tx_wake_queue(txq);
  
-	return 0;

+   return work;
  }
  
  static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)

@@ -2647,6 +2667,180 @@ static int virtnet_xdp(struct net_device *dev, struct 
netdev_bpf *xdp)
}
  }
  
+static int virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,

+   struct xdp_desc *desc)
+{
+   struct virtnet_info *vi = sq->vq->vdev->priv;
+   void *data, *ptr;
+   struct page *page;
+   struct virtnet_xsk_hdr *xskhdr;
+   u32 idx, offset, n, i, copy, copied;
+   u64 addr;
+   int err, m;
+
+   addr = desc->addr;
+
+   data = xsk_buff_raw_get_data(pool, addr);
+   offset = offset_in_page(data);
+
+   /* one for hdr, one for the first page */
+   n = 2;
+   m = desc->len - (PAGE_SIZE - offset);
+   if (m > 0) {
+   n += m >> PAGE_SHIFT;
+   if (m & PAGE_MASK)
+   ++n;
+
+   n = min_t(u32, n, ARRAY_SIZE(sq->sg));
+   }
+
+   idx = sq->xsk.hdr_con % sq->xsk.hdr_n;



I don't understand the reason of the hdr array. It looks to me all of 
them are zero and read only from device.


Any reason for not reusing a single hdr for all xdp descriptors? Or 
maybe it's time to introduce VIRTIO_NET_F_NO_HDR.




+   xskhdr = &sq->xsk.hdr[idx];
+
+   /* xskhdr->hdr has been memset to zero, so not need to clear again */
+
+   sg_init_table(sq->sg, n);
+   sg_set_buf(sq->sg, &xskhdr->hdr, vi->hdr_len);
+
+   copied = 0;
+   for (i = 1; i < n; ++i) {
+   copy = min_t(int, desc->len - copied, PAGE_SIZE - offset);
+
+   page = xsk_buff_raw_get_page(pool, addr + copied);
+
+   sg_set_page(sq->sg + i, page, copy, offset);
+   copied += copy;
+   if (offset)
+   offset = 0;
+   }



It looks to me we need to terminate the sg:

**
 * virtqueue_add_o

Re: [PATCH net-next v2 6/7] virtio-net, xsk: implement xsk wakeup callback

2021-01-18 Thread Jason Wang



On 2021/1/16 上午10:59, Xuan Zhuo wrote:

Since I did not find an interface to directly notify virtio to generate
a tx interrupt, I sent some data to trigger a new tx interrupt.

Another advantage of this is that the transmission delay will be
relatively small, and there is no need to wait for the tx interrupt to
start softirq.

Signed-off-by: Xuan Zhuo 
---
  drivers/net/virtio_net.c | 51 
  1 file changed, 51 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 42aa9ad..e552c2d 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2841,6 +2841,56 @@ static int virtnet_xsk_run(struct send_queue *sq,
return ret;
  }
  
+static int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)

+{
+   struct virtnet_info *vi = netdev_priv(dev);
+   struct send_queue *sq;
+   struct xsk_buff_pool *pool;
+   struct netdev_queue *txq;
+
+   if (!netif_running(dev))
+   return -ENETDOWN;
+
+   if (qid >= vi->curr_queue_pairs)
+   return -EINVAL;
+
+   sq = &vi->sq[qid];
+
+   rcu_read_lock();
+
+   pool = rcu_dereference(sq->xsk.pool);
+   if (!pool)
+   goto end;
+
+   if (test_and_set_bit(VIRTNET_STATE_XSK_WAKEUP, &sq->xsk.state))
+   goto end;
+
+   txq = netdev_get_tx_queue(dev, qid);
+
+   local_bh_disable();
+   __netif_tx_lock(txq, raw_smp_processor_id());



You can use __netif_tx_lock_bh().

Thanks



+
+   /* Send part of the package directly to reduce the delay in sending the
+* package, and this can actively trigger the tx interrupts.
+*
+* If the package is not processed, then continue processing in the
+* subsequent tx interrupt(virtnet_poll_tx).
+*
+* If no packet is sent out, the ring of the device is full. In this
+* case, we will still get a tx interrupt response. Then we will deal
+* with the subsequent packet sending work.
+*/
+
+   virtnet_xsk_run(sq, pool, xsk_budget);
+
+   __netif_tx_unlock(txq);
+   local_bh_enable();
+
+end:
+   rcu_read_unlock();
+   return 0;
+}
+
  static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
  size_t len)
  {
@@ -2895,6 +2945,7 @@ static int virtnet_set_features(struct net_device *dev,
.ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid,
.ndo_bpf= virtnet_xdp,
.ndo_xdp_xmit   = virtnet_xdp_xmit,
+   .ndo_xsk_wakeup = virtnet_xsk_wakeup,
.ndo_features_check = passthru_features_check,
.ndo_get_phys_port_name = virtnet_get_phys_port_name,
.ndo_set_features   = virtnet_set_features,




Re: [PATCH linux-next v3 6/6] vdpa_sim_net: Add support for user supported devices

2021-01-19 Thread Jason Wang



On 2021/1/15 下午2:27, Parav Pandit wrote:

With that mac, mtu as optional input fields provide the necessary flexibility

for different stacks to take appropriate shape as they desire.


Thanks for the clarification. I think we'd better document the above in the
patch that introduces the mac setting from management API.

Yes. Will do.
Thanks.



Adding Sean.

Regarding to mac address setting. Do we plan to allow to modify mac 
address after the creation? It looks like Openstack wants this.


Sean may share more information on this.

Thanks



Re: [PATCH bpf-next v2 1/3] net: add priv_flags for allow tx skb without linear

2021-01-19 Thread Jason Wang



On 2021/1/19 下午5:45, Xuan Zhuo wrote:

In some cases, we hope to construct skb directly based on the existing
memory without copying data. In this case, the page will be placed
directly in the skb, and the linear space of skb is empty. But
unfortunately, many the network card does not support this operation.
For example Mellanox Technologies MT27710 Family [ConnectX-4 Lx] will
get the following error message:

 mlx5_core :3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8, qn 0x1dbb, 
opcode 0xd, syndrome 0x1, vendor syndrome 0x68
 : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 0030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2
 WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64
 : 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00
 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 0020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00
 0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 mlx5_core :3b:00.1 eth1: ERR CQE on SQ: 0x1dbb

So a priv_flag is added here to indicate whether the network card
supports this feature.



I don't see Mellanox engineers are copied. I wonder if we need their 
confirmation on whether it's a bug or hardware limitation.


Thanks



Re: [PATCH net-next v2 2/7] virtio-net, xsk: distinguish XDP_TX and XSK XMIT ctx

2021-01-19 Thread Jason Wang



On 2021/1/18 下午7:54, Xuan Zhuo wrote:

On Mon, 18 Jan 2021 14:45:16 +0800, Jason Wang  wrote:

On 2021/1/16 上午10:59, Xuan Zhuo wrote:

If support xsk, a new ptr will be recovered during the
process of freeing the old ptr. In order to distinguish between ctx sent
by XDP_TX and ctx sent by xsk, a struct is added here to distinguish
between these two situations. virtnet_xdp_type.type It is used to
distinguish different ctx, and virtnet_xdp_type.offset is used to record
the offset between "true ctx" and virtnet_xdp_type.

The newly added virtnet_xsk_hdr will be used for xsk.

Signed-off-by: Xuan Zhuo 


Any reason that you can't simply encode the type in the pointer itself
as we used to do?

#define VIRTIO_XSK_FLAG    BIT(1)

?

Since xdp socket does not use xdp_frame, we will encounter three data types when
recycling: skb, xdp_frame, xdp socket.



Just to make sure we are in the same page. Currently, the pointer type 
is encoded with 1 bit in the pointer. Can we simply use 2 bit to 
distinguish skb, xdp, xsk?


Thanks








Re: [PATCH net-next v2 5/7] virtio-net, xsk: realize the function of xsk packet sending

2021-01-19 Thread Jason Wang



On 2021/1/18 下午8:03, Xuan Zhuo wrote:

On Mon, 18 Jan 2021 17:10:24 +0800, Jason Wang  wrote:

On 2021/1/16 上午10:59, Xuan Zhuo wrote:

virtnet_xsk_run will be called in the tx interrupt handling function
virtnet_poll_tx.

The sending process gets desc from the xsk tx queue, and assembles it to
send the data.

Compared with other drivers, a special place is that the page of the
data in xsk is used here instead of the dma address. Because the virtio
interface does not use the dma address.

Signed-off-by: Xuan Zhuo 
---
   drivers/net/virtio_net.c | 200 
++-
   1 file changed, 197 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index a62d456..42aa9ad 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -119,6 +119,8 @@ struct virtnet_xsk_hdr {
u32 len;
   };

+#define VIRTNET_STATE_XSK_WAKEUP 1
+
   #define VIRTNET_SQ_STAT(m)   offsetof(struct virtnet_sq_stats, m)
   #define VIRTNET_RQ_STAT(m)   offsetof(struct virtnet_rq_stats, m)

@@ -163,9 +165,12 @@ struct send_queue {
struct xsk_buff_pool   __rcu *pool;
struct virtnet_xsk_hdr __rcu *hdr;

+   unsigned long  state;
u64hdr_con;
u64hdr_pro;
u64hdr_n;
+   struct xdp_desclast_desc;
+   bool   wait_slot;
} xsk;
   };

@@ -284,6 +289,8 @@ static void __free_old_xmit_ptr(struct send_queue *sq, bool 
in_napi,
bool xsk_wakeup,
unsigned int *_packets, unsigned int *_bytes);
   static void free_old_xmit_skbs(struct send_queue *sq, bool in_napi);
+static int virtnet_xsk_run(struct send_queue *sq,
+  struct xsk_buff_pool *pool, int budget);

   static bool is_xdp_frame(void *ptr)
   {
@@ -1590,6 +1597,8 @@ static int virtnet_poll_tx(struct napi_struct *napi, int 
budget)
struct virtnet_info *vi = sq->vq->vdev->priv;
unsigned int index = vq2txq(sq->vq);
struct netdev_queue *txq;
+   struct xsk_buff_pool *pool;
+   int work = 0;

if (unlikely(is_xdp_raw_buffer_queue(vi, index))) {
/* We don't need to enable cb for XDP */
@@ -1599,15 +1608,26 @@ static int virtnet_poll_tx(struct napi_struct *napi, 
int budget)

txq = netdev_get_tx_queue(vi->dev, index);
__netif_tx_lock(txq, raw_smp_processor_id());
-   free_old_xmit_skbs(sq, true);
+
+   rcu_read_lock();
+   pool = rcu_dereference(sq->xsk.pool);
+   if (pool) {
+   work = virtnet_xsk_run(sq, pool, budget);
+   rcu_read_unlock();
+   } else {
+   rcu_read_unlock();
+   free_old_xmit_skbs(sq, true);
+   }
+
__netif_tx_unlock(txq);

-   virtqueue_napi_complete(napi, sq->vq, 0);
+   if (work < budget)
+   virtqueue_napi_complete(napi, sq->vq, 0);

if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
netif_tx_wake_queue(txq);

-   return 0;
+   return work;
   }

   static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
@@ -2647,6 +2667,180 @@ static int virtnet_xdp(struct net_device *dev, struct 
netdev_bpf *xdp)
}
   }

+static int virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
+   struct xdp_desc *desc)
+{
+   struct virtnet_info *vi = sq->vq->vdev->priv;
+   void *data, *ptr;
+   struct page *page;
+   struct virtnet_xsk_hdr *xskhdr;
+   u32 idx, offset, n, i, copy, copied;
+   u64 addr;
+   int err, m;
+
+   addr = desc->addr;
+
+   data = xsk_buff_raw_get_data(pool, addr);
+   offset = offset_in_page(data);
+
+   /* one for hdr, one for the first page */
+   n = 2;
+   m = desc->len - (PAGE_SIZE - offset);
+   if (m > 0) {
+   n += m >> PAGE_SHIFT;
+   if (m & PAGE_MASK)
+   ++n;
+
+   n = min_t(u32, n, ARRAY_SIZE(sq->sg));
+   }
+
+   idx = sq->xsk.hdr_con % sq->xsk.hdr_n;


I don't understand the reason of the hdr array. It looks to me all of
them are zero and read only from device.

Any reason for not reusing a single hdr for all xdp descriptors? Or
maybe it's time to introduce VIRTIO_NET_F_NO_HDR.

Yes, You are right.
Before supporting functions like csum, here it is indeed possible to achieve it
with only one hdr.



So let's drop the array logic for now since it will give unnecessary 
stress on the cache.






+   xskhdr = &sq->xsk.hdr[idx];
+
+   /* xskhdr->hdr has been memset to zero, so not need to clear again */
+
+   sg_init_table(sq->sg, n);
+   sg_set_buf(sq->sg, &xskhdr->hdr, vi->

Re: [RFC v3 03/11] vdpa: Remove the restriction that only supports virtio-net devices

2021-01-19 Thread Jason Wang



On 2021/1/19 下午12:59, Xie Yongji wrote:

With VDUSE, we should be able to support all kinds of virtio devices.

Signed-off-by: Xie Yongji 
---
  drivers/vhost/vdpa.c | 29 +++--
  1 file changed, 3 insertions(+), 26 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 29ed4173f04e..448be7875b6d 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -22,6 +22,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #include "vhost.h"
  
@@ -185,26 +186,6 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 __user *statusp)

return 0;
  }
  
-static int vhost_vdpa_config_validate(struct vhost_vdpa *v,

- struct vhost_vdpa_config *c)
-{
-   long size = 0;
-
-   switch (v->virtio_id) {
-   case VIRTIO_ID_NET:
-   size = sizeof(struct virtio_net_config);
-   break;
-   }
-
-   if (c->len == 0)
-   return -EINVAL;
-
-   if (c->len > size - c->off)
-   return -E2BIG;
-
-   return 0;
-}



I think we should use a separate patch for this.

Thanks



-
  static long vhost_vdpa_get_config(struct vhost_vdpa *v,
  struct vhost_vdpa_config __user *c)
  {
@@ -215,7 +196,7 @@ static long vhost_vdpa_get_config(struct vhost_vdpa *v,
  
  	if (copy_from_user(&config, c, size))

return -EFAULT;
-   if (vhost_vdpa_config_validate(v, &config))
+   if (config.len == 0)
return -EINVAL;
buf = kvzalloc(config.len, GFP_KERNEL);
if (!buf)
@@ -243,7 +224,7 @@ static long vhost_vdpa_set_config(struct vhost_vdpa *v,
  
  	if (copy_from_user(&config, c, size))

return -EFAULT;
-   if (vhost_vdpa_config_validate(v, &config))
+   if (config.len == 0)
return -EINVAL;
buf = kvzalloc(config.len, GFP_KERNEL);
if (!buf)
@@ -1025,10 +1006,6 @@ static int vhost_vdpa_probe(struct vdpa_device *vdpa)
int minor;
int r;
  
-	/* Currently, we only accept the network devices. */

-   if (ops->get_device_id(vdpa) != VIRTIO_ID_NET)
-   return -ENOTSUPP;
-
v = kzalloc(sizeof(*v), GFP_KERNEL | __GFP_RETRY_MAYFAIL);
if (!v)
return -ENOMEM;




Re: [RFC v3 04/11] vhost-vdpa: protect concurrent access to vhost device iotlb

2021-01-19 Thread Jason Wang



On 2021/1/19 下午12:59, Xie Yongji wrote:

Introduce a mutex to protect vhost device iotlb from
concurrent access.

Fixes: 4c8cf318("vhost: introduce vDPA-based backend")
Signed-off-by: Xie Yongji 
---
  drivers/vhost/vdpa.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 448be7875b6d..4a241d380c40 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -49,6 +49,7 @@ struct vhost_vdpa {
struct eventfd_ctx *config_ctx;
int in_batch;
struct vdpa_iova_range range;
+   struct mutex mutex;



Let's use the device mutex like what vhost_process_iotlb_msg() did.

Thanks



  };
  
  static DEFINE_IDA(vhost_vdpa_ida);

@@ -728,6 +729,7 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev 
*dev,
if (r)
return r;
  
+	mutex_lock(&v->mutex);

switch (msg->type) {
case VHOST_IOTLB_UPDATE:
r = vhost_vdpa_process_iotlb_update(v, msg);
@@ -747,6 +749,7 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev 
*dev,
r = -EINVAL;
break;
}
+   mutex_unlock(&v->mutex);
  
  	return r;

  }
@@ -1017,6 +1020,7 @@ static int vhost_vdpa_probe(struct vdpa_device *vdpa)
return minor;
}
  
+	mutex_init(&v->mutex);

atomic_set(&v->opened, 0);
v->minor = minor;
v->vdpa = vdpa;




Re: [RFC v3 01/11] eventfd: track eventfd_signal() recursion depth separately in different cases

2021-01-19 Thread Jason Wang



On 2021/1/19 下午12:59, Xie Yongji wrote:

Now we have a global percpu counter to limit the recursion depth
of eventfd_signal(). This can avoid deadlock or stack overflow.
But in stack overflow case, it should be OK to increase the
recursion depth if needed. So we add a percpu counter in eventfd_ctx
to limit the recursion depth for deadlock case. Then it could be
fine to increase the global percpu counter later.



I wonder whether or not it's worth to introduce percpu for each eventfd.

How about simply check if eventfd_signal_count() is greater than 2?

Thanks




Signed-off-by: Xie Yongji 
---
  fs/aio.c|  3 ++-
  fs/eventfd.c| 20 +++-
  include/linux/eventfd.h |  5 +
  3 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 1f32da13d39e..5d82903161f5 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1698,7 +1698,8 @@ static int aio_poll_wake(struct wait_queue_entry *wait, 
unsigned mode, int sync,
list_del(&iocb->ki_list);
iocb->ki_res.res = mangle_poll(mask);
req->done = true;
-   if (iocb->ki_eventfd && eventfd_signal_count()) {
+   if (iocb->ki_eventfd &&
+   eventfd_signal_count(iocb->ki_eventfd)) {
iocb = NULL;
INIT_WORK(&req->work, aio_poll_put_work);
schedule_work(&req->work);
diff --git a/fs/eventfd.c b/fs/eventfd.c
index e265b6dd4f34..2df24f9bada3 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -25,6 +25,8 @@
  #include 
  #include 
  
+#define EVENTFD_WAKE_DEPTH 0

+
  DEFINE_PER_CPU(int, eventfd_wake_count);
  
  static DEFINE_IDA(eventfd_ida);

@@ -42,9 +44,17 @@ struct eventfd_ctx {
 */
__u64 count;
unsigned int flags;
+   int __percpu *wake_count;
int id;
  };
  
+bool eventfd_signal_count(struct eventfd_ctx *ctx)

+{
+   return (this_cpu_read(*ctx->wake_count) ||
+   this_cpu_read(eventfd_wake_count) > EVENTFD_WAKE_DEPTH);
+}
+EXPORT_SYMBOL_GPL(eventfd_signal_count);
+
  /**
   * eventfd_signal - Adds @n to the eventfd counter.
   * @ctx: [in] Pointer to the eventfd context.
@@ -71,17 +81,19 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
 * it returns true, the eventfd_signal() call should be deferred to a
 * safe context.
 */
-   if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
+   if (WARN_ON_ONCE(eventfd_signal_count(ctx)))
return 0;
  
  	spin_lock_irqsave(&ctx->wqh.lock, flags);

this_cpu_inc(eventfd_wake_count);
+   this_cpu_inc(*ctx->wake_count);
if (ULLONG_MAX - ctx->count < n)
n = ULLONG_MAX - ctx->count;
ctx->count += n;
if (waitqueue_active(&ctx->wqh))
wake_up_locked_poll(&ctx->wqh, EPOLLIN);
this_cpu_dec(eventfd_wake_count);
+   this_cpu_dec(*ctx->wake_count);
spin_unlock_irqrestore(&ctx->wqh.lock, flags);
  
  	return n;

@@ -92,6 +104,7 @@ static void eventfd_free_ctx(struct eventfd_ctx *ctx)
  {
if (ctx->id >= 0)
ida_simple_remove(&eventfd_ida, ctx->id);
+   free_percpu(ctx->wake_count);
kfree(ctx);
  }
  
@@ -423,6 +436,11 @@ static int do_eventfd(unsigned int count, int flags)
  
  	kref_init(&ctx->kref);

init_waitqueue_head(&ctx->wqh);
+   ctx->wake_count = alloc_percpu(int);
+   if (!ctx->wake_count) {
+   kfree(ctx);
+   return -ENOMEM;
+   }
ctx->count = count;
ctx->flags = flags;
ctx->id = ida_simple_get(&eventfd_ida, 0, 0, GFP_KERNEL);
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index fa0a524baed0..1a11ebbd74a9 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -45,10 +45,7 @@ void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 
*cnt);
  
  DECLARE_PER_CPU(int, eventfd_wake_count);
  
-static inline bool eventfd_signal_count(void)

-{
-   return this_cpu_read(eventfd_wake_count);
-}
+bool eventfd_signal_count(struct eventfd_ctx *ctx);
  
  #else /* CONFIG_EVENTFD */
  




Re: [RFC v3 05/11] vdpa: shared virtual addressing support

2021-01-19 Thread Jason Wang



On 2021/1/19 下午12:59, Xie Yongji wrote:

This patches introduces SVA (Shared Virtual Addressing)
support for vDPA device. During vDPA device allocation,
vDPA device driver needs to indicate whether SVA is
supported by the device. Then vhost-vdpa bus driver
will not pin user page and transfer userspace virtual
address instead of physical address during DMA mapping.

Suggested-by: Jason Wang 
Signed-off-by: Xie Yongji 
---
  drivers/vdpa/ifcvf/ifcvf_main.c   |  2 +-
  drivers/vdpa/mlx5/net/mlx5_vnet.c |  2 +-
  drivers/vdpa/vdpa.c   |  5 -
  drivers/vdpa/vdpa_sim/vdpa_sim.c  |  3 ++-
  drivers/vhost/vdpa.c  | 35 +++
  include/linux/vdpa.h  | 10 +++---
  6 files changed, 38 insertions(+), 19 deletions(-)

diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c
index 23474af7da40..95c4601f82f5 100644
--- a/drivers/vdpa/ifcvf/ifcvf_main.c
+++ b/drivers/vdpa/ifcvf/ifcvf_main.c
@@ -439,7 +439,7 @@ static int ifcvf_probe(struct pci_dev *pdev, const struct 
pci_device_id *id)
  
  	adapter = vdpa_alloc_device(struct ifcvf_adapter, vdpa,

dev, &ifc_vdpa_ops,
-   IFCVF_MAX_QUEUE_PAIRS * 2, NULL);
+   IFCVF_MAX_QUEUE_PAIRS * 2, NULL, false);
if (adapter == NULL) {
IFCVF_ERR(pdev, "Failed to allocate vDPA structure");
return -ENOMEM;
diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 77595c81488d..05988d6907f2 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1959,7 +1959,7 @@ static int mlx5v_probe(struct auxiliary_device *adev,
max_vqs = min_t(u32, max_vqs, MLX5_MAX_SUPPORTED_VQS);
  
  	ndev = vdpa_alloc_device(struct mlx5_vdpa_net, mvdev.vdev, mdev->device, &mlx5_vdpa_ops,

-2 * mlx5_vdpa_max_qps(max_vqs), NULL);
+2 * mlx5_vdpa_max_qps(max_vqs), NULL, false);
if (IS_ERR(ndev))
return PTR_ERR(ndev);
  
diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c

index 32bd48baffab..50cab930b2e5 100644
--- a/drivers/vdpa/vdpa.c
+++ b/drivers/vdpa/vdpa.c
@@ -72,6 +72,7 @@ static void vdpa_release_dev(struct device *d)
   * @nvqs: number of virtqueues supported by this device
   * @size: size of the parent structure that contains private data
   * @name: name of the vdpa device; optional.
+ * @sva: indicate whether SVA (Shared Virtual Addressing) is supported
   *
   * Driver should use vdpa_alloc_device() wrapper macro instead of
   * using this directly.
@@ -81,7 +82,8 @@ static void vdpa_release_dev(struct device *d)
   */
  struct vdpa_device *__vdpa_alloc_device(struct device *parent,
const struct vdpa_config_ops *config,
-   int nvqs, size_t size, const char *name)
+   int nvqs, size_t size, const char *name,
+   bool sva)
  {
struct vdpa_device *vdev;
int err = -EINVAL;
@@ -108,6 +110,7 @@ struct vdpa_device *__vdpa_alloc_device(struct device 
*parent,
vdev->config = config;
vdev->features_valid = false;
vdev->nvqs = nvqs;
+   vdev->sva = sva;
  
  	if (name)

err = dev_set_name(&vdev->dev, "%s", name);
diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index 85776e4e6749..03c796873a6b 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -367,7 +367,8 @@ static struct vdpasim *vdpasim_create(const char *name)
else
ops = &vdpasim_net_config_ops;
  
-	vdpasim = vdpa_alloc_device(struct vdpasim, vdpa, NULL, ops, VDPASIM_VQ_NUM, name);

+   vdpasim = vdpa_alloc_device(struct vdpasim, vdpa, NULL, ops,
+   VDPASIM_VQ_NUM, name, false);
if (!vdpasim)
goto err_alloc;
  
diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c

index 4a241d380c40..36b6950ba37f 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -486,21 +486,25 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
  static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
  {
struct vhost_dev *dev = &v->vdev;
+   struct vdpa_device *vdpa = v->vdpa;
struct vhost_iotlb *iotlb = dev->iotlb;
struct vhost_iotlb_map *map;
struct page *page;
unsigned long pfn, pinned;
  
  	while ((map = vhost_iotlb_itree_first(iotlb, start, last)) != NULL) {

-   pinned = map->size >> PAGE_SHIFT;
-   for (pfn = map->addr >> PAGE_SHIFT;
-pinned > 0; pfn++, pinned--) {
-   

Re: [RFC v3 06/11] vhost-vdpa: Add an opaque pointer for vhost IOTLB

2021-01-19 Thread Jason Wang



On 2021/1/19 下午12:59, Xie Yongji wrote:

Add an opaque pointer for vhost IOTLB to store the
corresponding vma->vm_file and offset on the DMA mapping.



Let's split the patch into two.

1) opaque pointer
2) vma stuffs




It will be used in VDUSE case later.

Suggested-by: Jason Wang 
Signed-off-by: Xie Yongji 
---
  drivers/vdpa/vdpa_sim/vdpa_sim.c | 11 ---
  drivers/vhost/iotlb.c|  5 ++-
  drivers/vhost/vdpa.c | 66 +++-
  drivers/vhost/vhost.c|  4 +--
  include/linux/vdpa.h |  3 +-
  include/linux/vhost_iotlb.h  |  8 -
  6 files changed, 79 insertions(+), 18 deletions(-)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index 03c796873a6b..1ffcef67954f 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -279,7 +279,7 @@ static dma_addr_t vdpasim_map_page(struct device *dev, 
struct page *page,
 */
spin_lock(&vdpasim->iommu_lock);
ret = vhost_iotlb_add_range(iommu, pa, pa + size - 1,
-   pa, dir_to_perm(dir));
+   pa, dir_to_perm(dir), NULL);



Maybe its better to introduce

vhost_iotlb_add_range_ctx() which can accepts the opaque (context). And 
let vhost_iotlb_add_range() just call that.




spin_unlock(&vdpasim->iommu_lock);
if (ret)
return DMA_MAPPING_ERROR;
@@ -317,7 +317,7 @@ static void *vdpasim_alloc_coherent(struct device *dev, 
size_t size,
  
  		ret = vhost_iotlb_add_range(iommu, (u64)pa,

(u64)pa + size - 1,
-   pa, VHOST_MAP_RW);
+   pa, VHOST_MAP_RW, NULL);
if (ret) {
*dma_addr = DMA_MAPPING_ERROR;
kfree(addr);
@@ -625,7 +625,8 @@ static int vdpasim_set_map(struct vdpa_device *vdpa,
for (map = vhost_iotlb_itree_first(iotlb, start, last); map;
 map = vhost_iotlb_itree_next(map, start, last)) {
ret = vhost_iotlb_add_range(vdpasim->iommu, map->start,
-   map->last, map->addr, map->perm);
+   map->last, map->addr,
+   map->perm, NULL);
if (ret)
goto err;
}
@@ -639,14 +640,14 @@ static int vdpasim_set_map(struct vdpa_device *vdpa,
  }
  
  static int vdpasim_dma_map(struct vdpa_device *vdpa, u64 iova, u64 size,

-  u64 pa, u32 perm)
+  u64 pa, u32 perm, void *opaque)
  {
struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
int ret;
  
  	spin_lock(&vdpasim->iommu_lock);

ret = vhost_iotlb_add_range(vdpasim->iommu, iova, iova + size - 1, pa,
-   perm);
+   perm, NULL);
spin_unlock(&vdpasim->iommu_lock);
  
  	return ret;

diff --git a/drivers/vhost/iotlb.c b/drivers/vhost/iotlb.c
index 0fd3f87e913c..3bd5bd06cdbc 100644
--- a/drivers/vhost/iotlb.c
+++ b/drivers/vhost/iotlb.c
@@ -42,13 +42,15 @@ EXPORT_SYMBOL_GPL(vhost_iotlb_map_free);
   * @last: last of IOVA range
   * @addr: the address that is mapped to @start
   * @perm: access permission of this range
+ * @opaque: the opaque pointer for the IOTLB mapping
   *
   * Returns an error last is smaller than start or memory allocation
   * fails
   */
  int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
  u64 start, u64 last,
- u64 addr, unsigned int perm)
+ u64 addr, unsigned int perm,
+ void *opaque)
  {
struct vhost_iotlb_map *map;
  
@@ -71,6 +73,7 @@ int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,

map->last = last;
map->addr = addr;
map->perm = perm;
+   map->opaque = opaque;
  
  	iotlb->nmaps++;

vhost_iotlb_itree_insert(map, &iotlb->root);
diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 36b6950ba37f..e83e5be7cec8 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -488,6 +488,7 @@ static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, 
u64 start, u64 last)
struct vhost_dev *dev = &v->vdev;
struct vdpa_device *vdpa = v->vdpa;
struct vhost_iotlb *iotlb = dev->iotlb;
+   struct vhost_iotlb_file *iotlb_file;
struct vhost_iotlb_map *map;
struct page *page;
unsigned long pfn, pinned;
@@ -504,6 +505,10 @@ static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, 
u64 start, u64 last)
}
atomic64_sub(map->size >> PAGE_SHIFT,
&

Re: [PATCH 1/1] vhost scsi: allocate vhost_scsi with GFP_NOWAIT to avoid delay

2021-01-21 Thread Jason Wang



On 2021/1/21 13:03, Dongli Zhang wrote:

The size of 'struct vhost_scsi' is order-10 (~2.3MB). It may take long time
delay by kzalloc() to compact memory pages when there is a lack of
high-order pages. As a result, there is latency to create a VM (with
vhost-scsi) or to hotadd vhost-scsi-based storage.

The prior commit 595cb754983d ("vhost/scsi: use vmalloc for order-10
allocation") prefers to fallback only when really needed, while this patch
changes allocation to GFP_NOWAIT in order to avoid the delay caused by
memory page compact.

Cc: Aruna Ramakrishna 
Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
Another option is to rework by reducing the size of 'struct vhost_scsi',
e.g., by replacing inline vhost_scsi.vqs with just memory pointers while
each vhost_scsi.vqs[i] should be allocated separately. Please let me
know if that option is better.

  drivers/vhost/scsi.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 4ce9f00ae10e..85eaa4e883f4 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1814,7 +1814,7 @@ static int vhost_scsi_open(struct inode *inode, struct 
file *f)
struct vhost_virtqueue **vqs;
int r = -ENOMEM, i;
  
-	vs = kzalloc(sizeof(*vs), GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_MAYFAIL);

+   vs = kzalloc(sizeof(*vs), GFP_NOWAIT | __GFP_NOWARN);
if (!vs) {
vs = vzalloc(sizeof(*vs));
if (!vs)



Can we use kvzalloc?

Thanks




Re: [PATCH net] vhost_net: fix high cpu load when sendmsg fails

2020-12-10 Thread Jason Wang



On 2020/12/9 下午9:27, wangyunjian wrote:

-Original Message-
From: Michael S. Tsirkin [mailto:m...@redhat.com]
Sent: Wednesday, December 9, 2020 8:50 PM
To: wangyunjian 
Cc: jasow...@redhat.com; virtualizat...@lists.linux-foundation.org;
netdev@vger.kernel.org; Lilijun (Jerry) ;
chenchanghu ; xudingke 
Subject: Re: [PATCH net] vhost_net: fix high cpu load when sendmsg fails

On Wed, Dec 09, 2020 at 07:48:24PM +0800, wangyunjian wrote:

From: Yunjian Wang 

Currently we break the loop and wake up the vhost_worker when sendmsg
fails. When the worker wakes up again, we'll meet the same error. This
will cause high CPU load. To fix this issue, we can skip this
description by ignoring the error.

Signed-off-by: Yunjian Wang 
---
  drivers/vhost/net.c | 24 +---
  1 file changed, 5 insertions(+), 19 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index
531a00d703cd..ac950b1120f5 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -829,14 +829,8 @@ static void handle_tx_copy(struct vhost_net *net,
struct socket *sock)

/* TODO: Check specific error and bomb out unless ENOBUFS? */
err = sock->ops->sendmsg(sock, &msg, len);
-   if (unlikely(err < 0)) {
-   vhost_discard_vq_desc(vq, 1);
-   vhost_net_enable_vq(net, vq);
-   break;
-   }
-   if (err != len)
-   pr_debug("Truncated TX packet: len %d != %zd\n",
-err, len);
+   if (unlikely(err < 0 || err != len))
+   vq_err(vq, "Fail to sending packets err : %d, len : 
%zd\n", err,
+len);
  done:
vq->heads[nvq->done_idx].id = cpu_to_vhost32(vq, head);
vq->heads[nvq->done_idx].len = 0;

One of the reasons for sendmsg to fail is ENOBUFS.
In that case for sure we don't want to drop packet.

Now the function tap_sendmsg()/tun_sendmsg() don't return ENOBUFS.



I think not, it can happen if we exceeds sndbuf. E.g see tun_alloc_skb().

Thanks





There could be other transient errors.
Which error did you encounter, specifically?

Currently a guest vm send a skb which length is more than 64k.
If virtio hdr is wrong, the problem will also be triggered.

Thanks


@@ -925,19 +919,11 @@ static void handle_tx_zerocopy(struct vhost_net
*net, struct socket *sock)

/* TODO: Check specific error and bomb out unless ENOBUFS? */
err = sock->ops->sendmsg(sock, &msg, len);
-   if (unlikely(err < 0)) {
-   if (zcopy_used) {
+   if (unlikely(err < 0 || err != len)) {
+   if (zcopy_used && err < 0)
vhost_net_ubuf_put(ubufs);
-   nvq->upend_idx = ((unsigned)nvq->upend_idx - 1)
-   % UIO_MAXIOV;
-   }
-   vhost_discard_vq_desc(vq, 1);
-   vhost_net_enable_vq(net, vq);
-   break;
+   vq_err(vq, "Fail to sending packets err : %d, len : 
%zd\n", err,
+len);
}
-   if (err != len)
-   pr_debug("Truncated TX packet: "
-" len %d != %zd\n", err, len);
if (!zcopy_used)
vhost_add_used_and_signal(&net->dev, vq, head, 0);
else
--
2.23.0




Re: [PATCH net] vhost_net: fix high cpu load when sendmsg fails

2020-12-13 Thread Jason Wang



On 2020/12/11 下午3:37, wangyunjian wrote:

-Original Message-
From: Jason Wang [mailto:jasow...@redhat.com]
Sent: Friday, December 11, 2020 10:53 AM
To: wangyunjian ; Michael S. Tsirkin

Cc: virtualizat...@lists.linux-foundation.org; netdev@vger.kernel.org; Lilijun
(Jerry) ; chenchanghu ;
xudingke 
Subject: Re: [PATCH net] vhost_net: fix high cpu load when sendmsg fails


On 2020/12/9 下午9:27, wangyunjian wrote:

-Original Message-
From: Michael S. Tsirkin [mailto:m...@redhat.com]
Sent: Wednesday, December 9, 2020 8:50 PM
To: wangyunjian 
Cc: jasow...@redhat.com; virtualizat...@lists.linux-foundation.org;
netdev@vger.kernel.org; Lilijun (Jerry) ;
chenchanghu ; xudingke



Subject: Re: [PATCH net] vhost_net: fix high cpu load when sendmsg
fails

On Wed, Dec 09, 2020 at 07:48:24PM +0800, wangyunjian wrote:

From: Yunjian Wang 

Currently we break the loop and wake up the vhost_worker when
sendmsg fails. When the worker wakes up again, we'll meet the same
error. This will cause high CPU load. To fix this issue, we can skip
this description by ignoring the error.

Signed-off-by: Yunjian Wang 
---
   drivers/vhost/net.c | 24 +---
   1 file changed, 5 insertions(+), 19 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index
531a00d703cd..ac950b1120f5 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -829,14 +829,8 @@ static void handle_tx_copy(struct vhost_net
*net, struct socket *sock)

/* TODO: Check specific error and bomb out unless ENOBUFS?

*/

err = sock->ops->sendmsg(sock, &msg, len);
-   if (unlikely(err < 0)) {
-   vhost_discard_vq_desc(vq, 1);
-   vhost_net_enable_vq(net, vq);
-   break;
-   }
-   if (err != len)
-   pr_debug("Truncated TX packet: len %d != %zd\n",
-err, len);
+   if (unlikely(err < 0 || err != len))
+   vq_err(vq, "Fail to sending packets err : %d, len : 
%zd\n",

err,

+len);
   done:
vq->heads[nvq->done_idx].id = cpu_to_vhost32(vq, head);
vq->heads[nvq->done_idx].len = 0;

One of the reasons for sendmsg to fail is ENOBUFS.
In that case for sure we don't want to drop packet.

Now the function tap_sendmsg()/tun_sendmsg() don't return ENOBUFS.


I think not, it can happen if we exceeds sndbuf. E.g see tun_alloc_skb().

This patch 'net: add alloc_skb_with_frags() helper' modifys the return value
of sock_alloc_send_pskb() from -ENOBUFS to -EAGAIN when we exceeds sndbuf.
So the return value of tun_alloc_skb has been changed.



Ok.




We don't drop packet if the reasons for sendmsg to fail is EAGAIN.
How about this?



It should work.

Btw, the patch doesn't add the head to the used ring. This may confuses 
the driver.


Thanks




Thanks


Thanks



There could be other transient errors.
Which error did you encounter, specifically?

Currently a guest vm send a skb which length is more than 64k.
If virtio hdr is wrong, the problem will also be triggered.

Thanks


@@ -925,19 +919,11 @@ static void handle_tx_zerocopy(struct
vhost_net *net, struct socket *sock)

/* TODO: Check specific error and bomb out unless ENOBUFS?

*/

err = sock->ops->sendmsg(sock, &msg, len);
-   if (unlikely(err < 0)) {
-   if (zcopy_used) {
+   if (unlikely(err < 0 || err != len)) {
+   if (zcopy_used && err < 0)
vhost_net_ubuf_put(ubufs);
-   nvq->upend_idx = ((unsigned)nvq->upend_idx - 1)
-   % UIO_MAXIOV;
-   }
-   vhost_discard_vq_desc(vq, 1);
-   vhost_net_enable_vq(net, vq);
-   break;
+   vq_err(vq, "Fail to sending packets err : %d, len : 
%zd\n",

err,

+len);
}
-   if (err != len)
-   pr_debug("Truncated TX packet: "
-" len %d != %zd\n", err, len);
if (!zcopy_used)
vhost_add_used_and_signal(&net->dev, vq, head, 0);
else
--
2.23.0




Re: [PATCH net] vhost_net: fix high cpu load when sendmsg fails

2020-12-13 Thread Jason Wang



On 2020/12/11 下午3:37, wangyunjian wrote:

-Original Message-
From: Jason Wang [mailto:jasow...@redhat.com]
Sent: Friday, December 11, 2020 10:53 AM
To: wangyunjian ; Michael S. Tsirkin

Cc: virtualizat...@lists.linux-foundation.org; netdev@vger.kernel.org; Lilijun
(Jerry) ; chenchanghu ;
xudingke 
Subject: Re: [PATCH net] vhost_net: fix high cpu load when sendmsg fails


On 2020/12/9 下午9:27, wangyunjian wrote:

-Original Message-
From: Michael S. Tsirkin [mailto:m...@redhat.com]
Sent: Wednesday, December 9, 2020 8:50 PM
To: wangyunjian 
Cc: jasow...@redhat.com; virtualizat...@lists.linux-foundation.org;
netdev@vger.kernel.org; Lilijun (Jerry) ;
chenchanghu ; xudingke



Subject: Re: [PATCH net] vhost_net: fix high cpu load when sendmsg
fails

On Wed, Dec 09, 2020 at 07:48:24PM +0800, wangyunjian wrote:

From: Yunjian Wang 

Currently we break the loop and wake up the vhost_worker when
sendmsg fails. When the worker wakes up again, we'll meet the same
error. This will cause high CPU load. To fix this issue, we can skip
this description by ignoring the error.

Signed-off-by: Yunjian Wang 
---
   drivers/vhost/net.c | 24 +---
   1 file changed, 5 insertions(+), 19 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index
531a00d703cd..ac950b1120f5 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -829,14 +829,8 @@ static void handle_tx_copy(struct vhost_net
*net, struct socket *sock)

/* TODO: Check specific error and bomb out unless ENOBUFS?

*/

err = sock->ops->sendmsg(sock, &msg, len);
-   if (unlikely(err < 0)) {
-   vhost_discard_vq_desc(vq, 1);
-   vhost_net_enable_vq(net, vq);
-   break;
-   }
-   if (err != len)
-   pr_debug("Truncated TX packet: len %d != %zd\n",
-err, len);
+   if (unlikely(err < 0 || err != len))
+   vq_err(vq, "Fail to sending packets err : %d, len : 
%zd\n",

err,

+len);
   done:
vq->heads[nvq->done_idx].id = cpu_to_vhost32(vq, head);
vq->heads[nvq->done_idx].len = 0;

One of the reasons for sendmsg to fail is ENOBUFS.
In that case for sure we don't want to drop packet.

Now the function tap_sendmsg()/tun_sendmsg() don't return ENOBUFS.


I think not, it can happen if we exceeds sndbuf. E.g see tun_alloc_skb().

This patch 'net: add alloc_skb_with_frags() helper' modifys the return value
of sock_alloc_send_pskb() from -ENOBUFS to -EAGAIN when we exceeds sndbuf.
So the return value of tun_alloc_skb has been changed.



Ok.




We don't drop packet if the reasons for sendmsg to fail is EAGAIN.
How about this?



It should work.

Btw, the patch doesn't add the head to the used ring. This may confuse 
the driver.


Thanks




Thanks


Thanks



There could be other transient errors.
Which error did you encounter, specifically?

Currently a guest vm send a skb which length is more than 64k.
If virtio hdr is wrong, the problem will also be triggered.

Thanks


@@ -925,19 +919,11 @@ static void handle_tx_zerocopy(struct
vhost_net *net, struct socket *sock)

/* TODO: Check specific error and bomb out unless ENOBUFS?

*/

err = sock->ops->sendmsg(sock, &msg, len);
-   if (unlikely(err < 0)) {
-   if (zcopy_used) {
+   if (unlikely(err < 0 || err != len)) {
+   if (zcopy_used && err < 0)
vhost_net_ubuf_put(ubufs);
-   nvq->upend_idx = ((unsigned)nvq->upend_idx - 1)
-   % UIO_MAXIOV;
-   }
-   vhost_discard_vq_desc(vq, 1);
-   vhost_net_enable_vq(net, vq);
-   break;
+   vq_err(vq, "Fail to sending packets err : %d, len : 
%zd\n",

err,

+len);
}
-   if (err != len)
-   pr_debug("Truncated TX packet: "
-" len %d != %zd\n", err, len);
if (!zcopy_used)
vhost_add_used_and_signal(&net->dev, vq, head, 0);
else
--
2.23.0




Re: [PATCH net v2] tun: fix ubuf refcount incorrectly on error path

2020-12-13 Thread Jason Wang



On 2020/12/14 上午9:32, Willem de Bruijn wrote:

On Sat, Dec 12, 2020 at 7:18 PM Willem de Bruijn
 wrote:

afterwards, the error handling in vhost handle_tx() will try to
decrease the same refcount again. This is wrong and fix this by delay
copying ubuf_info until we're sure there's no errors.

I think the right approach is to address this in the error paths, rather than
complicate the normal datapath.

Is it sufficient to suppress the call to vhost_net_ubuf_put in the handle_tx
sendmsg error path, given that vhost_zerocopy_callback will be called on
kfree_skb?

We can not call kfree_skb() until the skb was created.


Or alternatively clear the destructor in drop:

The uarg->callback() is called immediately after we decide do datacopy
even if caller want to do zerocopy. If another error occurs later, the vhost
handle_tx() will try to decrease it again.

Oh right, I missed the else branch in this path:

 /* copy skb_ubuf_info for callback when skb has no error */
 if (zerocopy) {
 skb_shinfo(skb)->destructor_arg = msg_control;
 skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
 skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
 } else if (msg_control) {
 struct ubuf_info *uarg = msg_control;
 uarg->callback(uarg, false);
 }

So if handle_tx_zerocopy calls tun_sendmsg with ubuf_info (and thus a
reference to release), there are these five options:

1. tun_sendmsg succeeds, ubuf_info is associated with skb.
  reference released from kfree_skb calling vhost_zerocopy_callback later

2. tun_sendmsg succeeds, ubuf_info is released immediately, as skb is
not zerocopy.

3. tun_sendmsg fails before creating skb, handle_tx_zerocopy correctly
cleans up on receiving error from tun_sendmsg.

4. tun_sendmsg fails after creating skb, but with copying: decremented
at branch shown above + again in handle_tx_zerocopy

5. tun_sendmsg fails after creating skb, with zerocopy: decremented at
kfree_skb in drop: + again in handle_tx_zerocopy

Since handle_tx_zerocopy has no idea whether on error 3, 4 or 5
occurred,

Actually, it does. If sendmsg returns an error, it can test whether
vq->heads[nvq->upend_idx].len != VHOST_DMA_IN_PROGRESS.



Just to make sure I understand this. Any reason for it can't be 
VHOST_DMA_IN_PROGRESS here?


Thanks





either all decrement-on-error cases must be handled by
handle_tx_zerocopy or none.

Your patch chooses the latter. Makes sense.

But can this still go wrong if the xdp path is taken, but no program
exists or the program returns XDP_PASS. And then the packet hits an
error path, such as ! IFF_UP?




Re: [PATCH net v2] tun: fix ubuf refcount incorrectly on error path

2020-12-13 Thread Jason Wang



On 2020/12/14 上午11:56, Willem de Bruijn wrote:

On Sun, Dec 13, 2020 at 10:54 PM Willem de Bruijn
 wrote:

On Sun, Dec 13, 2020 at 10:30 PM Jason Wang  wrote:


On 2020/12/14 上午9:32, Willem de Bruijn wrote:

On Sat, Dec 12, 2020 at 7:18 PM Willem de Bruijn
 wrote:

afterwards, the error handling in vhost handle_tx() will try to
decrease the same refcount again. This is wrong and fix this by delay
copying ubuf_info until we're sure there's no errors.

I think the right approach is to address this in the error paths, rather than
complicate the normal datapath.

Is it sufficient to suppress the call to vhost_net_ubuf_put in the handle_tx
sendmsg error path, given that vhost_zerocopy_callback will be called on
kfree_skb?

We can not call kfree_skb() until the skb was created.


Or alternatively clear the destructor in drop:

The uarg->callback() is called immediately after we decide do datacopy
even if caller want to do zerocopy. If another error occurs later, the vhost
handle_tx() will try to decrease it again.

Oh right, I missed the else branch in this path:

  /* copy skb_ubuf_info for callback when skb has no error */
  if (zerocopy) {
  skb_shinfo(skb)->destructor_arg = msg_control;
  skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
  skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
  } else if (msg_control) {
  struct ubuf_info *uarg = msg_control;
  uarg->callback(uarg, false);
  }

So if handle_tx_zerocopy calls tun_sendmsg with ubuf_info (and thus a
reference to release), there are these five options:

1. tun_sendmsg succeeds, ubuf_info is associated with skb.
   reference released from kfree_skb calling vhost_zerocopy_callback later

2. tun_sendmsg succeeds, ubuf_info is released immediately, as skb is
not zerocopy.

3. tun_sendmsg fails before creating skb, handle_tx_zerocopy correctly
cleans up on receiving error from tun_sendmsg.

4. tun_sendmsg fails after creating skb, but with copying: decremented
at branch shown above + again in handle_tx_zerocopy

5. tun_sendmsg fails after creating skb, with zerocopy: decremented at
kfree_skb in drop: + again in handle_tx_zerocopy

Since handle_tx_zerocopy has no idea whether on error 3, 4 or 5
occurred,

Actually, it does. If sendmsg returns an error, it can test whether
vq->heads[nvq->upend_idx].len != VHOST_DMA_IN_PROGRESS.


Just to make sure I understand this. Any reason for it can't be
VHOST_DMA_IN_PROGRESS here?

It can be, and it will be if tun_sendmsg returns EINVAL before
assigning the skb destructor.

I meant returns an error, not necessarily only EINVAL.


Only if tun_sendmsg released the zerocopy state through
kfree_skb->vhost_zerocopy_callback will it have been updated to
VHOST_DMA_DONE_LEN. And only then must the caller not try to release
the state again.





I see. So I tend to fix this in vhost instead of tun to be consistent 
with the current error handling in handle_tx_zerocopy().


Thanks



Re: [PATCH net 2/2] vhost_net: fix high cpu load when sendmsg fails

2020-12-14 Thread Jason Wang



On 2020/12/15 上午9:48, wangyunjian wrote:

From: Yunjian Wang 

Currently we break the loop and wake up the vhost_worker when
sendmsg fails. When the worker wakes up again, we'll meet the
same error. This will cause high CPU load. To fix this issue,
we can skip this description by ignoring the error. When we
exceeds sndbuf, the return value of sendmsg is -EAGAIN. In
the case we don't skip the description and don't drop packet.

Signed-off-by: Yunjian Wang 
---
  drivers/vhost/net.c | 21 +
  1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index c8784dfafdd7..f966592d8900 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -827,16 +827,13 @@ static void handle_tx_copy(struct vhost_net *net, struct 
socket *sock)
msg.msg_flags &= ~MSG_MORE;
}
  
-		/* TODO: Check specific error and bomb out unless ENOBUFS? */

err = sock->ops->sendmsg(sock, &msg, len);
-   if (unlikely(err < 0)) {
+   if (unlikely(err == -EAGAIN)) {
vhost_discard_vq_desc(vq, 1);
vhost_net_enable_vq(net, vq);
break;
-   }



As I've pointed out in last version. If you don't discard descriptor, 
you probably need to add the head to used ring. Otherwise this 
descriptor will be always inflight that may confuse drivers.




-   if (err != len)
-   pr_debug("Truncated TX packet: len %d != %zd\n",
-err, len);
+   } else if (unlikely(err < 0 || err != len))



It looks to me err != len covers err < 0.

Thanks



+   vq_err(vq, "Fail to sending packets err : %d, len : 
%zd\n", err, len);
  done:
vq->heads[nvq->done_idx].id = cpu_to_vhost32(vq, head);
vq->heads[nvq->done_idx].len = 0;
@@ -922,7 +919,6 @@ static void handle_tx_zerocopy(struct vhost_net *net, 
struct socket *sock)
msg.msg_flags &= ~MSG_MORE;
}
  
-		/* TODO: Check specific error and bomb out unless ENOBUFS? */

err = sock->ops->sendmsg(sock, &msg, len);
if (unlikely(err < 0)) {
if (zcopy_used) {
@@ -931,13 +927,14 @@ static void handle_tx_zerocopy(struct vhost_net *net, 
struct socket *sock)
nvq->upend_idx = ((unsigned)nvq->upend_idx - 1)
% UIO_MAXIOV;
}
-   vhost_discard_vq_desc(vq, 1);
-   vhost_net_enable_vq(net, vq);
-   break;
+   if (err == -EAGAIN) {
+   vhost_discard_vq_desc(vq, 1);
+   vhost_net_enable_vq(net, vq);
+   break;
+   }
}
if (err != len)
-   pr_debug("Truncated TX packet: "
-" len %d != %zd\n", err, len);
+   vq_err(vq, "Fail to sending packets err : %d, len : 
%zd\n", err, len);
if (!zcopy_used)
vhost_add_used_and_signal(&net->dev, vq, head, 0);
else




Re: [PATCH net 2/2] vhost_net: fix high cpu load when sendmsg fails

2020-12-15 Thread Jason Wang



- Original Message -
> 
> 
> > -Original Message-
> > From: Jason Wang [mailto:jasow...@redhat.com]
> > Sent: Tuesday, December 15, 2020 12:10 PM
> > To: wangyunjian ; netdev@vger.kernel.org;
> > m...@redhat.com; willemdebruijn.ker...@gmail.com
> > Cc: virtualizat...@lists.linux-foundation.org; Lilijun (Jerry)
> > ; chenchanghu ;
> > xudingke ; huangbin (J)
> > 
> > Subject: Re: [PATCH net 2/2] vhost_net: fix high cpu load when sendmsg
> > fails
> > 
> > 
> > On 2020/12/15 上午9:48, wangyunjian wrote:
> > > From: Yunjian Wang 
> > >
> > > Currently we break the loop and wake up the vhost_worker when sendmsg
> > > fails. When the worker wakes up again, we'll meet the same error. This
> > > will cause high CPU load. To fix this issue, we can skip this
> > > description by ignoring the error. When we exceeds sndbuf, the return
> > > value of sendmsg is -EAGAIN. In the case we don't skip the description
> > > and don't drop packet.
> > >
> > > Signed-off-by: Yunjian Wang 
> > > ---
> > >   drivers/vhost/net.c | 21 +
> > >   1 file changed, 9 insertions(+), 12 deletions(-)
> > >
> > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index
> > > c8784dfafdd7..f966592d8900 100644
> > > --- a/drivers/vhost/net.c
> > > +++ b/drivers/vhost/net.c
> > > @@ -827,16 +827,13 @@ static void handle_tx_copy(struct vhost_net *net,
> > struct socket *sock)
> > >   msg.msg_flags &= ~MSG_MORE;
> > >   }
> > >
> > > - /* TODO: Check specific error and bomb out unless ENOBUFS? */
> > >   err = sock->ops->sendmsg(sock, &msg, len);
> > > - if (unlikely(err < 0)) {
> > > + if (unlikely(err == -EAGAIN)) {
> > >   vhost_discard_vq_desc(vq, 1);
> > >   vhost_net_enable_vq(net, vq);
> > >   break;
> > > - }
> > 
> > 
> > As I've pointed out in last version. If you don't discard descriptor, you
> > probably
> > need to add the head to used ring. Otherwise this descriptor will be always
> > inflight that may confuse drivers.
> 
> Sorry for missing the comment.
> 
> After deleting discard descriptor and break, the next processing will be the
> same
> as the normal success of sendmsg(), and vhost_zerocopy_signal_used() or
> vhost_add_used_and_signal() method will be called to add the head to used
> ring.

It's the next head not the one that contains the buggy packet?

Thanks

> 
> Thanks
> > 
> > 
> > > - if (err != len)
> > > - pr_debug("Truncated TX packet: len %d != %zd\n",
> > > -  err, len);
> > > + } else if (unlikely(err < 0 || err != len))
> > 
> > 
> > It looks to me err != len covers err < 0.
> 
> OK
> 
> > 
> > Thanks
> > 
> > 
> > > + vq_err(vq, "Fail to sending packets err : %d, len : 
> > > %zd\n", err,
> > > +len);
> > >   done:
> > >   vq->heads[nvq->done_idx].id = cpu_to_vhost32(vq, head);
> > >   vq->heads[nvq->done_idx].len = 0;
> > > @@ -922,7 +919,6 @@ static void handle_tx_zerocopy(struct vhost_net
> > *net, struct socket *sock)
> > >   msg.msg_flags &= ~MSG_MORE;
> > >   }
> > >
> > > - /* TODO: Check specific error and bomb out unless ENOBUFS? */
> > >   err = sock->ops->sendmsg(sock, &msg, len);
> > >   if (unlikely(err < 0)) {
> > >   if (zcopy_used) {
> > > @@ -931,13 +927,14 @@ static void handle_tx_zerocopy(struct vhost_net
> > *net, struct socket *sock)
> > >   nvq->upend_idx = 
> > > ((unsigned)nvq->upend_idx - 1)
> > >   % UIO_MAXIOV;
> > >   }
> > > - vhost_discard_vq_desc(vq, 1);
> > > - vhost_net_enable_vq(net, vq);
> > > - break;
> > > + if (err == -EAGAIN) {
> > > + vhost_discard_vq_desc(vq, 1);
> > > + vhost_net_enable_vq(net, vq);
> > > + break;
> > > + }
> > >   }
> > >   if (err != len)
> > > - pr_debug("Truncated TX packet: "
> > > -  " len %d != %zd\n", err, len);
> > > + vq_err(vq, "Fail to sending packets err : %d, len : 
> > > %zd\n", err,
> > > +len);
> > >   if (!zcopy_used)
> > >   vhost_add_used_and_signal(&net->dev, vq, head, 
> > > 0);
> > >   else
> 
> 



[PATCH 00/21] Control VQ support in vDPA

2020-12-15 Thread Jason Wang
Hi All:

This series tries to add the support for control virtqueue in vDPA.

Control virtqueue is used by networking device for accepting various
commands from the driver. It's a must to support multiqueue and other
configurations.

When used by vhost-vDPA bus driver for VM, the control virtqueue
should be shadowed via userspace VMM (Qemu) instead of being assigned
directly to Guest. This is because Qemu needs to know the device state
in order to start and stop device correctly (e.g for Live Migration).

This requies to isolate the memory mapping for control virtqueue
presented by vhost-vDPA to prevent guest from accesing it directly.

To achieve this, vDPA introduce two new abstractions:

- address space: identified through address space id (ASID) and a set
 of memory mapping in maintained
- virtqueue group: the minimal set of virtqueues that must share an
 address space

Device needs to advertise the following attributes to vDPA:

- the number of address spaces supported in the device
- the number of virtqueue groups supported in the device
- the mappings from a specific virtqueue to its virtqueue groups

The mappings from virtqueue to virtqueue groups is fixed and defined
by vDPA device driver. E.g:

- For the device that has hardware ASID support, it can simply
  advertise a per virtqueue virtqueue group.
- For the device that does not have hardware ASID support, it can
  simply advertise a single virtqueue group that contains all
  virtqueues. Or if it wants a software emulated control virtqueue, it
  can advertise two virtqueue groups, one is for cvq, another is for
  the rest virtqueues.

vDPA also allow to change the association between virtqueue group and
address space. So in the case of control virtqueue, userspace
VMM(Qemu) may use a dedicated address space for the control virtqueue
group to isolate the memory mapping.

The vhost/vhost-vDPA is also extend for the userspace to:

- query the number of virtqueue groups and address spaces supported by
  the device
- query the virtqueue group for a specific virtqueue
- assocaite a virtqueue group with an address space
- send ASID based IOTLB commands

This will help userspace VMM(Qemu) to detect whether the control vq
could be supported and isolate memory mappings of control virtqueue
from the others.

To demonstrate the usage, vDPA simulator is extended to support
setting MAC address via a emulated control virtqueue.

Please review.

Changes since RFC:

- tweak vhost uAPI documentation
- switch to use device specific IOTLB really in patch 4
- tweak the commit log
- fix that ASID in vhost is claimed to be 32 actually but 16bit
  actually
- fix use after free when using ASID with IOTLB batching requests
- switch to use Stefano's patch for having separated iov
- remove unused "used_as" variable
- fix the iotlb/asid checking in vhost_vdpa_unmap()

Thanks

Jason Wang (20):
  vhost: move the backend feature bits to vhost_types.h
  virtio-vdpa: don't set callback if virtio doesn't need it
  vhost-vdpa: passing iotlb to IOMMU mapping helpers
  vhost-vdpa: switch to use vhost-vdpa specific IOTLB
  vdpa: add the missing comment for nvqs in struct vdpa_device
  vdpa: introduce virtqueue groups
  vdpa: multiple address spaces support
  vdpa: introduce config operations for associating ASID to a virtqueue
group
  vhost_iotlb: split out IOTLB initialization
  vhost: support ASID in IOTLB API
  vhost-vdpa: introduce asid based IOTLB
  vhost-vdpa: introduce uAPI to get the number of virtqueue groups
  vhost-vdpa: introduce uAPI to get the number of address spaces
  vhost-vdpa: uAPI to get virtqueue group id
  vhost-vdpa: introduce uAPI to set group ASID
  vhost-vdpa: support ASID based IOTLB API
  vdpa_sim: advertise VIRTIO_NET_F_MTU
  vdpa_sim: factor out buffer completion logic
  vdpa_sim: filter destination mac address
  vdpasim: control virtqueue support

Stefano Garzarella (1):
  vdpa_sim: split vdpasim_virtqueue's iov field in out_iov and in_iov

 drivers/vdpa/ifcvf/ifcvf_main.c   |   9 +-
 drivers/vdpa/mlx5/net/mlx5_vnet.c |  11 +-
 drivers/vdpa/vdpa.c   |   8 +-
 drivers/vdpa/vdpa_sim/vdpa_sim.c  | 292 --
 drivers/vhost/iotlb.c |  23 ++-
 drivers/vhost/vdpa.c  | 246 -
 drivers/vhost/vhost.c |  23 ++-
 drivers/vhost/vhost.h |   4 +-
 drivers/virtio/virtio_vdpa.c  |   2 +-
 include/linux/vdpa.h  |  42 -
 include/linux/vhost_iotlb.h   |   2 +
 include/uapi/linux/vhost.h|  25 ++-
 include/uapi/linux/vhost_types.h  |  10 +-
 13 files changed, 561 insertions(+), 136 deletions(-)

-- 
2.25.1



[PATCH 02/21] virtio-vdpa: don't set callback if virtio doesn't need it

2020-12-15 Thread Jason Wang
There's no need for setting callbacks for the driver that doesn't care
about that.

Signed-off-by: Jason Wang 
---
 drivers/virtio/virtio_vdpa.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_vdpa.c b/drivers/virtio/virtio_vdpa.c
index 4a9ddb44b2a7..af6ee677f319 100644
--- a/drivers/virtio/virtio_vdpa.c
+++ b/drivers/virtio/virtio_vdpa.c
@@ -175,7 +175,7 @@ virtio_vdpa_setup_vq(struct virtio_device *vdev, unsigned 
int index,
}
 
/* Setup virtqueue callback */
-   cb.callback = virtio_vdpa_virtqueue_cb;
+   cb.callback = callback ? virtio_vdpa_virtqueue_cb : NULL;
cb.private = info;
ops->set_vq_cb(vdpa, index, &cb);
ops->set_vq_num(vdpa, index, virtqueue_get_vring_size(vq));
-- 
2.25.1



[PATCH 03/21] vhost-vdpa: passing iotlb to IOMMU mapping helpers

2020-12-15 Thread Jason Wang
To prepare for the ASID support for vhost-vdpa, try to pass IOTLB
object to dma helpers. No functional changes, it's just a preparation
for support multiple IOTLBs.

Signed-off-by: Jason Wang 
---
 drivers/vhost/vdpa.c | 41 +++--
 1 file changed, 23 insertions(+), 18 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 29ed4173f04e..07f92d48c173 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -501,10 +501,11 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
return r;
 }
 
-static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
+static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v,
+  struct vhost_iotlb *iotlb,
+  u64 start, u64 last)
 {
struct vhost_dev *dev = &v->vdev;
-   struct vhost_iotlb *iotlb = dev->iotlb;
struct vhost_iotlb_map *map;
struct page *page;
unsigned long pfn, pinned;
@@ -526,8 +527,9 @@ static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, 
u64 start, u64 last)
 static void vhost_vdpa_iotlb_free(struct vhost_vdpa *v)
 {
struct vhost_dev *dev = &v->vdev;
+   struct vhost_iotlb *iotlb = dev->iotlb;
 
-   vhost_vdpa_iotlb_unmap(v, 0ULL, 0ULL - 1);
+   vhost_vdpa_iotlb_unmap(v, iotlb, 0ULL, 0ULL - 1);
kfree(dev->iotlb);
dev->iotlb = NULL;
 }
@@ -554,7 +556,7 @@ static int perm_to_iommu_flags(u32 perm)
return flags | IOMMU_CACHE;
 }
 
-static int vhost_vdpa_map(struct vhost_vdpa *v,
+static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb,
  u64 iova, u64 size, u64 pa, u32 perm)
 {
struct vhost_dev *dev = &v->vdev;
@@ -562,7 +564,7 @@ static int vhost_vdpa_map(struct vhost_vdpa *v,
const struct vdpa_config_ops *ops = vdpa->config;
int r = 0;
 
-   r = vhost_iotlb_add_range(dev->iotlb, iova, iova + size - 1,
+   r = vhost_iotlb_add_range(iotlb, iova, iova + size - 1,
  pa, perm);
if (r)
return r;
@@ -571,43 +573,44 @@ static int vhost_vdpa_map(struct vhost_vdpa *v,
r = ops->dma_map(vdpa, iova, size, pa, perm);
} else if (ops->set_map) {
if (!v->in_batch)
-   r = ops->set_map(vdpa, dev->iotlb);
+   r = ops->set_map(vdpa, iotlb);
} else {
r = iommu_map(v->domain, iova, pa, size,
  perm_to_iommu_flags(perm));
}
 
if (r)
-   vhost_iotlb_del_range(dev->iotlb, iova, iova + size - 1);
+   vhost_iotlb_del_range(iotlb, iova, iova + size - 1);
else
atomic64_add(size >> PAGE_SHIFT, &dev->mm->pinned_vm);
 
return r;
 }
 
-static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
+static void vhost_vdpa_unmap(struct vhost_vdpa *v,
+struct vhost_iotlb *iotlb,
+u64 iova, u64 size)
 {
-   struct vhost_dev *dev = &v->vdev;
struct vdpa_device *vdpa = v->vdpa;
const struct vdpa_config_ops *ops = vdpa->config;
 
-   vhost_vdpa_iotlb_unmap(v, iova, iova + size - 1);
+   vhost_vdpa_iotlb_unmap(v, iotlb, iova, iova + size - 1);
 
if (ops->dma_map) {
ops->dma_unmap(vdpa, iova, size);
} else if (ops->set_map) {
if (!v->in_batch)
-   ops->set_map(vdpa, dev->iotlb);
+   ops->set_map(vdpa, iotlb);
} else {
iommu_unmap(v->domain, iova, size);
}
 }
 
 static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
+  struct vhost_iotlb *iotlb,
   struct vhost_iotlb_msg *msg)
 {
struct vhost_dev *dev = &v->vdev;
-   struct vhost_iotlb *iotlb = dev->iotlb;
struct page **page_list;
unsigned long list_size = PAGE_SIZE / sizeof(struct page *);
unsigned int gup_flags = FOLL_LONGTERM;
@@ -676,7 +679,7 @@ static int vhost_vdpa_process_iotlb_update(struct 
vhost_vdpa *v,
if (last_pfn && (this_pfn != last_pfn + 1)) {
/* Pin a contiguous chunk of memory */
csize = (last_pfn - map_pfn + 1) << PAGE_SHIFT;
-   ret = vhost_vdpa_map(v, iova, csize,
+   ret = vhost_vdpa_map(v, iotlb, iova, csize,
 map_pfn << PAGE_SHIFT,
 msg->perm);
if (ret) {
@@ -706,7 +709,8 @@ static int vhost_v

[PATCH 04/21] vhost-vdpa: switch to use vhost-vdpa specific IOTLB

2020-12-15 Thread Jason Wang
To ease the implementation of per group ASID support for vDPA
device. This patch switches to use a vhost-vdpa specific IOTLB to
avoid the unnecessary refactoring of the vhost core.

Signed-off-by: Jason Wang 
---
 drivers/vhost/vdpa.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 07f92d48c173..9bcc03d4e68b 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -39,6 +39,7 @@ struct vhost_vdpa {
struct vhost_virtqueue *vqs;
struct completion completion;
struct vdpa_device *vdpa;
+   struct vhost_iotlb *iotlb;
struct device dev;
struct cdev cdev;
atomic_t opened;
@@ -526,12 +527,11 @@ static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v,
 
 static void vhost_vdpa_iotlb_free(struct vhost_vdpa *v)
 {
-   struct vhost_dev *dev = &v->vdev;
-   struct vhost_iotlb *iotlb = dev->iotlb;
+   struct vhost_iotlb *iotlb = v->iotlb;
 
vhost_vdpa_iotlb_unmap(v, iotlb, 0ULL, 0ULL - 1);
-   kfree(dev->iotlb);
-   dev->iotlb = NULL;
+   kfree(v->iotlb);
+   v->iotlb = NULL;
 }
 
 static int perm_to_iommu_flags(u32 perm)
@@ -745,7 +745,7 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev 
*dev,
struct vhost_vdpa *v = container_of(dev, struct vhost_vdpa, vdev);
struct vdpa_device *vdpa = v->vdpa;
const struct vdpa_config_ops *ops = vdpa->config;
-   struct vhost_iotlb *iotlb = dev->iotlb;
+   struct vhost_iotlb *iotlb = v->iotlb;
int r = 0;
 
r = vhost_dev_check_owner(dev);
@@ -883,15 +883,15 @@ static int vhost_vdpa_open(struct inode *inode, struct 
file *filep)
vhost_dev_init(dev, vqs, nvqs, 0, 0, 0, false,
   vhost_vdpa_process_iotlb_msg);
 
-   dev->iotlb = vhost_iotlb_alloc(0, 0);
-   if (!dev->iotlb) {
+   v->iotlb = vhost_iotlb_alloc(0, 0);
+   if (!v->iotlb) {
r = -ENOMEM;
goto err_init_iotlb;
}
 
r = vhost_vdpa_alloc_domain(v);
if (r)
-   goto err_init_iotlb;
+   goto err_alloc_domain;
 
vhost_vdpa_set_iova_range(v);
 
@@ -899,6 +899,8 @@ static int vhost_vdpa_open(struct inode *inode, struct file 
*filep)
 
return 0;
 
+err_alloc_domain:
+   vhost_vdpa_iotlb_free(v);
 err_init_iotlb:
vhost_dev_cleanup(&v->vdev);
kfree(vqs);
-- 
2.25.1



[PATCH 05/21] vdpa: add the missing comment for nvqs in struct vdpa_device

2020-12-15 Thread Jason Wang
Signed-off-by: Jason Wang 
---
 include/linux/vdpa.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 30bc7a7223bb..8ab8dcde705d 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -42,6 +42,7 @@ struct vdpa_vq_state {
  * @config: the configuration ops for this device.
  * @index: device index
  * @features_valid: were features initialized? for legacy guests
+ * @nvqs: the number of virtqueues
  */
 struct vdpa_device {
struct device dev;
-- 
2.25.1



[PATCH 07/21] vdpa: multiple address spaces support

2020-12-15 Thread Jason Wang
This patches introduces the multiple address spaces support for vDPA
device. This idea is to identify a specific address space via an
dedicated identifier - ASID.

During vDPA device allocation, vDPA device driver needs to report the
number of address spaces supported by the device then the DMA mapping
ops of the vDPA device needs to be extended to support ASID.

This helps to isolate the environments for the virtqueue that will not
be assigned directly. E.g in the case of virtio-net, the control
virtqueue will not be assigned directly to guest.

As a start, simply claim 1 virtqueue groups and 1 address spaces for
all vDPA devices. And vhost-vDPA will simply reject the device with
more than 1 virtqueue groups or address spaces.

Signed-off-by: Jason Wang 
---
 drivers/vdpa/ifcvf/ifcvf_main.c   |  2 +-
 drivers/vdpa/mlx5/net/mlx5_vnet.c |  5 +++--
 drivers/vdpa/vdpa.c   |  4 +++-
 drivers/vdpa/vdpa_sim/vdpa_sim.c  | 10 ++
 drivers/vhost/vdpa.c  | 14 +-
 include/linux/vdpa.h  | 23 ---
 6 files changed, 38 insertions(+), 20 deletions(-)

diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c
index c629f4fcc738..8a43f562b169 100644
--- a/drivers/vdpa/ifcvf/ifcvf_main.c
+++ b/drivers/vdpa/ifcvf/ifcvf_main.c
@@ -445,7 +445,7 @@ static int ifcvf_probe(struct pci_dev *pdev, const struct 
pci_device_id *id)
 
adapter = vdpa_alloc_device(struct ifcvf_adapter, vdpa,
dev, &ifc_vdpa_ops,
-   IFCVF_MAX_QUEUE_PAIRS * 2, 1);
+   IFCVF_MAX_QUEUE_PAIRS * 2, 1, 1);
 
if (adapter == NULL) {
IFCVF_ERR(pdev, "Failed to allocate vDPA structure");
diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 719b52fcc547..7aaf0a4ee80d 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1804,7 +1804,8 @@ static u32 mlx5_vdpa_get_generation(struct vdpa_device 
*vdev)
return mvdev->generation;
 }
 
-static int mlx5_vdpa_set_map(struct vdpa_device *vdev, struct vhost_iotlb 
*iotlb)
+static int mlx5_vdpa_set_map(struct vdpa_device *vdev, unsigned int asid,
+struct vhost_iotlb *iotlb)
 {
struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev);
struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev);
@@ -1947,7 +1948,7 @@ void *mlx5_vdpa_add_dev(struct mlx5_core_dev *mdev)
max_vqs = min_t(u32, max_vqs, MLX5_MAX_SUPPORTED_VQS);
 
ndev = vdpa_alloc_device(struct mlx5_vdpa_net, mvdev.vdev, 
mdev->device, &mlx5_vdpa_ops,
-2 * mlx5_vdpa_max_qps(max_vqs), 1);
+2 * mlx5_vdpa_max_qps(max_vqs), 1, 1);
if (IS_ERR(ndev))
return ndev;
 
diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
index 46399746ec7c..05195fa7865d 100644
--- a/drivers/vdpa/vdpa.c
+++ b/drivers/vdpa/vdpa.c
@@ -63,6 +63,7 @@ static void vdpa_release_dev(struct device *d)
  * @config: the bus operations that is supported by this device
  * @nvqs: number of virtqueues supported by this device
  * @ngroups: number of groups supported by this device
+ * @nas: number of address spaces supported by this device
  * @size: size of the parent structure that contains private data
  *
  * Driver should use vdpa_alloc_device() wrapper macro instead of
@@ -74,7 +75,7 @@ static void vdpa_release_dev(struct device *d)
 struct vdpa_device *__vdpa_alloc_device(struct device *parent,
const struct vdpa_config_ops *config,
int nvqs, unsigned int ngroups,
-   size_t size)
+   unsigned int nas, size_t size)
 {
struct vdpa_device *vdev;
int err = -EINVAL;
@@ -102,6 +103,7 @@ struct vdpa_device *__vdpa_alloc_device(struct device 
*parent,
vdev->features_valid = false;
vdev->nvqs = nvqs;
vdev->ngroups = ngroups;
+   vdev->nas = nas;
 
err = dev_set_name(&vdev->dev, "vdpa%u", vdev->index);
if (err)
diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index 5d554b3cd152..140de452 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -359,7 +359,7 @@ static struct vdpasim *vdpasim_create(void)
ops = &vdpasim_net_config_ops;
 
vdpasim = vdpa_alloc_device(struct vdpasim, vdpa, NULL, ops,
-   VDPASIM_VQ_NUM, 1);
+   VDPASIM_VQ_NUM, 1, 1);
if (!vdpasim)
goto err_alloc;
 
@@ -606,7 +606,7 @@ static struct vdpa_iova_range vdpasim_get_iova_range(struct 
vdpa_device *vdpa)
return range;
 }
 
-static int vdpasim_set_map(

[PATCH 06/21] vdpa: introduce virtqueue groups

2020-12-15 Thread Jason Wang
This patch introduces virtqueue groups to vDPA device. The virtqueue
group is the minimal set of virtqueues that must share an address
space. And the adddress space identifier could only be attached to
a specific virtqueue group.

A new mandated bus operation is introduced to get the virtqueue group
ID for a specific virtqueue.

All the vDPA device drivers were converted to simply support a single
virtqueue group.

Signed-off-by: Jason Wang 
---
 drivers/vdpa/ifcvf/ifcvf_main.c   |  9 -
 drivers/vdpa/mlx5/net/mlx5_vnet.c |  8 +++-
 drivers/vdpa/vdpa.c   |  4 +++-
 drivers/vdpa/vdpa_sim/vdpa_sim.c  | 11 ++-
 include/linux/vdpa.h  | 12 +---
 5 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c
index 8b4028556cb6..c629f4fcc738 100644
--- a/drivers/vdpa/ifcvf/ifcvf_main.c
+++ b/drivers/vdpa/ifcvf/ifcvf_main.c
@@ -332,6 +332,11 @@ static u32 ifcvf_vdpa_get_vq_align(struct vdpa_device 
*vdpa_dev)
return IFCVF_QUEUE_ALIGNMENT;
 }
 
+static u32 ifcvf_vdpa_get_vq_group(struct vdpa_device *vdpa, u16 idx)
+{
+   return 0;
+}
+
 static void ifcvf_vdpa_get_config(struct vdpa_device *vdpa_dev,
  unsigned int offset,
  void *buf, unsigned int len)
@@ -392,6 +397,7 @@ static const struct vdpa_config_ops ifc_vdpa_ops = {
.get_device_id  = ifcvf_vdpa_get_device_id,
.get_vendor_id  = ifcvf_vdpa_get_vendor_id,
.get_vq_align   = ifcvf_vdpa_get_vq_align,
+   .get_vq_group   = ifcvf_vdpa_get_vq_group,
.get_config = ifcvf_vdpa_get_config,
.set_config = ifcvf_vdpa_set_config,
.set_config_cb  = ifcvf_vdpa_set_config_cb,
@@ -439,7 +445,8 @@ static int ifcvf_probe(struct pci_dev *pdev, const struct 
pci_device_id *id)
 
adapter = vdpa_alloc_device(struct ifcvf_adapter, vdpa,
dev, &ifc_vdpa_ops,
-   IFCVF_MAX_QUEUE_PAIRS * 2);
+   IFCVF_MAX_QUEUE_PAIRS * 2, 1);
+
if (adapter == NULL) {
IFCVF_ERR(pdev, "Failed to allocate vDPA structure");
return -ENOMEM;
diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 1fa6fcac8299..719b52fcc547 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1436,6 +1436,11 @@ static u32 mlx5_vdpa_get_vq_align(struct vdpa_device 
*vdev)
return PAGE_SIZE;
 }
 
+static u32 mlx5_vdpa_get_vq_group(struct vdpa_device *vdpa, u16 idx)
+{
+   return 0;
+}
+
 enum { MLX5_VIRTIO_NET_F_GUEST_CSUM = 1 << 9,
MLX5_VIRTIO_NET_F_CSUM = 1 << 10,
MLX5_VIRTIO_NET_F_HOST_TSO6 = 1 << 11,
@@ -1854,6 +1859,7 @@ static const struct vdpa_config_ops mlx5_vdpa_ops = {
.get_vq_notification = mlx5_get_vq_notification,
.get_vq_irq = mlx5_get_vq_irq,
.get_vq_align = mlx5_vdpa_get_vq_align,
+   .get_vq_group = mlx5_vdpa_get_vq_group,
.get_features = mlx5_vdpa_get_features,
.set_features = mlx5_vdpa_set_features,
.set_config_cb = mlx5_vdpa_set_config_cb,
@@ -1941,7 +1947,7 @@ void *mlx5_vdpa_add_dev(struct mlx5_core_dev *mdev)
max_vqs = min_t(u32, max_vqs, MLX5_MAX_SUPPORTED_VQS);
 
ndev = vdpa_alloc_device(struct mlx5_vdpa_net, mvdev.vdev, 
mdev->device, &mlx5_vdpa_ops,
-2 * mlx5_vdpa_max_qps(max_vqs));
+2 * mlx5_vdpa_max_qps(max_vqs), 1);
if (IS_ERR(ndev))
return ndev;
 
diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
index a69ffc991e13..46399746ec7c 100644
--- a/drivers/vdpa/vdpa.c
+++ b/drivers/vdpa/vdpa.c
@@ -62,6 +62,7 @@ static void vdpa_release_dev(struct device *d)
  * @parent: the parent device
  * @config: the bus operations that is supported by this device
  * @nvqs: number of virtqueues supported by this device
+ * @ngroups: number of groups supported by this device
  * @size: size of the parent structure that contains private data
  *
  * Driver should use vdpa_alloc_device() wrapper macro instead of
@@ -72,7 +73,7 @@ static void vdpa_release_dev(struct device *d)
  */
 struct vdpa_device *__vdpa_alloc_device(struct device *parent,
const struct vdpa_config_ops *config,
-   int nvqs,
+   int nvqs, unsigned int ngroups,
size_t size)
 {
struct vdpa_device *vdev;
@@ -100,6 +101,7 @@ struct vdpa_device *__vdpa_alloc_device(struct device 
*parent,
vdev->config = config;
vdev->features_valid = false;
vdev->nvqs = nvqs;
+   vdev->ngroups = ngroups;
 
err = dev_set_name(&vdev->dev, "vdpa%u", vdev->index);
 

[PATCH 09/21] vhost_iotlb: split out IOTLB initialization

2020-12-15 Thread Jason Wang
This patch splits out IOTLB initialization to make sure it could be
reused by external modules.

Signed-off-by: Jason Wang 
---
 drivers/vhost/iotlb.c   | 23 ++-
 include/linux/vhost_iotlb.h |  2 ++
 2 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/iotlb.c b/drivers/vhost/iotlb.c
index 0fd3f87e913c..e842d76c179e 100644
--- a/drivers/vhost/iotlb.c
+++ b/drivers/vhost/iotlb.c
@@ -98,6 +98,23 @@ void vhost_iotlb_del_range(struct vhost_iotlb *iotlb, u64 
start, u64 last)
 }
 EXPORT_SYMBOL_GPL(vhost_iotlb_del_range);
 
+/**
+ * vhost_iotlb_init - initialize a vhost IOTLB
+ * @iotlb: the IOTLB that needs to be initialized
+ * @limit: maximum number of IOTLB entries
+ * @flags: VHOST_IOTLB_FLAG_XXX
+ */
+void vhost_iotlb_init(struct vhost_iotlb *iotlb, unsigned int limit,
+ unsigned int flags)
+{
+   iotlb->root = RB_ROOT_CACHED;
+   iotlb->limit = limit;
+   iotlb->nmaps = 0;
+   iotlb->flags = flags;
+   INIT_LIST_HEAD(&iotlb->list);
+}
+EXPORT_SYMBOL_GPL(vhost_iotlb_init);
+
 /**
  * vhost_iotlb_alloc - add a new vhost IOTLB
  * @limit: maximum number of IOTLB entries
@@ -112,11 +129,7 @@ struct vhost_iotlb *vhost_iotlb_alloc(unsigned int limit, 
unsigned int flags)
if (!iotlb)
return NULL;
 
-   iotlb->root = RB_ROOT_CACHED;
-   iotlb->limit = limit;
-   iotlb->nmaps = 0;
-   iotlb->flags = flags;
-   INIT_LIST_HEAD(&iotlb->list);
+   vhost_iotlb_init(iotlb, limit, flags);
 
return iotlb;
 }
diff --git a/include/linux/vhost_iotlb.h b/include/linux/vhost_iotlb.h
index 6b09b786a762..c0df193ec3e1 100644
--- a/include/linux/vhost_iotlb.h
+++ b/include/linux/vhost_iotlb.h
@@ -33,6 +33,8 @@ int vhost_iotlb_add_range(struct vhost_iotlb *iotlb, u64 
start, u64 last,
  u64 addr, unsigned int perm);
 void vhost_iotlb_del_range(struct vhost_iotlb *iotlb, u64 start, u64 last);
 
+void vhost_iotlb_init(struct vhost_iotlb *iotlb, unsigned int limit,
+ unsigned int flags);
 struct vhost_iotlb *vhost_iotlb_alloc(unsigned int limit, unsigned int flags);
 void vhost_iotlb_free(struct vhost_iotlb *iotlb);
 void vhost_iotlb_reset(struct vhost_iotlb *iotlb);
-- 
2.25.1



[PATCH 01/21] vhost: move the backend feature bits to vhost_types.h

2020-12-15 Thread Jason Wang
We should store feature bits in vhost_types.h as what has been done
for e.g VHOST_F_LOG_ALL.

Signed-off-by: Jason Wang 
---
 include/uapi/linux/vhost.h   | 5 -
 include/uapi/linux/vhost_types.h | 5 +
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index c998860d7bbc..59c6c0fbaba1 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -89,11 +89,6 @@
 
 /* Set or get vhost backend capability */
 
-/* Use message type V2 */
-#define VHOST_BACKEND_F_IOTLB_MSG_V2 0x1
-/* IOTLB can accept batching hints */
-#define VHOST_BACKEND_F_IOTLB_BATCH  0x2
-
 #define VHOST_SET_BACKEND_FEATURES _IOW(VHOST_VIRTIO, 0x25, __u64)
 #define VHOST_GET_BACKEND_FEATURES _IOR(VHOST_VIRTIO, 0x26, __u64)
 
diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
index f7f6a3a28977..76ee7016c501 100644
--- a/include/uapi/linux/vhost_types.h
+++ b/include/uapi/linux/vhost_types.h
@@ -153,4 +153,9 @@ struct vhost_vdpa_iova_range {
 /* vhost-net should add virtio_net_hdr for RX, and strip for TX packets. */
 #define VHOST_NET_F_VIRTIO_NET_HDR 27
 
+/* Use message type V2 */
+#define VHOST_BACKEND_F_IOTLB_MSG_V2 0x1
+/* IOTLB can accept batching hints */
+#define VHOST_BACKEND_F_IOTLB_BATCH  0x2
+
 #endif
-- 
2.25.1



[PATCH 10/21] vhost: support ASID in IOTLB API

2020-12-15 Thread Jason Wang
This patches allows userspace to send ASID based IOTLB message to
vhost. This idea is to use the reserved u32 field in the existing V2
IOTLB message. Vhost device should advertise this capability via
VHOST_BACKEND_F_IOTLB_ASID backend feature.

Signed-off-by: Jason Wang 
---
 drivers/vhost/vdpa.c |  5 -
 drivers/vhost/vhost.c| 23 ++-
 drivers/vhost/vhost.h|  4 ++--
 include/uapi/linux/vhost_types.h |  5 -
 4 files changed, 28 insertions(+), 9 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 03a9b3311c6c..feb6a58df22d 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -739,7 +739,7 @@ static int vhost_vdpa_process_iotlb_update(struct 
vhost_vdpa *v,
return ret;
 }
 
-static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev,
+static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, u32 asid,
struct vhost_iotlb_msg *msg)
 {
struct vhost_vdpa *v = container_of(dev, struct vhost_vdpa, vdev);
@@ -748,6 +748,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev 
*dev,
struct vhost_iotlb *iotlb = v->iotlb;
int r = 0;
 
+   if (asid != 0)
+   return -EINVAL;
+
r = vhost_dev_check_owner(dev);
if (r)
return r;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a262e12c6dc2..7477b724c29b 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -468,7 +468,7 @@ void vhost_dev_init(struct vhost_dev *dev,
struct vhost_virtqueue **vqs, int nvqs,
int iov_limit, int weight, int byte_weight,
bool use_worker,
-   int (*msg_handler)(struct vhost_dev *dev,
+   int (*msg_handler)(struct vhost_dev *dev, u32 asid,
   struct vhost_iotlb_msg *msg))
 {
struct vhost_virtqueue *vq;
@@ -1084,11 +1084,14 @@ static bool umem_access_ok(u64 uaddr, u64 size, int 
access)
return true;
 }
 
-static int vhost_process_iotlb_msg(struct vhost_dev *dev,
+static int vhost_process_iotlb_msg(struct vhost_dev *dev, u16 asid,
   struct vhost_iotlb_msg *msg)
 {
int ret = 0;
 
+   if (asid != 0)
+   return -EINVAL;
+
mutex_lock(&dev->mutex);
vhost_dev_lock_vqs(dev);
switch (msg->type) {
@@ -1135,6 +1138,7 @@ ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
struct vhost_iotlb_msg msg;
size_t offset;
int type, ret;
+   u16 asid = 0;
 
ret = copy_from_iter(&type, sizeof(type), from);
if (ret != sizeof(type)) {
@@ -1150,7 +1154,16 @@ ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
offset = offsetof(struct vhost_msg, iotlb) - sizeof(int);
break;
case VHOST_IOTLB_MSG_V2:
-   offset = sizeof(__u32);
+   if (vhost_backend_has_feature(dev->vqs[0],
+ VHOST_BACKEND_F_IOTLB_ASID)) {
+   ret = copy_from_iter(&asid, sizeof(asid), from);
+   if (ret != sizeof(asid)) {
+   ret = -EINVAL;
+   goto done;
+   }
+   offset = sizeof(__u16);
+   } else
+   offset = sizeof(__u32);
break;
default:
ret = -EINVAL;
@@ -1165,9 +1178,9 @@ ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
}
 
if (dev->msg_handler)
-   ret = dev->msg_handler(dev, &msg);
+   ret = dev->msg_handler(dev, asid, &msg);
else
-   ret = vhost_process_iotlb_msg(dev, &msg);
+   ret = vhost_process_iotlb_msg(dev, asid, &msg);
if (ret) {
ret = -EFAULT;
goto done;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index b063324c7669..19753a90875c 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -162,7 +162,7 @@ struct vhost_dev {
int byte_weight;
u64 kcov_handle;
bool use_worker;
-   int (*msg_handler)(struct vhost_dev *dev,
+   int (*msg_handler)(struct vhost_dev *dev, u32 asid,
   struct vhost_iotlb_msg *msg);
 };
 
@@ -170,7 +170,7 @@ bool vhost_exceeds_weight(struct vhost_virtqueue *vq, int 
pkts, int total_len);
 void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs,
int nvqs, int iov_limit, int weight, int byte_weight,
bool use_worker,
-   int (*msg_handler)(struct vhost_dev *dev,
+   int (*msg_handler)(struct vhost_dev *dev, u32 asid,
   struct vhost_iotlb_msg *msg));
 long vhost_dev_set_owner(struct 

[PATCH 08/21] vdpa: introduce config operations for associating ASID to a virtqueue group

2020-12-15 Thread Jason Wang
This patch introduces a new bus operation to allow the vDPA bus driver
to associate an ASID to a virtqueue group.

Signed-off-by: Jason Wang 
---
 include/linux/vdpa.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 0a9a754f8180..2a8671f27b0b 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -174,6 +174,12 @@ struct vdpa_iova_range {
  * @vdev: vdpa device
  * Returns the iova range supported by
  * the device.
+ * @set_group_asid:Set address space identifier for a
+ * virtqueue group
+ * @vdev: vdpa device
+ * @group: virtqueue group
+ * @asid: address space id for this group
+ * Returns integer: success (0) or error (< 0)
  * @set_map:   Set device memory mapping (optional)
  * Needed for device that using device
  * specific DMA translation (on-chip IOMMU)
@@ -252,6 +258,10 @@ struct vdpa_config_ops {
   u64 iova, u64 size, u64 pa, u32 perm);
int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid,
 u64 iova, u64 size);
+   int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group,
+ unsigned int asid);
+
+
 
/* Free device resources */
void (*free)(struct vdpa_device *vdev);
-- 
2.25.1



[PATCH 13/21] vhost-vdpa: introduce uAPI to get the number of address spaces

2020-12-15 Thread Jason Wang
This patch introduces the uAPI for getting the number of address
spaces supported by this vDPA device.

Signed-off-by: Jason Wang 
---
 drivers/vhost/vdpa.c   | 3 +++
 include/uapi/linux/vhost.h | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 1ba5901b28e7..bff8aa214f78 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -540,6 +540,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
r = copy_to_user(argp, &v->vdpa->ngroups,
 sizeof(v->vdpa->ngroups));
break;
+   case VHOST_VDPA_GET_AS_NUM:
+   r = copy_to_user(argp, &v->vdpa->nas, sizeof(v->vdpa->nas));
+   break;
case VHOST_SET_LOG_BASE:
case VHOST_SET_LOG_FD:
r = -ENOIOCTLCMD;
diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index 8a4e6e426bbf..8762911a3cb8 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -148,4 +148,6 @@
 /* Get the number of virtqueue groups. */
 #define VHOST_VDPA_GET_GROUP_NUM   _IOR(VHOST_VIRTIO, 0x79, unsigned int)
 
+/* Get the number of address spaces. */
+#define VHOST_VDPA_GET_AS_NUM  _IOR(VHOST_VIRTIO, 0x7A, unsigned int)
 #endif
-- 
2.25.1



[PATCH 12/21] vhost-vdpa: introduce uAPI to get the number of virtqueue groups

2020-12-15 Thread Jason Wang
Follows the vDPA support for multiple address spaces, this patch
introduce uAPI for the userspace to know the number of virtqueue
groups supported by the vDPA device.

Signed-off-by: Jason Wang 
---
 drivers/vhost/vdpa.c   | 4 
 include/uapi/linux/vhost.h | 3 +++
 2 files changed, 7 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 060d5b5b7e64..1ba5901b28e7 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -536,6 +536,10 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
case VHOST_VDPA_GET_VRING_NUM:
r = vhost_vdpa_get_vring_num(v, argp);
break;
+   case VHOST_VDPA_GET_GROUP_NUM:
+   r = copy_to_user(argp, &v->vdpa->ngroups,
+sizeof(v->vdpa->ngroups));
+   break;
case VHOST_SET_LOG_BASE:
case VHOST_SET_LOG_FD:
r = -ENOIOCTLCMD;
diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index 59c6c0fbaba1..8a4e6e426bbf 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -145,4 +145,7 @@
 /* Get the valid iova range */
 #define VHOST_VDPA_GET_IOVA_RANGE  _IOR(VHOST_VIRTIO, 0x78, \
 struct vhost_vdpa_iova_range)
+/* Get the number of virtqueue groups. */
+#define VHOST_VDPA_GET_GROUP_NUM   _IOR(VHOST_VIRTIO, 0x79, unsigned int)
+
 #endif
-- 
2.25.1



[PATCH 11/21] vhost-vdpa: introduce asid based IOTLB

2020-12-15 Thread Jason Wang
This patch converts the vhost-vDPA device to support multiple IOTLBs
tagged via ASID via hlist. This will be used for supporting multiple
address spaces in the following patches.

Signed-off-by: Jason Wang 
---
 drivers/vhost/vdpa.c | 106 ---
 1 file changed, 80 insertions(+), 26 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index feb6a58df22d..060d5b5b7e64 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -33,13 +33,21 @@ enum {
 
 #define VHOST_VDPA_DEV_MAX (1U << MINORBITS)
 
+#define VHOST_VDPA_IOTLB_BUCKETS 16
+
+struct vhost_vdpa_as {
+   struct hlist_node hash_link;
+   struct vhost_iotlb iotlb;
+   u32 id;
+};
+
 struct vhost_vdpa {
struct vhost_dev vdev;
struct iommu_domain *domain;
struct vhost_virtqueue *vqs;
struct completion completion;
struct vdpa_device *vdpa;
-   struct vhost_iotlb *iotlb;
+   struct hlist_head as[VHOST_VDPA_IOTLB_BUCKETS];
struct device dev;
struct cdev cdev;
atomic_t opened;
@@ -49,12 +57,64 @@ struct vhost_vdpa {
struct eventfd_ctx *config_ctx;
int in_batch;
struct vdpa_iova_range range;
+   int used_as;
 };
 
 static DEFINE_IDA(vhost_vdpa_ida);
 
 static dev_t vhost_vdpa_major;
 
+static struct vhost_vdpa_as *asid_to_as(struct vhost_vdpa *v, u32 asid)
+{
+   struct hlist_head *head = &v->as[asid % VHOST_VDPA_IOTLB_BUCKETS];
+   struct vhost_vdpa_as *as;
+
+   hlist_for_each_entry(as, head, hash_link)
+   if (as->id == asid)
+   return as;
+
+   return NULL;
+}
+
+static struct vhost_vdpa_as *vhost_vdpa_alloc_as(struct vhost_vdpa *v, u32 
asid)
+{
+   struct hlist_head *head = &v->as[asid % VHOST_VDPA_IOTLB_BUCKETS];
+   struct vhost_vdpa_as *as;
+
+   if (asid_to_as(v, asid))
+   return NULL;
+
+   as = kmalloc(sizeof(*as), GFP_KERNEL);
+   if (!as)
+   return NULL;
+
+   vhost_iotlb_init(&as->iotlb, 0, 0);
+   as->id = asid;
+   hlist_add_head(&as->hash_link, head);
+   ++v->used_as;
+
+   return as;
+}
+
+static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
+{
+   struct vhost_vdpa_as *as = asid_to_as(v, asid);
+
+   /* Remove default address space is not allowed */
+   if (asid == 0)
+   return -EINVAL;
+
+   if (!as)
+   return -EINVAL;
+
+   hlist_del(&as->hash_link);
+   vhost_iotlb_reset(&as->iotlb);
+   kfree(as);
+   --v->used_as;
+
+   return 0;
+}
+
 static void handle_vq_kick(struct vhost_work *work)
 {
struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
@@ -525,15 +585,6 @@ static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v,
}
 }
 
-static void vhost_vdpa_iotlb_free(struct vhost_vdpa *v)
-{
-   struct vhost_iotlb *iotlb = v->iotlb;
-
-   vhost_vdpa_iotlb_unmap(v, iotlb, 0ULL, 0ULL - 1);
-   kfree(v->iotlb);
-   v->iotlb = NULL;
-}
-
 static int perm_to_iommu_flags(u32 perm)
 {
int flags = 0;
@@ -745,7 +796,8 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev 
*dev, u32 asid,
struct vhost_vdpa *v = container_of(dev, struct vhost_vdpa, vdev);
struct vdpa_device *vdpa = v->vdpa;
const struct vdpa_config_ops *ops = vdpa->config;
-   struct vhost_iotlb *iotlb = v->iotlb;
+   struct vhost_vdpa_as *as = asid_to_as(v, 0);
+   struct vhost_iotlb *iotlb = &as->iotlb;
int r = 0;
 
if (asid != 0)
@@ -856,6 +908,13 @@ static void vhost_vdpa_set_iova_range(struct vhost_vdpa *v)
}
 }
 
+static void vhost_vdpa_cleanup(struct vhost_vdpa *v)
+{
+   vhost_dev_cleanup(&v->vdev);
+   kfree(v->vdev.vqs);
+   vhost_vdpa_remove_as(v, 0);
+}
+
 static int vhost_vdpa_open(struct inode *inode, struct file *filep)
 {
struct vhost_vdpa *v;
@@ -886,15 +945,12 @@ static int vhost_vdpa_open(struct inode *inode, struct 
file *filep)
vhost_dev_init(dev, vqs, nvqs, 0, 0, 0, false,
   vhost_vdpa_process_iotlb_msg);
 
-   v->iotlb = vhost_iotlb_alloc(0, 0);
-   if (!v->iotlb) {
-   r = -ENOMEM;
-   goto err_init_iotlb;
-   }
+   if (!vhost_vdpa_alloc_as(v, 0))
+   goto err_alloc_as;
 
r = vhost_vdpa_alloc_domain(v);
if (r)
-   goto err_alloc_domain;
+   goto err_alloc_as;
 
vhost_vdpa_set_iova_range(v);
 
@@ -902,11 +958,8 @@ static int vhost_vdpa_open(struct inode *inode, struct 
file *filep)
 
return 0;
 
-err_alloc_domain:
-   vhost_vdpa_iotlb_free(v);
-err_init_iotlb:
-   vhost_dev_cleanup(&v->vdev);
-   kfree(vqs);
+err_alloc_as:
+   vhost_vdpa_cleanup(v);
 err:
atomic_dec(&v->opened);
re

[PATCH 14/21] vhost-vdpa: uAPI to get virtqueue group id

2020-12-15 Thread Jason Wang
Follows the support for virtqueue group in vDPA. This patches
introduces uAPI to get the virtqueue group ID for a specific virtqueue
in vhost-vdpa.

Signed-off-by: Jason Wang 
---
 drivers/vhost/vdpa.c   | 8 
 include/uapi/linux/vhost.h | 8 
 2 files changed, 16 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index bff8aa214f78..e7023abda12c 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -442,6 +442,14 @@ static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, 
unsigned int cmd,
return -EFAULT;
ops->set_vq_ready(vdpa, idx, s.num);
return 0;
+   case VHOST_VDPA_GET_VRING_GROUP:
+   s.index = idx;
+   s.num = ops->get_vq_group(vdpa, idx);
+   if (s.num >= vdpa->ngroups)
+   return -EIO;
+   else if (copy_to_user(argp, &s, sizeof s))
+   return -EFAULT;
+   return 0;
case VHOST_GET_VRING_BASE:
r = ops->get_vq_state(v->vdpa, idx, &vq_state);
if (r)
diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index 8762911a3cb8..99de06476fdc 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -150,4 +150,12 @@
 
 /* Get the number of address spaces. */
 #define VHOST_VDPA_GET_AS_NUM  _IOR(VHOST_VIRTIO, 0x7A, unsigned int)
+
+/* Get the group for a virtqueue: read index, write group in num,
+ * The virtqueue index is stored in the index field of
+ * vhost_vring_state. The group for this specific virtqueue is
+ * returned via num field of vhost_vring_state.
+ */
+#define VHOST_VDPA_GET_VRING_GROUP _IOWR(VHOST_VIRTIO, 0x7B,   \
+ struct vhost_vring_state)
 #endif
-- 
2.25.1



[PATCH 16/21] vhost-vdpa: support ASID based IOTLB API

2020-12-15 Thread Jason Wang
This patch extends the vhost-vdpa to support ASID based IOTLB API. The
vhost-vdpa device will allocated multple IOTLBs for vDPA device that
supports multiple address spaces. The IOTLBs and vDPA device memory
mappings is determined and maintained through ASID.

Note that we still don't support vDPA device with more than one
address spaces that depends on platform IOMMU. This work will be done
by moving the IOMMU logic from vhost-vDPA to vDPA device driver.

Signed-off-by: Jason Wang 
---
 drivers/vhost/vdpa.c  | 127 --
 drivers/vhost/vhost.c |   4 +-
 2 files changed, 99 insertions(+), 32 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index cd7c9a401a61..c4fda48d4273 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -28,7 +28,8 @@
 enum {
VHOST_VDPA_BACKEND_FEATURES =
(1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2) |
-   (1ULL << VHOST_BACKEND_F_IOTLB_BATCH),
+   (1ULL << VHOST_BACKEND_F_IOTLB_BATCH) |
+   (1ULL << VHOST_BACKEND_F_IOTLB_ASID),
 };
 
 #define VHOST_VDPA_DEV_MAX (1U << MINORBITS)
@@ -57,13 +58,20 @@ struct vhost_vdpa {
struct eventfd_ctx *config_ctx;
int in_batch;
struct vdpa_iova_range range;
-   int used_as;
+   u32 batch_asid;
 };
 
 static DEFINE_IDA(vhost_vdpa_ida);
 
 static dev_t vhost_vdpa_major;
 
+static inline u32 iotlb_to_asid(struct vhost_iotlb *iotlb)
+{
+   struct vhost_vdpa_as *as = container_of(iotlb, struct
+   vhost_vdpa_as, iotlb);
+   return as->id;
+}
+
 static struct vhost_vdpa_as *asid_to_as(struct vhost_vdpa *v, u32 asid)
 {
struct hlist_head *head = &v->as[asid % VHOST_VDPA_IOTLB_BUCKETS];
@@ -76,6 +84,16 @@ static struct vhost_vdpa_as *asid_to_as(struct vhost_vdpa 
*v, u32 asid)
return NULL;
 }
 
+static struct vhost_iotlb *asid_to_iotlb(struct vhost_vdpa *v, u32 asid)
+{
+   struct vhost_vdpa_as *as = asid_to_as(v, asid);
+
+   if (!as)
+   return NULL;
+
+   return &as->iotlb;
+}
+
 static struct vhost_vdpa_as *vhost_vdpa_alloc_as(struct vhost_vdpa *v, u32 
asid)
 {
struct hlist_head *head = &v->as[asid % VHOST_VDPA_IOTLB_BUCKETS];
@@ -84,6 +102,9 @@ static struct vhost_vdpa_as *vhost_vdpa_alloc_as(struct 
vhost_vdpa *v, u32 asid)
if (asid_to_as(v, asid))
return NULL;
 
+   if (asid >= v->vdpa->nas)
+   return NULL;
+
as = kmalloc(sizeof(*as), GFP_KERNEL);
if (!as)
return NULL;
@@ -91,18 +112,24 @@ static struct vhost_vdpa_as *vhost_vdpa_alloc_as(struct 
vhost_vdpa *v, u32 asid)
vhost_iotlb_init(&as->iotlb, 0, 0);
as->id = asid;
hlist_add_head(&as->hash_link, head);
-   ++v->used_as;
 
return as;
 }
 
-static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
+static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
+ u32 asid)
 {
struct vhost_vdpa_as *as = asid_to_as(v, asid);
 
-   /* Remove default address space is not allowed */
-   if (asid == 0)
-   return -EINVAL;
+   if (as)
+   return as;
+
+   return vhost_vdpa_alloc_as(v, asid);
+}
+
+static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
+{
+   struct vhost_vdpa_as *as = asid_to_as(v, asid);
 
if (!as)
return -EINVAL;
@@ -110,7 +137,6 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 
asid)
hlist_del(&as->hash_link);
vhost_iotlb_reset(&as->iotlb);
kfree(as);
-   --v->used_as;
 
return 0;
 }
@@ -636,6 +662,7 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct 
vhost_iotlb *iotlb,
struct vhost_dev *dev = &v->vdev;
struct vdpa_device *vdpa = v->vdpa;
const struct vdpa_config_ops *ops = vdpa->config;
+   u32 asid = iotlb_to_asid(iotlb);
int r = 0;
 
r = vhost_iotlb_add_range(iotlb, iova, iova + size - 1,
@@ -644,10 +671,10 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct 
vhost_iotlb *iotlb,
return r;
 
if (ops->dma_map) {
-   r = ops->dma_map(vdpa, 0, iova, size, pa, perm);
+   r = ops->dma_map(vdpa, asid, iova, size, pa, perm);
} else if (ops->set_map) {
if (!v->in_batch)
-   r = ops->set_map(vdpa, 0, iotlb);
+   r = ops->set_map(vdpa, asid, iotlb);
} else {
r = iommu_map(v->domain, iova, pa, size,
  perm_to_iommu_flags(perm));
@@ -661,23 +688,35 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct 
vhost_iotlb *iotlb,
return r;
 }
 
-static void vhost_vdpa_unmap(struct vhost_vdpa *v,
-   

[PATCH 15/21] vhost-vdpa: introduce uAPI to set group ASID

2020-12-15 Thread Jason Wang
Follows the vDPA support for associating ASID to a specific virtqueue
group. This patch adds a uAPI to support setting them from userspace.

Signed-off-by: Jason Wang 
---
 drivers/vhost/vdpa.c   | 8 
 include/uapi/linux/vhost.h | 7 +++
 2 files changed, 15 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index e7023abda12c..cd7c9a401a61 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -450,6 +450,14 @@ static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, 
unsigned int cmd,
else if (copy_to_user(argp, &s, sizeof s))
return -EFAULT;
return 0;
+   case VHOST_VDPA_SET_GROUP_ASID:
+   if (copy_from_user(&s, argp, sizeof(s)))
+   return -EFAULT;
+   if (s.num >= vdpa->nas)
+   return -EINVAL;
+   if (!ops->set_group_asid)
+   return -ENOTSUPP;
+   return ops->set_group_asid(vdpa, idx, s.num);
case VHOST_GET_VRING_BASE:
r = ops->get_vq_state(v->vdpa, idx, &vq_state);
if (r)
diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index 99de06476fdc..5e083490f1aa 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -158,4 +158,11 @@
  */
 #define VHOST_VDPA_GET_VRING_GROUP _IOWR(VHOST_VIRTIO, 0x7B,   \
  struct vhost_vring_state)
+/* Set the ASID for a virtqueue group. The group index is stored in
+ * the index field of vhost_vring_state, the ASID associated with this
+ * group is stored at num field of vhost_vring_state.
+ */
+#define VHOST_VDPA_SET_GROUP_ASID  _IOW(VHOST_VIRTIO, 0x7C, \
+struct vhost_vring_state)
+
 #endif
-- 
2.25.1



[PATCH 17/21] vdpa_sim: split vdpasim_virtqueue's iov field in out_iov and in_iov

2020-12-15 Thread Jason Wang
From: Stefano Garzarella 

vringh_getdesc_iotlb() manages 2 iovs for writable and readable
descriptors. This is very useful for the block device, where for
each request we have both types of descriptor.

Let's split the vdpasim_virtqueue's iov field in out_iov and
in_iov to use them with vringh_getdesc_iotlb().

We are using VIRTIO terminology for "out" (readable by the device)
and "in" (writable by the device) descriptors.

Acked-by: Jason Wang 
Signed-off-by: Stefano Garzarella 
---
 drivers/vdpa/vdpa_sim/vdpa_sim.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index 140de452..fe4888dfb70f 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -44,7 +44,8 @@ MODULE_PARM_DESC(macaddr, "Ethernet MAC address");
 
 struct vdpasim_virtqueue {
struct vringh vring;
-   struct vringh_kiov iov;
+   struct vringh_kiov in_iov;
+   struct vringh_kiov out_iov;
unsigned short head;
bool ready;
u64 desc_addr;
@@ -178,12 +179,12 @@ static void vdpasim_work(struct work_struct *work)
 
while (true) {
total_write = 0;
-   err = vringh_getdesc_iotlb(&txq->vring, &txq->iov, NULL,
+   err = vringh_getdesc_iotlb(&txq->vring, &txq->out_iov, NULL,
   &txq->head, GFP_ATOMIC);
if (err <= 0)
break;
 
-   err = vringh_getdesc_iotlb(&rxq->vring, NULL, &rxq->iov,
+   err = vringh_getdesc_iotlb(&rxq->vring, NULL, &rxq->in_iov,
   &rxq->head, GFP_ATOMIC);
if (err <= 0) {
vringh_complete_iotlb(&txq->vring, txq->head, 0);
@@ -191,13 +192,13 @@ static void vdpasim_work(struct work_struct *work)
}
 
while (true) {
-   read = vringh_iov_pull_iotlb(&txq->vring, &txq->iov,
+   read = vringh_iov_pull_iotlb(&txq->vring, &txq->out_iov,
 vdpasim->buffer,
 PAGE_SIZE);
if (read <= 0)
break;
 
-   write = vringh_iov_push_iotlb(&rxq->vring, &rxq->iov,
+   write = vringh_iov_push_iotlb(&rxq->vring, &rxq->in_iov,
  vdpasim->buffer, read);
if (write <= 0)
break;
-- 
2.25.1



[PATCH 18/21] vdpa_sim: advertise VIRTIO_NET_F_MTU

2020-12-15 Thread Jason Wang
We've already reported maximum mtu via config space, so let's
advertise the feature.

Signed-off-by: Jason Wang 
---
 drivers/vdpa/vdpa_sim/vdpa_sim.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index fe4888dfb70f..8d051cf25f0a 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -66,7 +66,8 @@ struct vdpasim_virtqueue {
 static u64 vdpasim_features = (1ULL << VIRTIO_F_ANY_LAYOUT) |
  (1ULL << VIRTIO_F_VERSION_1)  |
  (1ULL << VIRTIO_F_ACCESS_PLATFORM) |
- (1ULL << VIRTIO_NET_F_MAC);
+ (1ULL << VIRTIO_NET_F_MAC) |
+ (1ULL << VIRTIO_NET_F_MTU);
 
 /* State of each vdpasim device */
 struct vdpasim {
-- 
2.25.1



[PATCH 21/21] vdpasim: control virtqueue support

2020-12-15 Thread Jason Wang
This patch introduces the control virtqueue support for vDPA
simulator. This is a requirement for supporting advanced features like
multiqueue.

A requirement for control virtqueue is to isolate its memory access
from the rx/tx virtqueues. This is because when using vDPA device
for VM, the control virqueue is not directly assigned to VM. Userspace
(Qemu) will present a shadow control virtqueue to control for
recording the device states.

The isolation is done via the virtqueue groups and ASID support in
vDPA through vhost-vdpa. The simulator is extended to have:

1) three virtqueues: RXVQ, TXVQ and CVQ (control virtqueue)
2) two virtqueue groups: group 0 contains RXVQ and TXVQ; group 1
   contains CVQ
3) two address spaces and the simulator simply implements the address
   spaces by mapping it 1:1 to IOTLB.

For the VM use cases, userspace(Qemu) may set AS 0 to group 0 and AS 1
to group 1. So we have:

1) The IOTLB for virtqueue group 0 contains the mappings of guest, so
   RX and TX can be assigned to guest directly.
2) The IOTLB for virtqueue group 1 contains the mappings of CVQ which
   is the buffers that allocated and managed by VMM only. So CVQ of
   vhost-vdpa is visible to VMM only. And Guest can not access the CVQ
   of vhost-vdpa.

For the other use cases, since AS 0 is associated to all virtqueue
groups by default. All virtqueues share the same mapping by default.

To demonstrate the function, VIRITO_NET_F_CTRL_MACADDR is
implemented in the simulator for the driver to set mac address.

Signed-off-by: Jason Wang 
---
 drivers/vdpa/vdpa_sim/vdpa_sim.c | 189 +++
 1 file changed, 166 insertions(+), 23 deletions(-)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index fe90a783bde4..0fd06ac491cd 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -60,14 +60,18 @@ struct vdpasim_virtqueue {
 #define VDPASIM_QUEUE_MAX 256
 #define VDPASIM_DEVICE_ID 0x1
 #define VDPASIM_VENDOR_ID 0
-#define VDPASIM_VQ_NUM 0x2
+#define VDPASIM_VQ_NUM 0x3
+#define VDPASIM_AS_NUM 0x2
+#define VDPASIM_GROUP_NUM 0x2
 #define VDPASIM_NAME "vdpasim-netdev"
 
 static u64 vdpasim_features = (1ULL << VIRTIO_F_ANY_LAYOUT) |
  (1ULL << VIRTIO_F_VERSION_1)  |
  (1ULL << VIRTIO_F_ACCESS_PLATFORM) |
+ (1ULL << VIRTIO_NET_F_MTU) |
  (1ULL << VIRTIO_NET_F_MAC) |
- (1ULL << VIRTIO_NET_F_MTU);
+ (1ULL << VIRTIO_NET_F_CTRL_VQ) |
+ (1ULL << VIRTIO_NET_F_CTRL_MAC_ADDR);
 
 /* State of each vdpasim device */
 struct vdpasim {
@@ -147,11 +151,17 @@ static void vdpasim_reset(struct vdpasim *vdpasim)
 {
int i;
 
-   for (i = 0; i < VDPASIM_VQ_NUM; i++)
+   spin_lock(&vdpasim->iommu_lock);
+
+   for (i = 0; i < VDPASIM_VQ_NUM; i++) {
vdpasim_vq_reset(&vdpasim->vqs[i]);
+   vringh_set_iotlb(&vdpasim->vqs[i].vring,
+&vdpasim->iommu[0]);
+   }
 
-   spin_lock(&vdpasim->iommu_lock);
-   vhost_iotlb_reset(vdpasim->iommu);
+   for (i = 0; i < VDPASIM_AS_NUM; i++) {
+   vhost_iotlb_reset(&vdpasim->iommu[i]);
+   }
spin_unlock(&vdpasim->iommu_lock);
 
vdpasim->features = 0;
@@ -191,6 +201,81 @@ static bool receive_filter(struct vdpasim *vdpasim, size_t 
len)
return false;
 }
 
+virtio_net_ctrl_ack vdpasim_handle_ctrl_mac(struct vdpasim *vdpasim,
+   u8 cmd)
+{
+   struct vdpasim_virtqueue *cvq = &vdpasim->vqs[2];
+   virtio_net_ctrl_ack status = VIRTIO_NET_ERR;
+   size_t read;
+
+   switch (cmd) {
+   case VIRTIO_NET_CTRL_MAC_ADDR_SET:
+   read = vringh_iov_pull_iotlb(&cvq->vring, &cvq->in_iov,
+(void *)vdpasim->config.mac,
+ETH_ALEN);
+   if (read == ETH_ALEN)
+   status = VIRTIO_NET_OK;
+   break;
+   default:
+   break;
+   }
+
+   return status;
+}
+
+static void vdpasim_handle_cvq(struct vdpasim *vdpasim)
+{
+   struct vdpasim_virtqueue *cvq = &vdpasim->vqs[2];
+   virtio_net_ctrl_ack status = VIRTIO_NET_ERR;
+   struct virtio_net_ctrl_hdr ctrl;
+   size_t read, write;
+   int err;
+
+   if (!(vdpasim->features & (1ULL << VIRTIO_NET_F_CTRL_VQ)))
+   return;
+
+   if (!cvq->ready)
+   return;
+
+   while (true) {
+   err = vringh_getdesc_iotlb(&cvq->vring, &cvq->in_iov,
+  &cvq->out_iov,
+ 

[PATCH 20/21] vdpa_sim: filter destination mac address

2020-12-15 Thread Jason Wang
This patch implements a simple unicast filter for vDPA simulator.

Signed-off-by: Jason Wang 
---
 drivers/vdpa/vdpa_sim/vdpa_sim.c | 49 
 1 file changed, 31 insertions(+), 18 deletions(-)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index e901177c6dfe..fe90a783bde4 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -175,6 +175,22 @@ static void vdpasim_complete(struct vdpasim_virtqueue *vq, 
size_t len)
local_bh_enable();
 }
 
+static bool receive_filter(struct vdpasim *vdpasim, size_t len)
+{
+   bool modern = vdpasim->features & (1ULL << VIRTIO_F_VERSION_1);
+   size_t hdr_len = modern ? sizeof(struct virtio_net_hdr_v1) :
+ sizeof(struct virtio_net_hdr);
+
+   if (len < ETH_ALEN + hdr_len)
+   return false;
+
+   if (!strncmp(vdpasim->buffer + hdr_len,
+vdpasim->config.mac, ETH_ALEN))
+   return true;
+
+   return false;
+}
+
 static void vdpasim_work(struct work_struct *work)
 {
struct vdpasim *vdpasim = container_of(work, struct
@@ -182,7 +198,6 @@ static void vdpasim_work(struct work_struct *work)
struct vdpasim_virtqueue *txq = &vdpasim->vqs[1];
struct vdpasim_virtqueue *rxq = &vdpasim->vqs[0];
ssize_t read, write;
-   size_t total_write;
int pkts = 0;
int err;
 
@@ -195,36 +210,34 @@ static void vdpasim_work(struct work_struct *work)
goto out;
 
while (true) {
-   total_write = 0;
err = vringh_getdesc_iotlb(&txq->vring, &txq->out_iov, NULL,
   &txq->head, GFP_ATOMIC);
if (err <= 0)
break;
 
+   read = vringh_iov_pull_iotlb(&txq->vring, &txq->out_iov,
+vdpasim->buffer,
+PAGE_SIZE);
+
+   if (!receive_filter(vdpasim, read)) {
+   vdpasim_complete(txq, 0);
+   continue;
+   }
+
err = vringh_getdesc_iotlb(&rxq->vring, NULL, &rxq->in_iov,
   &rxq->head, GFP_ATOMIC);
if (err <= 0) {
-   vringh_complete_iotlb(&txq->vring, txq->head, 0);
+   vdpasim_complete(txq, 0);
break;
}
 
-   while (true) {
-   read = vringh_iov_pull_iotlb(&txq->vring, &txq->out_iov,
-vdpasim->buffer,
-PAGE_SIZE);
-   if (read <= 0)
-   break;
-
-   write = vringh_iov_push_iotlb(&rxq->vring, &rxq->in_iov,
- vdpasim->buffer, read);
-   if (write <= 0)
-   break;
-
-   total_write += write;
-   }
+   write = vringh_iov_push_iotlb(&rxq->vring, &rxq->in_iov,
+ vdpasim->buffer, read);
+   if (write <= 0)
+   break;
 
vdpasim_complete(txq, 0);
-   vdpasim_complete(rxq, total_write);
+   vdpasim_complete(rxq, write);
 
if (++pkts > 4) {
schedule_work(&vdpasim->work);
-- 
2.25.1



[PATCH 19/21] vdpa_sim: factor out buffer completion logic

2020-12-15 Thread Jason Wang
Signed-off-by: Jason Wang 
---
 drivers/vdpa/vdpa_sim/vdpa_sim.c | 33 +---
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index 8d051cf25f0a..e901177c6dfe 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -159,6 +159,22 @@ static void vdpasim_reset(struct vdpasim *vdpasim)
++vdpasim->generation;
 }
 
+static void vdpasim_complete(struct vdpasim_virtqueue *vq, size_t len)
+{
+   /* Make sure data is wrote before advancing index */
+   smp_wmb();
+
+   vringh_complete_iotlb(&vq->vring, vq->head, len);
+
+   /* Make sure used is visible before rasing the interrupt. */
+   smp_wmb();
+
+   local_bh_disable();
+   if (vq->cb)
+   vq->cb(vq->private);
+   local_bh_enable();
+}
+
 static void vdpasim_work(struct work_struct *work)
 {
struct vdpasim *vdpasim = container_of(work, struct
@@ -207,21 +223,8 @@ static void vdpasim_work(struct work_struct *work)
total_write += write;
}
 
-   /* Make sure data is wrote before advancing index */
-   smp_wmb();
-
-   vringh_complete_iotlb(&txq->vring, txq->head, 0);
-   vringh_complete_iotlb(&rxq->vring, rxq->head, total_write);
-
-   /* Make sure used is visible before rasing the interrupt. */
-   smp_wmb();
-
-   local_bh_disable();
-   if (txq->cb)
-   txq->cb(txq->private);
-   if (rxq->cb)
-   rxq->cb(rxq->private);
-   local_bh_enable();
+   vdpasim_complete(txq, 0);
+   vdpasim_complete(rxq, total_write);
 
if (++pkts > 4) {
schedule_work(&vdpasim->work);
-- 
2.25.1



Re: [PATCH net 2/2] vhost_net: fix high cpu load when sendmsg fails

2020-12-15 Thread Jason Wang



- Original Message -
> > -Original Message-
> > From: Jason Wang [mailto:jasow...@redhat.com]
> > Sent: Wednesday, December 16, 2020 1:57 PM
> > To: wangyunjian 
> > Cc: netdev@vger.kernel.org; m...@redhat.com; willemdebruijn kernel
> > ;
> > virtualizat...@lists.linux-foundation.org;
> > Lilijun (Jerry) ; chenchanghu
> > ; xudingke ; huangbin (J)
> > 
> > Subject: Re: [PATCH net 2/2] vhost_net: fix high cpu load when sendmsg
> > fails
> > 
> > 
> > 
> > - Original Message -
> > >
> > >
> > > > -Original Message-
> > > > From: Jason Wang [mailto:jasow...@redhat.com]
> > > > Sent: Tuesday, December 15, 2020 12:10 PM
> > > > To: wangyunjian ; netdev@vger.kernel.org;
> > > > m...@redhat.com; willemdebruijn.ker...@gmail.com
> > > > Cc: virtualizat...@lists.linux-foundation.org; Lilijun (Jerry)
> > > > ; chenchanghu ;
> > > > xudingke ; huangbin (J)
> > > > 
> > > > Subject: Re: [PATCH net 2/2] vhost_net: fix high cpu load when sendmsg
> > > > fails
> > > >
> > > >
> > > > On 2020/12/15 上午9:48, wangyunjian wrote:
> > > > > From: Yunjian Wang 
> > > > >
> > > > > Currently we break the loop and wake up the vhost_worker when
> > sendmsg
> > > > > fails. When the worker wakes up again, we'll meet the same error.
> > > > > This
> > > > > will cause high CPU load. To fix this issue, we can skip this
> > > > > description by ignoring the error. When we exceeds sndbuf, the return
> > > > > value of sendmsg is -EAGAIN. In the case we don't skip the
> > > > > description
> > > > > and don't drop packet.
> > > > >
> > > > > Signed-off-by: Yunjian Wang 
> > > > > ---
> > > > >   drivers/vhost/net.c | 21 +
> > > > >   1 file changed, 9 insertions(+), 12 deletions(-)
> > > > >
> > > > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index
> > > > > c8784dfafdd7..f966592d8900 100644
> > > > > --- a/drivers/vhost/net.c
> > > > > +++ b/drivers/vhost/net.c
> > > > > @@ -827,16 +827,13 @@ static void handle_tx_copy(struct vhost_net
> > *net,
> > > > struct socket *sock)
> > > > >   msg.msg_flags &= ~MSG_MORE;
> > > > >   }
> > > > >
> > > > > - /* TODO: Check specific error and bomb out unless 
> > > > > ENOBUFS?
> > */
> > > > >   err = sock->ops->sendmsg(sock, &msg, len);
> > > > > - if (unlikely(err < 0)) {
> > > > > + if (unlikely(err == -EAGAIN)) {
> > > > >   vhost_discard_vq_desc(vq, 1);
> > > > >   vhost_net_enable_vq(net, vq);
> > > > >   break;
> > > > > - }
> > > >
> > > >
> > > > As I've pointed out in last version. If you don't discard descriptor,
> > > > you
> > > > probably
> > > > need to add the head to used ring. Otherwise this descriptor will be
> > > > always
> > > > inflight that may confuse drivers.
> > >
> > > Sorry for missing the comment.
> > >
> > > After deleting discard descriptor and break, the next processing will be
> > > the
> > > same
> > > as the normal success of sendmsg(), and vhost_zerocopy_signal_used() or
> > > vhost_add_used_and_signal() method will be called to add the head to used
> > > ring.
> > 
> > It's the next head not the one that contains the buggy packet?
> 
> In the modified code logic, the head added to used ring is exectly the
> one that contains the buggy packet.

-ENOTEA :( You're right, I misread the code.

Thanks

> 
> Thanks
> 
> > 
> > Thanks
> > 
> > >
> > > Thanks
> > > >
> > > >
> > > > > - if (err != len)
> > > > > - pr_debug("Truncated TX packet: len %d != %zd\n",
> > > > > -  err, len);
> > > > > + } else if (unlikely(err < 0 || err != len))
> > > >
> > > >
> > > >

Re: [PATCH net v2 2/2] vhost_net: fix high cpu load when sendmsg fails

2020-12-16 Thread Jason Wang



On 2020/12/16 下午5:23, Michael S. Tsirkin wrote:

On Wed, Dec 16, 2020 at 04:20:37PM +0800, wangyunjian wrote:

From: Yunjian Wang

Currently we break the loop and wake up the vhost_worker when
sendmsg fails. When the worker wakes up again, we'll meet the
same error. This will cause high CPU load. To fix this issue,
we can skip this description by ignoring the error. When we
exceeds sndbuf, the return value of sendmsg is -EAGAIN. In
the case we don't skip the description and don't drop packet.

Question: with this patch, what happens if sendmsg is interrupted by a signal?



Since we use MSG_DONTWAIT, we don't need to care about signal I think.

Thanks









Re: [PATCH 00/21] Control VQ support in vDPA

2020-12-16 Thread Jason Wang



On 2020/12/16 下午5:47, Michael S. Tsirkin wrote:

On Wed, Dec 16, 2020 at 02:47:57PM +0800, Jason Wang wrote:

Hi All:

This series tries to add the support for control virtqueue in vDPA.

Control virtqueue is used by networking device for accepting various
commands from the driver. It's a must to support multiqueue and other
configurations.

When used by vhost-vDPA bus driver for VM, the control virtqueue
should be shadowed via userspace VMM (Qemu) instead of being assigned
directly to Guest. This is because Qemu needs to know the device state
in order to start and stop device correctly (e.g for Live Migration).

This requies to isolate the memory mapping for control virtqueue
presented by vhost-vDPA to prevent guest from accesing it directly.
To achieve this, vDPA introduce two new abstractions:

- address space: identified through address space id (ASID) and a set
  of memory mapping in maintained
- virtqueue group: the minimal set of virtqueues that must share an
  address space

How will this support the pretty common case where control vq
is programmed by the kernel through the PF, and others by the VFs?



In this case, the VF parent need to provide a software control vq and 
decode the command then send them to VF.






I actually thought the way to support it is by exposing
something like an "inject buffers" API which sends data to a given VQ.
Maybe an ioctl, and maybe down the road uio ring can support batching
these 



So the virtuqueue allows the request to be processed asynchronously (e.g 
driver may choose to use interrupt for control vq). This means we need 
to support that in uAPI level. And if we manage to do that, it's just 
another type of virtqueue.


For virtio-vDPA, this also means the extensions for queue processing 
which is a functional duplication. Using what proposed in this series, 
we don't need any changes for kernel virtio drivers.


What's more important, this series could be used for future features 
that requires DMA isolation between virtqueues:


- report dirty pages via virtqueue
- sub function level device slicing

...

Thanks






Device needs to advertise the following attributes to vDPA:

- the number of address spaces supported in the device
- the number of virtqueue groups supported in the device
- the mappings from a specific virtqueue to its virtqueue groups

The mappings from virtqueue to virtqueue groups is fixed and defined
by vDPA device driver. E.g:

- For the device that has hardware ASID support, it can simply
   advertise a per virtqueue virtqueue group.
- For the device that does not have hardware ASID support, it can
   simply advertise a single virtqueue group that contains all
   virtqueues. Or if it wants a software emulated control virtqueue, it
   can advertise two virtqueue groups, one is for cvq, another is for
   the rest virtqueues.

vDPA also allow to change the association between virtqueue group and
address space. So in the case of control virtqueue, userspace
VMM(Qemu) may use a dedicated address space for the control virtqueue
group to isolate the memory mapping.

The vhost/vhost-vDPA is also extend for the userspace to:

- query the number of virtqueue groups and address spaces supported by
   the device
- query the virtqueue group for a specific virtqueue
- assocaite a virtqueue group with an address space
- send ASID based IOTLB commands

This will help userspace VMM(Qemu) to detect whether the control vq
could be supported and isolate memory mappings of control virtqueue
from the others.

To demonstrate the usage, vDPA simulator is extended to support
setting MAC address via a emulated control virtqueue.

Please review.

Changes since RFC:

- tweak vhost uAPI documentation
- switch to use device specific IOTLB really in patch 4
- tweak the commit log
- fix that ASID in vhost is claimed to be 32 actually but 16bit
   actually
- fix use after free when using ASID with IOTLB batching requests
- switch to use Stefano's patch for having separated iov
- remove unused "used_as" variable
- fix the iotlb/asid checking in vhost_vdpa_unmap()

Thanks

Jason Wang (20):
   vhost: move the backend feature bits to vhost_types.h
   virtio-vdpa: don't set callback if virtio doesn't need it
   vhost-vdpa: passing iotlb to IOMMU mapping helpers
   vhost-vdpa: switch to use vhost-vdpa specific IOTLB
   vdpa: add the missing comment for nvqs in struct vdpa_device
   vdpa: introduce virtqueue groups
   vdpa: multiple address spaces support
   vdpa: introduce config operations for associating ASID to a virtqueue
 group
   vhost_iotlb: split out IOTLB initialization
   vhost: support ASID in IOTLB API
   vhost-vdpa: introduce asid based IOTLB
   vhost-vdpa: introduce uAPI to get the number of virtqueue groups
   vhost-vdpa: introduce uAPI to get the number of address spaces
   vhost-vdpa: uAPI to get virtqueue group id
   vhost-vdpa: introduce uAPI to set

Re: [PATCH 00/21] Control VQ support in vDPA

2020-12-17 Thread Jason Wang



On 2020/12/17 下午3:58, Michael S. Tsirkin wrote:

On Thu, Dec 17, 2020 at 11:30:18AM +0800, Jason Wang wrote:

On 2020/12/16 下午5:47, Michael S. Tsirkin wrote:

On Wed, Dec 16, 2020 at 02:47:57PM +0800, Jason Wang wrote:

Hi All:

This series tries to add the support for control virtqueue in vDPA.

Control virtqueue is used by networking device for accepting various
commands from the driver. It's a must to support multiqueue and other
configurations.

When used by vhost-vDPA bus driver for VM, the control virtqueue
should be shadowed via userspace VMM (Qemu) instead of being assigned
directly to Guest. This is because Qemu needs to know the device state
in order to start and stop device correctly (e.g for Live Migration).

This requies to isolate the memory mapping for control virtqueue
presented by vhost-vDPA to prevent guest from accesing it directly.
To achieve this, vDPA introduce two new abstractions:

- address space: identified through address space id (ASID) and a set
   of memory mapping in maintained
- virtqueue group: the minimal set of virtqueues that must share an
   address space

How will this support the pretty common case where control vq
is programmed by the kernel through the PF, and others by the VFs?


In this case, the VF parent need to provide a software control vq and decode
the command then send them to VF.


But how does that tie to the address space infrastructure?



In this case, address space is not a must. But the idea is to make 
control vq works for all types of hardware:


1) control virtqueue is implemented via VF/PF communication
2) control virtqueue is implemented by VF but not through DMA
3) control virtqueue is implemented by VF DMA, it could be either a 
hardware control virtqueue or other type of DMA


The address space is a must for 3) to work and can work for both 1) and 2).








I actually thought the way to support it is by exposing
something like an "inject buffers" API which sends data to a given VQ.
Maybe an ioctl, and maybe down the road uio ring can support batching
these 


So the virtuqueue allows the request to be processed asynchronously (e.g
driver may choose to use interrupt for control vq). This means we need to
support that in uAPI level.

I don't think we need to make it async, just a regular ioctl will do.
In fact no guest uses the asynchronous property.



It was not forbidden by the spec then we need to support that. E.g we 
can not assume driver doesn't assign interrupt for cvq.







And if we manage to do that, it's just another
type of virtqueue.

For virtio-vDPA, this also means the extensions for queue processing which
is a functional duplication.

I don't see why, just send it to the actual control vq :)



But in the case you've pointed out, there's no hardware control vq in fact.





Using what proposed in this series, we don't
need any changes for kernel virtio drivers.

What's more important, this series could be used for future features that
requires DMA isolation between virtqueues:

- report dirty pages via virtqueue
- sub function level device slicing


I agree these are nice to have, but I am not sure basic control vq must
be tied to that.



If the control virtqueue is implemented via DMA through VF, it looks 
like a must.


Thanks





...

Thanks





Device needs to advertise the following attributes to vDPA:

- the number of address spaces supported in the device
- the number of virtqueue groups supported in the device
- the mappings from a specific virtqueue to its virtqueue groups

The mappings from virtqueue to virtqueue groups is fixed and defined
by vDPA device driver. E.g:

- For the device that has hardware ASID support, it can simply
advertise a per virtqueue virtqueue group.
- For the device that does not have hardware ASID support, it can
simply advertise a single virtqueue group that contains all
virtqueues. Or if it wants a software emulated control virtqueue, it
can advertise two virtqueue groups, one is for cvq, another is for
the rest virtqueues.

vDPA also allow to change the association between virtqueue group and
address space. So in the case of control virtqueue, userspace
VMM(Qemu) may use a dedicated address space for the control virtqueue
group to isolate the memory mapping.

The vhost/vhost-vDPA is also extend for the userspace to:

- query the number of virtqueue groups and address spaces supported by
the device
- query the virtqueue group for a specific virtqueue
- assocaite a virtqueue group with an address space
- send ASID based IOTLB commands

This will help userspace VMM(Qemu) to detect whether the control vq
could be supported and isolate memory mappings of control virtqueue
from the others.

To demonstrate the usage, vDPA simulator is extended to support
setting MAC address via a emulated control virtqueue.

Please review.

Changes since RFC:

- tweak vhost uAPI docume

Re: [PATCH 00/21] Control VQ support in vDPA

2020-12-17 Thread Jason Wang



On 2020/12/18 上午6:28, Michael S. Tsirkin wrote:

On Thu, Dec 17, 2020 at 05:02:49PM +0800, Jason Wang wrote:

On 2020/12/17 下午3:58, Michael S. Tsirkin wrote:

On Thu, Dec 17, 2020 at 11:30:18AM +0800, Jason Wang wrote:

On 2020/12/16 下午5:47, Michael S. Tsirkin wrote:

On Wed, Dec 16, 2020 at 02:47:57PM +0800, Jason Wang wrote:

Hi All:

This series tries to add the support for control virtqueue in vDPA.

Control virtqueue is used by networking device for accepting various
commands from the driver. It's a must to support multiqueue and other
configurations.

When used by vhost-vDPA bus driver for VM, the control virtqueue
should be shadowed via userspace VMM (Qemu) instead of being assigned
directly to Guest. This is because Qemu needs to know the device state
in order to start and stop device correctly (e.g for Live Migration).

This requies to isolate the memory mapping for control virtqueue
presented by vhost-vDPA to prevent guest from accesing it directly.
To achieve this, vDPA introduce two new abstractions:

- address space: identified through address space id (ASID) and a set
of memory mapping in maintained
- virtqueue group: the minimal set of virtqueues that must share an
address space

How will this support the pretty common case where control vq
is programmed by the kernel through the PF, and others by the VFs?

In this case, the VF parent need to provide a software control vq and decode
the command then send them to VF.

But how does that tie to the address space infrastructure?

In this case, address space is not a must.

That's ok, problem is I don't see how address space is going
to work in this case at all.

There's no address space there that userspace/guest can control.



The virtqueue group is mandated by parent but the association between 
virtqueue group and address space is under the control of userspace (Qemu).


A simple but common case is that:

1) Device advertise two virtqueue groups: group 0 contains RX and TX, 
group 1 contains CVQ.

2) Device advertise two address spaces

Then, for vhost-vDPA using by VM:

1) associate group 0 with as 0, group 1 with as 1 (via vhost-vDPA 
VHOST_VDPA_SET_GROUP_ASID)

2) Publish guest memory mapping via IOTLB asid 0
3) Publish control virtqueue mapping via IOTLB asid 1

Then the DMA is totally isolated in this case.

For vhost-vDPA using by DPDK or virtio-vDPA

1) associate group 0 and group 1 with as 0

since we don't need DMA isolation in this case.

In order to let it be controlled by Guest, we need extend virtio spec to 
support those concepts.


Thanks




Re: [PATCH] virtio_net: Fix recursive call to cpus_read_lock()

2020-12-21 Thread Jason Wang



On 2020/12/22 上午11:36, Jeff Dike wrote:

virtnet_set_channels can recursively call cpus_read_lock if CONFIG_XPS
and CONFIG_HOTPLUG are enabled.

The path is:
 virtnet_set_channels - calls get_online_cpus(), which is a trivial
wrapper around cpus_read_lock()
 netif_set_real_num_tx_queues
 netif_reset_xps_queues_gt
 netif_reset_xps_queues - calls cpus_read_lock()

This call chain and potential deadlock happens when the number of TX
queues is reduced.

This commit the removes netif_set_real_num_[tr]x_queues calls from
inside the get/put_online_cpus section, as they don't require that it
be held.

Signed-off-by: Jeff Dike 
---



Adding netdev.

The patch can go with -net and is needed for -stable.

Acked-by: Jason Wang 



  drivers/net/virtio_net.c | 12 +++-
  1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 052975ea0af4..e02c7e0f1cf9 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2093,14 +2093,16 @@ static int virtnet_set_channels(struct net_device *dev,
  
  	get_online_cpus();

err = _virtnet_set_queues(vi, queue_pairs);
-   if (!err) {
-   netif_set_real_num_tx_queues(dev, queue_pairs);
-   netif_set_real_num_rx_queues(dev, queue_pairs);
-
-   virtnet_set_affinity(vi);
+   if (err){
+   put_online_cpus();
+   goto err;
}
+   virtnet_set_affinity(vi);
put_online_cpus();
  
+	netif_set_real_num_tx_queues(dev, queue_pairs);

+   netif_set_real_num_rx_queues(dev, queue_pairs);
+ err:
return err;
  }
  




Re: [PATCH net v2 2/2] vhost_net: fix high cpu load when sendmsg fails

2020-12-21 Thread Jason Wang



On 2020/12/22 上午7:07, Willem de Bruijn wrote:

On Wed, Dec 16, 2020 at 3:20 AM wangyunjian  wrote:

From: Yunjian Wang

Currently we break the loop and wake up the vhost_worker when
sendmsg fails. When the worker wakes up again, we'll meet the
same error.

The patch is based on the assumption that such error cases always
return EAGAIN. Can it not also be ENOMEM, such as from tun_build_skb?


This will cause high CPU load. To fix this issue,
we can skip this description by ignoring the error. When we
exceeds sndbuf, the return value of sendmsg is -EAGAIN. In
the case we don't skip the description and don't drop packet.

the -> that

here and above: description -> descriptor

Perhaps slightly revise to more explicitly state that

1. in the case of persistent failure (i.e., bad packet), the driver
drops the packet
2. in the case of transient failure (e.g,. memory pressure) the driver
schedules the worker to try again later



If we want to go with this way, we need a better time to wakeup the 
worker. Otherwise it just produces more stress on the cpu that is what 
this patch tries to avoid.


Thanks









Re: [PATCH net v2 2/2] vhost_net: fix high cpu load when sendmsg fails

2020-12-22 Thread Jason Wang



On 2020/12/22 下午10:24, Willem de Bruijn wrote:

On Mon, Dec 21, 2020 at 11:41 PM Jason Wang  wrote:


On 2020/12/22 上午7:07, Willem de Bruijn wrote:

On Wed, Dec 16, 2020 at 3:20 AM wangyunjian  wrote:

From: Yunjian Wang

Currently we break the loop and wake up the vhost_worker when
sendmsg fails. When the worker wakes up again, we'll meet the
same error.

The patch is based on the assumption that such error cases always
return EAGAIN. Can it not also be ENOMEM, such as from tun_build_skb?


This will cause high CPU load. To fix this issue,
we can skip this description by ignoring the error. When we
exceeds sndbuf, the return value of sendmsg is -EAGAIN. In
the case we don't skip the description and don't drop packet.

the -> that

here and above: description -> descriptor

Perhaps slightly revise to more explicitly state that

1. in the case of persistent failure (i.e., bad packet), the driver
drops the packet
2. in the case of transient failure (e.g,. memory pressure) the driver
schedules the worker to try again later


If we want to go with this way, we need a better time to wakeup the
worker. Otherwise it just produces more stress on the cpu that is what
this patch tries to avoid.

Perhaps I misunderstood the purpose of the patch: is it to drop
everything, regardless of transient or persistent failure, until the
ring runs out of descriptors?



My understanding is that the main motivation is to avoid high cpu 
utilization when sendmsg() fail due to guest reason (e.g bad packet).





I can understand both a blocking and drop strategy during memory
pressure. But partial drop strategy until exceeding ring capacity
seems like a peculiar hybrid?



Yes. So I wonder if we want to be do better when we are in the memory 
pressure. E.g can we let socket wake up us instead of rescheduling the 
workers here? At least in this case we know some memory might be freed?


Thanks








Re: [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace

2020-12-22 Thread Jason Wang



On 2020/12/22 下午10:52, Xie Yongji wrote:

This series introduces a framework, which can be used to implement
vDPA Devices in a userspace program. The work consist of two parts:
control path forwarding and data path offloading.

In the control path, the VDUSE driver will make use of message
mechnism to forward the config operation from vdpa bus driver
to userspace. Userspace can use read()/write() to receive/reply
those control messages.

In the data path, the core is mapping dma buffer into VDUSE
daemon's address space, which can be implemented in different ways
depending on the vdpa bus to which the vDPA device is attached.

In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver with
bounce-buffering mechanism to achieve that.



Rethink about the bounce buffer stuffs. I wonder instead of using kernel 
pages with mmap(), how about just use userspace pages like what vhost did?


It means we need a worker to do bouncing but we don't need to care about 
annoying stuffs like page reclaiming?




And in vhost-vdpa case, the dma
buffer is reside in a userspace memory region which can be shared to the
VDUSE userspace processs via transferring the shmfd.

The details and our user case is shown below:

-   
--
|Container ||  QEMU(VM) |   |   
VDUSE daemon |
|   -  ||  ---  |   | 
-  |
|   |dev/vdx|  ||  |/dev/vhost-vdpa-x|  |   | | vDPA device 
emulation | | block driver | |
+--- ---+   
-+--+-
 |   || 
 |
 |   || 
 |
+---++--+-
|| block device |   |  vhost device || vduse driver |   
   | TCP/IP ||
|---+   +---+   
   -+|
|   |   |   |   
||
| --+--   --+--- ---+---
||
| | virtio-blk driver |   |  vhost-vdpa driver | | vdpa device |
||
| --+--   --+--- ---+---
||
|   |  virtio bus   |   |   
||
|   ++---   |   |   
||
||  |   |   
||
|  --+--|   |   
||
|  | virtio-blk device ||   |   
||
|  --+--|   |   
||
||  |   |   
||
| ---+---   |   |   
||
| |  virtio-vdpa driver |   |   |   
||
| ---+---   |   |   
||
||  |   |vdpa 
bus   ||
| 
---+--+---+ 
  ||
|   
 ---+--- |
-|
 NIC |--

  ---+---

 |

-+-

| Remote Storages |

---

We make use of it to implement a block device connecting to
our distributed storage, which can be used both in containers and
VMs. Thus, we can have an unified technology stack in this two cases.

To test it with null-blk:

   $ qemu-storage-daemon \
   --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \
   --monitor chardev=charmonitor 

Re: [RFC v2 06/13] vduse: Introduce VDUSE - vDPA Device in Userspace

2020-12-23 Thread Jason Wang



On 2020/12/22 下午10:52, Xie Yongji wrote:

This VDUSE driver enables implementing vDPA devices in userspace.
Both control path and data path of vDPA devices will be able to
be handled in userspace.

In the control path, the VDUSE driver will make use of message
mechnism to forward the config operation from vdpa bus driver
to userspace. Userspace can use read()/write() to receive/reply
those control messages.

In the data path, the VDUSE driver implements a MMU-based on-chip
IOMMU driver which supports mapping the kernel dma buffer to a
userspace iova region dynamically. Userspace can access those
iova region via mmap(). Besides, the eventfd mechanism is used to
trigger interrupt callbacks and receive virtqueue kicks in userspace

Now we only support virtio-vdpa bus driver with this patch applied.

Signed-off-by: Xie Yongji 
---
  Documentation/driver-api/vduse.rst |   74 ++
  Documentation/userspace-api/ioctl/ioctl-number.rst |1 +
  drivers/vdpa/Kconfig   |8 +
  drivers/vdpa/Makefile  |1 +
  drivers/vdpa/vdpa_user/Makefile|5 +
  drivers/vdpa/vdpa_user/eventfd.c   |  221 
  drivers/vdpa/vdpa_user/eventfd.h   |   48 +
  drivers/vdpa/vdpa_user/iova_domain.c   |  442 
  drivers/vdpa/vdpa_user/iova_domain.h   |   93 ++
  drivers/vdpa/vdpa_user/vduse.h |   59 ++
  drivers/vdpa/vdpa_user/vduse_dev.c | 1121 
  include/uapi/linux/vdpa.h  |1 +
  include/uapi/linux/vduse.h |   99 ++
  13 files changed, 2173 insertions(+)
  create mode 100644 Documentation/driver-api/vduse.rst
  create mode 100644 drivers/vdpa/vdpa_user/Makefile
  create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
  create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
  create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
  create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
  create mode 100644 drivers/vdpa/vdpa_user/vduse.h
  create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
  create mode 100644 include/uapi/linux/vduse.h

diff --git a/Documentation/driver-api/vduse.rst 
b/Documentation/driver-api/vduse.rst
new file mode 100644
index ..da9b3040f20a
--- /dev/null
+++ b/Documentation/driver-api/vduse.rst
@@ -0,0 +1,74 @@
+==
+VDUSE - "vDPA Device in Userspace"
+==
+
+vDPA (virtio data path acceleration) device is a device that uses a
+datapath which complies with the virtio specifications with vendor
+specific control path. vDPA devices can be both physically located on
+the hardware or emulated by software. VDUSE is a framework that makes it
+possible to implement software-emulated vDPA devices in userspace.
+
+How VDUSE works
+
+Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
+the VDUSE character device (/dev/vduse). Then a file descriptor pointing
+to the new resources will be returned, which can be used to implement the
+userspace vDPA device's control path and data path.
+
+To implement control path, the read/write operations to the file descriptor
+will be used to receive/reply the control messages from/to VDUSE driver.
+Those control messages are based on the vdpa_config_ops which defines a
+unified interface to control different types of vDPA device.
+
+The following types of messages are provided by the VDUSE framework now:
+
+- VDUSE_SET_VQ_ADDR: Set the addresses of the different aspects of virtqueue.
+
+- VDUSE_SET_VQ_NUM: Set the size of virtqueue
+
+- VDUSE_SET_VQ_READY: Set ready status of virtqueue
+
+- VDUSE_GET_VQ_READY: Get ready status of virtqueue
+
+- VDUSE_SET_FEATURES: Set virtio features supported by the driver
+
+- VDUSE_GET_FEATURES: Get virtio features supported by the device
+
+- VDUSE_SET_STATUS: Set the device status
+
+- VDUSE_GET_STATUS: Get the device status
+
+- VDUSE_SET_CONFIG: Write to device specific configuration space
+
+- VDUSE_GET_CONFIG: Read from device specific configuration space
+
+Please see include/linux/vdpa.h for details.
+
+In the data path, VDUSE framework implements a MMU-based on-chip IOMMU
+driver which supports mapping the kernel dma buffer to a userspace iova
+region dynamically. The userspace iova region can be created by passing
+the userspace vDPA device fd to mmap(2).
+
+Besides, the eventfd mechanism is used to trigger interrupt callbacks and
+receive virtqueue kicks in userspace. The following ioctls on the userspace
+vDPA device fd are provided to support that:
+
+- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
+  by VDUSE driver to notify userspace to consume the vring.
+
+- VDUSE_VQ_SETUP_IRQFD: set the irqfd for virtqueue, this eventfd is used
+  by userspace to notify VDUSE driver to trigger interrupt callbacks.
+
+MMU-based IOMMU Driver
+

Re: [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace

2020-12-23 Thread Jason Wang



On 2020/12/23 下午2:38, Jason Wang wrote:


V1 to V2:
- Add vhost-vdpa support



I may miss something but I don't see any code to support that. E.g 
neither set_map nor dma_map/unmap is implemented in the config ops.


Thanks 



Speak too fast :(

I saw a new config ops was introduced.

Let me dive into that.

Thanks



Re: [RFC v2 08/13] vdpa: Introduce process_iotlb_msg() in vdpa_config_ops

2020-12-23 Thread Jason Wang



On 2020/12/22 下午10:52, Xie Yongji wrote:

This patch introduces a new method in the vdpa_config_ops to
support processing the raw vhost memory mapping message in the
vDPA device driver.

Signed-off-by: Xie Yongji 
---
  drivers/vhost/vdpa.c | 5 -
  include/linux/vdpa.h | 7 +++
  2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 448be7875b6d..ccbb391e38be 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -728,6 +728,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev 
*dev,
if (r)
return r;
  
+	if (ops->process_iotlb_msg)

+   return ops->process_iotlb_msg(vdpa, msg);
+
switch (msg->type) {
case VHOST_IOTLB_UPDATE:
r = vhost_vdpa_process_iotlb_update(v, msg);
@@ -770,7 +773,7 @@ static int vhost_vdpa_alloc_domain(struct vhost_vdpa *v)
int ret;
  
  	/* Device want to do DMA by itself */

-   if (ops->set_map || ops->dma_map)
+   if (ops->set_map || ops->dma_map || ops->process_iotlb_msg)
return 0;
  
  	bus = dma_dev->bus;

diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 656fe264234e..7bccedf22f4b 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -5,6 +5,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  
@@ -172,6 +173,10 @@ struct vdpa_iova_range {

   *@vdev: vdpa device
   *Returns the iova range supported by
   *the device.
+ * @process_iotlb_msg: Process vhost memory mapping message (optional)
+ * Only used for VDUSE device now
+ * @vdev: vdpa device
+ * @msg: vhost memory mapping message
   * @set_map:  Set device memory mapping (optional)
   *Needed for device that using device
   *specific DMA translation (on-chip IOMMU)
@@ -240,6 +245,8 @@ struct vdpa_config_ops {
struct vdpa_iova_range (*get_iova_range)(struct vdpa_device *vdev);
  
  	/* DMA ops */

+   int (*process_iotlb_msg)(struct vdpa_device *vdev,
+struct vhost_iotlb_msg *msg);
int (*set_map)(struct vdpa_device *vdev, struct vhost_iotlb *iotlb);
int (*dma_map)(struct vdpa_device *vdev, u64 iova, u64 size,
   u64 pa, u32 perm);



Is there any reason that it can't be done via dma_map/dma_unmap or set_map?

Thanks




Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message

2020-12-23 Thread Jason Wang



On 2020/12/22 下午10:52, Xie Yongji wrote:

To support vhost-vdpa bus driver, we need a way to share the
vhost-vdpa backend process's memory with the userspace VDUSE process.

This patch tries to make use of the vhost iotlb message to achieve
that. We will get the shm file from the iotlb message and pass it
to the userspace VDUSE process.

Signed-off-by: Xie Yongji 
---
  Documentation/driver-api/vduse.rst |  15 +++-
  drivers/vdpa/vdpa_user/vduse_dev.c | 147 -
  include/uapi/linux/vduse.h |  11 +++
  3 files changed, 171 insertions(+), 2 deletions(-)

diff --git a/Documentation/driver-api/vduse.rst 
b/Documentation/driver-api/vduse.rst
index 623f7b040ccf..48e4b1ba353f 100644
--- a/Documentation/driver-api/vduse.rst
+++ b/Documentation/driver-api/vduse.rst
@@ -46,13 +46,26 @@ The following types of messages are provided by the VDUSE 
framework now:
  
  - VDUSE_GET_CONFIG: Read from device specific configuration space
  
+- VDUSE_UPDATE_IOTLB: Update the memory mapping in device IOTLB

+
+- VDUSE_INVALIDATE_IOTLB: Invalidate the memory mapping in device IOTLB
+
  Please see include/linux/vdpa.h for details.
  
-In the data path, VDUSE framework implements a MMU-based on-chip IOMMU

+The data path of userspace vDPA device is implemented in different ways
+depending on the vdpa bus to which it is attached.
+
+In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU
  driver which supports mapping the kernel dma buffer to a userspace iova
  region dynamically. The userspace iova region can be created by passing
  the userspace vDPA device fd to mmap(2).
  
+In vhost-vdpa case, the dma buffer is reside in a userspace memory region

+which will be shared to the VDUSE userspace processs via the file
+descriptor in VDUSE_UPDATE_IOTLB message. And the corresponding address
+mapping (IOVA of dma buffer <-> VA of the memory region) is also included
+in this message.
+
  Besides, the eventfd mechanism is used to trigger interrupt callbacks and
  receive virtqueue kicks in userspace. The following ioctls on the userspace
  vDPA device fd are provided to support that:
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
b/drivers/vdpa/vdpa_user/vduse_dev.c
index b974333ed4e9..d24aaacb6008 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -34,6 +34,7 @@
  
  struct vduse_dev_msg {

struct vduse_dev_request req;
+   struct file *iotlb_file;
struct vduse_dev_response resp;
struct list_head list;
wait_queue_head_t waitq;
@@ -325,12 +326,80 @@ static int vduse_dev_set_vq_state(struct vduse_dev *dev,
return ret;
  }
  
+static int vduse_dev_update_iotlb(struct vduse_dev *dev, struct file *file,

+   u64 offset, u64 iova, u64 size, u8 perm)
+{
+   struct vduse_dev_msg *msg;
+   int ret;
+
+   if (!size)
+   return -EINVAL;
+
+   msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB);
+   msg->req.size = sizeof(struct vduse_iotlb);
+   msg->req.iotlb.offset = offset;
+   msg->req.iotlb.iova = iova;
+   msg->req.iotlb.size = size;
+   msg->req.iotlb.perm = perm;
+   msg->req.iotlb.fd = -1;
+   msg->iotlb_file = get_file(file);
+
+   ret = vduse_dev_msg_sync(dev, msg);



My feeling is that we should provide consistent API for the userspace 
device to use.


E.g we'd better carry the IOTLB message for both virtio/vhost drivers.

It looks to me for virtio drivers we can still use UPDAT_IOTLB message 
by using VDUSE file as msg->iotlb_file here.




+   vduse_dev_msg_put(msg);
+   fput(file);
+
+   return ret;
+}
+
+static int vduse_dev_invalidate_iotlb(struct vduse_dev *dev,
+   u64 iova, u64 size)
+{
+   struct vduse_dev_msg *msg;
+   int ret;
+
+   if (!size)
+   return -EINVAL;
+
+   msg = vduse_dev_new_msg(dev, VDUSE_INVALIDATE_IOTLB);
+   msg->req.size = sizeof(struct vduse_iotlb);
+   msg->req.iotlb.iova = iova;
+   msg->req.iotlb.size = size;
+
+   ret = vduse_dev_msg_sync(dev, msg);
+   vduse_dev_msg_put(msg);
+
+   return ret;
+}
+
+static unsigned int perm_to_file_flags(u8 perm)
+{
+   unsigned int flags = 0;
+
+   switch (perm) {
+   case VHOST_ACCESS_WO:
+   flags |= O_WRONLY;
+   break;
+   case VHOST_ACCESS_RO:
+   flags |= O_RDONLY;
+   break;
+   case VHOST_ACCESS_RW:
+   flags |= O_RDWR;
+   break;
+   default:
+   WARN(1, "invalidate vhost IOTLB permission\n");
+   break;
+   }
+
+   return flags;
+}
+
  static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
  {
struct file *file = iocb->ki_filp;
struct vduse_dev *dev = file->private_data;
struct vduse_dev_msg *msg;
-   int size = sizeof(struct vduse_dev_request);
+   

Re: [External] Re: [PATCH 0/7] Introduce vdpa management tool

2020-11-30 Thread Jason Wang



On 2020/11/30 下午3:07, Yongji Xie wrote:

Thanks for adding me, Jason!

Now I'm working on a v2 patchset for VDUSE (vDPA Device in Userspace)
[1]. This tool is very useful for the vduse device. So I'm considering
integrating this into my v2 patchset. But there is one problem:

In this tool, vdpa device config action and enable action are combined
into one netlink msg: VDPA_CMD_DEV_NEW. But in vduse case, it needs to
be splitted because a chardev should be created and opened by a
userspace process before we enable the vdpa device (call
vdpa_register_device()).

So I'd like to know whether it's possible (or have some plans) to add
two new netlink msgs something like: VDPA_CMD_DEV_ENABLE and
VDPA_CMD_DEV_DISABLE to make the config path more flexible.


Actually, we've discussed such intermediate step in some early
discussion. It looks to me VDUSE could be one of the users of this.

Or I wonder whether we can switch to use anonymous inode(fd) for VDUSE
then fetching it via an VDUSE_GET_DEVICE_FD ioctl?


Yes, we can. Actually the current implementation in VDUSE is like
this.  But seems like this is still a intermediate step. The fd should
be binded to a name or something else which need to be configured
before.



The name could be specified via the netlink. It looks to me the real 
issue is that until the device is connected with a userspace, it can't 
be used. So we also need to fail the enabling if it doesn't opened.


Thanks




Thanks,
Yongji





Re: [External] Re: [PATCH 0/7] Introduce vdpa management tool

2020-12-01 Thread Jason Wang



On 2020/12/1 下午5:55, Yongji Xie wrote:

On Tue, Dec 1, 2020 at 2:25 PM Jason Wang  wrote:


On 2020/11/30 下午3:07, Yongji Xie wrote:

Thanks for adding me, Jason!

Now I'm working on a v2 patchset for VDUSE (vDPA Device in Userspace)
[1]. This tool is very useful for the vduse device. So I'm considering
integrating this into my v2 patchset. But there is one problem:

In this tool, vdpa device config action and enable action are combined
into one netlink msg: VDPA_CMD_DEV_NEW. But in vduse case, it needs to
be splitted because a chardev should be created and opened by a
userspace process before we enable the vdpa device (call
vdpa_register_device()).

So I'd like to know whether it's possible (or have some plans) to add
two new netlink msgs something like: VDPA_CMD_DEV_ENABLE and
VDPA_CMD_DEV_DISABLE to make the config path more flexible.


Actually, we've discussed such intermediate step in some early
discussion. It looks to me VDUSE could be one of the users of this.

Or I wonder whether we can switch to use anonymous inode(fd) for VDUSE
then fetching it via an VDUSE_GET_DEVICE_FD ioctl?


Yes, we can. Actually the current implementation in VDUSE is like
this.  But seems like this is still a intermediate step. The fd should
be binded to a name or something else which need to be configured
before.


The name could be specified via the netlink. It looks to me the real
issue is that until the device is connected with a userspace, it can't
be used. So we also need to fail the enabling if it doesn't opened.


Yes, that's true. So you mean we can firstly try to fetch the fd
binded to a name/vduse_id via an VDUSE_GET_DEVICE_FD, then use the
name/vduse_id as a attribute to create vdpa device? It looks fine to
me.



Yes, something like this. The anonymous fd will be created during 
dev_add() and the fd will be carried in the msg to userspace.


Thanks




Thanks,
Yongji





Re: [External] Re: [PATCH 0/7] Introduce vdpa management tool

2020-12-01 Thread Jason Wang



On 2020/12/2 下午12:53, Parav Pandit wrote:



From: Yongji Xie 
Sent: Wednesday, December 2, 2020 9:00 AM

On Tue, Dec 1, 2020 at 11:59 PM Parav Pandit  wrote:




From: Yongji Xie 
Sent: Tuesday, December 1, 2020 7:49 PM

On Tue, Dec 1, 2020 at 7:32 PM Parav Pandit  wrote:




From: Yongji Xie 
Sent: Tuesday, December 1, 2020 3:26 PM

On Tue, Dec 1, 2020 at 2:25 PM Jason Wang 

wrote:


On 2020/11/30 下午3:07, Yongji Xie wrote:

Thanks for adding me, Jason!

Now I'm working on a v2 patchset for VDUSE (vDPA Device in
Userspace) [1]. This tool is very useful for the vduse device.
So I'm considering integrating this into my v2 patchset.
But there is one problem:

In this tool, vdpa device config action and enable action
are combined into one netlink msg: VDPA_CMD_DEV_NEW. But
in

vduse

case, it needs to be splitted because a chardev should be
created and opened by a userspace process before we enable
the vdpa device (call vdpa_register_device()).

So I'd like to know whether it's possible (or have some
plans) to add two new netlink msgs something like:
VDPA_CMD_DEV_ENABLE

and

VDPA_CMD_DEV_DISABLE to make the config path more flexible.


Actually, we've discussed such intermediate step in some
early discussion. It looks to me VDUSE could be one of the users of

this.

Or I wonder whether we can switch to use anonymous
inode(fd) for VDUSE then fetching it via an VDUSE_GET_DEVICE_FD

ioctl?

Yes, we can. Actually the current implementation in VDUSE is
like this.  But seems like this is still a intermediate step.
The fd should be binded to a name or something else which
need to be configured before.


The name could be specified via the netlink. It looks to me
the real issue is that until the device is connected with a
userspace, it can't be used. So we also need to fail the
enabling if it doesn't

opened.

Yes, that's true. So you mean we can firstly try to fetch the fd
binded to a name/vduse_id via an VDUSE_GET_DEVICE_FD, then use
the name/vduse_id as a attribute to create vdpa device? It looks fine to

me.

I probably do not well understand. I tried reading patch [1] and
few things

do not look correct as below.

Creating the vdpa device on the bus device and destroying the
device from

the workqueue seems unnecessary and racy.

It seems vduse driver needs
This is something should be done as part of the vdpa dev add
command,

instead of connecting two sides separately and ensuring race free
access to it.

So VDUSE_DEV_START and VDUSE_DEV_STOP should possibly be avoided.


Yes, we can avoid these two ioctls with the help of the management tool.


$ vdpa dev add parentdev vduse_mgmtdev type net name foo2

When above command is executed it creates necessary vdpa device
foo2

on the bus.

When user binds foo2 device with the vduse driver, in the probe(),
it

creates respective char device to access it from user space.


I see. So vduse cannot work with any existing vdpa devices like ifc, mlx5 or

netdevsim.

It has its own implementation similar to fuse with its own backend of choice.
More below.


But vduse driver is not a vdpa bus driver. It works like vdpasim
driver, but offloads the data plane and control plane to a user space process.

In that case to draw parallel lines,

1. netdevsim:
(a) create resources in kernel sw
(b) datapath simulates in kernel

2. ifc + mlx5 vdpa dev:
(a) creates resource in hw
(b) data path is in hw

3. vduse:
(a) creates resources in userspace sw
(b) data path is in user space.
hence creates data path resources for user space.
So char device is created, removed as result of vdpa device creation.

For example,
$ vdpa dev add parentdev vduse_mgmtdev type net name foo2

Above command will create char device for user space.

Similar command for ifc/mlx5 would have created similar channel for rest of

the config commands in hw.

vduse channel = char device, eventfd etc.
ifc/mlx5 hw channel = bar, irq, command interface etc Netdev sim
channel = sw direct calls

Does it make sense?

In my understanding, to make vdpa work, we need a backend (datapath
resources) and a frontend (a vdpa device attached to a vdpa bus). In the above
example, it looks like we use the command "vdpa dev add ..."
  to create a backend, so do we need another command to create a frontend?


For block device there is certainly some backend to process the IOs.
Sometimes backend to be setup first, before its front end is exposed.
"vdpa dev add" is the front end command who connects to the backend 
(implicitly) for network device.

vhost->vdpa_block_device->backend_io_processor (usr,hw,kernel).

And it needs a way to connect to backend when explicitly specified during 
creation time.
Something like,
$ vdpa dev add parentdev vdpa_vduse type block name foo3 handle 
In above example some vendor device specific unique handle is passed based on 
backend setup in hardware/user space.

In below 3 examples, vdpa block simulator is connecting to backend block

Re: [External] Re: [PATCH 0/7] Introduce vdpa management tool

2020-12-01 Thread Jason Wang



On 2020/12/2 下午2:24, Parav Pandit wrote:



From: Jason Wang 
Sent: Wednesday, December 2, 2020 11:21 AM

On 2020/12/2 下午12:53, Parav Pandit wrote:

From: Yongji Xie 
Sent: Wednesday, December 2, 2020 9:00 AM

On Tue, Dec 1, 2020 at 11:59 PM Parav Pandit  wrote:



From: Yongji Xie 
Sent: Tuesday, December 1, 2020 7:49 PM

On Tue, Dec 1, 2020 at 7:32 PM Parav Pandit 

wrote:



From: Yongji Xie 
Sent: Tuesday, December 1, 2020 3:26 PM

On Tue, Dec 1, 2020 at 2:25 PM Jason Wang



wrote:

On 2020/11/30 下午3:07, Yongji Xie wrote:

Thanks for adding me, Jason!

Now I'm working on a v2 patchset for VDUSE (vDPA Device in
Userspace) [1]. This tool is very useful for the vduse device.
So I'm considering integrating this into my v2 patchset.
But there is one problem:

In this tool, vdpa device config action and enable action are
combined into one netlink msg: VDPA_CMD_DEV_NEW. But in

vduse

case, it needs to be splitted because a chardev should be
created and opened by a userspace process before we enable
the vdpa device (call vdpa_register_device()).

So I'd like to know whether it's possible (or have some
plans) to add two new netlink msgs something like:
VDPA_CMD_DEV_ENABLE

and

VDPA_CMD_DEV_DISABLE to make the config path more

flexible.

Actually, we've discussed such intermediate step in some early
discussion. It looks to me VDUSE could be one of the users of

this.

Or I wonder whether we can switch to use anonymous
inode(fd) for VDUSE then fetching it via an
VDUSE_GET_DEVICE_FD

ioctl?

Yes, we can. Actually the current implementation in VDUSE is
like this.  But seems like this is still a intermediate step.
The fd should be binded to a name or something else which need
to be configured before.

The name could be specified via the netlink. It looks to me the
real issue is that until the device is connected with a
userspace, it can't be used. So we also need to fail the
enabling if it doesn't

opened.

Yes, that's true. So you mean we can firstly try to fetch the fd
binded to a name/vduse_id via an VDUSE_GET_DEVICE_FD, then

use

the name/vduse_id as a attribute to create vdpa device? It looks
fine to

me.

I probably do not well understand. I tried reading patch [1] and
few things

do not look correct as below.

Creating the vdpa device on the bus device and destroying the
device from

the workqueue seems unnecessary and racy.

It seems vduse driver needs
This is something should be done as part of the vdpa dev add
command,

instead of connecting two sides separately and ensuring race free
access to it.

So VDUSE_DEV_START and VDUSE_DEV_STOP should possibly be

avoided.

Yes, we can avoid these two ioctls with the help of the management

tool.

$ vdpa dev add parentdev vduse_mgmtdev type net name foo2

When above command is executed it creates necessary vdpa device
foo2

on the bus.

When user binds foo2 device with the vduse driver, in the probe(),
it

creates respective char device to access it from user space.


I see. So vduse cannot work with any existing vdpa devices like ifc,
mlx5 or

netdevsim.

It has its own implementation similar to fuse with its own backend of

choice.

More below.


But vduse driver is not a vdpa bus driver. It works like vdpasim
driver, but offloads the data plane and control plane to a user space

process.

In that case to draw parallel lines,

1. netdevsim:
(a) create resources in kernel sw
(b) datapath simulates in kernel

2. ifc + mlx5 vdpa dev:
(a) creates resource in hw
(b) data path is in hw

3. vduse:
(a) creates resources in userspace sw
(b) data path is in user space.
hence creates data path resources for user space.
So char device is created, removed as result of vdpa device creation.

For example,
$ vdpa dev add parentdev vduse_mgmtdev type net name foo2

Above command will create char device for user space.

Similar command for ifc/mlx5 would have created similar channel for
rest of

the config commands in hw.

vduse channel = char device, eventfd etc.
ifc/mlx5 hw channel = bar, irq, command interface etc Netdev sim
channel = sw direct calls

Does it make sense?

In my understanding, to make vdpa work, we need a backend (datapath
resources) and a frontend (a vdpa device attached to a vdpa bus). In
the above example, it looks like we use the command "vdpa dev add ..."
   to create a backend, so do we need another command to create a

frontend?

For block device there is certainly some backend to process the IOs.
Sometimes backend to be setup first, before its front end is exposed.
"vdpa dev add" is the front end command who connects to the backend

(implicitly) for network device.

vhost->vdpa_block_device->backend_io_processor (usr,hw,kernel).

And it needs a way to connect to backend when explicitly specified during

creation time.

Something like,
$ vdpa dev add parentdev vdpa_vduse type block name foo3 handle



In above example some vendor device specific unique handle is passed

based o

Re: [PATCH net-next] tun: fix ubuf refcount incorrectly on error path

2020-12-03 Thread Jason Wang



On 2020/12/3 下午4:00, wangyunjian wrote:

From: Yunjian Wang 

After setting callback for ubuf_info of skb, the callback
(vhost_net_zerocopy_callback) will be called to decrease
the refcount when freeing skb. But when an exception occurs
afterwards, the error handling in vhost handle_tx() will
try to decrease the same refcount again. This is wrong and
fix this by clearing ubuf_info when meeting errors.

Fixes: 4477138fa0ae ("tun: properly test for IFF_UP")
Fixes: 90e33d459407 ("tun: enable napi_gro_frags() for TUN/TAP driver")

Signed-off-by: Yunjian Wang 
---
  drivers/net/tun.c | 11 +++
  1 file changed, 11 insertions(+)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 2dc1988a8973..3614bb1b6d35 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1861,6 +1861,12 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
if (unlikely(!(tun->dev->flags & IFF_UP))) {
err = -EIO;
rcu_read_unlock();
+   if (zerocopy) {
+   skb_shinfo(skb)->destructor_arg = NULL;
+   skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
+   skb_shinfo(skb)->tx_flags &= ~SKBTX_SHARED_FRAG;
+   }
+
goto drop;
}
  
@@ -1874,6 +1880,11 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
  
  		if (unlikely(headlen > skb_headlen(skb))) {

atomic_long_inc(&tun->dev->rx_dropped);
+   if (zerocopy) {
+   skb_shinfo(skb)->destructor_arg = NULL;
+   skb_shinfo(skb)->tx_flags &= 
~SKBTX_DEV_ZEROCOPY;
+   skb_shinfo(skb)->tx_flags &= ~SKBTX_SHARED_FRAG;
+   }
napi_free_frags(&tfile->napi);
rcu_read_unlock();
mutex_unlock(&tfile->napi_mutex);



It looks to me then we miss the failure feedback.

The issues comes from the inconsistent error handling in tun.

I wonder whether we can simply do uarg->callback(uarg, false) if 
necessary on every failture path on tun_get_user().


Note that, zerocopy has a lot of issues which makes it not good for 
production environment.


Thanks



Re: [PATCH net-next] tun: fix ubuf refcount incorrectly on error path

2020-12-06 Thread Jason Wang



On 2020/12/4 下午6:22, wangyunjian wrote:

-Original Message-
From: Jason Wang [mailto:jasow...@redhat.com]
Sent: Friday, December 4, 2020 2:11 PM
To: wangyunjian ; m...@redhat.com
Cc: virtualizat...@lists.linux-foundation.org; netdev@vger.kernel.org; Lilijun
(Jerry) ; xudingke 
Subject: Re: [PATCH net-next] tun: fix ubuf refcount incorrectly on error path


On 2020/12/3 下午4:00, wangyunjian wrote:

From: Yunjian Wang 

After setting callback for ubuf_info of skb, the callback
(vhost_net_zerocopy_callback) will be called to decrease the refcount
when freeing skb. But when an exception occurs afterwards, the error
handling in vhost handle_tx() will try to decrease the same refcount
again. This is wrong and fix this by clearing ubuf_info when meeting
errors.

Fixes: 4477138fa0ae ("tun: properly test for IFF_UP")
Fixes: 90e33d459407 ("tun: enable napi_gro_frags() for TUN/TAP
driver")

Signed-off-by: Yunjian Wang 
---
   drivers/net/tun.c | 11 +++
   1 file changed, 11 insertions(+)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c index
2dc1988a8973..3614bb1b6d35 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1861,6 +1861,12 @@ static ssize_t tun_get_user(struct tun_struct

*tun, struct tun_file *tfile,

if (unlikely(!(tun->dev->flags & IFF_UP))) {
err = -EIO;
rcu_read_unlock();
+   if (zerocopy) {
+   skb_shinfo(skb)->destructor_arg = NULL;
+   skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
+   skb_shinfo(skb)->tx_flags &= ~SKBTX_SHARED_FRAG;
+   }
+
goto drop;
}

@@ -1874,6 +1880,11 @@ static ssize_t tun_get_user(struct tun_struct
*tun, struct tun_file *tfile,

if (unlikely(headlen > skb_headlen(skb))) {
atomic_long_inc(&tun->dev->rx_dropped);
+   if (zerocopy) {
+   skb_shinfo(skb)->destructor_arg = NULL;
+   skb_shinfo(skb)->tx_flags &= 
~SKBTX_DEV_ZEROCOPY;
+   skb_shinfo(skb)->tx_flags &= ~SKBTX_SHARED_FRAG;
+   }
napi_free_frags(&tfile->napi);
rcu_read_unlock();
mutex_unlock(&tfile->napi_mutex);


It looks to me then we miss the failure feedback.

The issues comes from the inconsistent error handling in tun.

I wonder whether we can simply do uarg->callback(uarg, false) if necessary on
every failture path on tun_get_user().

How about this?

---
  drivers/net/tun.c | 29 ++---
  1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 2dc1988a8973..36a8d8eacd7b 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1637,6 +1637,19 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
return NULL;
  }
  
+/* copy ubuf_info for callback when skb has no error */

+inline static tun_copy_ubuf_info(struct sk_buff *skb, bool zerocopy, void 
*msg_control)
+{
+   if (zerocopy) {
+   skb_shinfo(skb)->destructor_arg = msg_control;
+   skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
+   skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
+   } else if (msg_control) {
+   struct ubuf_info *uarg = msg_control;
+   uarg->callback(uarg, false);
+   }
+}
+
  /* Get packet from user space buffer */
  static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
void *msg_control, struct iov_iter *from,
@@ -1812,16 +1825,6 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
break;
}
  
-	/* copy skb_ubuf_info for callback when skb has no error */

-   if (zerocopy) {
-   skb_shinfo(skb)->destructor_arg = msg_control;
-   skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
-   skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
-   } else if (msg_control) {
-   struct ubuf_info *uarg = msg_control;
-   uarg->callback(uarg, false);
-   }
-
skb_reset_network_header(skb);
skb_probe_transport_header(skb);
skb_record_rx_queue(skb, tfile->queue_index);
@@ -1830,6 +1833,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
struct bpf_prog *xdp_prog;
int ret;
  
+		tun_copy_ubuf_info(skb, zerocopy, msg_control);



If you think disabling zerocopy for XDP (which I think it makes sense). 
It's better to do this in another patch.




local_bh_disable();
rcu_read_lock();
xdp_prog = rcu_dereference(tun->xdp_prog);
@@ 

Re: [PATCH] vhost scsi: fix error return code in vhost_scsi_set_endpoint()

2020-12-06 Thread Jason Wang



On 2020/12/4 下午4:43, Zhang Changzhong wrote:

Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.

Fixes: 25b98b64e284 ("vhost scsi: alloc cmds per vq instead of session")
Reported-by: Hulk Robot 
Signed-off-by: Zhang Changzhong 
---
  drivers/vhost/scsi.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 6ff8a5096..4ce9f00 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1643,7 +1643,8 @@ vhost_scsi_set_endpoint(struct vhost_scsi *vs,
if (!vhost_vq_is_setup(vq))
continue;
  
-			if (vhost_scsi_setup_vq_cmds(vq, vq->num))

+   ret = vhost_scsi_setup_vq_cmds(vq, vq->num);
+   if (ret)
goto destroy_vq_cmds;
}
  



Acked-by: Jason Wang 




Re: [PATCH net-next] tun: fix ubuf refcount incorrectly on error path

2020-12-07 Thread Jason Wang



On 2020/12/7 下午9:38, wangyunjian wrote:

I think the newly added code is easy to miss this problem, so I want to
copy ubuf_info until we're sure there's no errors.

Thanks,
Yunjian



But isn't this actually a disabling of zerocopy?

Thanks




Re: [PATCH net-next] tun: fix ubuf refcount incorrectly on error path

2020-12-09 Thread Jason Wang



On 2020/12/8 上午10:32, Jason Wang wrote:


On 2020/12/7 下午9:38, wangyunjian wrote:

I think the newly added code is easy to miss this problem, so I want to
copy ubuf_info until we're sure there's no errors.

Thanks,
Yunjian



But isn't this actually a disabling of zerocopy?

Thanks




Sorry, I misread the patch.

Please send a formal version, and let's move the discussion there.

Thanks



Re: [PATCH net] vhost_vdpa: Return -EFUALT if copy_from_user() fails

2020-11-18 Thread Jason Wang



On 2020/11/18 下午4:59, Michael S. Tsirkin wrote:

On Wed, Nov 18, 2020 at 02:08:17PM +0800, Jason Wang wrote:

On 2020/10/26 上午10:59, Jason Wang wrote:

On 2020/10/23 下午11:34, Michael S. Tsirkin wrote:

On Fri, Oct 23, 2020 at 03:08:53PM +0300, Dan Carpenter wrote:

The copy_to/from_user() functions return the number of bytes which we
weren't able to copy but the ioctl should return -EFAULT if they fail.

Fixes: a127c5bbb6a8 ("vhost-vdpa: fix backend feature ioctls")
Signed-off-by: Dan Carpenter 

Acked-by: Michael S. Tsirkin 
Needed for stable I guess.


Agree.

Acked-by: Jason Wang 


Hi Michael.

I don't see this in your tree, please consider to merge.

Thanks


I do see it there:

commit 7922460e33c81f41e0d2421417228b32e6fdbe94
Author: Dan Carpenter 
Date:   Fri Oct 23 15:08:53 2020 +0300

 vhost_vdpa: Return -EFAULT if copy_from_user() fails
 
the reason you can't find it is probably because I fixed up

a typo in the subject.



I see that.

Thanks









Re: netconsole deadlock with virtnet

2020-11-23 Thread Jason Wang



On 2020/11/24 上午3:21, Jakub Kicinski wrote:

On Mon, 23 Nov 2020 14:09:34 -0500 Steven Rostedt wrote:

On Mon, 23 Nov 2020 10:52:52 -0800
Jakub Kicinski  wrote:


On Mon, 23 Nov 2020 09:31:28 -0500 Steven Rostedt wrote:

On Mon, 23 Nov 2020 13:08:55 +0200
Leon Romanovsky  wrote:

 

  [   10.028024] Chain exists of:
  [   10.028025]   console_owner --> target_list_lock --> _xmit_ETHER#2

Note, the problem is that we have a location that grabs the xmit_lock while
holding target_list_lock (and possibly console_owner).

Well, it try_locks the xmit_lock. Does lockdep understand try-locks?

(not that I condone the shenanigans that are going on here)

Does it?

virtnet_poll_tx() {
__netif_tx_lock() {
spin_lock(&txq->_xmit_lock);

Umpf. Right. I was looking at virtnet_poll_cleantx()


That looks like we can have:


CPU0CPU1

lock(xmit_lock)

lock(console)
lock(target_list_lock)
__netif_tx_lock()
lock(xmit_lock);

[BLOCKED]


lock(console)

[BLOCKED]



  DEADLOCK.


So where is the trylock here?

Perhaps you need the trylock in virtnet_poll_tx()?

That could work. Best if we used normal lock if !!budget, and trylock
when budget is 0. But maybe that's too hairy.



If we use trylock, we probably lose(or delay) tx notification that may 
have side effects to the stack.





I'm assuming all this trickiness comes from virtqueue_get_buf() needing
locking vs the TX path? It's pretty unusual for the completion path to
need locking vs xmit path.



Two reasons for doing this:

1) For some historical reason, we try to free transmitted tx packets in 
xmit (see free_old_xmit_skbs() in start_xmit()), we can probably remove 
this if we remove the non tx interrupt mode.
2) virtio core requires virtqueue_get_buf() to be synchronized with 
virtqueue_add(), we probably can solve this but it requires some non 
trivial refactoring in the virtio core


Btw, have a quick search, there are several other drivers that uses tx 
lock in the tx NAPI.


Thanks







Re: [PATCH net-next 00/13] Add mlx5 subfunction support

2020-11-23 Thread Jason Wang



On 2020/11/21 上午3:04, Samudrala, Sridhar wrote:



On 11/20/2020 9:58 AM, Alexander Duyck wrote:

On Thu, Nov 19, 2020 at 5:29 PM Jakub Kicinski  wrote:


On Wed, 18 Nov 2020 21:35:29 -0700 David Ahern wrote:

On 11/18/20 7:14 PM, Jakub Kicinski wrote:

On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote:

On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote:

Just to refresh all our memory, we discussed and settled on the 
flow

in [2]; RFC [1] followed this discussion.

vdpa tool of [3] can add one or more vdpa device(s) on top of 
already

spawned PF, VF, SF device.


Nack for the networking part of that. It'd basically be VMDq.


What are you NAK'ing?


Spawning multiple netdevs from one device by slicing up its queues.


Why do you object to that? Slicing up h/w resources for virtual what
ever has been common practice for a long time.


My memory of the VMDq debate is hazy, let me rope in Alex into this.
I believe the argument was that we should offload software constructs,
not create HW-specific APIs which depend on HW availability and
implementation. So the path we took was offloading macvlan.


I think it somewhat depends on the type of interface we are talking
about. What we were wanting to avoid was drivers spawning their own
unique VMDq netdevs and each having a different way of doing it. The
approach Intel went with was to use a MACVLAN offload to approach it.
Although I would imagine many would argue the approach is somewhat
dated and limiting since you cannot do many offloads on a MACVLAN
interface.


Yes. We talked about this at netdev 0x14 and the limitations of 
macvlan based offloads.
https://netdevconf.info/0x14/session.html?talk-hardware-acceleration-of-container-networking-interfaces 



Subfunction seems to be a good model to expose VMDq VSI or SIOV ADI as 
a netdev for kernel containers. AF_XDP ZC in a container is one of the 
usecase this would address. Today we have to pass the entire PF/VF to 
a container to do AF_XDP.


Looks like the current model is to create a subfunction of a specific 
type on auxiliary bus, do some configuration to assign resources and 
then activate the subfunction.




With the VDPA case I believe there is a set of predefined virtio
devices that are being emulated and presented so it isn't as if they
are creating a totally new interface for this.



vDPA doesn't have any limitation of how the devices is created or 
implemented. It could be predefined or created dynamically. vDPA leaves 
all of those to the parent device with the help of a unified management 
API[1]. E.g It could be a PCI device (PF or VF), sub-function or  
software emulated devices.





What I would be interested in seeing is if there are any other vendors
that have reviewed this and sign off on this approach.



For "this approach" do you mean vDPA subfucntion? My understanding is 
that it's totally vendor specific, vDPA subsystem don't want to be 
limited by a specific type of device.




What we don't
want to see is Nivida/Mellanox do this one way, then Broadcom or Intel
come along later and have yet another way of doing this. We need an
interface and feature set that will work for everyone in terms of how
this will look going forward.


For feature set,  it would be hard to force (we can have a 
recommendation set of features) vendors to implement a common set of 
features consider they can be negotiated. So the management interface is 
expected to implement features like cpu clusters in order to make sure 
the migration compatibility, or qemu can assist for the missing feature 
with performance lose.


Thanks




Re: [PATCH net-next 00/13] Add mlx5 subfunction support

2020-11-23 Thread Jason Wang



On 2020/11/24 下午3:01, Jason Wang wrote:


On 2020/11/21 上午3:04, Samudrala, Sridhar wrote:



On 11/20/2020 9:58 AM, Alexander Duyck wrote:

On Thu, Nov 19, 2020 at 5:29 PM Jakub Kicinski  wrote:


On Wed, 18 Nov 2020 21:35:29 -0700 David Ahern wrote:

On 11/18/20 7:14 PM, Jakub Kicinski wrote:

On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote:

On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote:

Just to refresh all our memory, we discussed and settled on 
the flow

in [2]; RFC [1] followed this discussion.

vdpa tool of [3] can add one or more vdpa device(s) on top of 
already

spawned PF, VF, SF device.


Nack for the networking part of that. It'd basically be VMDq.


What are you NAK'ing?


Spawning multiple netdevs from one device by slicing up its queues.


Why do you object to that? Slicing up h/w resources for virtual what
ever has been common practice for a long time.


My memory of the VMDq debate is hazy, let me rope in Alex into this.
I believe the argument was that we should offload software constructs,
not create HW-specific APIs which depend on HW availability and
implementation. So the path we took was offloading macvlan.


I think it somewhat depends on the type of interface we are talking
about. What we were wanting to avoid was drivers spawning their own
unique VMDq netdevs and each having a different way of doing it. The
approach Intel went with was to use a MACVLAN offload to approach it.
Although I would imagine many would argue the approach is somewhat
dated and limiting since you cannot do many offloads on a MACVLAN
interface.


Yes. We talked about this at netdev 0x14 and the limitations of 
macvlan based offloads.
https://netdevconf.info/0x14/session.html?talk-hardware-acceleration-of-container-networking-interfaces 



Subfunction seems to be a good model to expose VMDq VSI or SIOV ADI 
as a netdev for kernel containers. AF_XDP ZC in a container is one of 
the usecase this would address. Today we have to pass the entire 
PF/VF to a container to do AF_XDP.


Looks like the current model is to create a subfunction of a specific 
type on auxiliary bus, do some configuration to assign resources and 
then activate the subfunction.




With the VDPA case I believe there is a set of predefined virtio
devices that are being emulated and presented so it isn't as if they
are creating a totally new interface for this.



vDPA doesn't have any limitation of how the devices is created or 
implemented. It could be predefined or created dynamically. vDPA 
leaves all of those to the parent device with the help of a unified 
management API[1]. E.g It could be a PCI device (PF or VF), 
sub-function or  software emulated devices.



Miss the link, https://www.spinics.net/lists/netdev/msg699374.html.

Thanks







What I would be interested in seeing is if there are any other vendors
that have reviewed this and sign off on this approach.



For "this approach" do you mean vDPA subfucntion? My understanding is 
that it's totally vendor specific, vDPA subsystem don't want to be 
limited by a specific type of device.




What we don't
want to see is Nivida/Mellanox do this one way, then Broadcom or Intel
come along later and have yet another way of doing this. We need an
interface and feature set that will work for everyone in terms of how
this will look going forward.


For feature set,  it would be hard to force (we can have a 
recommendation set of features) vendors to implement a common set of 
features consider they can be negotiated. So the management interface 
is expected to implement features like cpu clusters in order to make 
sure the migration compatibility, or qemu can assist for the missing 
feature with performance lose.


Thanks






Re: netconsole deadlock with virtnet

2020-11-24 Thread Jason Wang



On 2020/11/24 下午4:01, Leon Romanovsky wrote:

On Tue, Nov 24, 2020 at 11:22:03AM +0800, Jason Wang wrote:

On 2020/11/24 上午3:21, Jakub Kicinski wrote:

On Mon, 23 Nov 2020 14:09:34 -0500 Steven Rostedt wrote:

On Mon, 23 Nov 2020 10:52:52 -0800
Jakub Kicinski  wrote:


On Mon, 23 Nov 2020 09:31:28 -0500 Steven Rostedt wrote:

On Mon, 23 Nov 2020 13:08:55 +0200
Leon Romanovsky  wrote:


   [   10.028024] Chain exists of:
   [   10.028025]   console_owner --> target_list_lock --> _xmit_ETHER#2

Note, the problem is that we have a location that grabs the xmit_lock while
holding target_list_lock (and possibly console_owner).

Well, it try_locks the xmit_lock. Does lockdep understand try-locks?

(not that I condone the shenanigans that are going on here)

Does it?

virtnet_poll_tx() {
__netif_tx_lock() {
spin_lock(&txq->_xmit_lock);

Umpf. Right. I was looking at virtnet_poll_cleantx()


That looks like we can have:


CPU0CPU1

 lock(xmit_lock)

lock(console)
lock(target_list_lock)
__netif_tx_lock()
lock(xmit_lock);

[BLOCKED]

 
 lock(console)

 [BLOCKED]



   DEADLOCK.


So where is the trylock here?

Perhaps you need the trylock in virtnet_poll_tx()?

That could work. Best if we used normal lock if !!budget, and trylock
when budget is 0. But maybe that's too hairy.


If we use trylock, we probably lose(or delay) tx notification that may have
side effects to the stack.



I'm assuming all this trickiness comes from virtqueue_get_buf() needing
locking vs the TX path? It's pretty unusual for the completion path to
need locking vs xmit path.


Two reasons for doing this:

1) For some historical reason, we try to free transmitted tx packets in xmit
(see free_old_xmit_skbs() in start_xmit()), we can probably remove this if
we remove the non tx interrupt mode.
2) virtio core requires virtqueue_get_buf() to be synchronized with
virtqueue_add(), we probably can solve this but it requires some non trivial
refactoring in the virtio core

So how will we solve our lockdep issues?

Thanks



It's not clear to me that whether it's a virtio-net specific issue. E.g 
the above deadlock looks like a generic issue so workaround it via 
virtio-net may not help for other drivers.


Thanks





Btw, have a quick search, there are several other drivers that uses tx lock
in the tx NAPI.

Thanks





Re: netconsole deadlock with virtnet

2020-11-24 Thread Jason Wang



On 2020/11/24 下午10:31, Steven Rostedt wrote:

On Tue, 24 Nov 2020 11:22:03 +0800
Jason Wang  wrote:


Btw, have a quick search, there are several other drivers that uses tx
lock in the tx NAPI.

tx NAPI is not the issue. The issue is that write_msg() (in netconsole.c)
calls this polling logic with the target_list_lock held.



But in the tx NAPI poll it tries to lock TX instead of using trylock.




Are those other drivers called by netconsole? If not, then this is unique
to virtio-net.



I think the answer is yes, since net console is not disabled in the codes.

Thanks




-- Steve





Re: netconsole deadlock with virtnet

2020-11-24 Thread Jason Wang



On 2020/11/25 上午12:20, Jakub Kicinski wrote:

On Tue, 24 Nov 2020 11:22:03 +0800 Jason Wang wrote:

Perhaps you need the trylock in virtnet_poll_tx()?

That could work. Best if we used normal lock if !!budget, and trylock
when budget is 0. But maybe that's too hairy.

If we use trylock, we probably lose(or delay) tx notification that may
have side effects to the stack.

That's why I said only trylock with budget == 0. Only netpoll calls with
budget == 0, AFAIK.



Oh right.

So I think maybe we can switch to use trylock when budget is zero and 
try to schedule another TX NAPI if we trylock fail.






I'm assuming all this trickiness comes from virtqueue_get_buf() needing
locking vs the TX path? It's pretty unusual for the completion path to
need locking vs xmit path.

Two reasons for doing this:

1) For some historical reason, we try to free transmitted tx packets in
xmit (see free_old_xmit_skbs() in start_xmit()), we can probably remove
this if we remove the non tx interrupt mode.
2) virtio core requires virtqueue_get_buf() to be synchronized with
virtqueue_add(), we probably can solve this but it requires some non
trivial refactoring in the virtio core

Btw, have a quick search, there are several other drivers that uses tx
lock in the tx NAPI.

Unless they do:

netdev->priv_flags |= IFF_DISABLE_NETPOLL;

they are all broken.



Yes.

Thanks



Re: [PATCH 0/7] Introduce vdpa management tool

2020-11-26 Thread Jason Wang



On 2020/11/12 下午2:39, Parav Pandit wrote:

This patchset covers user requirements for managing existing vdpa devices,
using a tool and its internal design notes for kernel drivers.

Background and user requirements:
--
(1) Currently VDPA device is created by driver when driver is loaded.
However, user should have a choice when to create or not create a vdpa device
for the underlying parent device.

For example, mlx5 PCI VF and subfunction device supports multiple classes of
device such netdev, vdpa, rdma. Howevever it is not required to always created
vdpa device for such device.

(2) In another use case, a device may support creating one or multiple vdpa
device of same or different class such as net and block.
Creating vdpa devices at driver load time further limits this use case.

(3) A user should be able to monitor and query vdpa queue level or device level
statistics for a given vdpa device.

(4) A user should be able to query what class of vdpa devices are supported
by its parent device.

(5) A user should be able to view supported features and negotiated features
of the vdpa device.

(6) A user should be able to create a vdpa device in vendor agnostic manner
using single tool.

Hence, it is required to have a tool through which user can create one or more
vdpa devices from a parent device which addresses above user requirements.

Example devices:

  +---+ +---+ +-+ ++ +---+
  |vdpa dev 0 | |vdpa dev 1 | |rdma dev | |netdev  | |vdpa dev 3 |
  |type=net   | |type=block | |mlx5_0   | |ens3f0  | |type=net   |
  ++--+ +-+-+ +++ +-+--+ ++--+
   |  ||| |
   |  ||| |
  ++-+|   +++   |+++
  |  mlx5++   |mlx5 +---+|mlx5 |
  |pci vf 2  ||pci vf 4 ||pci sf 8 |
  |03:00:2   ||03:00.4  ||mlx5_sf.8|
  ++-+++++++
   |   |  |
   |  ++-+|
   +--+mlx5  ++
  |pci pf 0  |
  |03:00.0   |
  +--+

vdpa tool:
--
vdpa tool is a tool to create, delete vdpa devices from a parent device. It is a
tool that enables user to query statistics, features and may be more attributes
in future.

vdpa tool command draft:

(a) List parent devices which supports creating vdpa devices.
It also shows which class types supported by this parent device.
In below command example two parent devices support vdpa device creation.
First is PCI VF whose bdf is 03.00:2.
Second is PCI VF whose name is 03:00.4.
Third is PCI SF whose name is mlx5_core.sf.8

$ vdpa parentdev list
vdpasim
   supported_classes
 net
pci/:03.00:3
   supported_classes
 net block
pci/:03.00:4
   supported_classes
 net block
auxiliary/mlx5_core.sf.8
   supported_classes
 net

(b) Now add a vdpa device of networking class and show the device.
$ vdpa dev add parentdev pci/:03.00:2 type net name foo0 $ vdpa dev show 
foo0
foo0: parentdev pci/:03.00:2 type network parentdev vdpasim vendor_id 0 
max_vqs 2 max_vq_size 256

(c) Show features of a vdpa device
$ vdpa dev features show foo0
supported
   iommu platform
   version 1

(d) Dump vdpa device statistics
$ vdpa dev stats show foo0
kickdoorbells 10
wqes 100

(e) Now delete a vdpa device previously created.
$ vdpa dev del foo0

vdpa tool support in this patchset:
---
vdpa tool is created to create, delete and query vdpa devices.
examples:
Show vdpa parent device that supports creating, deleting vdpa devices.

$ vdpa parentdev show
vdpasim:
   supported_classes
 net

$ vdpa parentdev show -jp
{
 "show": {
"vdpasim": {
   "supported_classes": {
  "net"
 }
 }
}

Create a vdpa device of type networking named as "foo2" from the parent device 
vdpasim:

$ vdpa dev add parentdev vdpasim type net name foo2

Show the newly created vdpa device by its name:
$ vdpa dev show foo2
foo2: type network parentdev vdpasim vendor_id 0 max_vqs 2 max_vq_size 256

$ vdpa dev show foo2 -jp
{
 "dev": {
 "foo2": {
 "type": "network",
 "parentdev": "vdpasim",
 "vendor_id": 0,
 "max_vqs": 2,
 "max_vq_size": 256
 }
 }
}

Delete the vdpa device after its use:
$ vdpa dev del foo2

vdpa tool support by kernel:

vdpa tool user interface will be supported by existing vdpa kernel framework,
i.e. drivers/vdpa/vdpa.c It services user command through a netlink interface.

Each parent device r

Re: [PATCH v4] vdpa: mlx5: fix vdpa/vhost dependencies

2020-11-29 Thread Jason Wang



On 2020/11/29 上午5:39, Randy Dunlap wrote:

drivers/vdpa/mlx5/ uses vhost_iotlb*() interfaces, so select
VHOST_IOTLB to make them be built.

However, if VHOST_IOTLB is the only VHOST symbol that is
set/enabled, the object file still won't be built because
drivers/Makefile won't descend into drivers/vhost/ to build it,
so make drivers/Makefile build the needed binary whenever
VHOST_IOTLB is set, like it does for VHOST_RING.

Fixes these build errors:
ERROR: modpost: "vhost_iotlb_itree_next" [drivers/vdpa/mlx5/mlx5_vdpa.ko] 
undefined!
ERROR: modpost: "vhost_iotlb_itree_first" [drivers/vdpa/mlx5/mlx5_vdpa.ko] 
undefined!

Fixes: 29064bfdabd5 ("vdpa/mlx5: Add support library for mlx5 VDPA 
implementation")
Fixes: aff90770e54c ("vdpa/mlx5: Fix dependency on MLX5_CORE")
Reported-by: kernel test robot 
Signed-off-by: Randy Dunlap 
Cc: Eli Cohen 
Cc: Parav Pandit 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: virtualizat...@lists.linux-foundation.org
Cc: Saeed Mahameed 
Cc: Leon Romanovsky 
Cc: netdev@vger.kernel.org
---
v2: change from select to depends on VHOST (Saeed)
v3: change to depends on VHOST_IOTLB (Jason)
v4: use select VHOST_IOTLB (Michael); also add to drivers/Makefile

  drivers/Makefile |1 +
  drivers/vdpa/Kconfig |1 +
  2 files changed, 2 insertions(+)

--- linux-next-20201127.orig/drivers/vdpa/Kconfig
+++ linux-next-20201127/drivers/vdpa/Kconfig
@@ -32,6 +32,7 @@ config IFCVF
  
  config MLX5_VDPA

bool
+   select VHOST_IOTLB
help
  Support library for Mellanox VDPA drivers. Provides code that is
  common for all types of VDPA drivers. The following drivers are 
planned:
--- linux-next-20201127.orig/drivers/Makefile
+++ linux-next-20201127/drivers/Makefile
@@ -143,6 +143,7 @@ obj-$(CONFIG_OF)+= of/
  obj-$(CONFIG_SSB) += ssb/
  obj-$(CONFIG_BCMA)+= bcma/
  obj-$(CONFIG_VHOST_RING)  += vhost/
+obj-$(CONFIG_VHOST_IOTLB)  += vhost/
  obj-$(CONFIG_VHOST)   += vhost/
  obj-$(CONFIG_VLYNQ)   += vlynq/
  obj-$(CONFIG_GREYBUS) += greybus/



Acked-by: Jason Wang 

Thanks



Re: [External] Re: [PATCH 0/7] Introduce vdpa management tool

2020-11-29 Thread Jason Wang



On 2020/11/27 下午1:52, Yongji Xie wrote:
On Fri, Nov 27, 2020 at 11:53 AM Jason Wang <mailto:jasow...@redhat.com>> wrote:



On 2020/11/12 下午2:39, Parav Pandit wrote:
> This patchset covers user requirements for managing existing
vdpa devices,
> using a tool and its internal design notes for kernel drivers.
>
> Background and user requirements:
> --
> (1) Currently VDPA device is created by driver when driver is
loaded.
> However, user should have a choice when to create or not create
a vdpa device
> for the underlying parent device.
>
> For example, mlx5 PCI VF and subfunction device supports
multiple classes of
> device such netdev, vdpa, rdma. Howevever it is not required to
always created
> vdpa device for such device.
>
> (2) In another use case, a device may support creating one or
multiple vdpa
> device of same or different class such as net and block.
> Creating vdpa devices at driver load time further limits this
use case.
>
> (3) A user should be able to monitor and query vdpa queue level
or device level
> statistics for a given vdpa device.
>
> (4) A user should be able to query what class of vdpa devices
are supported
> by its parent device.
>
> (5) A user should be able to view supported features and
negotiated features
> of the vdpa device.
>
> (6) A user should be able to create a vdpa device in vendor
agnostic manner
> using single tool.
>
> Hence, it is required to have a tool through which user can
create one or more
> vdpa devices from a parent device which addresses above user
requirements.
>
> Example devices:
> 
>   +---+ +---+ +-+ ++ +---+
>   |vdpa dev 0 | |vdpa dev 1 | |rdma dev | |netdev  | |vdpa dev 3 |
>   |type=net   | |type=block | |mlx5_0   | |ens3f0  | |type=net   |
>   ++--+ +-+-+ +++ +-+--+ ++--+
>        |              |            |            |    |
>        |              |            |            |    |
>   ++-+        |       +++       | +++
>   |  mlx5    ++       |mlx5     +---+ |mlx5     |
>   |pci vf 2  |                |pci vf 4 | |pci sf 8 |
>   |03:00:2   |                |03:00.4  | |mlx5_sf.8|
>   ++-+                +++ +++
>        |                           |   |
>        |                      ++-+   |
>        +--+mlx5 ++
>                               |pci pf 0  |
>                               |03:00.0   |
>                               +--+
>
> vdpa tool:
> --
> vdpa tool is a tool to create, delete vdpa devices from a parent
device. It is a
> tool that enables user to query statistics, features and may be
more attributes
> in future.
>
> vdpa tool command draft:
> 
> (a) List parent devices which supports creating vdpa devices.
> It also shows which class types supported by this parent device.
> In below command example two parent devices support vdpa device
creation.
> First is PCI VF whose bdf is 03.00:2.
> Second is PCI VF whose name is 03:00.4.
> Third is PCI SF whose name is mlx5_core.sf.8
>
> $ vdpa parentdev list
> vdpasim
>    supported_classes
>      net
> pci/:03.00:3
>    supported_classes
>      net block
> pci/:03.00:4
>    supported_classes
>      net block
> auxiliary/mlx5_core.sf.8
>    supported_classes
>      net
>
> (b) Now add a vdpa device of networking class and show the device.
> $ vdpa dev add parentdev pci/:03.00:2 type net name foo0 $
vdpa dev show foo0
> foo0: parentdev pci/:03.00:2 type network parentdev vdpasim
vendor_id 0 max_vqs 2 max_vq_size 256
>
> (c) Show features of a vdpa device
> $ vdpa dev features show foo0
> supported
>    iommu platform
>    version 1
>
> (d) Dump vdpa device statistics
> $ vdpa dev stats show foo0
> kickdoorbells 10
> wqes 100
>
> (e) Now delete a vdpa device previously created.
> $ vdpa dev del foo0
>
> vdpa tool support in this patchset:
> ---
> vdpa tool is created to create, delete and query vdpa devices.
> examples:
> Sho

Re: [PATCH netdev 1/2] virtio: add module option to turn off guest offloads

2020-11-12 Thread Jason Wang



On 2020/11/12 下午4:11, Xuan Zhuo wrote:

* VIRTIO_NET_F_GUEST_CSUM
* VIRTIO_NET_F_GUEST_TSO4
* VIRTIO_NET_F_GUEST_TSO6
* VIRTIO_NET_F_GUEST_ECN
* VIRTIO_NET_F_GUEST_UFO
* VIRTIO_NET_F_MTU

If these features are negotiated successfully, it may cause virtio-net to
receive large packages, which will cause xdp to fail to load. And in
many cases, it cannot be dynamically turned off, so add a module option
to turn off these features.



Actually we will disable those through control virtqueue. Or does it 
mean your hardware doesn't support control guest offloads?


Module parameters may introduce a lot of burden for management and 
use-ability.


Disabling guest offloads means you may suffer from low RX throughput.

Thanks




Signed-off-by: Xuan Zhuo 
---
  drivers/net/virtio_net.c | 36 +++-
  1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 21b7114..232a539 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -26,10 +26,11 @@
  static int napi_weight = NAPI_POLL_WEIGHT;
  module_param(napi_weight, int, 0444);
  
-static bool csum = true, gso = true, napi_tx = true;

+static bool csum = true, gso = true, napi_tx, guest_offload = true;
  module_param(csum, bool, 0444);
  module_param(gso, bool, 0444);
  module_param(napi_tx, bool, 0644);
+module_param(guest_offload, bool, 0644);
  
  /* FIXME: MTU in config. */

  #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
@@ -3245,6 +3246,18 @@ static __maybe_unused int virtnet_restore(struct 
virtio_device *vdev)
VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_STANDBY
  
+#define VIRTNET_FEATURES_WITHOUT_GUEST_OFFLOADS \

+   VIRTIO_NET_F_CSUM, \
+   VIRTIO_NET_F_MAC, \
+   VIRTIO_NET_F_HOST_TSO4, VIRTIO_NET_F_HOST_UFO, VIRTIO_NET_F_HOST_TSO6, \
+   VIRTIO_NET_F_HOST_ECN, \
+   VIRTIO_NET_F_MRG_RXBUF, VIRTIO_NET_F_STATUS, VIRTIO_NET_F_CTRL_VQ, \
+   VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, \
+   VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
+   VIRTIO_NET_F_CTRL_MAC_ADDR, \
+   VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
+   VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_STANDBY
+
  static unsigned int features[] = {
VIRTNET_FEATURES,
  };
@@ -3255,6 +3268,16 @@ static __maybe_unused int virtnet_restore(struct 
virtio_device *vdev)
VIRTIO_F_ANY_LAYOUT,
  };
  
+static unsigned int features_without_offloads[] = {

+   VIRTNET_FEATURES_WITHOUT_GUEST_OFFLOADS,
+};
+
+static unsigned int features_without_offloads_legacy[] = {
+   VIRTNET_FEATURES_WITHOUT_GUEST_OFFLOADS,
+   VIRTIO_NET_F_GSO,
+   VIRTIO_F_ANY_LAYOUT,
+};
+
  static struct virtio_driver virtio_net_driver = {
.feature_table = features,
.feature_table_size = ARRAY_SIZE(features),
@@ -3288,6 +3311,17 @@ static __init int virtio_net_driver_init(void)
if (ret)
goto err_dead;
  
+	if (!guest_offload) {

+   virtio_net_driver.feature_table = features_without_offloads;
+   virtio_net_driver.feature_table_size =
+   ARRAY_SIZE(features_without_offloads);
+
+   virtio_net_driver.feature_table_legacy =
+   features_without_offloads_legacy;
+   virtio_net_driver.feature_table_size_legacy =
+   ARRAY_SIZE(features_without_offloads_legacy);
+   }
+
  ret = register_virtio_driver(&virtio_net_driver);
if (ret)
goto err_virtio;




Re: [PATCH netdev 2/2] virtio, xdp: Allow xdp to load, even if there is not enough queue

2020-11-12 Thread Jason Wang



On 2020/11/12 下午5:15, Xuan Zhuo wrote:

Since XDP_TX uses an independent queue to send data by default, and
requires a queue for each CPU, this condition is often not met when the
number of CPUs is relatively large, but XDP_TX is not used in many
cases. I hope In this case, XDP is allowed to load and a warning message
is submitted. If the user uses XDP_TX, another error is generated.

This is not a perfect solution, but I still hope to solve some of the
problems first.

Signed-off-by: Xuan Zhuo



This leads bad user experiences.

Let's do something like this:

1) When TX queues is sufficient, go as in the past
2) When TX queue is not, use tx lock to synchronize

Thanks



Re: [PATCH] vringh: fix vringh_iov_push_*() documentation

2020-11-16 Thread Jason Wang



On 2020/11/17 上午12:16, Stefano Garzarella wrote:

vringh_iov_push_*() functions don't have 'dst' parameter, but have
the 'src' parameter.

Replace 'dst' description with 'src' description.

Signed-off-by: Stefano Garzarella 



Acked-by: Jason Wang 



---
  drivers/vhost/vringh.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vringh.c b/drivers/vhost/vringh.c
index 8bd8b403f087..b7403ba8e7f7 100644
--- a/drivers/vhost/vringh.c
+++ b/drivers/vhost/vringh.c
@@ -730,7 +730,7 @@ EXPORT_SYMBOL(vringh_iov_pull_user);
  /**
   * vringh_iov_push_user - copy bytes into vring_iov.
   * @wiov: the wiov as passed to vringh_getdesc_user() (updated as we consume)
- * @dst: the place to copy.
+ * @src: the place to copy from.
   * @len: the maximum length to copy.
   *
   * Returns the bytes copied <= len or a negative errno.
@@ -976,7 +976,7 @@ EXPORT_SYMBOL(vringh_iov_pull_kern);
  /**
   * vringh_iov_push_kern - copy bytes into vring_iov.
   * @wiov: the wiov as passed to vringh_getdesc_kern() (updated as we consume)
- * @dst: the place to copy.
+ * @src: the place to copy from.
   * @len: the maximum length to copy.
   *
   * Returns the bytes copied <= len or a negative errno.
@@ -1333,7 +1333,7 @@ EXPORT_SYMBOL(vringh_iov_pull_iotlb);
   * vringh_iov_push_iotlb - copy bytes into vring_iov.
   * @vrh: the vring.
   * @wiov: the wiov as passed to vringh_getdesc_iotlb() (updated as we consume)
- * @dst: the place to copy.
+ * @src: the place to copy from.
   * @len: the maximum length to copy.
   *
   * Returns the bytes copied <= len or a negative errno.




Re: [PATCH net] vhost_vdpa: Return -EFUALT if copy_from_user() fails

2020-11-17 Thread Jason Wang



On 2020/10/26 上午10:59, Jason Wang wrote:


On 2020/10/23 下午11:34, Michael S. Tsirkin wrote:

On Fri, Oct 23, 2020 at 03:08:53PM +0300, Dan Carpenter wrote:

The copy_to/from_user() functions return the number of bytes which we
weren't able to copy but the ioctl should return -EFAULT if they fail.

Fixes: a127c5bbb6a8 ("vhost-vdpa: fix backend feature ioctls")
Signed-off-by: Dan Carpenter 

Acked-by: Michael S. Tsirkin 
Needed for stable I guess.



Agree.

Acked-by: Jason Wang 



Hi Michael.

I don't see this in your tree, please consider to merge.

Thanks





Re: [PATCH v2] vhost/vsock: add IOTLB API support

2020-12-23 Thread Jason Wang



On 2020/12/23 下午10:36, Stefano Garzarella wrote:

This patch enables the IOTLB API support for vhost-vsock devices,
allowing the userspace to emulate an IOMMU for the guest.

These changes were made following vhost-net, in details this patch:
- exposes VIRTIO_F_ACCESS_PLATFORM feature and inits the iotlb
   device if the feature is acked
- implements VHOST_GET_BACKEND_FEATURES and
   VHOST_SET_BACKEND_FEATURES ioctls
- calls vq_meta_prefetch() before vq processing to prefetch vq
   metadata address in IOTLB
- provides .read_iter, .write_iter, and .poll callbacks for the
   chardev; they are used by the userspace to exchange IOTLB messages

This patch was tested specifying "intel_iommu=strict" in the guest
kernel command line. I used QEMU with a patch applied [1] to fix a
simple issue (that patch was merged in QEMU v5.2.0):
 $ qemu -M q35,accel=kvm,kernel-irqchip=split \
-drive file=fedora.qcow2,format=qcow2,if=virtio \
-device intel-iommu,intremap=on,device-iotlb=on \
-device vhost-vsock-pci,guest-cid=3,iommu_platform=on,ats=on

[1] https://lists.gnu.org/archive/html/qemu-devel/2020-10/msg09077.html

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Stefano Garzarella 



Acked-by: Jason Wang 



---

The patch is the same of v1, but I re-tested it with:
- QEMU v5.2.0-551-ga05f8ecd88
- Linux 5.9.15 (host)
- Linux 5.9.15 and 5.10.0 (guest)
Now, enabling 'ats' it works well, there are just a few simple changes.

v1: https://www.spinics.net/lists/kernel/msg3716022.html
v2:
- updated commit message about QEMU version and string used to test
- rebased on mst/vhost branch

Thanks,
Stefano
---
  drivers/vhost/vsock.c | 68 +--
  1 file changed, 65 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index a483cec31d5c..5e78fb719602 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -30,7 +30,12 @@
  #define VHOST_VSOCK_PKT_WEIGHT 256
  
  enum {

-   VHOST_VSOCK_FEATURES = VHOST_FEATURES,
+   VHOST_VSOCK_FEATURES = VHOST_FEATURES |
+  (1ULL << VIRTIO_F_ACCESS_PLATFORM)
+};
+
+enum {
+   VHOST_VSOCK_BACKEND_FEATURES = (1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2)
  };
  
  /* Used to track all the vhost_vsock instances on the system. */

@@ -94,6 +99,9 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
if (!vhost_vq_get_backend(vq))
goto out;
  
+	if (!vq_meta_prefetch(vq))

+   goto out;
+
/* Avoid further vmexits, we're already processing the virtqueue */
vhost_disable_notify(&vsock->dev, vq);
  
@@ -449,6 +457,9 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)

if (!vhost_vq_get_backend(vq))
goto out;
  
+	if (!vq_meta_prefetch(vq))

+   goto out;
+
vhost_disable_notify(&vsock->dev, vq);
do {
u32 len;
@@ -766,8 +777,12 @@ static int vhost_vsock_set_features(struct vhost_vsock 
*vsock, u64 features)
mutex_lock(&vsock->dev.mutex);
if ((features & (1 << VHOST_F_LOG_ALL)) &&
!vhost_log_access_ok(&vsock->dev)) {
-   mutex_unlock(&vsock->dev.mutex);
-   return -EFAULT;
+   goto err;
+   }
+
+   if ((features & (1ULL << VIRTIO_F_ACCESS_PLATFORM))) {
+   if (vhost_init_device_iotlb(&vsock->dev, true))
+   goto err;
}
  
  	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {

@@ -778,6 +793,10 @@ static int vhost_vsock_set_features(struct vhost_vsock 
*vsock, u64 features)
}
mutex_unlock(&vsock->dev.mutex);
return 0;
+
+err:
+   mutex_unlock(&vsock->dev.mutex);
+   return -EFAULT;
  }
  
  static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl,

@@ -811,6 +830,18 @@ static long vhost_vsock_dev_ioctl(struct file *f, unsigned 
int ioctl,
if (copy_from_user(&features, argp, sizeof(features)))
return -EFAULT;
return vhost_vsock_set_features(vsock, features);
+   case VHOST_GET_BACKEND_FEATURES:
+   features = VHOST_VSOCK_BACKEND_FEATURES;
+   if (copy_to_user(argp, &features, sizeof(features)))
+   return -EFAULT;
+   return 0;
+   case VHOST_SET_BACKEND_FEATURES:
+   if (copy_from_user(&features, argp, sizeof(features)))
+   return -EFAULT;
+   if (features & ~VHOST_VSOCK_BACKEND_FEATURES)
+   return -EOPNOTSUPP;
+   vhost_set_backend_features(&vsock->dev, features);
+   return 0;
default:
mutex_lock(&vsock->dev.mutex);
r = vhost_dev_ioctl(&vsock->dev, ioctl, argp);

Re: [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace

2020-12-23 Thread Jason Wang



On 2020/12/23 下午6:59, Yongji Xie wrote:

On Wed, Dec 23, 2020 at 2:38 PM Jason Wang  wrote:


On 2020/12/22 下午10:52, Xie Yongji wrote:

This series introduces a framework, which can be used to implement
vDPA Devices in a userspace program. The work consist of two parts:
control path forwarding and data path offloading.

In the control path, the VDUSE driver will make use of message
mechnism to forward the config operation from vdpa bus driver
to userspace. Userspace can use read()/write() to receive/reply
those control messages.

In the data path, the core is mapping dma buffer into VDUSE
daemon's address space, which can be implemented in different ways
depending on the vdpa bus to which the vDPA device is attached.

In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver with
bounce-buffering mechanism to achieve that.


Rethink about the bounce buffer stuffs. I wonder instead of using kernel
pages with mmap(), how about just use userspace pages like what vhost did?

It means we need a worker to do bouncing but we don't need to care about
annoying stuffs like page reclaiming?


Now the I/O bouncing is done in the streaming DMA mapping routines
which can be called from interrupt context. If we put this into a
kworker, that means we need to synchronize with a kworker in an
interrupt context. I think it can't work.



We just need to make sure the buffer is ready before the user is trying 
to access them.


But I admit it would be tricky (require shadow virtqueue etc) which is 
probably not a good idea.


Thanks




Thanks,
Yongji





Re: [RFC v2 08/13] vdpa: Introduce process_iotlb_msg() in vdpa_config_ops

2020-12-23 Thread Jason Wang



On 2020/12/23 下午7:06, Yongji Xie wrote:

On Wed, Dec 23, 2020 at 4:37 PM Jason Wang  wrote:


On 2020/12/22 下午10:52, Xie Yongji wrote:

This patch introduces a new method in the vdpa_config_ops to
support processing the raw vhost memory mapping message in the
vDPA device driver.

Signed-off-by: Xie Yongji 
---
   drivers/vhost/vdpa.c | 5 -
   include/linux/vdpa.h | 7 +++
   2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 448be7875b6d..ccbb391e38be 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -728,6 +728,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev 
*dev,
   if (r)
   return r;

+ if (ops->process_iotlb_msg)
+ return ops->process_iotlb_msg(vdpa, msg);
+
   switch (msg->type) {
   case VHOST_IOTLB_UPDATE:
   r = vhost_vdpa_process_iotlb_update(v, msg);
@@ -770,7 +773,7 @@ static int vhost_vdpa_alloc_domain(struct vhost_vdpa *v)
   int ret;

   /* Device want to do DMA by itself */
- if (ops->set_map || ops->dma_map)
+ if (ops->set_map || ops->dma_map || ops->process_iotlb_msg)
   return 0;

   bus = dma_dev->bus;
diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 656fe264234e..7bccedf22f4b 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -5,6 +5,7 @@
   #include 
   #include 
   #include 
+#include 
   #include 
   #include 

@@ -172,6 +173,10 @@ struct vdpa_iova_range {
*  @vdev: vdpa device
*  Returns the iova range supported by
*  the device.
+ * @process_iotlb_msg:   Process vhost memory mapping message 
(optional)
+ *   Only used for VDUSE device now
+ *   @vdev: vdpa device
+ *   @msg: vhost memory mapping message
* @set_map:Set device memory mapping (optional)
*  Needed for device that using device
*  specific DMA translation (on-chip IOMMU)
@@ -240,6 +245,8 @@ struct vdpa_config_ops {
   struct vdpa_iova_range (*get_iova_range)(struct vdpa_device *vdev);

   /* DMA ops */
+ int (*process_iotlb_msg)(struct vdpa_device *vdev,
+  struct vhost_iotlb_msg *msg);
   int (*set_map)(struct vdpa_device *vdev, struct vhost_iotlb *iotlb);
   int (*dma_map)(struct vdpa_device *vdev, u64 iova, u64 size,
  u64 pa, u32 perm);


Is there any reason that it can't be done via dma_map/dma_unmap or set_map?


To get the shmfd, we need the vma rather than physical address. And
it's not necessary to pin the user pages in VDUSE case.



Right, actually, vhost-vDPA is planning to support shared virtual 
address space.


So let's try to reuse the existing config ops. How about just introduce 
an attribute to vdpa device that tells the bus tells the bus it can do 
shared virtual memory. Then when the device is probed by vhost-vDPA, use 
pages won't be pinned and we will do VA->VA mapping as IOVA->PA mapping 
in the vhost IOTLB and the config ops. vhost IOTLB needs to be extended 
to accept opaque pointer to store the file. And the file was pass via 
the config ops as well.


Thanks





Thanks,
Yongji





  1   2   3   4   5   6   7   8   9   10   >