Re: Question about IRQs during the .remove() of virtio-vsock driver
On 2019/5/21 下午9:56, Michael S. Tsirkin wrote: On Tue, May 21, 2019 at 03:49:20PM +0200, Stefano Garzarella wrote: On Tue, May 21, 2019 at 06:05:31AM -0400, Michael S. Tsirkin wrote: On Tue, May 21, 2019 at 11:44:07AM +0200, Stefano Garzarella wrote: Hi Micheal, Jason, as suggested by Stefan, I'm checking if we have some races in the virtio-vsock driver. We found some races in the .probe() and .remove() with the upper layer (socket) and I'll fix it. Now my attention is on the bottom layer (virtio device) and my question is: during the .remove() of virtio-vsock driver (virtio_vsock_remove), could happen that an IRQ comes and one of our callback (e.g. virtio_vsock_rx_done()) is executed, queueing new works? I tried to follow the code in both cases (device unplugged or module removed) and maybe it couldn't happen because we remove it from bus's knowledge, but I'm not sure and your advice would be very helpful. Thanks in advance, Stefano Great question! This should be better documented: patches welcome! When I'm clear, I'll be happy to document this. Here's my understanding: A typical removal flow works like this: - prevent linux from sending new kick requests to device and flush such outstanding requests if any (device can still send notifications to linux) - call vi->vdev->config->reset(vi->vdev); this will flush all device writes and interrupts. device will not use any more buffers. previously outstanding callbacks might still be active. - Then call vdev->config->del_vqs(vdev); to flush outstanding callbacks if any. Thanks for sharing these useful information. So, IIUC between step 1 (e.g. in virtio-vsock we flush all work-queues) and step 2, new IRQs could happen, and in the virtio-vsock driver new work will be queued. In order to handle this case, I'm thinking to add a new variable 'work_enabled' in the struct virtio_vsock, put it to false at the start of the .remove(), then call synchronize_rcu() before to flush all work queues and use an helper function virtio_transport_queue_work() to queue a new work, where the check of work_enabled and the queue_work are in the RCU read critical section. Here a pseudo code to explain better the idea: virtio_vsock_remove() { vsock->work_enabled = false; /* Wait for other CPUs to finish to queue works */ synchronize_rcu(); flush_works(); vdev->config->reset(vdev); ... vdev->config->del_vqs(vdev); } virtio_vsock_queue_work(vsock, work) { rcu_read_lock(); if (!vsock->work_enabled) { goto out; } queue_work(virtio_vsock_workqueue, work); out: rcu_read_unlock(); } Do you think can work? Please tell me if there is a better way to handle this case. Thanks, Stefano instead of rcu tricks I would just have rx_run and tx_run and check it within the queued work - presumably under tx or rx lock. then queueing an extra work becomes harmless, and you flush it after del vqs which flushes everything for you. It looks to me that we need guarantee no work queued or scheduled before del_vqs. Otherwise it may lead use after free? (E.g net disable NAPI before del_vqs). Thanks
Re: [PATCH bpf-next 3/3] veth: Support bulk XDP_TX
On 2019/5/23 下午9:51, Toshiaki Makita wrote: On 19/05/23 (木) 22:29:27, Jesper Dangaard Brouer wrote: On Thu, 23 May 2019 20:35:50 +0900 Toshiaki Makita wrote: On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote: Toshiaki Makita writes: This improves XDP_TX performance by about 8%. Here are single core XDP_TX test results. CPU consumptions are taken from "perf report --no-child". - Before: 7.26 Mpps _raw_spin_lock 7.83% veth_xdp_xmit 12.23% - After: 7.84 Mpps _raw_spin_lock 1.17% veth_xdp_xmit 6.45% Signed-off-by: Toshiaki Makita --- drivers/net/veth.c | 26 +- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/drivers/net/veth.c b/drivers/net/veth.c index 52110e5..4edc75f 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device *dev, int n, return ret; } +static void veth_xdp_flush_bq(struct net_device *dev) +{ + struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq); + int sent, i, err = 0; + + sent = veth_xdp_xmit(dev, bq->count, bq->q, 0); Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So you're introducing an additional per-cpu bulk queue, only to avoid lock contention around the existing pointer ring. But the pointer ring is per-rq, so if you have lock contention, this means you must have multiple CPUs servicing the same rq, no? Yes, it's possible. Not recommended though. I think the general per-cpu TX bulk queue is overkill. There is a loop over packets in veth_xdp_rcv(struct veth_rq *rq, budget, *status), and the caller veth_poll() will call veth_xdp_flush(rq->dev). Why can't you store this "temp" bulk array in struct veth_rq ? Of course I can. But I thought tun has the same problem and we can decrease memory footprint by sharing the same storage between devices. For TUN and for its fast path where vhost passes a bulk of XDP frames (through msg_control) to us, we probably just need a temporary bulk array in tun_xdp_one() instead of a global one. I can post patch or maybe you if you're interested in this. Thanks Or if other devices want to reduce queues so that we can use XDP on many-cpu servers and introduce locks, we can use this storage for that case as well. Still do you prefer veth-specific solution? You could even alloc/create it on the stack of veth_poll() and send it along via a pointer to veth_xdp_rcv). Toshiaki Makita
Re: [PATCH bpf-next 3/3] veth: Support bulk XDP_TX
On 2019/5/24 上午11:28, Toshiaki Makita wrote: On 2019/05/24 12:13, Jason Wang wrote: On 2019/5/23 下午9:51, Toshiaki Makita wrote: On 19/05/23 (木) 22:29:27, Jesper Dangaard Brouer wrote: On Thu, 23 May 2019 20:35:50 +0900 Toshiaki Makita wrote: On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote: Toshiaki Makita writes: This improves XDP_TX performance by about 8%. Here are single core XDP_TX test results. CPU consumptions are taken from "perf report --no-child". - Before: 7.26 Mpps _raw_spin_lock 7.83% veth_xdp_xmit 12.23% - After: 7.84 Mpps _raw_spin_lock 1.17% veth_xdp_xmit 6.45% Signed-off-by: Toshiaki Makita --- drivers/net/veth.c | 26 +- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/drivers/net/veth.c b/drivers/net/veth.c index 52110e5..4edc75f 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device *dev, int n, return ret; } +static void veth_xdp_flush_bq(struct net_device *dev) +{ + struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq); + int sent, i, err = 0; + + sent = veth_xdp_xmit(dev, bq->count, bq->q, 0); Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So you're introducing an additional per-cpu bulk queue, only to avoid lock contention around the existing pointer ring. But the pointer ring is per-rq, so if you have lock contention, this means you must have multiple CPUs servicing the same rq, no? Yes, it's possible. Not recommended though. I think the general per-cpu TX bulk queue is overkill. There is a loop over packets in veth_xdp_rcv(struct veth_rq *rq, budget, *status), and the caller veth_poll() will call veth_xdp_flush(rq->dev). Why can't you store this "temp" bulk array in struct veth_rq ? Of course I can. But I thought tun has the same problem and we can decrease memory footprint by sharing the same storage between devices. For TUN and for its fast path where vhost passes a bulk of XDP frames (through msg_control) to us, we probably just need a temporary bulk array in tun_xdp_one() instead of a global one. I can post patch or maybe you if you're interested in this. Of course you/I can. What I'm concerned is that could be waste of cache line when softirq runs veth napi handler and then tun napi handler. Well, technically the bulk queue passed to TUN could be reused. I admit it may save cacheline in ideal case but I wonder how much we could gain on real workload. (Note TUN doesn't use napi handler to do XDP, it has a NAPI mode but it was mainly used for hardening and XDP was not implemented there, maybe we should fix this). Thanks Thanks Or if other devices want to reduce queues so that we can use XDP on many-cpu servers and introduce locks, we can use this storage for that case as well. Still do you prefer veth-specific solution? You could even alloc/create it on the stack of veth_poll() and send it along via a pointer to veth_xdp_rcv). Toshiaki Makita
Re: [PATCH v3 2/2] net: core: support XDP generic on stacked devices.
On 2019/5/24 上午3:19, Saeed Mahameed wrote: On Thu, 2019-05-23 at 10:54 -0700, Stephen Hemminger wrote: When a device is stacked like (team, bonding, failsafe or netvsc) the XDP generic program for the parent device was not called. Move the call to XDP generic inside __netif_receive_skb_core where it can be done multiple times for stacked case. Suggested-by: Jiri Pirko Fixes: d445516966dc ("net: xdp: support xdp generic on virtual devices") Signed-off-by: Stephen Hemminger --- v1 - call xdp_generic in netvsc handler v2 - do xdp_generic in generic rx handler processing v3 - move xdp_generic call inside the another pass loop net/core/dev.c | 56 ++ 1 file changed, 11 insertions(+), 45 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index b6b8505cfb3e..696776e14d00 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4502,23 +4502,6 @@ static int netif_rx_internal(struct sk_buff *skb) trace_netif_rx(skb); - if (static_branch_unlikely(&generic_xdp_needed_key)) { - int ret; - - preempt_disable(); - rcu_read_lock(); - ret = do_xdp_generic(rcu_dereference(skb->dev- xdp_prog), skb); - rcu_read_unlock(); - preempt_enable(); - - /* Consider XDP consuming the packet a success from -* the netdev point of view we do not want to count -* this as an error. -*/ - if (ret != XDP_PASS) - return NET_RX_SUCCESS; - } - Adding Jesper, There is a small behavioral change due to this patch, At least not for current code. We don't support skb in cpumap (though the fix should not be hard), in dp_do_generic_redirect_map() we had: } else { /* TODO: Handle BPF_MAP_TYPE_CPUMAP */ err = -EBADRQC; goto err; } the XDP program after this patch will run on the RPS CPU, if configured, which could cause some behavioral changes in xdp_redirect_cpu: bpf_redirect_map(cpu_map). Maybe this is acceptable, but it should be documented, as the current assumption dictates: XDP program runs on the core where the XDP frame/SKB was first seen. At lest for TUN, this is not true. XDP frames were built by vhost_net and passed to TUN. There's no guarantee that vhost_net kthread won't move to another core. Thanks
Re: [PATCH v3 2/2] net: core: support XDP generic on stacked devices.
On 2019/5/24 下午6:07, Jesper Dangaard Brouer wrote: On Fri, 24 May 2019 12:17:27 +0800 Jason Wang wrote: Maybe this is acceptable, but it should be documented, as the current assumption dictates: XDP program runs on the core where the XDP frame/SKB was first seen. At lest for TUN, this is not true. XDP frames were built by vhost_net and passed to TUN. There's no guarantee that vhost_net kthread won't move to another core. This sound a little scary, as we depend on per-CPU variables (e.g. bpf_redirect_info). Can the vhost_net kthread move between CPUs within/during the NAPI-poll? The RX of TUN will not go for NAPI usually. What we did is: 1) Vhost kthread prepares an array of XDP frames and pass them to TUN through sendmsg 2) TUN will disable bh and run XDP for each frames then enable bh So kthread can move to another CPU before 2) but we guarantee that the per-CPU dependency of XDP work in 2). TUN indeed has a NAPI mode which is mainly used for hardening, and XDP was not implemented on that path (this could be fixed in the future). Thanks
Re: [PATCH v3 2/2] net: core: support XDP generic on stacked devices.
On 2019/5/24 下午5:33, Jesper Dangaard Brouer wrote: On Thu, 23 May 2019 13:15:44 -0700 Stephen Hemminger wrote: On Thu, 23 May 2019 19:19:40 + Saeed Mahameed wrote: On Thu, 2019-05-23 at 10:54 -0700, Stephen Hemminger wrote: When a device is stacked like (team, bonding, failsafe or netvsc) the XDP generic program for the parent device was not called. Move the call to XDP generic inside __netif_receive_skb_core where it can be done multiple times for stacked case. Suggested-by: Jiri Pirko Fixes: d445516966dc ("net: xdp: support xdp generic on virtual devices") Signed-off-by: Stephen Hemminger --- v1 - call xdp_generic in netvsc handler v2 - do xdp_generic in generic rx handler processing v3 - move xdp_generic call inside the another pass loop net/core/dev.c | 56 ++ 1 file changed, 11 insertions(+), 45 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index b6b8505cfb3e..696776e14d00 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4502,23 +4502,6 @@ static int netif_rx_internal(struct sk_buff *skb) trace_netif_rx(skb); - if (static_branch_unlikely(&generic_xdp_needed_key)) { - int ret; - - preempt_disable(); - rcu_read_lock(); - ret = do_xdp_generic(rcu_dereference(skb->dev- xdp_prog), skb); - rcu_read_unlock(); - preempt_enable(); - - /* Consider XDP consuming the packet a success from -* the netdev point of view we do not want to count -* this as an error. -*/ - if (ret != XDP_PASS) - return NET_RX_SUCCESS; - } - Adding Jesper, There is a small behavioral change due to this patch, the XDP program after this patch will run on the RPS CPU, if configured, which could cause some behavioral changes in xdp_redirect_cpu: bpf_redirect_map(cpu_map). Maybe this is acceptable, but it should be documented, as the current assumption dictates: XDP program runs on the core where the XDP frame/SKB was first seen. This does break some assumptions, that I worry about. I've not optimized generic XDP much, as this is suppose to be slow-path, but as you can see in my evaluation[1] generic-XDP do have a performance potential (XDP drop: native=12Mpps and generic=8.4Mpps), but the XDP-generic performance dies as soon as we e.g. do XDP_TX (native=10Mpps and generic=4Mpps). The reason is lack of bulking. We could implement the same kind of RX-bulking tricks as we do for XDP_REDIRECT, where bpf_redirect_map store frames in the map and send them once NAPI-poll exit via a xdp_do_flush_map(). These tricks depend on per-CPU data (bpf_redirect_info), thus I cannot see how this could work if XDP-generic now happens after RPS on a remote CPU. RPS uses backlog NAPI so per-CPU is probably not an issue. Thanks Notice, that TX bulking at XDP-generic level, is actually rather simple, as netstack TX path-code support xmit_more via creating a list of SKBs... Last time I hacked it up, I saw 20%-30% speedup... anyone motivated to do this? Or maybe XDP should just force off RPS (like it does gro) I guess, we could force off RPS. But I do see one valid use-case of combining CPUMAP redirect with RFS (Receive Flow Steering) which is part of RPS. Yes, I know we/I *also* have to implement generic-XDP-cpumap support. But for native-XDP CPUMAP redirect, from 1-2 RX-CPUs into N-remote CPUs via CPUMAP, and then lets RFS send SKBs to where the application runs, does make sense to me. (I do have an "assignment" to implement this in eBPF here[2]). [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/Documentation/blogposts/xdp25_eval_generic_xdp_tx.rst [2] https://github.com/xdp-project/xdp-project/blob/master/areas/cpumap.org#cpumap-implement-dynamic-load-balancer-that-is-ooo-safe
Re: [RFC] virtio-net: share receive_*() and add_recvbuf_*() with virtio-vsock
struct sk_buff *virtskb_receive_small(struct virtskb *vs, ...); struct sk_buff *virtskb_receive_big(struct virtskb *vs, ...); struct sk_buff *virtskb_receive_mergeable(struct virtskb *vs, ...); int virtskb_add_recvbuf_small(struct virtskb*vs, ...); int virtskb_add_recvbuf_big(struct virtskb *vs, ...); int virtskb_add_recvbuf_mergeable(struct virtskb *vs, ...); For the Guest->Host path it should be easier, so maybe I can add a "virtskb_send(struct virtskb *vs, struct sk_buff *skb)" with a part of the code of xmit_skb(). I may miss something, but I don't see any thing that prevents us from using xmit_skb() directly. Yes, but my initial idea was to make it more parametric and not related to the virtio_net_hdr, so the 'hdr_len' could be a parameter and the 'num_buffers' should be handled by the caller. Let me know if you have in mind better names or if I should put these function in another place. I would like to leave the control part completely separate, so, for example, the two drivers will negotiate the features independently and they will call the right virtskb_receive_*() function based on the negotiation. If it's one the issue of negotiation, we can simply change the virtnet_probe() to deal with different devices. I already started to work on it, but before to do more steps and send an RFC patch, I would like to hear your opinion. Do you think that makes sense? Do you see any issue or a better solution? I still think we need to seek a way of adding some codes on virtio-net.c directly if there's no huge different in the processing of TX/RX. That would save us a lot time. After the reading of the buffers from the virtqueue I think the process is slightly different, because virtio-net will interface with the network stack, while virtio-vsock will interface with the vsock-core (socket). So the virtio-vsock implements the following: - control flow mechanism to avoid to loose packets, informing the peer about the amount of memory available in the receive queue using some fields in the virtio_vsock_hdr - de-multiplexing parsing the virtio_vsock_hdr and choosing the right socket depending on the port - socket state handling I think it's just a branch, for ethernet, go for networking stack. otherwise go for vsock core? Yes, that should work. So, I should refactor the functions that can be called also from the vsock core, in order to remove "struct net_device *dev" parameter. Maybe creating some wrappers for the network stack. Otherwise I should create a fake net_device for vsock_core. What do you suggest? I'm not quite sure I get the question. Can you just use the one that created by virtio_net? Thanks
[PATCH 6/6] vhost: don't do synchronize_rcu() in vhost_uninit_vq_maps()
There's no need for RCU synchronization in vhost_uninit_vq_maps() since we've already serialized with readers (memory accessors). This also avoid the possible userspace DOS through ioctl() because of the possible high latency caused by synchronize_rcu(). Reported-by: Michael S. Tsirkin Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address") Signed-off-by: Jason Wang --- drivers/vhost/vhost.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 5b8821d00fe4..a17df1f4069a 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -334,7 +334,9 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq) } spin_unlock(&vq->mmu_lock); - synchronize_rcu(); + /* No need for synchronize_rcu() or kfree_rcu() since we are +* serialized with memory accessors (e.g vq mutex held). +*/ for (i = 0; i < VHOST_NUM_ADDRS; i++) if (map[i]) -- 2.18.1
Re: [PATCH v1] tun: mark small packets as owned by the tap sock
On 2019/7/23 下午9:01, Alexis Bauvin wrote: Small packets going out of a tap device go through an optimized code path that uses build_skb() rather than sock_alloc_send_pskb(). The latter calls skb_set_owner_w(), but the small packet code path does not. The net effect is that small packets are not owned by the userland application's socket (e.g. QEMU), while large packets are. This can be seen with a TCP session, where packets are not owned when the window size is small enough (around PAGE_SIZE), while they are once the window grows (note that this requires the host to support virtio tso for the guest to offload segmentation). All this leads to inconsistent behaviour in the kernel, especially on netfilter modules that uses sk->socket (e.g. xt_owner). Signed-off-by: Alexis Bauvin Fixes: 66ccbc9c87c2 ("tap: use build_skb() for small packet") --- drivers/net/tun.c | 71 --- 1 file changed, 37 insertions(+), 34 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 3d443597bd04..ac56b6a29eb2 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1656,6 +1656,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun, { struct page_frag *alloc_frag = ¤t->task_frag; struct bpf_prog *xdp_prog; + struct sk_buff *skb; int buflen = SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); char *buf; size_t copied; @@ -1686,44 +1687,46 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun, */ if (hdr->gso_type || !xdp_prog) { *skb_xdp = 1; - return __tun_build_skb(alloc_frag, buf, buflen, len, pad); - } - - *skb_xdp = 0; + } else { + *skb_xdp = 0; - local_bh_disable(); - rcu_read_lock(); - xdp_prog = rcu_dereference(tun->xdp_prog); - if (xdp_prog) { - struct xdp_buff xdp; - u32 act; - - xdp.data_hard_start = buf; - xdp.data = buf + pad; - xdp_set_data_meta_invalid(&xdp); - xdp.data_end = xdp.data + len; - xdp.rxq = &tfile->xdp_rxq; - - act = bpf_prog_run_xdp(xdp_prog, &xdp); - if (act == XDP_REDIRECT || act == XDP_TX) { - get_page(alloc_frag->page); - alloc_frag->offset += buflen; + local_bh_disable(); + rcu_read_lock(); + xdp_prog = rcu_dereference(tun->xdp_prog); + if (xdp_prog) { + struct xdp_buff xdp; + u32 act; + + xdp.data_hard_start = buf; + xdp.data = buf + pad; + xdp_set_data_meta_invalid(&xdp); + xdp.data_end = xdp.data + len; + xdp.rxq = &tfile->xdp_rxq; + + act = bpf_prog_run_xdp(xdp_prog, &xdp); + if (act == XDP_REDIRECT || act == XDP_TX) { + get_page(alloc_frag->page); + alloc_frag->offset += buflen; + } + err = tun_xdp_act(tun, xdp_prog, &xdp, act); + if (err < 0) + goto err_xdp; + if (err == XDP_REDIRECT) + xdp_do_flush_map(); + if (err != XDP_PASS) + goto out; + + pad = xdp.data - xdp.data_hard_start; + len = xdp.data_end - xdp.data; } - err = tun_xdp_act(tun, xdp_prog, &xdp, act); - if (err < 0) - goto err_xdp; - if (err == XDP_REDIRECT) - xdp_do_flush_map(); - if (err != XDP_PASS) - goto out; - - pad = xdp.data - xdp.data_hard_start; - len = xdp.data_end - xdp.data; + rcu_read_unlock(); + local_bh_enable(); } - rcu_read_unlock(); - local_bh_enable(); - return __tun_build_skb(alloc_frag, buf, buflen, len, pad); + skb = __tun_build_skb(alloc_frag, buf, buflen, len, pad); + if (skb) + skb_set_owner_w(skb, tfile->socket.sk); + return skb; err_xdp: put_page(alloc_frag->page); To reduce the change set, anyhow you can move the skb_set_owner_w() to __tun_build_skb()? Thanks
Re: [PATCH v2] tun: mark small packets as owned by the tap sock
On 2019/7/23 下午10:23, Alexis Bauvin wrote: - v1 -> v2: Move skb_set_owner_w to __tun_build_skb to reduce patch size Small packets going out of a tap device go through an optimized code path that uses build_skb() rather than sock_alloc_send_pskb(). The latter calls skb_set_owner_w(), but the small packet code path does not. The net effect is that small packets are not owned by the userland application's socket (e.g. QEMU), while large packets are. This can be seen with a TCP session, where packets are not owned when the window size is small enough (around PAGE_SIZE), while they are once the window grows (note that this requires the host to support virtio tso for the guest to offload segmentation). All this leads to inconsistent behaviour in the kernel, especially on netfilter modules that uses sk->socket (e.g. xt_owner). Signed-off-by: Alexis Bauvin Fixes: 66ccbc9c87c2 ("tap: use build_skb() for small packet") Acked-by: Jason Wang --- drivers/net/tun.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 3d443597bd04..db16d7a13e00 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1599,7 +1599,8 @@ static bool tun_can_build_skb(struct tun_struct *tun, struct tun_file *tfile, return true; } -static struct sk_buff *__tun_build_skb(struct page_frag *alloc_frag, char *buf, +static struct sk_buff *__tun_build_skb(struct tun_file *tfile, + struct page_frag *alloc_frag, char *buf, int buflen, int len, int pad) { struct sk_buff *skb = build_skb(buf, buflen); @@ -1609,6 +1610,7 @@ static struct sk_buff *__tun_build_skb(struct page_frag *alloc_frag, char *buf, skb_reserve(skb, pad); skb_put(skb, len); + skb_set_owner_w(skb, tfile->socket.sk); get_page(alloc_frag->page); alloc_frag->offset += buflen; @@ -1686,7 +1688,8 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun, */ if (hdr->gso_type || !xdp_prog) { *skb_xdp = 1; - return __tun_build_skb(alloc_frag, buf, buflen, len, pad); + return __tun_build_skb(tfile, alloc_frag, buf, buflen, len, + pad); } *skb_xdp = 0; @@ -1723,7 +1726,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun, rcu_read_unlock(); local_bh_enable(); - return __tun_build_skb(alloc_frag, buf, buflen, len, pad); + return __tun_build_skb(tfile, alloc_frag, buf, buflen, len, pad); err_xdp: put_page(alloc_frag->page);
Re: tun: mark small packets as owned by the tap sock
On 2019/8/13 上午6:19, Dave Jones wrote: On Wed, Aug 07, 2019 at 12:30:07AM +, Linux Kernel wrote: > Commit: 4b663366246be1d1d4b1b8b01245b2e88ad9e706 > Parent: 16b2084a8afa1432d14ba72b7c97d7908e178178 > Web: https://git.kernel.org/torvalds/c/4b663366246be1d1d4b1b8b01245b2e88ad9e706 > Author: Alexis Bauvin > AuthorDate: Tue Jul 23 16:23:01 2019 +0200 > > tun: mark small packets as owned by the tap sock > > - v1 -> v2: Move skb_set_owner_w to __tun_build_skb to reduce patch size > > Small packets going out of a tap device go through an optimized code > path that uses build_skb() rather than sock_alloc_send_pskb(). The > latter calls skb_set_owner_w(), but the small packet code path does not. > > The net effect is that small packets are not owned by the userland > application's socket (e.g. QEMU), while large packets are. > This can be seen with a TCP session, where packets are not owned when > the window size is small enough (around PAGE_SIZE), while they are once > the window grows (note that this requires the host to support virtio > tso for the guest to offload segmentation). > All this leads to inconsistent behaviour in the kernel, especially on > netfilter modules that uses sk->socket (e.g. xt_owner). > > Fixes: 66ccbc9c87c2 ("tap: use build_skb() for small packet") > Signed-off-by: Alexis Bauvin > Acked-by: Jason Wang This commit breaks ipv6 routing when I deployed on it a linode. It seems to work briefly after boot, and then silently all packets get dropped. (Presumably, it's dropping RA or ND packets) With this reverted, everything works as it did in rc3. Dave Hi: Two questions: - Are you using XDP for TUN? - Does it work before 66ccbc9c87c2? If yes, could you show us the result of net_dropmonitor? Thanks
Re: [PATCH v3] vhost_vdpa: fix the problem in vhost_vdpa_set_config_call
On 2021/1/12 下午1:36, Cindy Lu wrote: In vhost_vdpa_set_config_call, the cb.private should be vhost_vdpa. this cb.private will finally use in vhost_vdpa_config_cb as vhost_vdpa. Fix this issue. Fixes: 776f395004d82 ("vhost_vdpa: Support config interrupt in vdpa") Acked-by: Jason Wang Signed-off-by: Cindy Lu --- Hi Cindy: I think at least you forget to cc stable. Thanks drivers/vhost/vdpa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index ef688c8c0e0e..3fbb9c1f49da 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -319,7 +319,7 @@ static long vhost_vdpa_set_config_call(struct vhost_vdpa *v, u32 __user *argp) struct eventfd_ctx *ctx; cb.callback = vhost_vdpa_config_cb; - cb.private = v->vdpa; + cb.private = v; if (copy_from_user(&fd, argp, sizeof(fd))) return -EFAULT;
Re: [RFC PATCH 0/7] Support for virtio-net hash reporting
On 2021/1/13 上午7:47, Willem de Bruijn wrote: On Tue, Jan 12, 2021 at 3:29 PM Yuri Benditovich wrote: On Tue, Jan 12, 2021 at 9:49 PM Yuri Benditovich wrote: On Tue, Jan 12, 2021 at 9:41 PM Yuri Benditovich wrote: Existing TUN module is able to use provided "steering eBPF" to calculate per-packet hash and derive the destination queue to place the packet to. The eBPF uses mapped configuration data containing a key for hash calculation and indirection table with array of queues' indices. This series of patches adds support for virtio-net hash reporting feature as defined in virtio specification. It extends the TUN module and the "steering eBPF" as follows: Extended steering eBPF calculates the hash value and hash type, keeps hash value in the skb->hash and returns index of destination virtqueue and the type of the hash. TUN module keeps returned hash type in (currently unused) field of the skb. skb->__unused renamed to 'hash_report_type'. When TUN module is called later to allocate and fill the virtio-net header and push it to destination virtqueue it populates the hash and the hash type into virtio-net header. VHOST driver is made aware of respective virtio-net feature that extends the virtio-net header to report the hash value and hash report type. Comment from Willem de Bruijn: Skbuff fields are in short supply. I don't think we need to add one just for this narrow path entirely internal to the tun device. We understand that and try to minimize the impact by using an already existing unused field of skb. Not anymore. It was repurposed as a flags field very recently. This use case is also very narrow in scope. And a very short path from data producer to consumer. So I don't think it needs to claim scarce bits in the skb. tun_ebpf_select_queue stores the field, tun_put_user reads it and converts it to the virtio_net_hdr in the descriptor. tun_ebpf_select_queue is called from .ndo_select_queue. Storing the field in skb->cb is fragile, as in theory some code could overwrite that between field between ndo_select_queue and ndo_start_xmit/tun_net_xmit, from which point it is fully under tun control again. But in practice, I don't believe anything does. Alternatively an existing skb field that is used only on disjoint datapaths, such as ingress-only, could be viable. A question here. We had metadata support in XDP for cooperation between eBPF programs. Do we have something similar in the skb? E.g in the RSS, if we want to pass some metadata information between eBPF program and the logic that generates the vnet header (either hard logic in the kernel or another eBPF program). Is there any way that can avoid the possible conflicts of qdiscs? Instead, you could just run the flow_dissector in tun_put_user if the feature is negotiated. Indeed, the flow dissector seems more apt to me than BPF here. Note that the flow dissector internally can be overridden by a BPF program if the admin so chooses. When this set of patches is related to hash delivery in the virtio-net packet in general, it was prepared in context of RSS feature implementation as defined in virtio spec [1] In case of RSS it is not enough to run the flow_dissector in tun_put_user: in tun_ebpf_select_queue the TUN calls eBPF to calculate the hash, hash type and queue index according to the (mapped) parameters (key, hash types, indirection table) received from the guest. TUNSETSTEERINGEBPF was added to support more diverse queue selection than the default in case of multiqueue tun. Not sure what the exact use cases are. But RSS is exactly the purpose of the flow dissector. It is used for that purpose in the software variant RPS. The flow dissector implements a superset of the RSS spec, and certainly computes a four-tuple for TCP/IPv6. In the case of RPS, it is skipped if the NIC has already computed a 4-tuple hash. What it does not give is a type indication, such as VIRTIO_NET_HASH_TYPE_TCPv6. I don't understand how this would be used. In datapaths where the NIC has already computed the four-tuple hash and stored it in skb->hash --the common case for servers--, That type field is the only reason to have to compute again. The problem is there's no guarantee that the packet comes from the NIC, it could be a simple VM2VM or host2VM packet. And even if the packet is coming from the NIC that calculates the hash there's no guarantee that it's the has that guest want (guest may use different RSS keys). Thanks Our intention is to keep the hash and hash type in the skb to populate them into a virtio-net header later in tun_put_user. Note that in this case the type of calculated hash is selected not only from flow dissections but also from limitations provided by the guest. This is already implemented in qemu (for case of vhost=off), see [2] (virtio_net_process_rss) For case of vhost=on there are WIP for qemu to load eBPF and attach it to TUN. Note that exact way of selecting rx virtqueue depends on the guest,
Re: [RFC PATCH 0/7] Support for virtio-net hash reporting
On 2021/1/13 下午10:33, Willem de Bruijn wrote: On Tue, Jan 12, 2021 at 11:11 PM Jason Wang wrote: On 2021/1/13 上午7:47, Willem de Bruijn wrote: On Tue, Jan 12, 2021 at 3:29 PM Yuri Benditovich wrote: On Tue, Jan 12, 2021 at 9:49 PM Yuri Benditovich wrote: On Tue, Jan 12, 2021 at 9:41 PM Yuri Benditovich wrote: Existing TUN module is able to use provided "steering eBPF" to calculate per-packet hash and derive the destination queue to place the packet to. The eBPF uses mapped configuration data containing a key for hash calculation and indirection table with array of queues' indices. This series of patches adds support for virtio-net hash reporting feature as defined in virtio specification. It extends the TUN module and the "steering eBPF" as follows: Extended steering eBPF calculates the hash value and hash type, keeps hash value in the skb->hash and returns index of destination virtqueue and the type of the hash. TUN module keeps returned hash type in (currently unused) field of the skb. skb->__unused renamed to 'hash_report_type'. When TUN module is called later to allocate and fill the virtio-net header and push it to destination virtqueue it populates the hash and the hash type into virtio-net header. VHOST driver is made aware of respective virtio-net feature that extends the virtio-net header to report the hash value and hash report type. Comment from Willem de Bruijn: Skbuff fields are in short supply. I don't think we need to add one just for this narrow path entirely internal to the tun device. We understand that and try to minimize the impact by using an already existing unused field of skb. Not anymore. It was repurposed as a flags field very recently. This use case is also very narrow in scope. And a very short path from data producer to consumer. So I don't think it needs to claim scarce bits in the skb. tun_ebpf_select_queue stores the field, tun_put_user reads it and converts it to the virtio_net_hdr in the descriptor. tun_ebpf_select_queue is called from .ndo_select_queue. Storing the field in skb->cb is fragile, as in theory some code could overwrite that between field between ndo_select_queue and ndo_start_xmit/tun_net_xmit, from which point it is fully under tun control again. But in practice, I don't believe anything does. Alternatively an existing skb field that is used only on disjoint datapaths, such as ingress-only, could be viable. A question here. We had metadata support in XDP for cooperation between eBPF programs. Do we have something similar in the skb? E.g in the RSS, if we want to pass some metadata information between eBPF program and the logic that generates the vnet header (either hard logic in the kernel or another eBPF program). Is there any way that can avoid the possible conflicts of qdiscs? Not that I am aware of. The closest thing is cb[]. It'll have to aliase a field like that, that is known unused for the given path. Right, we need to make sure cb is not used by other ones. I'm not sure how hard to achieve that consider Qemu installs the eBPF program but it doesn't deal with networking configurations. One other approach that has been used within linear call stacks is out of band. Like percpu variables softnet_data.xmit.more and mirred_rec_level. But that is perhaps a bit overwrought for this use case. Yes, and if we go that way then eBPF turns out to be a burden since we need to invent helpers to access those auxiliary data structure. It would be better then to hard-coded the RSS in the kernel. Instead, you could just run the flow_dissector in tun_put_user if the feature is negotiated. Indeed, the flow dissector seems more apt to me than BPF here. Note that the flow dissector internally can be overridden by a BPF program if the admin so chooses. When this set of patches is related to hash delivery in the virtio-net packet in general, it was prepared in context of RSS feature implementation as defined in virtio spec [1] In case of RSS it is not enough to run the flow_dissector in tun_put_user: in tun_ebpf_select_queue the TUN calls eBPF to calculate the hash, hash type and queue index according to the (mapped) parameters (key, hash types, indirection table) received from the guest. TUNSETSTEERINGEBPF was added to support more diverse queue selection than the default in case of multiqueue tun. Not sure what the exact use cases are. But RSS is exactly the purpose of the flow dissector. It is used for that purpose in the software variant RPS. The flow dissector implements a superset of the RSS spec, and certainly computes a four-tuple for TCP/IPv6. In the case of RPS, it is skipped if the NIC has already computed a 4-tuple hash. What it does not give is a type indication, such as VIRTIO_NET_HASH_TYPE_TCPv6. I don't understand how this would be used. In datapaths where the NIC has already computed the four-tuple hash and stored it in s
Re: [PATCH linux-next v3 6/6] vdpa_sim_net: Add support for user supported devices
On 2021/1/7 上午11:48, Parav Pandit wrote: From: Michael S. Tsirkin Sent: Tuesday, January 5, 2021 6:53 PM On Tue, Jan 05, 2021 at 12:30:15PM +, Parav Pandit wrote: From: Michael S. Tsirkin Sent: Tuesday, January 5, 2021 5:45 PM On Tue, Jan 05, 2021 at 12:02:33PM +, Parav Pandit wrote: From: Michael S. Tsirkin Sent: Tuesday, January 5, 2021 5:19 PM On Tue, Jan 05, 2021 at 12:32:03PM +0200, Parav Pandit wrote: Enable user to create vdpasim net simulate devices. $ vdpa dev add mgmtdev vdpasim_net name foo2 Show the newly created vdpa device by its name: $ vdpa dev show foo2 foo2: type network mgmtdev vdpasim_net vendor_id 0 max_vqs 2 max_vq_size 256 $ vdpa dev show foo2 -jp { "dev": { "foo2": { "type": "network", "mgmtdev": "vdpasim_net", "vendor_id": 0, "max_vqs": 2, "max_vq_size": 256 } } } I'd like an example of how do device specific (e.g. net specific) interfaces tie in to this. Not sure I follow your question. Do you mean how to set mac address or mtu of this vdpa device of type net? If so, dev add command will be extended shortly in subsequent series to set this net specific attributes. (I did mention in the next steps in cover letter). +static int __init vdpasim_net_init(void) { + int ret; + + if (macaddr) { + mac_pton(macaddr, macaddr_buf); + if (!is_valid_ether_addr(macaddr_buf)) + return -EADDRNOTAVAIL; + } else { + eth_random_addr(macaddr_buf); } Hmm so all devices start out with the same MAC until changed? And how is the change effected? Post this patchset and post we have iproute2 vdpa in the tree, will add the mac address as the input attribute during "vdpa dev add" command. So that each different vdpa device can have user specified (different) mac address. For now maybe just avoid VIRTIO_NET_F_MAC then for new devices then? That would require book keeping existing net vdpa_sim devices created to avoid setting VIRTIO_NET_F_MAC. Such book keeping code will be short lived anyway. Not sure if its worth it. Until now only one device was created. So not sure two vdpa devices with same mac address will be a real issue. When we add mac address attribute in add command, at that point also remove the module parameter macaddr. Will that be mandatory? I'm not to happy with a UAPI we intend to break straight away ... No. Specifying mac address shouldn't be mandatory. UAPI wont' be broken. If it's not mandatory. Does it mean the vDPA parent need to use its own logic to generate a validate mac? I'm not sure this is what management (libvirt want). Thanks
Re: [PATCH linux-next v3 6/6] vdpa_sim_net: Add support for user supported devices
On 2021/1/14 下午3:58, Parav Pandit wrote: From: Jason Wang Sent: Thursday, January 14, 2021 9:48 AM On 2021/1/7 上午11:48, Parav Pandit wrote: From: Michael S. Tsirkin Sent: Tuesday, January 5, 2021 6:53 PM On Tue, Jan 05, 2021 at 12:30:15PM +, Parav Pandit wrote: From: Michael S. Tsirkin Sent: Tuesday, January 5, 2021 5:45 PM On Tue, Jan 05, 2021 at 12:02:33PM +, Parav Pandit wrote: From: Michael S. Tsirkin Sent: Tuesday, January 5, 2021 5:19 PM On Tue, Jan 05, 2021 at 12:32:03PM +0200, Parav Pandit wrote: Enable user to create vdpasim net simulate devices. $ vdpa dev add mgmtdev vdpasim_net name foo2 Show the newly created vdpa device by its name: $ vdpa dev show foo2 foo2: type network mgmtdev vdpasim_net vendor_id 0 max_vqs 2 max_vq_size 256 $ vdpa dev show foo2 -jp { "dev": { "foo2": { "type": "network", "mgmtdev": "vdpasim_net", "vendor_id": 0, "max_vqs": 2, "max_vq_size": 256 } } } I'd like an example of how do device specific (e.g. net specific) interfaces tie in to this. Not sure I follow your question. Do you mean how to set mac address or mtu of this vdpa device of type net? If so, dev add command will be extended shortly in subsequent series to set this net specific attributes. (I did mention in the next steps in cover letter). +static int __init vdpasim_net_init(void) { + int ret; + + if (macaddr) { + mac_pton(macaddr, macaddr_buf); + if (!is_valid_ether_addr(macaddr_buf)) + return -EADDRNOTAVAIL; + } else { + eth_random_addr(macaddr_buf); } Hmm so all devices start out with the same MAC until changed? And how is the change effected? Post this patchset and post we have iproute2 vdpa in the tree, will add the mac address as the input attribute during "vdpa dev add" command. So that each different vdpa device can have user specified (different) mac address. For now maybe just avoid VIRTIO_NET_F_MAC then for new devices then? That would require book keeping existing net vdpa_sim devices created to avoid setting VIRTIO_NET_F_MAC. Such book keeping code will be short lived anyway. Not sure if its worth it. Until now only one device was created. So not sure two vdpa devices with same mac address will be a real issue. When we add mac address attribute in add command, at that point also remove the module parameter macaddr. Will that be mandatory? I'm not to happy with a UAPI we intend to break straight away ... No. Specifying mac address shouldn't be mandatory. UAPI wont' be broken. If it's not mandatory. Does it mean the vDPA parent need to use its own logic to generate a validate mac? I'm not sure this is what management (libvirt want). There are few use cases that I see with PFs, VFs and SFs supporting vdpa devices. 1. User wants to use the VF only for vdpa purpose. Here user got the VF which was pre-setup by the sysadmin. In this case whatever MAC assigned to the VF can be used by its vdpa device. Here, user doesn't need to pass the mac address during vdpa device creation time. This is done as the same MAC has been setup in the ACL rules on the switch side. Non VDPA users of a VF typically use the VF this way for Netdev and rdma functionality. They might continue same way for vdpa application as well. Here VF mac is either set using (a) devlink port function set hw_addr command or using (b) ip link set vf mac So vdpa tool didn't pass the mac. (optional). Though VIRTIO_NET_F_MAC is still valid. 2. User may want to create one or more vdpa device out of the mgmt. device. Here user wants to more/full control of all features, overriding what sysadmin has setup as MAC of the VF/SF. In this case user will specify the MAC via mgmt tool. (a) This is also used by those vdpa devices which doesn't have eswitch offloads. (b) This will work with eswitch offloads as well who does source learning. (c) User chose to use the vdpa device of a VF while VF Netdev and rdma device are used by hypervisor for something else as well. VIRTIO_NET_F_MAC remains valid in all 2.{a,b,c}. 3. A vendor mgmt. device always expects it user to provide mac for its vdpa devices. So when it is not provided, it can fail with error message string in extack or clear the VIRTIO_NET_F_MAC and let it work using virtio spec's 5.1.5 point 5 to proceed. As common denominator of all above cases, if QEMU or user pass the MAC during creation, it will almost always work. Advance user and QEMU with switchdev mode support who has done 1.a/1.b, will omit it. I do not know how deep integration of QEMU exist with the switchdev mode support. With that mac, mtu as optional input fields provide the necessary flexibility for different stacks t
Re: [PATCH net-next v7] vhost_net: avoid tx queue stuck when sendmsg fails
On 2021/1/15 下午12:46, wangyunjian wrote: From: Yunjian Wang Currently the driver doesn't drop a packet which can't be sent by tun (e.g bad packet). In this case, the driver will always process the same packet lead to the tx queue stuck. To fix this issue: 1. in the case of persistent failure (e.g bad packet), the driver can skip this descriptor by ignoring the error. 2. in the case of transient failure (e.g -ENOBUFS, -EAGAIN and -ENOMEM), the driver schedules the worker to try again. Signed-off-by: Yunjian Wang Acked-by: Jason Wang --- v7: * code rebase v6: * update code styles and commit log --- drivers/vhost/net.c | 26 ++ 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 3b744031ec8f..df82b124170e 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -828,14 +828,15 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock) msg.msg_flags &= ~MSG_MORE; } - /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock->ops->sendmsg(sock, &msg, len); if (unlikely(err < 0)) { - vhost_discard_vq_desc(vq, 1); - vhost_net_enable_vq(net, vq); - break; - } - if (err != len) + if (err == -EAGAIN || err == -ENOMEM || err == -ENOBUFS) { + vhost_discard_vq_desc(vq, 1); + vhost_net_enable_vq(net, vq); + break; + } + pr_debug("Fail to send packet: err %d", err); + } else if (unlikely(err != len)) pr_debug("Truncated TX packet: len %d != %zd\n", err, len); done: @@ -924,7 +925,6 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock) msg.msg_flags &= ~MSG_MORE; } - /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock->ops->sendmsg(sock, &msg, len); if (unlikely(err < 0)) { if (zcopy_used) { @@ -933,11 +933,13 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock) nvq->upend_idx = ((unsigned)nvq->upend_idx - 1) % UIO_MAXIOV; } - vhost_discard_vq_desc(vq, 1); - vhost_net_enable_vq(net, vq); - break; - } - if (err != len) + if (err == -EAGAIN || err == -ENOMEM || err == -ENOBUFS) { + vhost_discard_vq_desc(vq, 1); + vhost_net_enable_vq(net, vq); + break; + } + pr_debug("Fail to send packet: err %d", err); + } else if (unlikely(err != len)) pr_debug("Truncated TX packet: " " len %d != %zd\n", err, len); if (!zcopy_used)
Re: [RFC PATCH 0/7] Support for virtio-net hash reporting
On 2021/1/17 下午3:57, Yuri Benditovich wrote: On Thu, Jan 14, 2021 at 5:39 AM Jason Wang wrote: On 2021/1/13 下午10:33, Willem de Bruijn wrote: On Tue, Jan 12, 2021 at 11:11 PM Jason Wang wrote: On 2021/1/13 上午7:47, Willem de Bruijn wrote: On Tue, Jan 12, 2021 at 3:29 PM Yuri Benditovich wrote: On Tue, Jan 12, 2021 at 9:49 PM Yuri Benditovich wrote: On Tue, Jan 12, 2021 at 9:41 PM Yuri Benditovich wrote: Existing TUN module is able to use provided "steering eBPF" to calculate per-packet hash and derive the destination queue to place the packet to. The eBPF uses mapped configuration data containing a key for hash calculation and indirection table with array of queues' indices. This series of patches adds support for virtio-net hash reporting feature as defined in virtio specification. It extends the TUN module and the "steering eBPF" as follows: Extended steering eBPF calculates the hash value and hash type, keeps hash value in the skb->hash and returns index of destination virtqueue and the type of the hash. TUN module keeps returned hash type in (currently unused) field of the skb. skb->__unused renamed to 'hash_report_type'. When TUN module is called later to allocate and fill the virtio-net header and push it to destination virtqueue it populates the hash and the hash type into virtio-net header. VHOST driver is made aware of respective virtio-net feature that extends the virtio-net header to report the hash value and hash report type. Comment from Willem de Bruijn: Skbuff fields are in short supply. I don't think we need to add one just for this narrow path entirely internal to the tun device. We understand that and try to minimize the impact by using an already existing unused field of skb. Not anymore. It was repurposed as a flags field very recently. This use case is also very narrow in scope. And a very short path from data producer to consumer. So I don't think it needs to claim scarce bits in the skb. tun_ebpf_select_queue stores the field, tun_put_user reads it and converts it to the virtio_net_hdr in the descriptor. tun_ebpf_select_queue is called from .ndo_select_queue. Storing the field in skb->cb is fragile, as in theory some code could overwrite that between field between ndo_select_queue and ndo_start_xmit/tun_net_xmit, from which point it is fully under tun control again. But in practice, I don't believe anything does. Alternatively an existing skb field that is used only on disjoint datapaths, such as ingress-only, could be viable. A question here. We had metadata support in XDP for cooperation between eBPF programs. Do we have something similar in the skb? E.g in the RSS, if we want to pass some metadata information between eBPF program and the logic that generates the vnet header (either hard logic in the kernel or another eBPF program). Is there any way that can avoid the possible conflicts of qdiscs? Not that I am aware of. The closest thing is cb[]. It'll have to aliase a field like that, that is known unused for the given path. Right, we need to make sure cb is not used by other ones. I'm not sure how hard to achieve that consider Qemu installs the eBPF program but it doesn't deal with networking configurations. One other approach that has been used within linear call stacks is out of band. Like percpu variables softnet_data.xmit.more and mirred_rec_level. But that is perhaps a bit overwrought for this use case. Yes, and if we go that way then eBPF turns out to be a burden since we need to invent helpers to access those auxiliary data structure. It would be better then to hard-coded the RSS in the kernel. Instead, you could just run the flow_dissector in tun_put_user if the feature is negotiated. Indeed, the flow dissector seems more apt to me than BPF here. Note that the flow dissector internally can be overridden by a BPF program if the admin so chooses. When this set of patches is related to hash delivery in the virtio-net packet in general, it was prepared in context of RSS feature implementation as defined in virtio spec [1] In case of RSS it is not enough to run the flow_dissector in tun_put_user: in tun_ebpf_select_queue the TUN calls eBPF to calculate the hash, hash type and queue index according to the (mapped) parameters (key, hash types, indirection table) received from the guest. TUNSETSTEERINGEBPF was added to support more diverse queue selection than the default in case of multiqueue tun. Not sure what the exact use cases are. But RSS is exactly the purpose of the flow dissector. It is used for that purpose in the software variant RPS. The flow dissector implements a superset of the RSS spec, and certainly computes a four-tuple for TCP/IPv6. In the case of RPS, it is skipped if the NIC has already computed a 4-tuple hash. What it does not give is a type indication, such as VIRTIO_NET_HASH_TYPE_TCPv6. I don't understand how this would be used.
Re: [PATCH net-next v2 0/7] virtio-net support xdp socket zero copy xmit
On 2021/1/16 上午10:59, Xuan Zhuo wrote: XDP socket is an excellent by pass kernel network transmission framework. The zero copy feature of xsk (XDP socket) needs to be supported by the driver. The performance of zero copy is very good. mlx5 and intel ixgbe already support this feature, This patch set allows virtio-net to support xsk's zerocopy xmit feature. And xsk's zerocopy rx has made major changes to virtio-net, and I hope to submit it after this patch set are received. Compared with other drivers, virtio-net does not directly obtain the dma address, so I first obtain the xsk page, and then pass the page to virtio. When recycling the sent packets, we have to distinguish between skb and xdp. Now we have to distinguish between skb, xdp, xsk. So the second patch solves this problem first. The last four patches are used to support xsk zerocopy in virtio-net: 1. support xsk enable/disable 2. realize the function of xsk packet sending 3. implement xsk wakeup callback 4. set xsk completed when packet sent done Performance Testing The udp package tool implemented by the interface of xsk vs sockperf(kernel udp) for performance testing: xsk zero copy in virtio-net: CPUPPS MSGSIZE 28.7% 3833857 64 38.5% 3689491 512 38.9% 2787096 1456 Some questions on the results: 1) What's the setup on the vhost? 2) What's the setup of the mitigation in both host and guest? 3) Any analyze on the possible bottleneck via perf or other tools? Thanks xsk without zero copy in virtio-net: CPUPPS MSGSIZE 100% 1916747 64 100% 1775988 512 100% 1440054 1456 sockperf: CPUPPS MSGSIZE 100% 713274 64 100% 701024 512 100% 695832 1456 Xuan Zhuo (7): xsk: support get page for drv virtio-net, xsk: distinguish XDP_TX and XSK XMIT ctx xsk, virtio-net: prepare for support xsk zerocopy xmit virtio-net, xsk: support xsk enable/disable virtio-net, xsk: realize the function of xsk packet sending virtio-net, xsk: implement xsk wakeup callback virtio-net, xsk: set xsk completed when packet sent done drivers/net/virtio_net.c| 559 +++- include/linux/netdevice.h | 1 + include/net/xdp_sock_drv.h | 10 + include/net/xsk_buff_pool.h | 1 + net/xdp/xsk_buff_pool.c | 10 +- 5 files changed, 523 insertions(+), 58 deletions(-) -- 1.8.3.1
Re: [PATCH net-next v2 2/7] virtio-net, xsk: distinguish XDP_TX and XSK XMIT ctx
On 2021/1/16 上午10:59, Xuan Zhuo wrote: If support xsk, a new ptr will be recovered during the process of freeing the old ptr. In order to distinguish between ctx sent by XDP_TX and ctx sent by xsk, a struct is added here to distinguish between these two situations. virtnet_xdp_type.type It is used to distinguish different ctx, and virtnet_xdp_type.offset is used to record the offset between "true ctx" and virtnet_xdp_type. The newly added virtnet_xsk_hdr will be used for xsk. Signed-off-by: Xuan Zhuo Any reason that you can't simply encode the type in the pointer itself as we used to do? #define VIRTIO_XSK_FLAG BIT(1) ? --- drivers/net/virtio_net.c | 75 ++-- 1 file changed, 60 insertions(+), 15 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index ba8e637..e707c31 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -94,6 +94,22 @@ struct virtnet_rq_stats { u64 kicks; }; +enum { + XDP_TYPE_XSK, + XDP_TYPE_TX, +}; + +struct virtnet_xdp_type { + int offset:24; + unsigned type:8; +}; + +struct virtnet_xsk_hdr { + struct virtnet_xdp_type type; + struct virtio_net_hdr_mrg_rxbuf hdr; + u32 len; +}; + #define VIRTNET_SQ_STAT(m)offsetof(struct virtnet_sq_stats, m) #define VIRTNET_RQ_STAT(m)offsetof(struct virtnet_rq_stats, m) @@ -251,14 +267,19 @@ static bool is_xdp_frame(void *ptr) return (unsigned long)ptr & VIRTIO_XDP_FLAG; } -static void *xdp_to_ptr(struct xdp_frame *ptr) +static void *xdp_to_ptr(struct virtnet_xdp_type *ptr) { return (void *)((unsigned long)ptr | VIRTIO_XDP_FLAG); } -static struct xdp_frame *ptr_to_xdp(void *ptr) +static struct virtnet_xdp_type *ptr_to_xtype(void *ptr) +{ + return (struct virtnet_xdp_type *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG); +} + +static void *xtype_get_ptr(struct virtnet_xdp_type *xdptype) { - return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG); + return (char *)xdptype + xdptype->offset; } /* Converting between virtqueue no. and kernel tx/rx queue no. @@ -459,11 +480,16 @@ static int __virtnet_xdp_xmit_one(struct virtnet_info *vi, struct xdp_frame *xdpf) { struct virtio_net_hdr_mrg_rxbuf *hdr; + struct virtnet_xdp_type *xdptype; int err; - if (unlikely(xdpf->headroom < vi->hdr_len)) + if (unlikely(xdpf->headroom < vi->hdr_len + sizeof(*xdptype))) return -EOVERFLOW; + xdptype = (struct virtnet_xdp_type *)(xdpf + 1); + xdptype->offset = (char *)xdpf - (char *)xdptype; + xdptype->type = XDP_TYPE_TX; + /* Make room for virtqueue hdr (also change xdpf->headroom?) */ xdpf->data -= vi->hdr_len; /* Zero header and leave csum up to XDP layers */ @@ -473,7 +499,7 @@ static int __virtnet_xdp_xmit_one(struct virtnet_info *vi, sg_init_one(sq->sg, xdpf->data, xdpf->len); - err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdp_to_ptr(xdpf), + err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdp_to_ptr(xdptype), GFP_ATOMIC); if (unlikely(err)) return -ENOSPC; /* Caller handle free/refcnt */ @@ -523,8 +549,11 @@ static int virtnet_xdp_xmit(struct net_device *dev, /* Free up any pending old buffers before queueing new ones. */ while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) { if (likely(is_xdp_frame(ptr))) { - struct xdp_frame *frame = ptr_to_xdp(ptr); + struct virtnet_xdp_type *xtype; + struct xdp_frame *frame; + xtype = ptr_to_xtype(ptr); + frame = xtype_get_ptr(xtype); bytes += frame->len; xdp_return_frame(frame); } else { @@ -1373,24 +1402,34 @@ static int virtnet_receive(struct receive_queue *rq, int budget, static void free_old_xmit_skbs(struct send_queue *sq, bool in_napi) { - unsigned int len; unsigned int packets = 0; unsigned int bytes = 0; - void *ptr; + unsigned int len; + struct virtnet_xdp_type *xtype; + struct xdp_frame*frame; + struct virtnet_xsk_hdr *xskhdr; + struct sk_buff *skb; + void*ptr; while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) { if (likely(!is_xdp_frame(ptr))) { - struct sk_buff *skb = ptr; + skb = ptr; pr_debug("Sent skb %p\n", skb); bytes += skb->len; napi_consume_skb(skb, in_napi); } else { - struct xdp_frame *frame = ptr_to_xdp(ptr); + xtype = ptr_to_xtype(ptr); - bytes += frame->len; - xdp_return_fra
Re: [PATCH net-next v2 5/7] virtio-net, xsk: realize the function of xsk packet sending
On 2021/1/16 上午10:59, Xuan Zhuo wrote: virtnet_xsk_run will be called in the tx interrupt handling function virtnet_poll_tx. The sending process gets desc from the xsk tx queue, and assembles it to send the data. Compared with other drivers, a special place is that the page of the data in xsk is used here instead of the dma address. Because the virtio interface does not use the dma address. Signed-off-by: Xuan Zhuo --- drivers/net/virtio_net.c | 200 ++- 1 file changed, 197 insertions(+), 3 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index a62d456..42aa9ad 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -119,6 +119,8 @@ struct virtnet_xsk_hdr { u32 len; }; +#define VIRTNET_STATE_XSK_WAKEUP 1 + #define VIRTNET_SQ_STAT(m)offsetof(struct virtnet_sq_stats, m) #define VIRTNET_RQ_STAT(m)offsetof(struct virtnet_rq_stats, m) @@ -163,9 +165,12 @@ struct send_queue { struct xsk_buff_pool __rcu *pool; struct virtnet_xsk_hdr __rcu *hdr; + unsigned long state; u64hdr_con; u64hdr_pro; u64hdr_n; + struct xdp_desclast_desc; + bool wait_slot; } xsk; }; @@ -284,6 +289,8 @@ static void __free_old_xmit_ptr(struct send_queue *sq, bool in_napi, bool xsk_wakeup, unsigned int *_packets, unsigned int *_bytes); static void free_old_xmit_skbs(struct send_queue *sq, bool in_napi); +static int virtnet_xsk_run(struct send_queue *sq, + struct xsk_buff_pool *pool, int budget); static bool is_xdp_frame(void *ptr) { @@ -1590,6 +1597,8 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget) struct virtnet_info *vi = sq->vq->vdev->priv; unsigned int index = vq2txq(sq->vq); struct netdev_queue *txq; + struct xsk_buff_pool *pool; + int work = 0; if (unlikely(is_xdp_raw_buffer_queue(vi, index))) { /* We don't need to enable cb for XDP */ @@ -1599,15 +1608,26 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget) txq = netdev_get_tx_queue(vi->dev, index); __netif_tx_lock(txq, raw_smp_processor_id()); - free_old_xmit_skbs(sq, true); + + rcu_read_lock(); + pool = rcu_dereference(sq->xsk.pool); + if (pool) { + work = virtnet_xsk_run(sq, pool, budget); + rcu_read_unlock(); + } else { + rcu_read_unlock(); + free_old_xmit_skbs(sq, true); + } + __netif_tx_unlock(txq); - virtqueue_napi_complete(napi, sq->vq, 0); + if (work < budget) + virtqueue_napi_complete(napi, sq->vq, 0); if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS) netif_tx_wake_queue(txq); - return 0; + return work; } static int xmit_skb(struct send_queue *sq, struct sk_buff *skb) @@ -2647,6 +2667,180 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp) } } +static int virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool, + struct xdp_desc *desc) +{ + struct virtnet_info *vi = sq->vq->vdev->priv; + void *data, *ptr; + struct page *page; + struct virtnet_xsk_hdr *xskhdr; + u32 idx, offset, n, i, copy, copied; + u64 addr; + int err, m; + + addr = desc->addr; + + data = xsk_buff_raw_get_data(pool, addr); + offset = offset_in_page(data); + + /* one for hdr, one for the first page */ + n = 2; + m = desc->len - (PAGE_SIZE - offset); + if (m > 0) { + n += m >> PAGE_SHIFT; + if (m & PAGE_MASK) + ++n; + + n = min_t(u32, n, ARRAY_SIZE(sq->sg)); + } + + idx = sq->xsk.hdr_con % sq->xsk.hdr_n; I don't understand the reason of the hdr array. It looks to me all of them are zero and read only from device. Any reason for not reusing a single hdr for all xdp descriptors? Or maybe it's time to introduce VIRTIO_NET_F_NO_HDR. + xskhdr = &sq->xsk.hdr[idx]; + + /* xskhdr->hdr has been memset to zero, so not need to clear again */ + + sg_init_table(sq->sg, n); + sg_set_buf(sq->sg, &xskhdr->hdr, vi->hdr_len); + + copied = 0; + for (i = 1; i < n; ++i) { + copy = min_t(int, desc->len - copied, PAGE_SIZE - offset); + + page = xsk_buff_raw_get_page(pool, addr + copied); + + sg_set_page(sq->sg + i, page, copy, offset); + copied += copy; + if (offset) + offset = 0; + } It looks to me we need to terminate the sg: ** * virtqueue_add_o
Re: [PATCH net-next v2 6/7] virtio-net, xsk: implement xsk wakeup callback
On 2021/1/16 上午10:59, Xuan Zhuo wrote: Since I did not find an interface to directly notify virtio to generate a tx interrupt, I sent some data to trigger a new tx interrupt. Another advantage of this is that the transmission delay will be relatively small, and there is no need to wait for the tx interrupt to start softirq. Signed-off-by: Xuan Zhuo --- drivers/net/virtio_net.c | 51 1 file changed, 51 insertions(+) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 42aa9ad..e552c2d 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -2841,6 +2841,56 @@ static int virtnet_xsk_run(struct send_queue *sq, return ret; } +static int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag) +{ + struct virtnet_info *vi = netdev_priv(dev); + struct send_queue *sq; + struct xsk_buff_pool *pool; + struct netdev_queue *txq; + + if (!netif_running(dev)) + return -ENETDOWN; + + if (qid >= vi->curr_queue_pairs) + return -EINVAL; + + sq = &vi->sq[qid]; + + rcu_read_lock(); + + pool = rcu_dereference(sq->xsk.pool); + if (!pool) + goto end; + + if (test_and_set_bit(VIRTNET_STATE_XSK_WAKEUP, &sq->xsk.state)) + goto end; + + txq = netdev_get_tx_queue(dev, qid); + + local_bh_disable(); + __netif_tx_lock(txq, raw_smp_processor_id()); You can use __netif_tx_lock_bh(). Thanks + + /* Send part of the package directly to reduce the delay in sending the +* package, and this can actively trigger the tx interrupts. +* +* If the package is not processed, then continue processing in the +* subsequent tx interrupt(virtnet_poll_tx). +* +* If no packet is sent out, the ring of the device is full. In this +* case, we will still get a tx interrupt response. Then we will deal +* with the subsequent packet sending work. +*/ + + virtnet_xsk_run(sq, pool, xsk_budget); + + __netif_tx_unlock(txq); + local_bh_enable(); + +end: + rcu_read_unlock(); + return 0; +} + static int virtnet_get_phys_port_name(struct net_device *dev, char *buf, size_t len) { @@ -2895,6 +2945,7 @@ static int virtnet_set_features(struct net_device *dev, .ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid, .ndo_bpf= virtnet_xdp, .ndo_xdp_xmit = virtnet_xdp_xmit, + .ndo_xsk_wakeup = virtnet_xsk_wakeup, .ndo_features_check = passthru_features_check, .ndo_get_phys_port_name = virtnet_get_phys_port_name, .ndo_set_features = virtnet_set_features,
Re: [PATCH linux-next v3 6/6] vdpa_sim_net: Add support for user supported devices
On 2021/1/15 下午2:27, Parav Pandit wrote: With that mac, mtu as optional input fields provide the necessary flexibility for different stacks to take appropriate shape as they desire. Thanks for the clarification. I think we'd better document the above in the patch that introduces the mac setting from management API. Yes. Will do. Thanks. Adding Sean. Regarding to mac address setting. Do we plan to allow to modify mac address after the creation? It looks like Openstack wants this. Sean may share more information on this. Thanks
Re: [PATCH bpf-next v2 1/3] net: add priv_flags for allow tx skb without linear
On 2021/1/19 下午5:45, Xuan Zhuo wrote: In some cases, we hope to construct skb directly based on the existing memory without copying data. In this case, the page will be placed directly in the skb, and the linear space of skb is empty. But unfortunately, many the network card does not support this operation. For example Mellanox Technologies MT27710 Family [ConnectX-4 Lx] will get the following error message: mlx5_core :3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8, qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68 : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2 WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64 : 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00 0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 mlx5_core :3b:00.1 eth1: ERR CQE on SQ: 0x1dbb So a priv_flag is added here to indicate whether the network card supports this feature. I don't see Mellanox engineers are copied. I wonder if we need their confirmation on whether it's a bug or hardware limitation. Thanks
Re: [PATCH net-next v2 2/7] virtio-net, xsk: distinguish XDP_TX and XSK XMIT ctx
On 2021/1/18 下午7:54, Xuan Zhuo wrote: On Mon, 18 Jan 2021 14:45:16 +0800, Jason Wang wrote: On 2021/1/16 上午10:59, Xuan Zhuo wrote: If support xsk, a new ptr will be recovered during the process of freeing the old ptr. In order to distinguish between ctx sent by XDP_TX and ctx sent by xsk, a struct is added here to distinguish between these two situations. virtnet_xdp_type.type It is used to distinguish different ctx, and virtnet_xdp_type.offset is used to record the offset between "true ctx" and virtnet_xdp_type. The newly added virtnet_xsk_hdr will be used for xsk. Signed-off-by: Xuan Zhuo Any reason that you can't simply encode the type in the pointer itself as we used to do? #define VIRTIO_XSK_FLAG BIT(1) ? Since xdp socket does not use xdp_frame, we will encounter three data types when recycling: skb, xdp_frame, xdp socket. Just to make sure we are in the same page. Currently, the pointer type is encoded with 1 bit in the pointer. Can we simply use 2 bit to distinguish skb, xdp, xsk? Thanks
Re: [PATCH net-next v2 5/7] virtio-net, xsk: realize the function of xsk packet sending
On 2021/1/18 下午8:03, Xuan Zhuo wrote: On Mon, 18 Jan 2021 17:10:24 +0800, Jason Wang wrote: On 2021/1/16 上午10:59, Xuan Zhuo wrote: virtnet_xsk_run will be called in the tx interrupt handling function virtnet_poll_tx. The sending process gets desc from the xsk tx queue, and assembles it to send the data. Compared with other drivers, a special place is that the page of the data in xsk is used here instead of the dma address. Because the virtio interface does not use the dma address. Signed-off-by: Xuan Zhuo --- drivers/net/virtio_net.c | 200 ++- 1 file changed, 197 insertions(+), 3 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index a62d456..42aa9ad 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -119,6 +119,8 @@ struct virtnet_xsk_hdr { u32 len; }; +#define VIRTNET_STATE_XSK_WAKEUP 1 + #define VIRTNET_SQ_STAT(m) offsetof(struct virtnet_sq_stats, m) #define VIRTNET_RQ_STAT(m) offsetof(struct virtnet_rq_stats, m) @@ -163,9 +165,12 @@ struct send_queue { struct xsk_buff_pool __rcu *pool; struct virtnet_xsk_hdr __rcu *hdr; + unsigned long state; u64hdr_con; u64hdr_pro; u64hdr_n; + struct xdp_desclast_desc; + bool wait_slot; } xsk; }; @@ -284,6 +289,8 @@ static void __free_old_xmit_ptr(struct send_queue *sq, bool in_napi, bool xsk_wakeup, unsigned int *_packets, unsigned int *_bytes); static void free_old_xmit_skbs(struct send_queue *sq, bool in_napi); +static int virtnet_xsk_run(struct send_queue *sq, + struct xsk_buff_pool *pool, int budget); static bool is_xdp_frame(void *ptr) { @@ -1590,6 +1597,8 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget) struct virtnet_info *vi = sq->vq->vdev->priv; unsigned int index = vq2txq(sq->vq); struct netdev_queue *txq; + struct xsk_buff_pool *pool; + int work = 0; if (unlikely(is_xdp_raw_buffer_queue(vi, index))) { /* We don't need to enable cb for XDP */ @@ -1599,15 +1608,26 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget) txq = netdev_get_tx_queue(vi->dev, index); __netif_tx_lock(txq, raw_smp_processor_id()); - free_old_xmit_skbs(sq, true); + + rcu_read_lock(); + pool = rcu_dereference(sq->xsk.pool); + if (pool) { + work = virtnet_xsk_run(sq, pool, budget); + rcu_read_unlock(); + } else { + rcu_read_unlock(); + free_old_xmit_skbs(sq, true); + } + __netif_tx_unlock(txq); - virtqueue_napi_complete(napi, sq->vq, 0); + if (work < budget) + virtqueue_napi_complete(napi, sq->vq, 0); if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS) netif_tx_wake_queue(txq); - return 0; + return work; } static int xmit_skb(struct send_queue *sq, struct sk_buff *skb) @@ -2647,6 +2667,180 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp) } } +static int virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool, + struct xdp_desc *desc) +{ + struct virtnet_info *vi = sq->vq->vdev->priv; + void *data, *ptr; + struct page *page; + struct virtnet_xsk_hdr *xskhdr; + u32 idx, offset, n, i, copy, copied; + u64 addr; + int err, m; + + addr = desc->addr; + + data = xsk_buff_raw_get_data(pool, addr); + offset = offset_in_page(data); + + /* one for hdr, one for the first page */ + n = 2; + m = desc->len - (PAGE_SIZE - offset); + if (m > 0) { + n += m >> PAGE_SHIFT; + if (m & PAGE_MASK) + ++n; + + n = min_t(u32, n, ARRAY_SIZE(sq->sg)); + } + + idx = sq->xsk.hdr_con % sq->xsk.hdr_n; I don't understand the reason of the hdr array. It looks to me all of them are zero and read only from device. Any reason for not reusing a single hdr for all xdp descriptors? Or maybe it's time to introduce VIRTIO_NET_F_NO_HDR. Yes, You are right. Before supporting functions like csum, here it is indeed possible to achieve it with only one hdr. So let's drop the array logic for now since it will give unnecessary stress on the cache. + xskhdr = &sq->xsk.hdr[idx]; + + /* xskhdr->hdr has been memset to zero, so not need to clear again */ + + sg_init_table(sq->sg, n); + sg_set_buf(sq->sg, &xskhdr->hdr, vi->
Re: [RFC v3 03/11] vdpa: Remove the restriction that only supports virtio-net devices
On 2021/1/19 下午12:59, Xie Yongji wrote: With VDUSE, we should be able to support all kinds of virtio devices. Signed-off-by: Xie Yongji --- drivers/vhost/vdpa.c | 29 +++-- 1 file changed, 3 insertions(+), 26 deletions(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 29ed4173f04e..448be7875b6d 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -22,6 +22,7 @@ #include #include #include +#include #include "vhost.h" @@ -185,26 +186,6 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 __user *statusp) return 0; } -static int vhost_vdpa_config_validate(struct vhost_vdpa *v, - struct vhost_vdpa_config *c) -{ - long size = 0; - - switch (v->virtio_id) { - case VIRTIO_ID_NET: - size = sizeof(struct virtio_net_config); - break; - } - - if (c->len == 0) - return -EINVAL; - - if (c->len > size - c->off) - return -E2BIG; - - return 0; -} I think we should use a separate patch for this. Thanks - static long vhost_vdpa_get_config(struct vhost_vdpa *v, struct vhost_vdpa_config __user *c) { @@ -215,7 +196,7 @@ static long vhost_vdpa_get_config(struct vhost_vdpa *v, if (copy_from_user(&config, c, size)) return -EFAULT; - if (vhost_vdpa_config_validate(v, &config)) + if (config.len == 0) return -EINVAL; buf = kvzalloc(config.len, GFP_KERNEL); if (!buf) @@ -243,7 +224,7 @@ static long vhost_vdpa_set_config(struct vhost_vdpa *v, if (copy_from_user(&config, c, size)) return -EFAULT; - if (vhost_vdpa_config_validate(v, &config)) + if (config.len == 0) return -EINVAL; buf = kvzalloc(config.len, GFP_KERNEL); if (!buf) @@ -1025,10 +1006,6 @@ static int vhost_vdpa_probe(struct vdpa_device *vdpa) int minor; int r; - /* Currently, we only accept the network devices. */ - if (ops->get_device_id(vdpa) != VIRTIO_ID_NET) - return -ENOTSUPP; - v = kzalloc(sizeof(*v), GFP_KERNEL | __GFP_RETRY_MAYFAIL); if (!v) return -ENOMEM;
Re: [RFC v3 04/11] vhost-vdpa: protect concurrent access to vhost device iotlb
On 2021/1/19 下午12:59, Xie Yongji wrote: Introduce a mutex to protect vhost device iotlb from concurrent access. Fixes: 4c8cf318("vhost: introduce vDPA-based backend") Signed-off-by: Xie Yongji --- drivers/vhost/vdpa.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 448be7875b6d..4a241d380c40 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -49,6 +49,7 @@ struct vhost_vdpa { struct eventfd_ctx *config_ctx; int in_batch; struct vdpa_iova_range range; + struct mutex mutex; Let's use the device mutex like what vhost_process_iotlb_msg() did. Thanks }; static DEFINE_IDA(vhost_vdpa_ida); @@ -728,6 +729,7 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, if (r) return r; + mutex_lock(&v->mutex); switch (msg->type) { case VHOST_IOTLB_UPDATE: r = vhost_vdpa_process_iotlb_update(v, msg); @@ -747,6 +749,7 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, r = -EINVAL; break; } + mutex_unlock(&v->mutex); return r; } @@ -1017,6 +1020,7 @@ static int vhost_vdpa_probe(struct vdpa_device *vdpa) return minor; } + mutex_init(&v->mutex); atomic_set(&v->opened, 0); v->minor = minor; v->vdpa = vdpa;
Re: [RFC v3 01/11] eventfd: track eventfd_signal() recursion depth separately in different cases
On 2021/1/19 下午12:59, Xie Yongji wrote: Now we have a global percpu counter to limit the recursion depth of eventfd_signal(). This can avoid deadlock or stack overflow. But in stack overflow case, it should be OK to increase the recursion depth if needed. So we add a percpu counter in eventfd_ctx to limit the recursion depth for deadlock case. Then it could be fine to increase the global percpu counter later. I wonder whether or not it's worth to introduce percpu for each eventfd. How about simply check if eventfd_signal_count() is greater than 2? Thanks Signed-off-by: Xie Yongji --- fs/aio.c| 3 ++- fs/eventfd.c| 20 +++- include/linux/eventfd.h | 5 + 3 files changed, 22 insertions(+), 6 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 1f32da13d39e..5d82903161f5 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1698,7 +1698,8 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, list_del(&iocb->ki_list); iocb->ki_res.res = mangle_poll(mask); req->done = true; - if (iocb->ki_eventfd && eventfd_signal_count()) { + if (iocb->ki_eventfd && + eventfd_signal_count(iocb->ki_eventfd)) { iocb = NULL; INIT_WORK(&req->work, aio_poll_put_work); schedule_work(&req->work); diff --git a/fs/eventfd.c b/fs/eventfd.c index e265b6dd4f34..2df24f9bada3 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -25,6 +25,8 @@ #include #include +#define EVENTFD_WAKE_DEPTH 0 + DEFINE_PER_CPU(int, eventfd_wake_count); static DEFINE_IDA(eventfd_ida); @@ -42,9 +44,17 @@ struct eventfd_ctx { */ __u64 count; unsigned int flags; + int __percpu *wake_count; int id; }; +bool eventfd_signal_count(struct eventfd_ctx *ctx) +{ + return (this_cpu_read(*ctx->wake_count) || + this_cpu_read(eventfd_wake_count) > EVENTFD_WAKE_DEPTH); +} +EXPORT_SYMBOL_GPL(eventfd_signal_count); + /** * eventfd_signal - Adds @n to the eventfd counter. * @ctx: [in] Pointer to the eventfd context. @@ -71,17 +81,19 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n) * it returns true, the eventfd_signal() call should be deferred to a * safe context. */ - if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count))) + if (WARN_ON_ONCE(eventfd_signal_count(ctx))) return 0; spin_lock_irqsave(&ctx->wqh.lock, flags); this_cpu_inc(eventfd_wake_count); + this_cpu_inc(*ctx->wake_count); if (ULLONG_MAX - ctx->count < n) n = ULLONG_MAX - ctx->count; ctx->count += n; if (waitqueue_active(&ctx->wqh)) wake_up_locked_poll(&ctx->wqh, EPOLLIN); this_cpu_dec(eventfd_wake_count); + this_cpu_dec(*ctx->wake_count); spin_unlock_irqrestore(&ctx->wqh.lock, flags); return n; @@ -92,6 +104,7 @@ static void eventfd_free_ctx(struct eventfd_ctx *ctx) { if (ctx->id >= 0) ida_simple_remove(&eventfd_ida, ctx->id); + free_percpu(ctx->wake_count); kfree(ctx); } @@ -423,6 +436,11 @@ static int do_eventfd(unsigned int count, int flags) kref_init(&ctx->kref); init_waitqueue_head(&ctx->wqh); + ctx->wake_count = alloc_percpu(int); + if (!ctx->wake_count) { + kfree(ctx); + return -ENOMEM; + } ctx->count = count; ctx->flags = flags; ctx->id = ida_simple_get(&eventfd_ida, 0, 0, GFP_KERNEL); diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h index fa0a524baed0..1a11ebbd74a9 100644 --- a/include/linux/eventfd.h +++ b/include/linux/eventfd.h @@ -45,10 +45,7 @@ void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt); DECLARE_PER_CPU(int, eventfd_wake_count); -static inline bool eventfd_signal_count(void) -{ - return this_cpu_read(eventfd_wake_count); -} +bool eventfd_signal_count(struct eventfd_ctx *ctx); #else /* CONFIG_EVENTFD */
Re: [RFC v3 05/11] vdpa: shared virtual addressing support
On 2021/1/19 下午12:59, Xie Yongji wrote: This patches introduces SVA (Shared Virtual Addressing) support for vDPA device. During vDPA device allocation, vDPA device driver needs to indicate whether SVA is supported by the device. Then vhost-vdpa bus driver will not pin user page and transfer userspace virtual address instead of physical address during DMA mapping. Suggested-by: Jason Wang Signed-off-by: Xie Yongji --- drivers/vdpa/ifcvf/ifcvf_main.c | 2 +- drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 +- drivers/vdpa/vdpa.c | 5 - drivers/vdpa/vdpa_sim/vdpa_sim.c | 3 ++- drivers/vhost/vdpa.c | 35 +++ include/linux/vdpa.h | 10 +++--- 6 files changed, 38 insertions(+), 19 deletions(-) diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c index 23474af7da40..95c4601f82f5 100644 --- a/drivers/vdpa/ifcvf/ifcvf_main.c +++ b/drivers/vdpa/ifcvf/ifcvf_main.c @@ -439,7 +439,7 @@ static int ifcvf_probe(struct pci_dev *pdev, const struct pci_device_id *id) adapter = vdpa_alloc_device(struct ifcvf_adapter, vdpa, dev, &ifc_vdpa_ops, - IFCVF_MAX_QUEUE_PAIRS * 2, NULL); + IFCVF_MAX_QUEUE_PAIRS * 2, NULL, false); if (adapter == NULL) { IFCVF_ERR(pdev, "Failed to allocate vDPA structure"); return -ENOMEM; diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 77595c81488d..05988d6907f2 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -1959,7 +1959,7 @@ static int mlx5v_probe(struct auxiliary_device *adev, max_vqs = min_t(u32, max_vqs, MLX5_MAX_SUPPORTED_VQS); ndev = vdpa_alloc_device(struct mlx5_vdpa_net, mvdev.vdev, mdev->device, &mlx5_vdpa_ops, -2 * mlx5_vdpa_max_qps(max_vqs), NULL); +2 * mlx5_vdpa_max_qps(max_vqs), NULL, false); if (IS_ERR(ndev)) return PTR_ERR(ndev); diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c index 32bd48baffab..50cab930b2e5 100644 --- a/drivers/vdpa/vdpa.c +++ b/drivers/vdpa/vdpa.c @@ -72,6 +72,7 @@ static void vdpa_release_dev(struct device *d) * @nvqs: number of virtqueues supported by this device * @size: size of the parent structure that contains private data * @name: name of the vdpa device; optional. + * @sva: indicate whether SVA (Shared Virtual Addressing) is supported * * Driver should use vdpa_alloc_device() wrapper macro instead of * using this directly. @@ -81,7 +82,8 @@ static void vdpa_release_dev(struct device *d) */ struct vdpa_device *__vdpa_alloc_device(struct device *parent, const struct vdpa_config_ops *config, - int nvqs, size_t size, const char *name) + int nvqs, size_t size, const char *name, + bool sva) { struct vdpa_device *vdev; int err = -EINVAL; @@ -108,6 +110,7 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent, vdev->config = config; vdev->features_valid = false; vdev->nvqs = nvqs; + vdev->sva = sva; if (name) err = dev_set_name(&vdev->dev, "%s", name); diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c index 85776e4e6749..03c796873a6b 100644 --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c @@ -367,7 +367,8 @@ static struct vdpasim *vdpasim_create(const char *name) else ops = &vdpasim_net_config_ops; - vdpasim = vdpa_alloc_device(struct vdpasim, vdpa, NULL, ops, VDPASIM_VQ_NUM, name); + vdpasim = vdpa_alloc_device(struct vdpasim, vdpa, NULL, ops, + VDPASIM_VQ_NUM, name, false); if (!vdpasim) goto err_alloc; diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 4a241d380c40..36b6950ba37f 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -486,21 +486,25 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last) { struct vhost_dev *dev = &v->vdev; + struct vdpa_device *vdpa = v->vdpa; struct vhost_iotlb *iotlb = dev->iotlb; struct vhost_iotlb_map *map; struct page *page; unsigned long pfn, pinned; while ((map = vhost_iotlb_itree_first(iotlb, start, last)) != NULL) { - pinned = map->size >> PAGE_SHIFT; - for (pfn = map->addr >> PAGE_SHIFT; -pinned > 0; pfn++, pinned--) { -
Re: [RFC v3 06/11] vhost-vdpa: Add an opaque pointer for vhost IOTLB
On 2021/1/19 下午12:59, Xie Yongji wrote: Add an opaque pointer for vhost IOTLB to store the corresponding vma->vm_file and offset on the DMA mapping. Let's split the patch into two. 1) opaque pointer 2) vma stuffs It will be used in VDUSE case later. Suggested-by: Jason Wang Signed-off-by: Xie Yongji --- drivers/vdpa/vdpa_sim/vdpa_sim.c | 11 --- drivers/vhost/iotlb.c| 5 ++- drivers/vhost/vdpa.c | 66 +++- drivers/vhost/vhost.c| 4 +-- include/linux/vdpa.h | 3 +- include/linux/vhost_iotlb.h | 8 - 6 files changed, 79 insertions(+), 18 deletions(-) diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c index 03c796873a6b..1ffcef67954f 100644 --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c @@ -279,7 +279,7 @@ static dma_addr_t vdpasim_map_page(struct device *dev, struct page *page, */ spin_lock(&vdpasim->iommu_lock); ret = vhost_iotlb_add_range(iommu, pa, pa + size - 1, - pa, dir_to_perm(dir)); + pa, dir_to_perm(dir), NULL); Maybe its better to introduce vhost_iotlb_add_range_ctx() which can accepts the opaque (context). And let vhost_iotlb_add_range() just call that. spin_unlock(&vdpasim->iommu_lock); if (ret) return DMA_MAPPING_ERROR; @@ -317,7 +317,7 @@ static void *vdpasim_alloc_coherent(struct device *dev, size_t size, ret = vhost_iotlb_add_range(iommu, (u64)pa, (u64)pa + size - 1, - pa, VHOST_MAP_RW); + pa, VHOST_MAP_RW, NULL); if (ret) { *dma_addr = DMA_MAPPING_ERROR; kfree(addr); @@ -625,7 +625,8 @@ static int vdpasim_set_map(struct vdpa_device *vdpa, for (map = vhost_iotlb_itree_first(iotlb, start, last); map; map = vhost_iotlb_itree_next(map, start, last)) { ret = vhost_iotlb_add_range(vdpasim->iommu, map->start, - map->last, map->addr, map->perm); + map->last, map->addr, + map->perm, NULL); if (ret) goto err; } @@ -639,14 +640,14 @@ static int vdpasim_set_map(struct vdpa_device *vdpa, } static int vdpasim_dma_map(struct vdpa_device *vdpa, u64 iova, u64 size, - u64 pa, u32 perm) + u64 pa, u32 perm, void *opaque) { struct vdpasim *vdpasim = vdpa_to_sim(vdpa); int ret; spin_lock(&vdpasim->iommu_lock); ret = vhost_iotlb_add_range(vdpasim->iommu, iova, iova + size - 1, pa, - perm); + perm, NULL); spin_unlock(&vdpasim->iommu_lock); return ret; diff --git a/drivers/vhost/iotlb.c b/drivers/vhost/iotlb.c index 0fd3f87e913c..3bd5bd06cdbc 100644 --- a/drivers/vhost/iotlb.c +++ b/drivers/vhost/iotlb.c @@ -42,13 +42,15 @@ EXPORT_SYMBOL_GPL(vhost_iotlb_map_free); * @last: last of IOVA range * @addr: the address that is mapped to @start * @perm: access permission of this range + * @opaque: the opaque pointer for the IOTLB mapping * * Returns an error last is smaller than start or memory allocation * fails */ int vhost_iotlb_add_range(struct vhost_iotlb *iotlb, u64 start, u64 last, - u64 addr, unsigned int perm) + u64 addr, unsigned int perm, + void *opaque) { struct vhost_iotlb_map *map; @@ -71,6 +73,7 @@ int vhost_iotlb_add_range(struct vhost_iotlb *iotlb, map->last = last; map->addr = addr; map->perm = perm; + map->opaque = opaque; iotlb->nmaps++; vhost_iotlb_itree_insert(map, &iotlb->root); diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 36b6950ba37f..e83e5be7cec8 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -488,6 +488,7 @@ static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last) struct vhost_dev *dev = &v->vdev; struct vdpa_device *vdpa = v->vdpa; struct vhost_iotlb *iotlb = dev->iotlb; + struct vhost_iotlb_file *iotlb_file; struct vhost_iotlb_map *map; struct page *page; unsigned long pfn, pinned; @@ -504,6 +505,10 @@ static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last) } atomic64_sub(map->size >> PAGE_SHIFT, &
Re: [PATCH 1/1] vhost scsi: allocate vhost_scsi with GFP_NOWAIT to avoid delay
On 2021/1/21 13:03, Dongli Zhang wrote: The size of 'struct vhost_scsi' is order-10 (~2.3MB). It may take long time delay by kzalloc() to compact memory pages when there is a lack of high-order pages. As a result, there is latency to create a VM (with vhost-scsi) or to hotadd vhost-scsi-based storage. The prior commit 595cb754983d ("vhost/scsi: use vmalloc for order-10 allocation") prefers to fallback only when really needed, while this patch changes allocation to GFP_NOWAIT in order to avoid the delay caused by memory page compact. Cc: Aruna Ramakrishna Cc: Joe Jin Signed-off-by: Dongli Zhang --- Another option is to rework by reducing the size of 'struct vhost_scsi', e.g., by replacing inline vhost_scsi.vqs with just memory pointers while each vhost_scsi.vqs[i] should be allocated separately. Please let me know if that option is better. drivers/vhost/scsi.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index 4ce9f00ae10e..85eaa4e883f4 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -1814,7 +1814,7 @@ static int vhost_scsi_open(struct inode *inode, struct file *f) struct vhost_virtqueue **vqs; int r = -ENOMEM, i; - vs = kzalloc(sizeof(*vs), GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_MAYFAIL); + vs = kzalloc(sizeof(*vs), GFP_NOWAIT | __GFP_NOWARN); if (!vs) { vs = vzalloc(sizeof(*vs)); if (!vs) Can we use kvzalloc? Thanks
Re: [PATCH net] vhost_net: fix high cpu load when sendmsg fails
On 2020/12/9 下午9:27, wangyunjian wrote: -Original Message- From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Wednesday, December 9, 2020 8:50 PM To: wangyunjian Cc: jasow...@redhat.com; virtualizat...@lists.linux-foundation.org; netdev@vger.kernel.org; Lilijun (Jerry) ; chenchanghu ; xudingke Subject: Re: [PATCH net] vhost_net: fix high cpu load when sendmsg fails On Wed, Dec 09, 2020 at 07:48:24PM +0800, wangyunjian wrote: From: Yunjian Wang Currently we break the loop and wake up the vhost_worker when sendmsg fails. When the worker wakes up again, we'll meet the same error. This will cause high CPU load. To fix this issue, we can skip this description by ignoring the error. Signed-off-by: Yunjian Wang --- drivers/vhost/net.c | 24 +--- 1 file changed, 5 insertions(+), 19 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 531a00d703cd..ac950b1120f5 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -829,14 +829,8 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock) /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock->ops->sendmsg(sock, &msg, len); - if (unlikely(err < 0)) { - vhost_discard_vq_desc(vq, 1); - vhost_net_enable_vq(net, vq); - break; - } - if (err != len) - pr_debug("Truncated TX packet: len %d != %zd\n", -err, len); + if (unlikely(err < 0 || err != len)) + vq_err(vq, "Fail to sending packets err : %d, len : %zd\n", err, +len); done: vq->heads[nvq->done_idx].id = cpu_to_vhost32(vq, head); vq->heads[nvq->done_idx].len = 0; One of the reasons for sendmsg to fail is ENOBUFS. In that case for sure we don't want to drop packet. Now the function tap_sendmsg()/tun_sendmsg() don't return ENOBUFS. I think not, it can happen if we exceeds sndbuf. E.g see tun_alloc_skb(). Thanks There could be other transient errors. Which error did you encounter, specifically? Currently a guest vm send a skb which length is more than 64k. If virtio hdr is wrong, the problem will also be triggered. Thanks @@ -925,19 +919,11 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock) /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock->ops->sendmsg(sock, &msg, len); - if (unlikely(err < 0)) { - if (zcopy_used) { + if (unlikely(err < 0 || err != len)) { + if (zcopy_used && err < 0) vhost_net_ubuf_put(ubufs); - nvq->upend_idx = ((unsigned)nvq->upend_idx - 1) - % UIO_MAXIOV; - } - vhost_discard_vq_desc(vq, 1); - vhost_net_enable_vq(net, vq); - break; + vq_err(vq, "Fail to sending packets err : %d, len : %zd\n", err, +len); } - if (err != len) - pr_debug("Truncated TX packet: " -" len %d != %zd\n", err, len); if (!zcopy_used) vhost_add_used_and_signal(&net->dev, vq, head, 0); else -- 2.23.0
Re: [PATCH net] vhost_net: fix high cpu load when sendmsg fails
On 2020/12/11 下午3:37, wangyunjian wrote: -Original Message- From: Jason Wang [mailto:jasow...@redhat.com] Sent: Friday, December 11, 2020 10:53 AM To: wangyunjian ; Michael S. Tsirkin Cc: virtualizat...@lists.linux-foundation.org; netdev@vger.kernel.org; Lilijun (Jerry) ; chenchanghu ; xudingke Subject: Re: [PATCH net] vhost_net: fix high cpu load when sendmsg fails On 2020/12/9 下午9:27, wangyunjian wrote: -Original Message- From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Wednesday, December 9, 2020 8:50 PM To: wangyunjian Cc: jasow...@redhat.com; virtualizat...@lists.linux-foundation.org; netdev@vger.kernel.org; Lilijun (Jerry) ; chenchanghu ; xudingke Subject: Re: [PATCH net] vhost_net: fix high cpu load when sendmsg fails On Wed, Dec 09, 2020 at 07:48:24PM +0800, wangyunjian wrote: From: Yunjian Wang Currently we break the loop and wake up the vhost_worker when sendmsg fails. When the worker wakes up again, we'll meet the same error. This will cause high CPU load. To fix this issue, we can skip this description by ignoring the error. Signed-off-by: Yunjian Wang --- drivers/vhost/net.c | 24 +--- 1 file changed, 5 insertions(+), 19 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 531a00d703cd..ac950b1120f5 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -829,14 +829,8 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock) /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock->ops->sendmsg(sock, &msg, len); - if (unlikely(err < 0)) { - vhost_discard_vq_desc(vq, 1); - vhost_net_enable_vq(net, vq); - break; - } - if (err != len) - pr_debug("Truncated TX packet: len %d != %zd\n", -err, len); + if (unlikely(err < 0 || err != len)) + vq_err(vq, "Fail to sending packets err : %d, len : %zd\n", err, +len); done: vq->heads[nvq->done_idx].id = cpu_to_vhost32(vq, head); vq->heads[nvq->done_idx].len = 0; One of the reasons for sendmsg to fail is ENOBUFS. In that case for sure we don't want to drop packet. Now the function tap_sendmsg()/tun_sendmsg() don't return ENOBUFS. I think not, it can happen if we exceeds sndbuf. E.g see tun_alloc_skb(). This patch 'net: add alloc_skb_with_frags() helper' modifys the return value of sock_alloc_send_pskb() from -ENOBUFS to -EAGAIN when we exceeds sndbuf. So the return value of tun_alloc_skb has been changed. Ok. We don't drop packet if the reasons for sendmsg to fail is EAGAIN. How about this? It should work. Btw, the patch doesn't add the head to the used ring. This may confuses the driver. Thanks Thanks Thanks There could be other transient errors. Which error did you encounter, specifically? Currently a guest vm send a skb which length is more than 64k. If virtio hdr is wrong, the problem will also be triggered. Thanks @@ -925,19 +919,11 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock) /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock->ops->sendmsg(sock, &msg, len); - if (unlikely(err < 0)) { - if (zcopy_used) { + if (unlikely(err < 0 || err != len)) { + if (zcopy_used && err < 0) vhost_net_ubuf_put(ubufs); - nvq->upend_idx = ((unsigned)nvq->upend_idx - 1) - % UIO_MAXIOV; - } - vhost_discard_vq_desc(vq, 1); - vhost_net_enable_vq(net, vq); - break; + vq_err(vq, "Fail to sending packets err : %d, len : %zd\n", err, +len); } - if (err != len) - pr_debug("Truncated TX packet: " -" len %d != %zd\n", err, len); if (!zcopy_used) vhost_add_used_and_signal(&net->dev, vq, head, 0); else -- 2.23.0
Re: [PATCH net] vhost_net: fix high cpu load when sendmsg fails
On 2020/12/11 下午3:37, wangyunjian wrote: -Original Message- From: Jason Wang [mailto:jasow...@redhat.com] Sent: Friday, December 11, 2020 10:53 AM To: wangyunjian ; Michael S. Tsirkin Cc: virtualizat...@lists.linux-foundation.org; netdev@vger.kernel.org; Lilijun (Jerry) ; chenchanghu ; xudingke Subject: Re: [PATCH net] vhost_net: fix high cpu load when sendmsg fails On 2020/12/9 下午9:27, wangyunjian wrote: -Original Message- From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Wednesday, December 9, 2020 8:50 PM To: wangyunjian Cc: jasow...@redhat.com; virtualizat...@lists.linux-foundation.org; netdev@vger.kernel.org; Lilijun (Jerry) ; chenchanghu ; xudingke Subject: Re: [PATCH net] vhost_net: fix high cpu load when sendmsg fails On Wed, Dec 09, 2020 at 07:48:24PM +0800, wangyunjian wrote: From: Yunjian Wang Currently we break the loop and wake up the vhost_worker when sendmsg fails. When the worker wakes up again, we'll meet the same error. This will cause high CPU load. To fix this issue, we can skip this description by ignoring the error. Signed-off-by: Yunjian Wang --- drivers/vhost/net.c | 24 +--- 1 file changed, 5 insertions(+), 19 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 531a00d703cd..ac950b1120f5 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -829,14 +829,8 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock) /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock->ops->sendmsg(sock, &msg, len); - if (unlikely(err < 0)) { - vhost_discard_vq_desc(vq, 1); - vhost_net_enable_vq(net, vq); - break; - } - if (err != len) - pr_debug("Truncated TX packet: len %d != %zd\n", -err, len); + if (unlikely(err < 0 || err != len)) + vq_err(vq, "Fail to sending packets err : %d, len : %zd\n", err, +len); done: vq->heads[nvq->done_idx].id = cpu_to_vhost32(vq, head); vq->heads[nvq->done_idx].len = 0; One of the reasons for sendmsg to fail is ENOBUFS. In that case for sure we don't want to drop packet. Now the function tap_sendmsg()/tun_sendmsg() don't return ENOBUFS. I think not, it can happen if we exceeds sndbuf. E.g see tun_alloc_skb(). This patch 'net: add alloc_skb_with_frags() helper' modifys the return value of sock_alloc_send_pskb() from -ENOBUFS to -EAGAIN when we exceeds sndbuf. So the return value of tun_alloc_skb has been changed. Ok. We don't drop packet if the reasons for sendmsg to fail is EAGAIN. How about this? It should work. Btw, the patch doesn't add the head to the used ring. This may confuse the driver. Thanks Thanks Thanks There could be other transient errors. Which error did you encounter, specifically? Currently a guest vm send a skb which length is more than 64k. If virtio hdr is wrong, the problem will also be triggered. Thanks @@ -925,19 +919,11 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock) /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock->ops->sendmsg(sock, &msg, len); - if (unlikely(err < 0)) { - if (zcopy_used) { + if (unlikely(err < 0 || err != len)) { + if (zcopy_used && err < 0) vhost_net_ubuf_put(ubufs); - nvq->upend_idx = ((unsigned)nvq->upend_idx - 1) - % UIO_MAXIOV; - } - vhost_discard_vq_desc(vq, 1); - vhost_net_enable_vq(net, vq); - break; + vq_err(vq, "Fail to sending packets err : %d, len : %zd\n", err, +len); } - if (err != len) - pr_debug("Truncated TX packet: " -" len %d != %zd\n", err, len); if (!zcopy_used) vhost_add_used_and_signal(&net->dev, vq, head, 0); else -- 2.23.0
Re: [PATCH net v2] tun: fix ubuf refcount incorrectly on error path
On 2020/12/14 上午9:32, Willem de Bruijn wrote: On Sat, Dec 12, 2020 at 7:18 PM Willem de Bruijn wrote: afterwards, the error handling in vhost handle_tx() will try to decrease the same refcount again. This is wrong and fix this by delay copying ubuf_info until we're sure there's no errors. I think the right approach is to address this in the error paths, rather than complicate the normal datapath. Is it sufficient to suppress the call to vhost_net_ubuf_put in the handle_tx sendmsg error path, given that vhost_zerocopy_callback will be called on kfree_skb? We can not call kfree_skb() until the skb was created. Or alternatively clear the destructor in drop: The uarg->callback() is called immediately after we decide do datacopy even if caller want to do zerocopy. If another error occurs later, the vhost handle_tx() will try to decrease it again. Oh right, I missed the else branch in this path: /* copy skb_ubuf_info for callback when skb has no error */ if (zerocopy) { skb_shinfo(skb)->destructor_arg = msg_control; skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY; skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG; } else if (msg_control) { struct ubuf_info *uarg = msg_control; uarg->callback(uarg, false); } So if handle_tx_zerocopy calls tun_sendmsg with ubuf_info (and thus a reference to release), there are these five options: 1. tun_sendmsg succeeds, ubuf_info is associated with skb. reference released from kfree_skb calling vhost_zerocopy_callback later 2. tun_sendmsg succeeds, ubuf_info is released immediately, as skb is not zerocopy. 3. tun_sendmsg fails before creating skb, handle_tx_zerocopy correctly cleans up on receiving error from tun_sendmsg. 4. tun_sendmsg fails after creating skb, but with copying: decremented at branch shown above + again in handle_tx_zerocopy 5. tun_sendmsg fails after creating skb, with zerocopy: decremented at kfree_skb in drop: + again in handle_tx_zerocopy Since handle_tx_zerocopy has no idea whether on error 3, 4 or 5 occurred, Actually, it does. If sendmsg returns an error, it can test whether vq->heads[nvq->upend_idx].len != VHOST_DMA_IN_PROGRESS. Just to make sure I understand this. Any reason for it can't be VHOST_DMA_IN_PROGRESS here? Thanks either all decrement-on-error cases must be handled by handle_tx_zerocopy or none. Your patch chooses the latter. Makes sense. But can this still go wrong if the xdp path is taken, but no program exists or the program returns XDP_PASS. And then the packet hits an error path, such as ! IFF_UP?
Re: [PATCH net v2] tun: fix ubuf refcount incorrectly on error path
On 2020/12/14 上午11:56, Willem de Bruijn wrote: On Sun, Dec 13, 2020 at 10:54 PM Willem de Bruijn wrote: On Sun, Dec 13, 2020 at 10:30 PM Jason Wang wrote: On 2020/12/14 上午9:32, Willem de Bruijn wrote: On Sat, Dec 12, 2020 at 7:18 PM Willem de Bruijn wrote: afterwards, the error handling in vhost handle_tx() will try to decrease the same refcount again. This is wrong and fix this by delay copying ubuf_info until we're sure there's no errors. I think the right approach is to address this in the error paths, rather than complicate the normal datapath. Is it sufficient to suppress the call to vhost_net_ubuf_put in the handle_tx sendmsg error path, given that vhost_zerocopy_callback will be called on kfree_skb? We can not call kfree_skb() until the skb was created. Or alternatively clear the destructor in drop: The uarg->callback() is called immediately after we decide do datacopy even if caller want to do zerocopy. If another error occurs later, the vhost handle_tx() will try to decrease it again. Oh right, I missed the else branch in this path: /* copy skb_ubuf_info for callback when skb has no error */ if (zerocopy) { skb_shinfo(skb)->destructor_arg = msg_control; skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY; skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG; } else if (msg_control) { struct ubuf_info *uarg = msg_control; uarg->callback(uarg, false); } So if handle_tx_zerocopy calls tun_sendmsg with ubuf_info (and thus a reference to release), there are these five options: 1. tun_sendmsg succeeds, ubuf_info is associated with skb. reference released from kfree_skb calling vhost_zerocopy_callback later 2. tun_sendmsg succeeds, ubuf_info is released immediately, as skb is not zerocopy. 3. tun_sendmsg fails before creating skb, handle_tx_zerocopy correctly cleans up on receiving error from tun_sendmsg. 4. tun_sendmsg fails after creating skb, but with copying: decremented at branch shown above + again in handle_tx_zerocopy 5. tun_sendmsg fails after creating skb, with zerocopy: decremented at kfree_skb in drop: + again in handle_tx_zerocopy Since handle_tx_zerocopy has no idea whether on error 3, 4 or 5 occurred, Actually, it does. If sendmsg returns an error, it can test whether vq->heads[nvq->upend_idx].len != VHOST_DMA_IN_PROGRESS. Just to make sure I understand this. Any reason for it can't be VHOST_DMA_IN_PROGRESS here? It can be, and it will be if tun_sendmsg returns EINVAL before assigning the skb destructor. I meant returns an error, not necessarily only EINVAL. Only if tun_sendmsg released the zerocopy state through kfree_skb->vhost_zerocopy_callback will it have been updated to VHOST_DMA_DONE_LEN. And only then must the caller not try to release the state again. I see. So I tend to fix this in vhost instead of tun to be consistent with the current error handling in handle_tx_zerocopy(). Thanks
Re: [PATCH net 2/2] vhost_net: fix high cpu load when sendmsg fails
On 2020/12/15 上午9:48, wangyunjian wrote: From: Yunjian Wang Currently we break the loop and wake up the vhost_worker when sendmsg fails. When the worker wakes up again, we'll meet the same error. This will cause high CPU load. To fix this issue, we can skip this description by ignoring the error. When we exceeds sndbuf, the return value of sendmsg is -EAGAIN. In the case we don't skip the description and don't drop packet. Signed-off-by: Yunjian Wang --- drivers/vhost/net.c | 21 + 1 file changed, 9 insertions(+), 12 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index c8784dfafdd7..f966592d8900 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -827,16 +827,13 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock) msg.msg_flags &= ~MSG_MORE; } - /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock->ops->sendmsg(sock, &msg, len); - if (unlikely(err < 0)) { + if (unlikely(err == -EAGAIN)) { vhost_discard_vq_desc(vq, 1); vhost_net_enable_vq(net, vq); break; - } As I've pointed out in last version. If you don't discard descriptor, you probably need to add the head to used ring. Otherwise this descriptor will be always inflight that may confuse drivers. - if (err != len) - pr_debug("Truncated TX packet: len %d != %zd\n", -err, len); + } else if (unlikely(err < 0 || err != len)) It looks to me err != len covers err < 0. Thanks + vq_err(vq, "Fail to sending packets err : %d, len : %zd\n", err, len); done: vq->heads[nvq->done_idx].id = cpu_to_vhost32(vq, head); vq->heads[nvq->done_idx].len = 0; @@ -922,7 +919,6 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock) msg.msg_flags &= ~MSG_MORE; } - /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock->ops->sendmsg(sock, &msg, len); if (unlikely(err < 0)) { if (zcopy_used) { @@ -931,13 +927,14 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock) nvq->upend_idx = ((unsigned)nvq->upend_idx - 1) % UIO_MAXIOV; } - vhost_discard_vq_desc(vq, 1); - vhost_net_enable_vq(net, vq); - break; + if (err == -EAGAIN) { + vhost_discard_vq_desc(vq, 1); + vhost_net_enable_vq(net, vq); + break; + } } if (err != len) - pr_debug("Truncated TX packet: " -" len %d != %zd\n", err, len); + vq_err(vq, "Fail to sending packets err : %d, len : %zd\n", err, len); if (!zcopy_used) vhost_add_used_and_signal(&net->dev, vq, head, 0); else
Re: [PATCH net 2/2] vhost_net: fix high cpu load when sendmsg fails
- Original Message - > > > > -Original Message- > > From: Jason Wang [mailto:jasow...@redhat.com] > > Sent: Tuesday, December 15, 2020 12:10 PM > > To: wangyunjian ; netdev@vger.kernel.org; > > m...@redhat.com; willemdebruijn.ker...@gmail.com > > Cc: virtualizat...@lists.linux-foundation.org; Lilijun (Jerry) > > ; chenchanghu ; > > xudingke ; huangbin (J) > > > > Subject: Re: [PATCH net 2/2] vhost_net: fix high cpu load when sendmsg > > fails > > > > > > On 2020/12/15 上午9:48, wangyunjian wrote: > > > From: Yunjian Wang > > > > > > Currently we break the loop and wake up the vhost_worker when sendmsg > > > fails. When the worker wakes up again, we'll meet the same error. This > > > will cause high CPU load. To fix this issue, we can skip this > > > description by ignoring the error. When we exceeds sndbuf, the return > > > value of sendmsg is -EAGAIN. In the case we don't skip the description > > > and don't drop packet. > > > > > > Signed-off-by: Yunjian Wang > > > --- > > > drivers/vhost/net.c | 21 + > > > 1 file changed, 9 insertions(+), 12 deletions(-) > > > > > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index > > > c8784dfafdd7..f966592d8900 100644 > > > --- a/drivers/vhost/net.c > > > +++ b/drivers/vhost/net.c > > > @@ -827,16 +827,13 @@ static void handle_tx_copy(struct vhost_net *net, > > struct socket *sock) > > > msg.msg_flags &= ~MSG_MORE; > > > } > > > > > > - /* TODO: Check specific error and bomb out unless ENOBUFS? */ > > > err = sock->ops->sendmsg(sock, &msg, len); > > > - if (unlikely(err < 0)) { > > > + if (unlikely(err == -EAGAIN)) { > > > vhost_discard_vq_desc(vq, 1); > > > vhost_net_enable_vq(net, vq); > > > break; > > > - } > > > > > > As I've pointed out in last version. If you don't discard descriptor, you > > probably > > need to add the head to used ring. Otherwise this descriptor will be always > > inflight that may confuse drivers. > > Sorry for missing the comment. > > After deleting discard descriptor and break, the next processing will be the > same > as the normal success of sendmsg(), and vhost_zerocopy_signal_used() or > vhost_add_used_and_signal() method will be called to add the head to used > ring. It's the next head not the one that contains the buggy packet? Thanks > > Thanks > > > > > > > - if (err != len) > > > - pr_debug("Truncated TX packet: len %d != %zd\n", > > > - err, len); > > > + } else if (unlikely(err < 0 || err != len)) > > > > > > It looks to me err != len covers err < 0. > > OK > > > > > Thanks > > > > > > > + vq_err(vq, "Fail to sending packets err : %d, len : > > > %zd\n", err, > > > +len); > > > done: > > > vq->heads[nvq->done_idx].id = cpu_to_vhost32(vq, head); > > > vq->heads[nvq->done_idx].len = 0; > > > @@ -922,7 +919,6 @@ static void handle_tx_zerocopy(struct vhost_net > > *net, struct socket *sock) > > > msg.msg_flags &= ~MSG_MORE; > > > } > > > > > > - /* TODO: Check specific error and bomb out unless ENOBUFS? */ > > > err = sock->ops->sendmsg(sock, &msg, len); > > > if (unlikely(err < 0)) { > > > if (zcopy_used) { > > > @@ -931,13 +927,14 @@ static void handle_tx_zerocopy(struct vhost_net > > *net, struct socket *sock) > > > nvq->upend_idx = > > > ((unsigned)nvq->upend_idx - 1) > > > % UIO_MAXIOV; > > > } > > > - vhost_discard_vq_desc(vq, 1); > > > - vhost_net_enable_vq(net, vq); > > > - break; > > > + if (err == -EAGAIN) { > > > + vhost_discard_vq_desc(vq, 1); > > > + vhost_net_enable_vq(net, vq); > > > + break; > > > + } > > > } > > > if (err != len) > > > - pr_debug("Truncated TX packet: " > > > - " len %d != %zd\n", err, len); > > > + vq_err(vq, "Fail to sending packets err : %d, len : > > > %zd\n", err, > > > +len); > > > if (!zcopy_used) > > > vhost_add_used_and_signal(&net->dev, vq, head, > > > 0); > > > else > >
[PATCH 00/21] Control VQ support in vDPA
Hi All: This series tries to add the support for control virtqueue in vDPA. Control virtqueue is used by networking device for accepting various commands from the driver. It's a must to support multiqueue and other configurations. When used by vhost-vDPA bus driver for VM, the control virtqueue should be shadowed via userspace VMM (Qemu) instead of being assigned directly to Guest. This is because Qemu needs to know the device state in order to start and stop device correctly (e.g for Live Migration). This requies to isolate the memory mapping for control virtqueue presented by vhost-vDPA to prevent guest from accesing it directly. To achieve this, vDPA introduce two new abstractions: - address space: identified through address space id (ASID) and a set of memory mapping in maintained - virtqueue group: the minimal set of virtqueues that must share an address space Device needs to advertise the following attributes to vDPA: - the number of address spaces supported in the device - the number of virtqueue groups supported in the device - the mappings from a specific virtqueue to its virtqueue groups The mappings from virtqueue to virtqueue groups is fixed and defined by vDPA device driver. E.g: - For the device that has hardware ASID support, it can simply advertise a per virtqueue virtqueue group. - For the device that does not have hardware ASID support, it can simply advertise a single virtqueue group that contains all virtqueues. Or if it wants a software emulated control virtqueue, it can advertise two virtqueue groups, one is for cvq, another is for the rest virtqueues. vDPA also allow to change the association between virtqueue group and address space. So in the case of control virtqueue, userspace VMM(Qemu) may use a dedicated address space for the control virtqueue group to isolate the memory mapping. The vhost/vhost-vDPA is also extend for the userspace to: - query the number of virtqueue groups and address spaces supported by the device - query the virtqueue group for a specific virtqueue - assocaite a virtqueue group with an address space - send ASID based IOTLB commands This will help userspace VMM(Qemu) to detect whether the control vq could be supported and isolate memory mappings of control virtqueue from the others. To demonstrate the usage, vDPA simulator is extended to support setting MAC address via a emulated control virtqueue. Please review. Changes since RFC: - tweak vhost uAPI documentation - switch to use device specific IOTLB really in patch 4 - tweak the commit log - fix that ASID in vhost is claimed to be 32 actually but 16bit actually - fix use after free when using ASID with IOTLB batching requests - switch to use Stefano's patch for having separated iov - remove unused "used_as" variable - fix the iotlb/asid checking in vhost_vdpa_unmap() Thanks Jason Wang (20): vhost: move the backend feature bits to vhost_types.h virtio-vdpa: don't set callback if virtio doesn't need it vhost-vdpa: passing iotlb to IOMMU mapping helpers vhost-vdpa: switch to use vhost-vdpa specific IOTLB vdpa: add the missing comment for nvqs in struct vdpa_device vdpa: introduce virtqueue groups vdpa: multiple address spaces support vdpa: introduce config operations for associating ASID to a virtqueue group vhost_iotlb: split out IOTLB initialization vhost: support ASID in IOTLB API vhost-vdpa: introduce asid based IOTLB vhost-vdpa: introduce uAPI to get the number of virtqueue groups vhost-vdpa: introduce uAPI to get the number of address spaces vhost-vdpa: uAPI to get virtqueue group id vhost-vdpa: introduce uAPI to set group ASID vhost-vdpa: support ASID based IOTLB API vdpa_sim: advertise VIRTIO_NET_F_MTU vdpa_sim: factor out buffer completion logic vdpa_sim: filter destination mac address vdpasim: control virtqueue support Stefano Garzarella (1): vdpa_sim: split vdpasim_virtqueue's iov field in out_iov and in_iov drivers/vdpa/ifcvf/ifcvf_main.c | 9 +- drivers/vdpa/mlx5/net/mlx5_vnet.c | 11 +- drivers/vdpa/vdpa.c | 8 +- drivers/vdpa/vdpa_sim/vdpa_sim.c | 292 -- drivers/vhost/iotlb.c | 23 ++- drivers/vhost/vdpa.c | 246 - drivers/vhost/vhost.c | 23 ++- drivers/vhost/vhost.h | 4 +- drivers/virtio/virtio_vdpa.c | 2 +- include/linux/vdpa.h | 42 - include/linux/vhost_iotlb.h | 2 + include/uapi/linux/vhost.h| 25 ++- include/uapi/linux/vhost_types.h | 10 +- 13 files changed, 561 insertions(+), 136 deletions(-) -- 2.25.1
[PATCH 02/21] virtio-vdpa: don't set callback if virtio doesn't need it
There's no need for setting callbacks for the driver that doesn't care about that. Signed-off-by: Jason Wang --- drivers/virtio/virtio_vdpa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/virtio/virtio_vdpa.c b/drivers/virtio/virtio_vdpa.c index 4a9ddb44b2a7..af6ee677f319 100644 --- a/drivers/virtio/virtio_vdpa.c +++ b/drivers/virtio/virtio_vdpa.c @@ -175,7 +175,7 @@ virtio_vdpa_setup_vq(struct virtio_device *vdev, unsigned int index, } /* Setup virtqueue callback */ - cb.callback = virtio_vdpa_virtqueue_cb; + cb.callback = callback ? virtio_vdpa_virtqueue_cb : NULL; cb.private = info; ops->set_vq_cb(vdpa, index, &cb); ops->set_vq_num(vdpa, index, virtqueue_get_vring_size(vq)); -- 2.25.1
[PATCH 03/21] vhost-vdpa: passing iotlb to IOMMU mapping helpers
To prepare for the ASID support for vhost-vdpa, try to pass IOTLB object to dma helpers. No functional changes, it's just a preparation for support multiple IOTLBs. Signed-off-by: Jason Wang --- drivers/vhost/vdpa.c | 41 +++-- 1 file changed, 23 insertions(+), 18 deletions(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 29ed4173f04e..07f92d48c173 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -501,10 +501,11 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, return r; } -static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last) +static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, + struct vhost_iotlb *iotlb, + u64 start, u64 last) { struct vhost_dev *dev = &v->vdev; - struct vhost_iotlb *iotlb = dev->iotlb; struct vhost_iotlb_map *map; struct page *page; unsigned long pfn, pinned; @@ -526,8 +527,9 @@ static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last) static void vhost_vdpa_iotlb_free(struct vhost_vdpa *v) { struct vhost_dev *dev = &v->vdev; + struct vhost_iotlb *iotlb = dev->iotlb; - vhost_vdpa_iotlb_unmap(v, 0ULL, 0ULL - 1); + vhost_vdpa_iotlb_unmap(v, iotlb, 0ULL, 0ULL - 1); kfree(dev->iotlb); dev->iotlb = NULL; } @@ -554,7 +556,7 @@ static int perm_to_iommu_flags(u32 perm) return flags | IOMMU_CACHE; } -static int vhost_vdpa_map(struct vhost_vdpa *v, +static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb, u64 iova, u64 size, u64 pa, u32 perm) { struct vhost_dev *dev = &v->vdev; @@ -562,7 +564,7 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, const struct vdpa_config_ops *ops = vdpa->config; int r = 0; - r = vhost_iotlb_add_range(dev->iotlb, iova, iova + size - 1, + r = vhost_iotlb_add_range(iotlb, iova, iova + size - 1, pa, perm); if (r) return r; @@ -571,43 +573,44 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, r = ops->dma_map(vdpa, iova, size, pa, perm); } else if (ops->set_map) { if (!v->in_batch) - r = ops->set_map(vdpa, dev->iotlb); + r = ops->set_map(vdpa, iotlb); } else { r = iommu_map(v->domain, iova, pa, size, perm_to_iommu_flags(perm)); } if (r) - vhost_iotlb_del_range(dev->iotlb, iova, iova + size - 1); + vhost_iotlb_del_range(iotlb, iova, iova + size - 1); else atomic64_add(size >> PAGE_SHIFT, &dev->mm->pinned_vm); return r; } -static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size) +static void vhost_vdpa_unmap(struct vhost_vdpa *v, +struct vhost_iotlb *iotlb, +u64 iova, u64 size) { - struct vhost_dev *dev = &v->vdev; struct vdpa_device *vdpa = v->vdpa; const struct vdpa_config_ops *ops = vdpa->config; - vhost_vdpa_iotlb_unmap(v, iova, iova + size - 1); + vhost_vdpa_iotlb_unmap(v, iotlb, iova, iova + size - 1); if (ops->dma_map) { ops->dma_unmap(vdpa, iova, size); } else if (ops->set_map) { if (!v->in_batch) - ops->set_map(vdpa, dev->iotlb); + ops->set_map(vdpa, iotlb); } else { iommu_unmap(v->domain, iova, size); } } static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v, + struct vhost_iotlb *iotlb, struct vhost_iotlb_msg *msg) { struct vhost_dev *dev = &v->vdev; - struct vhost_iotlb *iotlb = dev->iotlb; struct page **page_list; unsigned long list_size = PAGE_SIZE / sizeof(struct page *); unsigned int gup_flags = FOLL_LONGTERM; @@ -676,7 +679,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v, if (last_pfn && (this_pfn != last_pfn + 1)) { /* Pin a contiguous chunk of memory */ csize = (last_pfn - map_pfn + 1) << PAGE_SHIFT; - ret = vhost_vdpa_map(v, iova, csize, + ret = vhost_vdpa_map(v, iotlb, iova, csize, map_pfn << PAGE_SHIFT, msg->perm); if (ret) { @@ -706,7 +709,8 @@ static int vhost_v
[PATCH 04/21] vhost-vdpa: switch to use vhost-vdpa specific IOTLB
To ease the implementation of per group ASID support for vDPA device. This patch switches to use a vhost-vdpa specific IOTLB to avoid the unnecessary refactoring of the vhost core. Signed-off-by: Jason Wang --- drivers/vhost/vdpa.c | 18 ++ 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 07f92d48c173..9bcc03d4e68b 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -39,6 +39,7 @@ struct vhost_vdpa { struct vhost_virtqueue *vqs; struct completion completion; struct vdpa_device *vdpa; + struct vhost_iotlb *iotlb; struct device dev; struct cdev cdev; atomic_t opened; @@ -526,12 +527,11 @@ static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, static void vhost_vdpa_iotlb_free(struct vhost_vdpa *v) { - struct vhost_dev *dev = &v->vdev; - struct vhost_iotlb *iotlb = dev->iotlb; + struct vhost_iotlb *iotlb = v->iotlb; vhost_vdpa_iotlb_unmap(v, iotlb, 0ULL, 0ULL - 1); - kfree(dev->iotlb); - dev->iotlb = NULL; + kfree(v->iotlb); + v->iotlb = NULL; } static int perm_to_iommu_flags(u32 perm) @@ -745,7 +745,7 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, struct vhost_vdpa *v = container_of(dev, struct vhost_vdpa, vdev); struct vdpa_device *vdpa = v->vdpa; const struct vdpa_config_ops *ops = vdpa->config; - struct vhost_iotlb *iotlb = dev->iotlb; + struct vhost_iotlb *iotlb = v->iotlb; int r = 0; r = vhost_dev_check_owner(dev); @@ -883,15 +883,15 @@ static int vhost_vdpa_open(struct inode *inode, struct file *filep) vhost_dev_init(dev, vqs, nvqs, 0, 0, 0, false, vhost_vdpa_process_iotlb_msg); - dev->iotlb = vhost_iotlb_alloc(0, 0); - if (!dev->iotlb) { + v->iotlb = vhost_iotlb_alloc(0, 0); + if (!v->iotlb) { r = -ENOMEM; goto err_init_iotlb; } r = vhost_vdpa_alloc_domain(v); if (r) - goto err_init_iotlb; + goto err_alloc_domain; vhost_vdpa_set_iova_range(v); @@ -899,6 +899,8 @@ static int vhost_vdpa_open(struct inode *inode, struct file *filep) return 0; +err_alloc_domain: + vhost_vdpa_iotlb_free(v); err_init_iotlb: vhost_dev_cleanup(&v->vdev); kfree(vqs); -- 2.25.1
[PATCH 05/21] vdpa: add the missing comment for nvqs in struct vdpa_device
Signed-off-by: Jason Wang --- include/linux/vdpa.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index 30bc7a7223bb..8ab8dcde705d 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -42,6 +42,7 @@ struct vdpa_vq_state { * @config: the configuration ops for this device. * @index: device index * @features_valid: were features initialized? for legacy guests + * @nvqs: the number of virtqueues */ struct vdpa_device { struct device dev; -- 2.25.1
[PATCH 07/21] vdpa: multiple address spaces support
This patches introduces the multiple address spaces support for vDPA device. This idea is to identify a specific address space via an dedicated identifier - ASID. During vDPA device allocation, vDPA device driver needs to report the number of address spaces supported by the device then the DMA mapping ops of the vDPA device needs to be extended to support ASID. This helps to isolate the environments for the virtqueue that will not be assigned directly. E.g in the case of virtio-net, the control virtqueue will not be assigned directly to guest. As a start, simply claim 1 virtqueue groups and 1 address spaces for all vDPA devices. And vhost-vDPA will simply reject the device with more than 1 virtqueue groups or address spaces. Signed-off-by: Jason Wang --- drivers/vdpa/ifcvf/ifcvf_main.c | 2 +- drivers/vdpa/mlx5/net/mlx5_vnet.c | 5 +++-- drivers/vdpa/vdpa.c | 4 +++- drivers/vdpa/vdpa_sim/vdpa_sim.c | 10 ++ drivers/vhost/vdpa.c | 14 +- include/linux/vdpa.h | 23 --- 6 files changed, 38 insertions(+), 20 deletions(-) diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c index c629f4fcc738..8a43f562b169 100644 --- a/drivers/vdpa/ifcvf/ifcvf_main.c +++ b/drivers/vdpa/ifcvf/ifcvf_main.c @@ -445,7 +445,7 @@ static int ifcvf_probe(struct pci_dev *pdev, const struct pci_device_id *id) adapter = vdpa_alloc_device(struct ifcvf_adapter, vdpa, dev, &ifc_vdpa_ops, - IFCVF_MAX_QUEUE_PAIRS * 2, 1); + IFCVF_MAX_QUEUE_PAIRS * 2, 1, 1); if (adapter == NULL) { IFCVF_ERR(pdev, "Failed to allocate vDPA structure"); diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 719b52fcc547..7aaf0a4ee80d 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -1804,7 +1804,8 @@ static u32 mlx5_vdpa_get_generation(struct vdpa_device *vdev) return mvdev->generation; } -static int mlx5_vdpa_set_map(struct vdpa_device *vdev, struct vhost_iotlb *iotlb) +static int mlx5_vdpa_set_map(struct vdpa_device *vdev, unsigned int asid, +struct vhost_iotlb *iotlb) { struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev); struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev); @@ -1947,7 +1948,7 @@ void *mlx5_vdpa_add_dev(struct mlx5_core_dev *mdev) max_vqs = min_t(u32, max_vqs, MLX5_MAX_SUPPORTED_VQS); ndev = vdpa_alloc_device(struct mlx5_vdpa_net, mvdev.vdev, mdev->device, &mlx5_vdpa_ops, -2 * mlx5_vdpa_max_qps(max_vqs), 1); +2 * mlx5_vdpa_max_qps(max_vqs), 1, 1); if (IS_ERR(ndev)) return ndev; diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c index 46399746ec7c..05195fa7865d 100644 --- a/drivers/vdpa/vdpa.c +++ b/drivers/vdpa/vdpa.c @@ -63,6 +63,7 @@ static void vdpa_release_dev(struct device *d) * @config: the bus operations that is supported by this device * @nvqs: number of virtqueues supported by this device * @ngroups: number of groups supported by this device + * @nas: number of address spaces supported by this device * @size: size of the parent structure that contains private data * * Driver should use vdpa_alloc_device() wrapper macro instead of @@ -74,7 +75,7 @@ static void vdpa_release_dev(struct device *d) struct vdpa_device *__vdpa_alloc_device(struct device *parent, const struct vdpa_config_ops *config, int nvqs, unsigned int ngroups, - size_t size) + unsigned int nas, size_t size) { struct vdpa_device *vdev; int err = -EINVAL; @@ -102,6 +103,7 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent, vdev->features_valid = false; vdev->nvqs = nvqs; vdev->ngroups = ngroups; + vdev->nas = nas; err = dev_set_name(&vdev->dev, "vdpa%u", vdev->index); if (err) diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c index 5d554b3cd152..140de452 100644 --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c @@ -359,7 +359,7 @@ static struct vdpasim *vdpasim_create(void) ops = &vdpasim_net_config_ops; vdpasim = vdpa_alloc_device(struct vdpasim, vdpa, NULL, ops, - VDPASIM_VQ_NUM, 1); + VDPASIM_VQ_NUM, 1, 1); if (!vdpasim) goto err_alloc; @@ -606,7 +606,7 @@ static struct vdpa_iova_range vdpasim_get_iova_range(struct vdpa_device *vdpa) return range; } -static int vdpasim_set_map(
[PATCH 06/21] vdpa: introduce virtqueue groups
This patch introduces virtqueue groups to vDPA device. The virtqueue group is the minimal set of virtqueues that must share an address space. And the adddress space identifier could only be attached to a specific virtqueue group. A new mandated bus operation is introduced to get the virtqueue group ID for a specific virtqueue. All the vDPA device drivers were converted to simply support a single virtqueue group. Signed-off-by: Jason Wang --- drivers/vdpa/ifcvf/ifcvf_main.c | 9 - drivers/vdpa/mlx5/net/mlx5_vnet.c | 8 +++- drivers/vdpa/vdpa.c | 4 +++- drivers/vdpa/vdpa_sim/vdpa_sim.c | 11 ++- include/linux/vdpa.h | 12 +--- 5 files changed, 37 insertions(+), 7 deletions(-) diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c index 8b4028556cb6..c629f4fcc738 100644 --- a/drivers/vdpa/ifcvf/ifcvf_main.c +++ b/drivers/vdpa/ifcvf/ifcvf_main.c @@ -332,6 +332,11 @@ static u32 ifcvf_vdpa_get_vq_align(struct vdpa_device *vdpa_dev) return IFCVF_QUEUE_ALIGNMENT; } +static u32 ifcvf_vdpa_get_vq_group(struct vdpa_device *vdpa, u16 idx) +{ + return 0; +} + static void ifcvf_vdpa_get_config(struct vdpa_device *vdpa_dev, unsigned int offset, void *buf, unsigned int len) @@ -392,6 +397,7 @@ static const struct vdpa_config_ops ifc_vdpa_ops = { .get_device_id = ifcvf_vdpa_get_device_id, .get_vendor_id = ifcvf_vdpa_get_vendor_id, .get_vq_align = ifcvf_vdpa_get_vq_align, + .get_vq_group = ifcvf_vdpa_get_vq_group, .get_config = ifcvf_vdpa_get_config, .set_config = ifcvf_vdpa_set_config, .set_config_cb = ifcvf_vdpa_set_config_cb, @@ -439,7 +445,8 @@ static int ifcvf_probe(struct pci_dev *pdev, const struct pci_device_id *id) adapter = vdpa_alloc_device(struct ifcvf_adapter, vdpa, dev, &ifc_vdpa_ops, - IFCVF_MAX_QUEUE_PAIRS * 2); + IFCVF_MAX_QUEUE_PAIRS * 2, 1); + if (adapter == NULL) { IFCVF_ERR(pdev, "Failed to allocate vDPA structure"); return -ENOMEM; diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 1fa6fcac8299..719b52fcc547 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -1436,6 +1436,11 @@ static u32 mlx5_vdpa_get_vq_align(struct vdpa_device *vdev) return PAGE_SIZE; } +static u32 mlx5_vdpa_get_vq_group(struct vdpa_device *vdpa, u16 idx) +{ + return 0; +} + enum { MLX5_VIRTIO_NET_F_GUEST_CSUM = 1 << 9, MLX5_VIRTIO_NET_F_CSUM = 1 << 10, MLX5_VIRTIO_NET_F_HOST_TSO6 = 1 << 11, @@ -1854,6 +1859,7 @@ static const struct vdpa_config_ops mlx5_vdpa_ops = { .get_vq_notification = mlx5_get_vq_notification, .get_vq_irq = mlx5_get_vq_irq, .get_vq_align = mlx5_vdpa_get_vq_align, + .get_vq_group = mlx5_vdpa_get_vq_group, .get_features = mlx5_vdpa_get_features, .set_features = mlx5_vdpa_set_features, .set_config_cb = mlx5_vdpa_set_config_cb, @@ -1941,7 +1947,7 @@ void *mlx5_vdpa_add_dev(struct mlx5_core_dev *mdev) max_vqs = min_t(u32, max_vqs, MLX5_MAX_SUPPORTED_VQS); ndev = vdpa_alloc_device(struct mlx5_vdpa_net, mvdev.vdev, mdev->device, &mlx5_vdpa_ops, -2 * mlx5_vdpa_max_qps(max_vqs)); +2 * mlx5_vdpa_max_qps(max_vqs), 1); if (IS_ERR(ndev)) return ndev; diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c index a69ffc991e13..46399746ec7c 100644 --- a/drivers/vdpa/vdpa.c +++ b/drivers/vdpa/vdpa.c @@ -62,6 +62,7 @@ static void vdpa_release_dev(struct device *d) * @parent: the parent device * @config: the bus operations that is supported by this device * @nvqs: number of virtqueues supported by this device + * @ngroups: number of groups supported by this device * @size: size of the parent structure that contains private data * * Driver should use vdpa_alloc_device() wrapper macro instead of @@ -72,7 +73,7 @@ static void vdpa_release_dev(struct device *d) */ struct vdpa_device *__vdpa_alloc_device(struct device *parent, const struct vdpa_config_ops *config, - int nvqs, + int nvqs, unsigned int ngroups, size_t size) { struct vdpa_device *vdev; @@ -100,6 +101,7 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent, vdev->config = config; vdev->features_valid = false; vdev->nvqs = nvqs; + vdev->ngroups = ngroups; err = dev_set_name(&vdev->dev, "vdpa%u", vdev->index);
[PATCH 09/21] vhost_iotlb: split out IOTLB initialization
This patch splits out IOTLB initialization to make sure it could be reused by external modules. Signed-off-by: Jason Wang --- drivers/vhost/iotlb.c | 23 ++- include/linux/vhost_iotlb.h | 2 ++ 2 files changed, 20 insertions(+), 5 deletions(-) diff --git a/drivers/vhost/iotlb.c b/drivers/vhost/iotlb.c index 0fd3f87e913c..e842d76c179e 100644 --- a/drivers/vhost/iotlb.c +++ b/drivers/vhost/iotlb.c @@ -98,6 +98,23 @@ void vhost_iotlb_del_range(struct vhost_iotlb *iotlb, u64 start, u64 last) } EXPORT_SYMBOL_GPL(vhost_iotlb_del_range); +/** + * vhost_iotlb_init - initialize a vhost IOTLB + * @iotlb: the IOTLB that needs to be initialized + * @limit: maximum number of IOTLB entries + * @flags: VHOST_IOTLB_FLAG_XXX + */ +void vhost_iotlb_init(struct vhost_iotlb *iotlb, unsigned int limit, + unsigned int flags) +{ + iotlb->root = RB_ROOT_CACHED; + iotlb->limit = limit; + iotlb->nmaps = 0; + iotlb->flags = flags; + INIT_LIST_HEAD(&iotlb->list); +} +EXPORT_SYMBOL_GPL(vhost_iotlb_init); + /** * vhost_iotlb_alloc - add a new vhost IOTLB * @limit: maximum number of IOTLB entries @@ -112,11 +129,7 @@ struct vhost_iotlb *vhost_iotlb_alloc(unsigned int limit, unsigned int flags) if (!iotlb) return NULL; - iotlb->root = RB_ROOT_CACHED; - iotlb->limit = limit; - iotlb->nmaps = 0; - iotlb->flags = flags; - INIT_LIST_HEAD(&iotlb->list); + vhost_iotlb_init(iotlb, limit, flags); return iotlb; } diff --git a/include/linux/vhost_iotlb.h b/include/linux/vhost_iotlb.h index 6b09b786a762..c0df193ec3e1 100644 --- a/include/linux/vhost_iotlb.h +++ b/include/linux/vhost_iotlb.h @@ -33,6 +33,8 @@ int vhost_iotlb_add_range(struct vhost_iotlb *iotlb, u64 start, u64 last, u64 addr, unsigned int perm); void vhost_iotlb_del_range(struct vhost_iotlb *iotlb, u64 start, u64 last); +void vhost_iotlb_init(struct vhost_iotlb *iotlb, unsigned int limit, + unsigned int flags); struct vhost_iotlb *vhost_iotlb_alloc(unsigned int limit, unsigned int flags); void vhost_iotlb_free(struct vhost_iotlb *iotlb); void vhost_iotlb_reset(struct vhost_iotlb *iotlb); -- 2.25.1
[PATCH 01/21] vhost: move the backend feature bits to vhost_types.h
We should store feature bits in vhost_types.h as what has been done for e.g VHOST_F_LOG_ALL. Signed-off-by: Jason Wang --- include/uapi/linux/vhost.h | 5 - include/uapi/linux/vhost_types.h | 5 + 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h index c998860d7bbc..59c6c0fbaba1 100644 --- a/include/uapi/linux/vhost.h +++ b/include/uapi/linux/vhost.h @@ -89,11 +89,6 @@ /* Set or get vhost backend capability */ -/* Use message type V2 */ -#define VHOST_BACKEND_F_IOTLB_MSG_V2 0x1 -/* IOTLB can accept batching hints */ -#define VHOST_BACKEND_F_IOTLB_BATCH 0x2 - #define VHOST_SET_BACKEND_FEATURES _IOW(VHOST_VIRTIO, 0x25, __u64) #define VHOST_GET_BACKEND_FEATURES _IOR(VHOST_VIRTIO, 0x26, __u64) diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h index f7f6a3a28977..76ee7016c501 100644 --- a/include/uapi/linux/vhost_types.h +++ b/include/uapi/linux/vhost_types.h @@ -153,4 +153,9 @@ struct vhost_vdpa_iova_range { /* vhost-net should add virtio_net_hdr for RX, and strip for TX packets. */ #define VHOST_NET_F_VIRTIO_NET_HDR 27 +/* Use message type V2 */ +#define VHOST_BACKEND_F_IOTLB_MSG_V2 0x1 +/* IOTLB can accept batching hints */ +#define VHOST_BACKEND_F_IOTLB_BATCH 0x2 + #endif -- 2.25.1
[PATCH 10/21] vhost: support ASID in IOTLB API
This patches allows userspace to send ASID based IOTLB message to vhost. This idea is to use the reserved u32 field in the existing V2 IOTLB message. Vhost device should advertise this capability via VHOST_BACKEND_F_IOTLB_ASID backend feature. Signed-off-by: Jason Wang --- drivers/vhost/vdpa.c | 5 - drivers/vhost/vhost.c| 23 ++- drivers/vhost/vhost.h| 4 ++-- include/uapi/linux/vhost_types.h | 5 - 4 files changed, 28 insertions(+), 9 deletions(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 03a9b3311c6c..feb6a58df22d 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -739,7 +739,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v, return ret; } -static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, +static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, u32 asid, struct vhost_iotlb_msg *msg) { struct vhost_vdpa *v = container_of(dev, struct vhost_vdpa, vdev); @@ -748,6 +748,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, struct vhost_iotlb *iotlb = v->iotlb; int r = 0; + if (asid != 0) + return -EINVAL; + r = vhost_dev_check_owner(dev); if (r) return r; diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index a262e12c6dc2..7477b724c29b 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -468,7 +468,7 @@ void vhost_dev_init(struct vhost_dev *dev, struct vhost_virtqueue **vqs, int nvqs, int iov_limit, int weight, int byte_weight, bool use_worker, - int (*msg_handler)(struct vhost_dev *dev, + int (*msg_handler)(struct vhost_dev *dev, u32 asid, struct vhost_iotlb_msg *msg)) { struct vhost_virtqueue *vq; @@ -1084,11 +1084,14 @@ static bool umem_access_ok(u64 uaddr, u64 size, int access) return true; } -static int vhost_process_iotlb_msg(struct vhost_dev *dev, +static int vhost_process_iotlb_msg(struct vhost_dev *dev, u16 asid, struct vhost_iotlb_msg *msg) { int ret = 0; + if (asid != 0) + return -EINVAL; + mutex_lock(&dev->mutex); vhost_dev_lock_vqs(dev); switch (msg->type) { @@ -1135,6 +1138,7 @@ ssize_t vhost_chr_write_iter(struct vhost_dev *dev, struct vhost_iotlb_msg msg; size_t offset; int type, ret; + u16 asid = 0; ret = copy_from_iter(&type, sizeof(type), from); if (ret != sizeof(type)) { @@ -1150,7 +1154,16 @@ ssize_t vhost_chr_write_iter(struct vhost_dev *dev, offset = offsetof(struct vhost_msg, iotlb) - sizeof(int); break; case VHOST_IOTLB_MSG_V2: - offset = sizeof(__u32); + if (vhost_backend_has_feature(dev->vqs[0], + VHOST_BACKEND_F_IOTLB_ASID)) { + ret = copy_from_iter(&asid, sizeof(asid), from); + if (ret != sizeof(asid)) { + ret = -EINVAL; + goto done; + } + offset = sizeof(__u16); + } else + offset = sizeof(__u32); break; default: ret = -EINVAL; @@ -1165,9 +1178,9 @@ ssize_t vhost_chr_write_iter(struct vhost_dev *dev, } if (dev->msg_handler) - ret = dev->msg_handler(dev, &msg); + ret = dev->msg_handler(dev, asid, &msg); else - ret = vhost_process_iotlb_msg(dev, &msg); + ret = vhost_process_iotlb_msg(dev, asid, &msg); if (ret) { ret = -EFAULT; goto done; diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index b063324c7669..19753a90875c 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -162,7 +162,7 @@ struct vhost_dev { int byte_weight; u64 kcov_handle; bool use_worker; - int (*msg_handler)(struct vhost_dev *dev, + int (*msg_handler)(struct vhost_dev *dev, u32 asid, struct vhost_iotlb_msg *msg); }; @@ -170,7 +170,7 @@ bool vhost_exceeds_weight(struct vhost_virtqueue *vq, int pkts, int total_len); void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int nvqs, int iov_limit, int weight, int byte_weight, bool use_worker, - int (*msg_handler)(struct vhost_dev *dev, + int (*msg_handler)(struct vhost_dev *dev, u32 asid, struct vhost_iotlb_msg *msg)); long vhost_dev_set_owner(struct
[PATCH 08/21] vdpa: introduce config operations for associating ASID to a virtqueue group
This patch introduces a new bus operation to allow the vDPA bus driver to associate an ASID to a virtqueue group. Signed-off-by: Jason Wang --- include/linux/vdpa.h | 10 ++ 1 file changed, 10 insertions(+) diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index 0a9a754f8180..2a8671f27b0b 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -174,6 +174,12 @@ struct vdpa_iova_range { * @vdev: vdpa device * Returns the iova range supported by * the device. + * @set_group_asid:Set address space identifier for a + * virtqueue group + * @vdev: vdpa device + * @group: virtqueue group + * @asid: address space id for this group + * Returns integer: success (0) or error (< 0) * @set_map: Set device memory mapping (optional) * Needed for device that using device * specific DMA translation (on-chip IOMMU) @@ -252,6 +258,10 @@ struct vdpa_config_ops { u64 iova, u64 size, u64 pa, u32 perm); int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid, u64 iova, u64 size); + int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group, + unsigned int asid); + + /* Free device resources */ void (*free)(struct vdpa_device *vdev); -- 2.25.1
[PATCH 13/21] vhost-vdpa: introduce uAPI to get the number of address spaces
This patch introduces the uAPI for getting the number of address spaces supported by this vDPA device. Signed-off-by: Jason Wang --- drivers/vhost/vdpa.c | 3 +++ include/uapi/linux/vhost.h | 2 ++ 2 files changed, 5 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 1ba5901b28e7..bff8aa214f78 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -540,6 +540,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, r = copy_to_user(argp, &v->vdpa->ngroups, sizeof(v->vdpa->ngroups)); break; + case VHOST_VDPA_GET_AS_NUM: + r = copy_to_user(argp, &v->vdpa->nas, sizeof(v->vdpa->nas)); + break; case VHOST_SET_LOG_BASE: case VHOST_SET_LOG_FD: r = -ENOIOCTLCMD; diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h index 8a4e6e426bbf..8762911a3cb8 100644 --- a/include/uapi/linux/vhost.h +++ b/include/uapi/linux/vhost.h @@ -148,4 +148,6 @@ /* Get the number of virtqueue groups. */ #define VHOST_VDPA_GET_GROUP_NUM _IOR(VHOST_VIRTIO, 0x79, unsigned int) +/* Get the number of address spaces. */ +#define VHOST_VDPA_GET_AS_NUM _IOR(VHOST_VIRTIO, 0x7A, unsigned int) #endif -- 2.25.1
[PATCH 12/21] vhost-vdpa: introduce uAPI to get the number of virtqueue groups
Follows the vDPA support for multiple address spaces, this patch introduce uAPI for the userspace to know the number of virtqueue groups supported by the vDPA device. Signed-off-by: Jason Wang --- drivers/vhost/vdpa.c | 4 include/uapi/linux/vhost.h | 3 +++ 2 files changed, 7 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 060d5b5b7e64..1ba5901b28e7 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -536,6 +536,10 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, case VHOST_VDPA_GET_VRING_NUM: r = vhost_vdpa_get_vring_num(v, argp); break; + case VHOST_VDPA_GET_GROUP_NUM: + r = copy_to_user(argp, &v->vdpa->ngroups, +sizeof(v->vdpa->ngroups)); + break; case VHOST_SET_LOG_BASE: case VHOST_SET_LOG_FD: r = -ENOIOCTLCMD; diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h index 59c6c0fbaba1..8a4e6e426bbf 100644 --- a/include/uapi/linux/vhost.h +++ b/include/uapi/linux/vhost.h @@ -145,4 +145,7 @@ /* Get the valid iova range */ #define VHOST_VDPA_GET_IOVA_RANGE _IOR(VHOST_VIRTIO, 0x78, \ struct vhost_vdpa_iova_range) +/* Get the number of virtqueue groups. */ +#define VHOST_VDPA_GET_GROUP_NUM _IOR(VHOST_VIRTIO, 0x79, unsigned int) + #endif -- 2.25.1
[PATCH 11/21] vhost-vdpa: introduce asid based IOTLB
This patch converts the vhost-vDPA device to support multiple IOTLBs tagged via ASID via hlist. This will be used for supporting multiple address spaces in the following patches. Signed-off-by: Jason Wang --- drivers/vhost/vdpa.c | 106 --- 1 file changed, 80 insertions(+), 26 deletions(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index feb6a58df22d..060d5b5b7e64 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -33,13 +33,21 @@ enum { #define VHOST_VDPA_DEV_MAX (1U << MINORBITS) +#define VHOST_VDPA_IOTLB_BUCKETS 16 + +struct vhost_vdpa_as { + struct hlist_node hash_link; + struct vhost_iotlb iotlb; + u32 id; +}; + struct vhost_vdpa { struct vhost_dev vdev; struct iommu_domain *domain; struct vhost_virtqueue *vqs; struct completion completion; struct vdpa_device *vdpa; - struct vhost_iotlb *iotlb; + struct hlist_head as[VHOST_VDPA_IOTLB_BUCKETS]; struct device dev; struct cdev cdev; atomic_t opened; @@ -49,12 +57,64 @@ struct vhost_vdpa { struct eventfd_ctx *config_ctx; int in_batch; struct vdpa_iova_range range; + int used_as; }; static DEFINE_IDA(vhost_vdpa_ida); static dev_t vhost_vdpa_major; +static struct vhost_vdpa_as *asid_to_as(struct vhost_vdpa *v, u32 asid) +{ + struct hlist_head *head = &v->as[asid % VHOST_VDPA_IOTLB_BUCKETS]; + struct vhost_vdpa_as *as; + + hlist_for_each_entry(as, head, hash_link) + if (as->id == asid) + return as; + + return NULL; +} + +static struct vhost_vdpa_as *vhost_vdpa_alloc_as(struct vhost_vdpa *v, u32 asid) +{ + struct hlist_head *head = &v->as[asid % VHOST_VDPA_IOTLB_BUCKETS]; + struct vhost_vdpa_as *as; + + if (asid_to_as(v, asid)) + return NULL; + + as = kmalloc(sizeof(*as), GFP_KERNEL); + if (!as) + return NULL; + + vhost_iotlb_init(&as->iotlb, 0, 0); + as->id = asid; + hlist_add_head(&as->hash_link, head); + ++v->used_as; + + return as; +} + +static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) +{ + struct vhost_vdpa_as *as = asid_to_as(v, asid); + + /* Remove default address space is not allowed */ + if (asid == 0) + return -EINVAL; + + if (!as) + return -EINVAL; + + hlist_del(&as->hash_link); + vhost_iotlb_reset(&as->iotlb); + kfree(as); + --v->used_as; + + return 0; +} + static void handle_vq_kick(struct vhost_work *work) { struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue, @@ -525,15 +585,6 @@ static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, } } -static void vhost_vdpa_iotlb_free(struct vhost_vdpa *v) -{ - struct vhost_iotlb *iotlb = v->iotlb; - - vhost_vdpa_iotlb_unmap(v, iotlb, 0ULL, 0ULL - 1); - kfree(v->iotlb); - v->iotlb = NULL; -} - static int perm_to_iommu_flags(u32 perm) { int flags = 0; @@ -745,7 +796,8 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, u32 asid, struct vhost_vdpa *v = container_of(dev, struct vhost_vdpa, vdev); struct vdpa_device *vdpa = v->vdpa; const struct vdpa_config_ops *ops = vdpa->config; - struct vhost_iotlb *iotlb = v->iotlb; + struct vhost_vdpa_as *as = asid_to_as(v, 0); + struct vhost_iotlb *iotlb = &as->iotlb; int r = 0; if (asid != 0) @@ -856,6 +908,13 @@ static void vhost_vdpa_set_iova_range(struct vhost_vdpa *v) } } +static void vhost_vdpa_cleanup(struct vhost_vdpa *v) +{ + vhost_dev_cleanup(&v->vdev); + kfree(v->vdev.vqs); + vhost_vdpa_remove_as(v, 0); +} + static int vhost_vdpa_open(struct inode *inode, struct file *filep) { struct vhost_vdpa *v; @@ -886,15 +945,12 @@ static int vhost_vdpa_open(struct inode *inode, struct file *filep) vhost_dev_init(dev, vqs, nvqs, 0, 0, 0, false, vhost_vdpa_process_iotlb_msg); - v->iotlb = vhost_iotlb_alloc(0, 0); - if (!v->iotlb) { - r = -ENOMEM; - goto err_init_iotlb; - } + if (!vhost_vdpa_alloc_as(v, 0)) + goto err_alloc_as; r = vhost_vdpa_alloc_domain(v); if (r) - goto err_alloc_domain; + goto err_alloc_as; vhost_vdpa_set_iova_range(v); @@ -902,11 +958,8 @@ static int vhost_vdpa_open(struct inode *inode, struct file *filep) return 0; -err_alloc_domain: - vhost_vdpa_iotlb_free(v); -err_init_iotlb: - vhost_dev_cleanup(&v->vdev); - kfree(vqs); +err_alloc_as: + vhost_vdpa_cleanup(v); err: atomic_dec(&v->opened); re
[PATCH 14/21] vhost-vdpa: uAPI to get virtqueue group id
Follows the support for virtqueue group in vDPA. This patches introduces uAPI to get the virtqueue group ID for a specific virtqueue in vhost-vdpa. Signed-off-by: Jason Wang --- drivers/vhost/vdpa.c | 8 include/uapi/linux/vhost.h | 8 2 files changed, 16 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index bff8aa214f78..e7023abda12c 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -442,6 +442,14 @@ static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, unsigned int cmd, return -EFAULT; ops->set_vq_ready(vdpa, idx, s.num); return 0; + case VHOST_VDPA_GET_VRING_GROUP: + s.index = idx; + s.num = ops->get_vq_group(vdpa, idx); + if (s.num >= vdpa->ngroups) + return -EIO; + else if (copy_to_user(argp, &s, sizeof s)) + return -EFAULT; + return 0; case VHOST_GET_VRING_BASE: r = ops->get_vq_state(v->vdpa, idx, &vq_state); if (r) diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h index 8762911a3cb8..99de06476fdc 100644 --- a/include/uapi/linux/vhost.h +++ b/include/uapi/linux/vhost.h @@ -150,4 +150,12 @@ /* Get the number of address spaces. */ #define VHOST_VDPA_GET_AS_NUM _IOR(VHOST_VIRTIO, 0x7A, unsigned int) + +/* Get the group for a virtqueue: read index, write group in num, + * The virtqueue index is stored in the index field of + * vhost_vring_state. The group for this specific virtqueue is + * returned via num field of vhost_vring_state. + */ +#define VHOST_VDPA_GET_VRING_GROUP _IOWR(VHOST_VIRTIO, 0x7B, \ + struct vhost_vring_state) #endif -- 2.25.1
[PATCH 16/21] vhost-vdpa: support ASID based IOTLB API
This patch extends the vhost-vdpa to support ASID based IOTLB API. The vhost-vdpa device will allocated multple IOTLBs for vDPA device that supports multiple address spaces. The IOTLBs and vDPA device memory mappings is determined and maintained through ASID. Note that we still don't support vDPA device with more than one address spaces that depends on platform IOMMU. This work will be done by moving the IOMMU logic from vhost-vDPA to vDPA device driver. Signed-off-by: Jason Wang --- drivers/vhost/vdpa.c | 127 -- drivers/vhost/vhost.c | 4 +- 2 files changed, 99 insertions(+), 32 deletions(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index cd7c9a401a61..c4fda48d4273 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -28,7 +28,8 @@ enum { VHOST_VDPA_BACKEND_FEATURES = (1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2) | - (1ULL << VHOST_BACKEND_F_IOTLB_BATCH), + (1ULL << VHOST_BACKEND_F_IOTLB_BATCH) | + (1ULL << VHOST_BACKEND_F_IOTLB_ASID), }; #define VHOST_VDPA_DEV_MAX (1U << MINORBITS) @@ -57,13 +58,20 @@ struct vhost_vdpa { struct eventfd_ctx *config_ctx; int in_batch; struct vdpa_iova_range range; - int used_as; + u32 batch_asid; }; static DEFINE_IDA(vhost_vdpa_ida); static dev_t vhost_vdpa_major; +static inline u32 iotlb_to_asid(struct vhost_iotlb *iotlb) +{ + struct vhost_vdpa_as *as = container_of(iotlb, struct + vhost_vdpa_as, iotlb); + return as->id; +} + static struct vhost_vdpa_as *asid_to_as(struct vhost_vdpa *v, u32 asid) { struct hlist_head *head = &v->as[asid % VHOST_VDPA_IOTLB_BUCKETS]; @@ -76,6 +84,16 @@ static struct vhost_vdpa_as *asid_to_as(struct vhost_vdpa *v, u32 asid) return NULL; } +static struct vhost_iotlb *asid_to_iotlb(struct vhost_vdpa *v, u32 asid) +{ + struct vhost_vdpa_as *as = asid_to_as(v, asid); + + if (!as) + return NULL; + + return &as->iotlb; +} + static struct vhost_vdpa_as *vhost_vdpa_alloc_as(struct vhost_vdpa *v, u32 asid) { struct hlist_head *head = &v->as[asid % VHOST_VDPA_IOTLB_BUCKETS]; @@ -84,6 +102,9 @@ static struct vhost_vdpa_as *vhost_vdpa_alloc_as(struct vhost_vdpa *v, u32 asid) if (asid_to_as(v, asid)) return NULL; + if (asid >= v->vdpa->nas) + return NULL; + as = kmalloc(sizeof(*as), GFP_KERNEL); if (!as) return NULL; @@ -91,18 +112,24 @@ static struct vhost_vdpa_as *vhost_vdpa_alloc_as(struct vhost_vdpa *v, u32 asid) vhost_iotlb_init(&as->iotlb, 0, 0); as->id = asid; hlist_add_head(&as->hash_link, head); - ++v->used_as; return as; } -static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) +static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v, + u32 asid) { struct vhost_vdpa_as *as = asid_to_as(v, asid); - /* Remove default address space is not allowed */ - if (asid == 0) - return -EINVAL; + if (as) + return as; + + return vhost_vdpa_alloc_as(v, asid); +} + +static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) +{ + struct vhost_vdpa_as *as = asid_to_as(v, asid); if (!as) return -EINVAL; @@ -110,7 +137,6 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) hlist_del(&as->hash_link); vhost_iotlb_reset(&as->iotlb); kfree(as); - --v->used_as; return 0; } @@ -636,6 +662,7 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb, struct vhost_dev *dev = &v->vdev; struct vdpa_device *vdpa = v->vdpa; const struct vdpa_config_ops *ops = vdpa->config; + u32 asid = iotlb_to_asid(iotlb); int r = 0; r = vhost_iotlb_add_range(iotlb, iova, iova + size - 1, @@ -644,10 +671,10 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb, return r; if (ops->dma_map) { - r = ops->dma_map(vdpa, 0, iova, size, pa, perm); + r = ops->dma_map(vdpa, asid, iova, size, pa, perm); } else if (ops->set_map) { if (!v->in_batch) - r = ops->set_map(vdpa, 0, iotlb); + r = ops->set_map(vdpa, asid, iotlb); } else { r = iommu_map(v->domain, iova, pa, size, perm_to_iommu_flags(perm)); @@ -661,23 +688,35 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb, return r; } -static void vhost_vdpa_unmap(struct vhost_vdpa *v, -
[PATCH 15/21] vhost-vdpa: introduce uAPI to set group ASID
Follows the vDPA support for associating ASID to a specific virtqueue group. This patch adds a uAPI to support setting them from userspace. Signed-off-by: Jason Wang --- drivers/vhost/vdpa.c | 8 include/uapi/linux/vhost.h | 7 +++ 2 files changed, 15 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index e7023abda12c..cd7c9a401a61 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -450,6 +450,14 @@ static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, unsigned int cmd, else if (copy_to_user(argp, &s, sizeof s)) return -EFAULT; return 0; + case VHOST_VDPA_SET_GROUP_ASID: + if (copy_from_user(&s, argp, sizeof(s))) + return -EFAULT; + if (s.num >= vdpa->nas) + return -EINVAL; + if (!ops->set_group_asid) + return -ENOTSUPP; + return ops->set_group_asid(vdpa, idx, s.num); case VHOST_GET_VRING_BASE: r = ops->get_vq_state(v->vdpa, idx, &vq_state); if (r) diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h index 99de06476fdc..5e083490f1aa 100644 --- a/include/uapi/linux/vhost.h +++ b/include/uapi/linux/vhost.h @@ -158,4 +158,11 @@ */ #define VHOST_VDPA_GET_VRING_GROUP _IOWR(VHOST_VIRTIO, 0x7B, \ struct vhost_vring_state) +/* Set the ASID for a virtqueue group. The group index is stored in + * the index field of vhost_vring_state, the ASID associated with this + * group is stored at num field of vhost_vring_state. + */ +#define VHOST_VDPA_SET_GROUP_ASID _IOW(VHOST_VIRTIO, 0x7C, \ +struct vhost_vring_state) + #endif -- 2.25.1
[PATCH 17/21] vdpa_sim: split vdpasim_virtqueue's iov field in out_iov and in_iov
From: Stefano Garzarella vringh_getdesc_iotlb() manages 2 iovs for writable and readable descriptors. This is very useful for the block device, where for each request we have both types of descriptor. Let's split the vdpasim_virtqueue's iov field in out_iov and in_iov to use them with vringh_getdesc_iotlb(). We are using VIRTIO terminology for "out" (readable by the device) and "in" (writable by the device) descriptors. Acked-by: Jason Wang Signed-off-by: Stefano Garzarella --- drivers/vdpa/vdpa_sim/vdpa_sim.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c index 140de452..fe4888dfb70f 100644 --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c @@ -44,7 +44,8 @@ MODULE_PARM_DESC(macaddr, "Ethernet MAC address"); struct vdpasim_virtqueue { struct vringh vring; - struct vringh_kiov iov; + struct vringh_kiov in_iov; + struct vringh_kiov out_iov; unsigned short head; bool ready; u64 desc_addr; @@ -178,12 +179,12 @@ static void vdpasim_work(struct work_struct *work) while (true) { total_write = 0; - err = vringh_getdesc_iotlb(&txq->vring, &txq->iov, NULL, + err = vringh_getdesc_iotlb(&txq->vring, &txq->out_iov, NULL, &txq->head, GFP_ATOMIC); if (err <= 0) break; - err = vringh_getdesc_iotlb(&rxq->vring, NULL, &rxq->iov, + err = vringh_getdesc_iotlb(&rxq->vring, NULL, &rxq->in_iov, &rxq->head, GFP_ATOMIC); if (err <= 0) { vringh_complete_iotlb(&txq->vring, txq->head, 0); @@ -191,13 +192,13 @@ static void vdpasim_work(struct work_struct *work) } while (true) { - read = vringh_iov_pull_iotlb(&txq->vring, &txq->iov, + read = vringh_iov_pull_iotlb(&txq->vring, &txq->out_iov, vdpasim->buffer, PAGE_SIZE); if (read <= 0) break; - write = vringh_iov_push_iotlb(&rxq->vring, &rxq->iov, + write = vringh_iov_push_iotlb(&rxq->vring, &rxq->in_iov, vdpasim->buffer, read); if (write <= 0) break; -- 2.25.1
[PATCH 18/21] vdpa_sim: advertise VIRTIO_NET_F_MTU
We've already reported maximum mtu via config space, so let's advertise the feature. Signed-off-by: Jason Wang --- drivers/vdpa/vdpa_sim/vdpa_sim.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c index fe4888dfb70f..8d051cf25f0a 100644 --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c @@ -66,7 +66,8 @@ struct vdpasim_virtqueue { static u64 vdpasim_features = (1ULL << VIRTIO_F_ANY_LAYOUT) | (1ULL << VIRTIO_F_VERSION_1) | (1ULL << VIRTIO_F_ACCESS_PLATFORM) | - (1ULL << VIRTIO_NET_F_MAC); + (1ULL << VIRTIO_NET_F_MAC) | + (1ULL << VIRTIO_NET_F_MTU); /* State of each vdpasim device */ struct vdpasim { -- 2.25.1
[PATCH 21/21] vdpasim: control virtqueue support
This patch introduces the control virtqueue support for vDPA simulator. This is a requirement for supporting advanced features like multiqueue. A requirement for control virtqueue is to isolate its memory access from the rx/tx virtqueues. This is because when using vDPA device for VM, the control virqueue is not directly assigned to VM. Userspace (Qemu) will present a shadow control virtqueue to control for recording the device states. The isolation is done via the virtqueue groups and ASID support in vDPA through vhost-vdpa. The simulator is extended to have: 1) three virtqueues: RXVQ, TXVQ and CVQ (control virtqueue) 2) two virtqueue groups: group 0 contains RXVQ and TXVQ; group 1 contains CVQ 3) two address spaces and the simulator simply implements the address spaces by mapping it 1:1 to IOTLB. For the VM use cases, userspace(Qemu) may set AS 0 to group 0 and AS 1 to group 1. So we have: 1) The IOTLB for virtqueue group 0 contains the mappings of guest, so RX and TX can be assigned to guest directly. 2) The IOTLB for virtqueue group 1 contains the mappings of CVQ which is the buffers that allocated and managed by VMM only. So CVQ of vhost-vdpa is visible to VMM only. And Guest can not access the CVQ of vhost-vdpa. For the other use cases, since AS 0 is associated to all virtqueue groups by default. All virtqueues share the same mapping by default. To demonstrate the function, VIRITO_NET_F_CTRL_MACADDR is implemented in the simulator for the driver to set mac address. Signed-off-by: Jason Wang --- drivers/vdpa/vdpa_sim/vdpa_sim.c | 189 +++ 1 file changed, 166 insertions(+), 23 deletions(-) diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c index fe90a783bde4..0fd06ac491cd 100644 --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c @@ -60,14 +60,18 @@ struct vdpasim_virtqueue { #define VDPASIM_QUEUE_MAX 256 #define VDPASIM_DEVICE_ID 0x1 #define VDPASIM_VENDOR_ID 0 -#define VDPASIM_VQ_NUM 0x2 +#define VDPASIM_VQ_NUM 0x3 +#define VDPASIM_AS_NUM 0x2 +#define VDPASIM_GROUP_NUM 0x2 #define VDPASIM_NAME "vdpasim-netdev" static u64 vdpasim_features = (1ULL << VIRTIO_F_ANY_LAYOUT) | (1ULL << VIRTIO_F_VERSION_1) | (1ULL << VIRTIO_F_ACCESS_PLATFORM) | + (1ULL << VIRTIO_NET_F_MTU) | (1ULL << VIRTIO_NET_F_MAC) | - (1ULL << VIRTIO_NET_F_MTU); + (1ULL << VIRTIO_NET_F_CTRL_VQ) | + (1ULL << VIRTIO_NET_F_CTRL_MAC_ADDR); /* State of each vdpasim device */ struct vdpasim { @@ -147,11 +151,17 @@ static void vdpasim_reset(struct vdpasim *vdpasim) { int i; - for (i = 0; i < VDPASIM_VQ_NUM; i++) + spin_lock(&vdpasim->iommu_lock); + + for (i = 0; i < VDPASIM_VQ_NUM; i++) { vdpasim_vq_reset(&vdpasim->vqs[i]); + vringh_set_iotlb(&vdpasim->vqs[i].vring, +&vdpasim->iommu[0]); + } - spin_lock(&vdpasim->iommu_lock); - vhost_iotlb_reset(vdpasim->iommu); + for (i = 0; i < VDPASIM_AS_NUM; i++) { + vhost_iotlb_reset(&vdpasim->iommu[i]); + } spin_unlock(&vdpasim->iommu_lock); vdpasim->features = 0; @@ -191,6 +201,81 @@ static bool receive_filter(struct vdpasim *vdpasim, size_t len) return false; } +virtio_net_ctrl_ack vdpasim_handle_ctrl_mac(struct vdpasim *vdpasim, + u8 cmd) +{ + struct vdpasim_virtqueue *cvq = &vdpasim->vqs[2]; + virtio_net_ctrl_ack status = VIRTIO_NET_ERR; + size_t read; + + switch (cmd) { + case VIRTIO_NET_CTRL_MAC_ADDR_SET: + read = vringh_iov_pull_iotlb(&cvq->vring, &cvq->in_iov, +(void *)vdpasim->config.mac, +ETH_ALEN); + if (read == ETH_ALEN) + status = VIRTIO_NET_OK; + break; + default: + break; + } + + return status; +} + +static void vdpasim_handle_cvq(struct vdpasim *vdpasim) +{ + struct vdpasim_virtqueue *cvq = &vdpasim->vqs[2]; + virtio_net_ctrl_ack status = VIRTIO_NET_ERR; + struct virtio_net_ctrl_hdr ctrl; + size_t read, write; + int err; + + if (!(vdpasim->features & (1ULL << VIRTIO_NET_F_CTRL_VQ))) + return; + + if (!cvq->ready) + return; + + while (true) { + err = vringh_getdesc_iotlb(&cvq->vring, &cvq->in_iov, + &cvq->out_iov, +
[PATCH 20/21] vdpa_sim: filter destination mac address
This patch implements a simple unicast filter for vDPA simulator. Signed-off-by: Jason Wang --- drivers/vdpa/vdpa_sim/vdpa_sim.c | 49 1 file changed, 31 insertions(+), 18 deletions(-) diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c index e901177c6dfe..fe90a783bde4 100644 --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c @@ -175,6 +175,22 @@ static void vdpasim_complete(struct vdpasim_virtqueue *vq, size_t len) local_bh_enable(); } +static bool receive_filter(struct vdpasim *vdpasim, size_t len) +{ + bool modern = vdpasim->features & (1ULL << VIRTIO_F_VERSION_1); + size_t hdr_len = modern ? sizeof(struct virtio_net_hdr_v1) : + sizeof(struct virtio_net_hdr); + + if (len < ETH_ALEN + hdr_len) + return false; + + if (!strncmp(vdpasim->buffer + hdr_len, +vdpasim->config.mac, ETH_ALEN)) + return true; + + return false; +} + static void vdpasim_work(struct work_struct *work) { struct vdpasim *vdpasim = container_of(work, struct @@ -182,7 +198,6 @@ static void vdpasim_work(struct work_struct *work) struct vdpasim_virtqueue *txq = &vdpasim->vqs[1]; struct vdpasim_virtqueue *rxq = &vdpasim->vqs[0]; ssize_t read, write; - size_t total_write; int pkts = 0; int err; @@ -195,36 +210,34 @@ static void vdpasim_work(struct work_struct *work) goto out; while (true) { - total_write = 0; err = vringh_getdesc_iotlb(&txq->vring, &txq->out_iov, NULL, &txq->head, GFP_ATOMIC); if (err <= 0) break; + read = vringh_iov_pull_iotlb(&txq->vring, &txq->out_iov, +vdpasim->buffer, +PAGE_SIZE); + + if (!receive_filter(vdpasim, read)) { + vdpasim_complete(txq, 0); + continue; + } + err = vringh_getdesc_iotlb(&rxq->vring, NULL, &rxq->in_iov, &rxq->head, GFP_ATOMIC); if (err <= 0) { - vringh_complete_iotlb(&txq->vring, txq->head, 0); + vdpasim_complete(txq, 0); break; } - while (true) { - read = vringh_iov_pull_iotlb(&txq->vring, &txq->out_iov, -vdpasim->buffer, -PAGE_SIZE); - if (read <= 0) - break; - - write = vringh_iov_push_iotlb(&rxq->vring, &rxq->in_iov, - vdpasim->buffer, read); - if (write <= 0) - break; - - total_write += write; - } + write = vringh_iov_push_iotlb(&rxq->vring, &rxq->in_iov, + vdpasim->buffer, read); + if (write <= 0) + break; vdpasim_complete(txq, 0); - vdpasim_complete(rxq, total_write); + vdpasim_complete(rxq, write); if (++pkts > 4) { schedule_work(&vdpasim->work); -- 2.25.1
[PATCH 19/21] vdpa_sim: factor out buffer completion logic
Signed-off-by: Jason Wang --- drivers/vdpa/vdpa_sim/vdpa_sim.c | 33 +--- 1 file changed, 18 insertions(+), 15 deletions(-) diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c index 8d051cf25f0a..e901177c6dfe 100644 --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c @@ -159,6 +159,22 @@ static void vdpasim_reset(struct vdpasim *vdpasim) ++vdpasim->generation; } +static void vdpasim_complete(struct vdpasim_virtqueue *vq, size_t len) +{ + /* Make sure data is wrote before advancing index */ + smp_wmb(); + + vringh_complete_iotlb(&vq->vring, vq->head, len); + + /* Make sure used is visible before rasing the interrupt. */ + smp_wmb(); + + local_bh_disable(); + if (vq->cb) + vq->cb(vq->private); + local_bh_enable(); +} + static void vdpasim_work(struct work_struct *work) { struct vdpasim *vdpasim = container_of(work, struct @@ -207,21 +223,8 @@ static void vdpasim_work(struct work_struct *work) total_write += write; } - /* Make sure data is wrote before advancing index */ - smp_wmb(); - - vringh_complete_iotlb(&txq->vring, txq->head, 0); - vringh_complete_iotlb(&rxq->vring, rxq->head, total_write); - - /* Make sure used is visible before rasing the interrupt. */ - smp_wmb(); - - local_bh_disable(); - if (txq->cb) - txq->cb(txq->private); - if (rxq->cb) - rxq->cb(rxq->private); - local_bh_enable(); + vdpasim_complete(txq, 0); + vdpasim_complete(rxq, total_write); if (++pkts > 4) { schedule_work(&vdpasim->work); -- 2.25.1
Re: [PATCH net 2/2] vhost_net: fix high cpu load when sendmsg fails
- Original Message - > > -Original Message- > > From: Jason Wang [mailto:jasow...@redhat.com] > > Sent: Wednesday, December 16, 2020 1:57 PM > > To: wangyunjian > > Cc: netdev@vger.kernel.org; m...@redhat.com; willemdebruijn kernel > > ; > > virtualizat...@lists.linux-foundation.org; > > Lilijun (Jerry) ; chenchanghu > > ; xudingke ; huangbin (J) > > > > Subject: Re: [PATCH net 2/2] vhost_net: fix high cpu load when sendmsg > > fails > > > > > > > > - Original Message - > > > > > > > > > > -Original Message- > > > > From: Jason Wang [mailto:jasow...@redhat.com] > > > > Sent: Tuesday, December 15, 2020 12:10 PM > > > > To: wangyunjian ; netdev@vger.kernel.org; > > > > m...@redhat.com; willemdebruijn.ker...@gmail.com > > > > Cc: virtualizat...@lists.linux-foundation.org; Lilijun (Jerry) > > > > ; chenchanghu ; > > > > xudingke ; huangbin (J) > > > > > > > > Subject: Re: [PATCH net 2/2] vhost_net: fix high cpu load when sendmsg > > > > fails > > > > > > > > > > > > On 2020/12/15 上午9:48, wangyunjian wrote: > > > > > From: Yunjian Wang > > > > > > > > > > Currently we break the loop and wake up the vhost_worker when > > sendmsg > > > > > fails. When the worker wakes up again, we'll meet the same error. > > > > > This > > > > > will cause high CPU load. To fix this issue, we can skip this > > > > > description by ignoring the error. When we exceeds sndbuf, the return > > > > > value of sendmsg is -EAGAIN. In the case we don't skip the > > > > > description > > > > > and don't drop packet. > > > > > > > > > > Signed-off-by: Yunjian Wang > > > > > --- > > > > > drivers/vhost/net.c | 21 + > > > > > 1 file changed, 9 insertions(+), 12 deletions(-) > > > > > > > > > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index > > > > > c8784dfafdd7..f966592d8900 100644 > > > > > --- a/drivers/vhost/net.c > > > > > +++ b/drivers/vhost/net.c > > > > > @@ -827,16 +827,13 @@ static void handle_tx_copy(struct vhost_net > > *net, > > > > struct socket *sock) > > > > > msg.msg_flags &= ~MSG_MORE; > > > > > } > > > > > > > > > > - /* TODO: Check specific error and bomb out unless > > > > > ENOBUFS? > > */ > > > > > err = sock->ops->sendmsg(sock, &msg, len); > > > > > - if (unlikely(err < 0)) { > > > > > + if (unlikely(err == -EAGAIN)) { > > > > > vhost_discard_vq_desc(vq, 1); > > > > > vhost_net_enable_vq(net, vq); > > > > > break; > > > > > - } > > > > > > > > > > > > As I've pointed out in last version. If you don't discard descriptor, > > > > you > > > > probably > > > > need to add the head to used ring. Otherwise this descriptor will be > > > > always > > > > inflight that may confuse drivers. > > > > > > Sorry for missing the comment. > > > > > > After deleting discard descriptor and break, the next processing will be > > > the > > > same > > > as the normal success of sendmsg(), and vhost_zerocopy_signal_used() or > > > vhost_add_used_and_signal() method will be called to add the head to used > > > ring. > > > > It's the next head not the one that contains the buggy packet? > > In the modified code logic, the head added to used ring is exectly the > one that contains the buggy packet. -ENOTEA :( You're right, I misread the code. Thanks > > Thanks > > > > > Thanks > > > > > > > > Thanks > > > > > > > > > > > > > - if (err != len) > > > > > - pr_debug("Truncated TX packet: len %d != %zd\n", > > > > > - err, len); > > > > > + } else if (unlikely(err < 0 || err != len)) > > > > > > > > > > > >
Re: [PATCH net v2 2/2] vhost_net: fix high cpu load when sendmsg fails
On 2020/12/16 下午5:23, Michael S. Tsirkin wrote: On Wed, Dec 16, 2020 at 04:20:37PM +0800, wangyunjian wrote: From: Yunjian Wang Currently we break the loop and wake up the vhost_worker when sendmsg fails. When the worker wakes up again, we'll meet the same error. This will cause high CPU load. To fix this issue, we can skip this description by ignoring the error. When we exceeds sndbuf, the return value of sendmsg is -EAGAIN. In the case we don't skip the description and don't drop packet. Question: with this patch, what happens if sendmsg is interrupted by a signal? Since we use MSG_DONTWAIT, we don't need to care about signal I think. Thanks
Re: [PATCH 00/21] Control VQ support in vDPA
On 2020/12/16 下午5:47, Michael S. Tsirkin wrote: On Wed, Dec 16, 2020 at 02:47:57PM +0800, Jason Wang wrote: Hi All: This series tries to add the support for control virtqueue in vDPA. Control virtqueue is used by networking device for accepting various commands from the driver. It's a must to support multiqueue and other configurations. When used by vhost-vDPA bus driver for VM, the control virtqueue should be shadowed via userspace VMM (Qemu) instead of being assigned directly to Guest. This is because Qemu needs to know the device state in order to start and stop device correctly (e.g for Live Migration). This requies to isolate the memory mapping for control virtqueue presented by vhost-vDPA to prevent guest from accesing it directly. To achieve this, vDPA introduce two new abstractions: - address space: identified through address space id (ASID) and a set of memory mapping in maintained - virtqueue group: the minimal set of virtqueues that must share an address space How will this support the pretty common case where control vq is programmed by the kernel through the PF, and others by the VFs? In this case, the VF parent need to provide a software control vq and decode the command then send them to VF. I actually thought the way to support it is by exposing something like an "inject buffers" API which sends data to a given VQ. Maybe an ioctl, and maybe down the road uio ring can support batching these So the virtuqueue allows the request to be processed asynchronously (e.g driver may choose to use interrupt for control vq). This means we need to support that in uAPI level. And if we manage to do that, it's just another type of virtqueue. For virtio-vDPA, this also means the extensions for queue processing which is a functional duplication. Using what proposed in this series, we don't need any changes for kernel virtio drivers. What's more important, this series could be used for future features that requires DMA isolation between virtqueues: - report dirty pages via virtqueue - sub function level device slicing ... Thanks Device needs to advertise the following attributes to vDPA: - the number of address spaces supported in the device - the number of virtqueue groups supported in the device - the mappings from a specific virtqueue to its virtqueue groups The mappings from virtqueue to virtqueue groups is fixed and defined by vDPA device driver. E.g: - For the device that has hardware ASID support, it can simply advertise a per virtqueue virtqueue group. - For the device that does not have hardware ASID support, it can simply advertise a single virtqueue group that contains all virtqueues. Or if it wants a software emulated control virtqueue, it can advertise two virtqueue groups, one is for cvq, another is for the rest virtqueues. vDPA also allow to change the association between virtqueue group and address space. So in the case of control virtqueue, userspace VMM(Qemu) may use a dedicated address space for the control virtqueue group to isolate the memory mapping. The vhost/vhost-vDPA is also extend for the userspace to: - query the number of virtqueue groups and address spaces supported by the device - query the virtqueue group for a specific virtqueue - assocaite a virtqueue group with an address space - send ASID based IOTLB commands This will help userspace VMM(Qemu) to detect whether the control vq could be supported and isolate memory mappings of control virtqueue from the others. To demonstrate the usage, vDPA simulator is extended to support setting MAC address via a emulated control virtqueue. Please review. Changes since RFC: - tweak vhost uAPI documentation - switch to use device specific IOTLB really in patch 4 - tweak the commit log - fix that ASID in vhost is claimed to be 32 actually but 16bit actually - fix use after free when using ASID with IOTLB batching requests - switch to use Stefano's patch for having separated iov - remove unused "used_as" variable - fix the iotlb/asid checking in vhost_vdpa_unmap() Thanks Jason Wang (20): vhost: move the backend feature bits to vhost_types.h virtio-vdpa: don't set callback if virtio doesn't need it vhost-vdpa: passing iotlb to IOMMU mapping helpers vhost-vdpa: switch to use vhost-vdpa specific IOTLB vdpa: add the missing comment for nvqs in struct vdpa_device vdpa: introduce virtqueue groups vdpa: multiple address spaces support vdpa: introduce config operations for associating ASID to a virtqueue group vhost_iotlb: split out IOTLB initialization vhost: support ASID in IOTLB API vhost-vdpa: introduce asid based IOTLB vhost-vdpa: introduce uAPI to get the number of virtqueue groups vhost-vdpa: introduce uAPI to get the number of address spaces vhost-vdpa: uAPI to get virtqueue group id vhost-vdpa: introduce uAPI to set
Re: [PATCH 00/21] Control VQ support in vDPA
On 2020/12/17 下午3:58, Michael S. Tsirkin wrote: On Thu, Dec 17, 2020 at 11:30:18AM +0800, Jason Wang wrote: On 2020/12/16 下午5:47, Michael S. Tsirkin wrote: On Wed, Dec 16, 2020 at 02:47:57PM +0800, Jason Wang wrote: Hi All: This series tries to add the support for control virtqueue in vDPA. Control virtqueue is used by networking device for accepting various commands from the driver. It's a must to support multiqueue and other configurations. When used by vhost-vDPA bus driver for VM, the control virtqueue should be shadowed via userspace VMM (Qemu) instead of being assigned directly to Guest. This is because Qemu needs to know the device state in order to start and stop device correctly (e.g for Live Migration). This requies to isolate the memory mapping for control virtqueue presented by vhost-vDPA to prevent guest from accesing it directly. To achieve this, vDPA introduce two new abstractions: - address space: identified through address space id (ASID) and a set of memory mapping in maintained - virtqueue group: the minimal set of virtqueues that must share an address space How will this support the pretty common case where control vq is programmed by the kernel through the PF, and others by the VFs? In this case, the VF parent need to provide a software control vq and decode the command then send them to VF. But how does that tie to the address space infrastructure? In this case, address space is not a must. But the idea is to make control vq works for all types of hardware: 1) control virtqueue is implemented via VF/PF communication 2) control virtqueue is implemented by VF but not through DMA 3) control virtqueue is implemented by VF DMA, it could be either a hardware control virtqueue or other type of DMA The address space is a must for 3) to work and can work for both 1) and 2). I actually thought the way to support it is by exposing something like an "inject buffers" API which sends data to a given VQ. Maybe an ioctl, and maybe down the road uio ring can support batching these So the virtuqueue allows the request to be processed asynchronously (e.g driver may choose to use interrupt for control vq). This means we need to support that in uAPI level. I don't think we need to make it async, just a regular ioctl will do. In fact no guest uses the asynchronous property. It was not forbidden by the spec then we need to support that. E.g we can not assume driver doesn't assign interrupt for cvq. And if we manage to do that, it's just another type of virtqueue. For virtio-vDPA, this also means the extensions for queue processing which is a functional duplication. I don't see why, just send it to the actual control vq :) But in the case you've pointed out, there's no hardware control vq in fact. Using what proposed in this series, we don't need any changes for kernel virtio drivers. What's more important, this series could be used for future features that requires DMA isolation between virtqueues: - report dirty pages via virtqueue - sub function level device slicing I agree these are nice to have, but I am not sure basic control vq must be tied to that. If the control virtqueue is implemented via DMA through VF, it looks like a must. Thanks ... Thanks Device needs to advertise the following attributes to vDPA: - the number of address spaces supported in the device - the number of virtqueue groups supported in the device - the mappings from a specific virtqueue to its virtqueue groups The mappings from virtqueue to virtqueue groups is fixed and defined by vDPA device driver. E.g: - For the device that has hardware ASID support, it can simply advertise a per virtqueue virtqueue group. - For the device that does not have hardware ASID support, it can simply advertise a single virtqueue group that contains all virtqueues. Or if it wants a software emulated control virtqueue, it can advertise two virtqueue groups, one is for cvq, another is for the rest virtqueues. vDPA also allow to change the association between virtqueue group and address space. So in the case of control virtqueue, userspace VMM(Qemu) may use a dedicated address space for the control virtqueue group to isolate the memory mapping. The vhost/vhost-vDPA is also extend for the userspace to: - query the number of virtqueue groups and address spaces supported by the device - query the virtqueue group for a specific virtqueue - assocaite a virtqueue group with an address space - send ASID based IOTLB commands This will help userspace VMM(Qemu) to detect whether the control vq could be supported and isolate memory mappings of control virtqueue from the others. To demonstrate the usage, vDPA simulator is extended to support setting MAC address via a emulated control virtqueue. Please review. Changes since RFC: - tweak vhost uAPI docume
Re: [PATCH 00/21] Control VQ support in vDPA
On 2020/12/18 上午6:28, Michael S. Tsirkin wrote: On Thu, Dec 17, 2020 at 05:02:49PM +0800, Jason Wang wrote: On 2020/12/17 下午3:58, Michael S. Tsirkin wrote: On Thu, Dec 17, 2020 at 11:30:18AM +0800, Jason Wang wrote: On 2020/12/16 下午5:47, Michael S. Tsirkin wrote: On Wed, Dec 16, 2020 at 02:47:57PM +0800, Jason Wang wrote: Hi All: This series tries to add the support for control virtqueue in vDPA. Control virtqueue is used by networking device for accepting various commands from the driver. It's a must to support multiqueue and other configurations. When used by vhost-vDPA bus driver for VM, the control virtqueue should be shadowed via userspace VMM (Qemu) instead of being assigned directly to Guest. This is because Qemu needs to know the device state in order to start and stop device correctly (e.g for Live Migration). This requies to isolate the memory mapping for control virtqueue presented by vhost-vDPA to prevent guest from accesing it directly. To achieve this, vDPA introduce two new abstractions: - address space: identified through address space id (ASID) and a set of memory mapping in maintained - virtqueue group: the minimal set of virtqueues that must share an address space How will this support the pretty common case where control vq is programmed by the kernel through the PF, and others by the VFs? In this case, the VF parent need to provide a software control vq and decode the command then send them to VF. But how does that tie to the address space infrastructure? In this case, address space is not a must. That's ok, problem is I don't see how address space is going to work in this case at all. There's no address space there that userspace/guest can control. The virtqueue group is mandated by parent but the association between virtqueue group and address space is under the control of userspace (Qemu). A simple but common case is that: 1) Device advertise two virtqueue groups: group 0 contains RX and TX, group 1 contains CVQ. 2) Device advertise two address spaces Then, for vhost-vDPA using by VM: 1) associate group 0 with as 0, group 1 with as 1 (via vhost-vDPA VHOST_VDPA_SET_GROUP_ASID) 2) Publish guest memory mapping via IOTLB asid 0 3) Publish control virtqueue mapping via IOTLB asid 1 Then the DMA is totally isolated in this case. For vhost-vDPA using by DPDK or virtio-vDPA 1) associate group 0 and group 1 with as 0 since we don't need DMA isolation in this case. In order to let it be controlled by Guest, we need extend virtio spec to support those concepts. Thanks
Re: [PATCH] virtio_net: Fix recursive call to cpus_read_lock()
On 2020/12/22 上午11:36, Jeff Dike wrote: virtnet_set_channels can recursively call cpus_read_lock if CONFIG_XPS and CONFIG_HOTPLUG are enabled. The path is: virtnet_set_channels - calls get_online_cpus(), which is a trivial wrapper around cpus_read_lock() netif_set_real_num_tx_queues netif_reset_xps_queues_gt netif_reset_xps_queues - calls cpus_read_lock() This call chain and potential deadlock happens when the number of TX queues is reduced. This commit the removes netif_set_real_num_[tr]x_queues calls from inside the get/put_online_cpus section, as they don't require that it be held. Signed-off-by: Jeff Dike --- Adding netdev. The patch can go with -net and is needed for -stable. Acked-by: Jason Wang drivers/net/virtio_net.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 052975ea0af4..e02c7e0f1cf9 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -2093,14 +2093,16 @@ static int virtnet_set_channels(struct net_device *dev, get_online_cpus(); err = _virtnet_set_queues(vi, queue_pairs); - if (!err) { - netif_set_real_num_tx_queues(dev, queue_pairs); - netif_set_real_num_rx_queues(dev, queue_pairs); - - virtnet_set_affinity(vi); + if (err){ + put_online_cpus(); + goto err; } + virtnet_set_affinity(vi); put_online_cpus(); + netif_set_real_num_tx_queues(dev, queue_pairs); + netif_set_real_num_rx_queues(dev, queue_pairs); + err: return err; }
Re: [PATCH net v2 2/2] vhost_net: fix high cpu load when sendmsg fails
On 2020/12/22 上午7:07, Willem de Bruijn wrote: On Wed, Dec 16, 2020 at 3:20 AM wangyunjian wrote: From: Yunjian Wang Currently we break the loop and wake up the vhost_worker when sendmsg fails. When the worker wakes up again, we'll meet the same error. The patch is based on the assumption that such error cases always return EAGAIN. Can it not also be ENOMEM, such as from tun_build_skb? This will cause high CPU load. To fix this issue, we can skip this description by ignoring the error. When we exceeds sndbuf, the return value of sendmsg is -EAGAIN. In the case we don't skip the description and don't drop packet. the -> that here and above: description -> descriptor Perhaps slightly revise to more explicitly state that 1. in the case of persistent failure (i.e., bad packet), the driver drops the packet 2. in the case of transient failure (e.g,. memory pressure) the driver schedules the worker to try again later If we want to go with this way, we need a better time to wakeup the worker. Otherwise it just produces more stress on the cpu that is what this patch tries to avoid. Thanks
Re: [PATCH net v2 2/2] vhost_net: fix high cpu load when sendmsg fails
On 2020/12/22 下午10:24, Willem de Bruijn wrote: On Mon, Dec 21, 2020 at 11:41 PM Jason Wang wrote: On 2020/12/22 上午7:07, Willem de Bruijn wrote: On Wed, Dec 16, 2020 at 3:20 AM wangyunjian wrote: From: Yunjian Wang Currently we break the loop and wake up the vhost_worker when sendmsg fails. When the worker wakes up again, we'll meet the same error. The patch is based on the assumption that such error cases always return EAGAIN. Can it not also be ENOMEM, such as from tun_build_skb? This will cause high CPU load. To fix this issue, we can skip this description by ignoring the error. When we exceeds sndbuf, the return value of sendmsg is -EAGAIN. In the case we don't skip the description and don't drop packet. the -> that here and above: description -> descriptor Perhaps slightly revise to more explicitly state that 1. in the case of persistent failure (i.e., bad packet), the driver drops the packet 2. in the case of transient failure (e.g,. memory pressure) the driver schedules the worker to try again later If we want to go with this way, we need a better time to wakeup the worker. Otherwise it just produces more stress on the cpu that is what this patch tries to avoid. Perhaps I misunderstood the purpose of the patch: is it to drop everything, regardless of transient or persistent failure, until the ring runs out of descriptors? My understanding is that the main motivation is to avoid high cpu utilization when sendmsg() fail due to guest reason (e.g bad packet). I can understand both a blocking and drop strategy during memory pressure. But partial drop strategy until exceeding ring capacity seems like a peculiar hybrid? Yes. So I wonder if we want to be do better when we are in the memory pressure. E.g can we let socket wake up us instead of rescheduling the workers here? At least in this case we know some memory might be freed? Thanks
Re: [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace
On 2020/12/22 下午10:52, Xie Yongji wrote: This series introduces a framework, which can be used to implement vDPA Devices in a userspace program. The work consist of two parts: control path forwarding and data path offloading. In the control path, the VDUSE driver will make use of message mechnism to forward the config operation from vdpa bus driver to userspace. Userspace can use read()/write() to receive/reply those control messages. In the data path, the core is mapping dma buffer into VDUSE daemon's address space, which can be implemented in different ways depending on the vdpa bus to which the vDPA device is attached. In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver with bounce-buffering mechanism to achieve that. Rethink about the bounce buffer stuffs. I wonder instead of using kernel pages with mmap(), how about just use userspace pages like what vhost did? It means we need a worker to do bouncing but we don't need to care about annoying stuffs like page reclaiming? And in vhost-vdpa case, the dma buffer is reside in a userspace memory region which can be shared to the VDUSE userspace processs via transferring the shmfd. The details and our user case is shown below: - -- |Container || QEMU(VM) | | VDUSE daemon | | - || --- | | - | | |dev/vdx| || |/dev/vhost-vdpa-x| | | | vDPA device emulation | | block driver | | +--- ---+ -+--+- | || | | || | +---++--+- || block device | | vhost device || vduse driver | | TCP/IP || |---+ +---+ -+| | | | | || | --+-- --+--- ---+--- || | | virtio-blk driver | | vhost-vdpa driver | | vdpa device | || | --+-- --+--- ---+--- || | | virtio bus | | || | ++--- | | || || | | || | --+--| | || | | virtio-blk device || | || | --+--| | || || | | || | ---+--- | | || | | virtio-vdpa driver | | | || | ---+--- | | || || | |vdpa bus || | ---+--+---+ || | ---+--- | -| NIC |-- ---+--- | -+- | Remote Storages | --- We make use of it to implement a block device connecting to our distributed storage, which can be used both in containers and VMs. Thus, we can have an unified technology stack in this two cases. To test it with null-blk: $ qemu-storage-daemon \ --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \ --monitor chardev=charmonitor
Re: [RFC v2 06/13] vduse: Introduce VDUSE - vDPA Device in Userspace
On 2020/12/22 下午10:52, Xie Yongji wrote: This VDUSE driver enables implementing vDPA devices in userspace. Both control path and data path of vDPA devices will be able to be handled in userspace. In the control path, the VDUSE driver will make use of message mechnism to forward the config operation from vdpa bus driver to userspace. Userspace can use read()/write() to receive/reply those control messages. In the data path, the VDUSE driver implements a MMU-based on-chip IOMMU driver which supports mapping the kernel dma buffer to a userspace iova region dynamically. Userspace can access those iova region via mmap(). Besides, the eventfd mechanism is used to trigger interrupt callbacks and receive virtqueue kicks in userspace Now we only support virtio-vdpa bus driver with this patch applied. Signed-off-by: Xie Yongji --- Documentation/driver-api/vduse.rst | 74 ++ Documentation/userspace-api/ioctl/ioctl-number.rst |1 + drivers/vdpa/Kconfig |8 + drivers/vdpa/Makefile |1 + drivers/vdpa/vdpa_user/Makefile|5 + drivers/vdpa/vdpa_user/eventfd.c | 221 drivers/vdpa/vdpa_user/eventfd.h | 48 + drivers/vdpa/vdpa_user/iova_domain.c | 442 drivers/vdpa/vdpa_user/iova_domain.h | 93 ++ drivers/vdpa/vdpa_user/vduse.h | 59 ++ drivers/vdpa/vdpa_user/vduse_dev.c | 1121 include/uapi/linux/vdpa.h |1 + include/uapi/linux/vduse.h | 99 ++ 13 files changed, 2173 insertions(+) create mode 100644 Documentation/driver-api/vduse.rst create mode 100644 drivers/vdpa/vdpa_user/Makefile create mode 100644 drivers/vdpa/vdpa_user/eventfd.c create mode 100644 drivers/vdpa/vdpa_user/eventfd.h create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h create mode 100644 drivers/vdpa/vdpa_user/vduse.h create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c create mode 100644 include/uapi/linux/vduse.h diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst new file mode 100644 index ..da9b3040f20a --- /dev/null +++ b/Documentation/driver-api/vduse.rst @@ -0,0 +1,74 @@ +== +VDUSE - "vDPA Device in Userspace" +== + +vDPA (virtio data path acceleration) device is a device that uses a +datapath which complies with the virtio specifications with vendor +specific control path. vDPA devices can be both physically located on +the hardware or emulated by software. VDUSE is a framework that makes it +possible to implement software-emulated vDPA devices in userspace. + +How VDUSE works + +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on +the VDUSE character device (/dev/vduse). Then a file descriptor pointing +to the new resources will be returned, which can be used to implement the +userspace vDPA device's control path and data path. + +To implement control path, the read/write operations to the file descriptor +will be used to receive/reply the control messages from/to VDUSE driver. +Those control messages are based on the vdpa_config_ops which defines a +unified interface to control different types of vDPA device. + +The following types of messages are provided by the VDUSE framework now: + +- VDUSE_SET_VQ_ADDR: Set the addresses of the different aspects of virtqueue. + +- VDUSE_SET_VQ_NUM: Set the size of virtqueue + +- VDUSE_SET_VQ_READY: Set ready status of virtqueue + +- VDUSE_GET_VQ_READY: Get ready status of virtqueue + +- VDUSE_SET_FEATURES: Set virtio features supported by the driver + +- VDUSE_GET_FEATURES: Get virtio features supported by the device + +- VDUSE_SET_STATUS: Set the device status + +- VDUSE_GET_STATUS: Get the device status + +- VDUSE_SET_CONFIG: Write to device specific configuration space + +- VDUSE_GET_CONFIG: Read from device specific configuration space + +Please see include/linux/vdpa.h for details. + +In the data path, VDUSE framework implements a MMU-based on-chip IOMMU +driver which supports mapping the kernel dma buffer to a userspace iova +region dynamically. The userspace iova region can be created by passing +the userspace vDPA device fd to mmap(2). + +Besides, the eventfd mechanism is used to trigger interrupt callbacks and +receive virtqueue kicks in userspace. The following ioctls on the userspace +vDPA device fd are provided to support that: + +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used + by VDUSE driver to notify userspace to consume the vring. + +- VDUSE_VQ_SETUP_IRQFD: set the irqfd for virtqueue, this eventfd is used + by userspace to notify VDUSE driver to trigger interrupt callbacks. + +MMU-based IOMMU Driver +
Re: [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace
On 2020/12/23 下午2:38, Jason Wang wrote: V1 to V2: - Add vhost-vdpa support I may miss something but I don't see any code to support that. E.g neither set_map nor dma_map/unmap is implemented in the config ops. Thanks Speak too fast :( I saw a new config ops was introduced. Let me dive into that. Thanks
Re: [RFC v2 08/13] vdpa: Introduce process_iotlb_msg() in vdpa_config_ops
On 2020/12/22 下午10:52, Xie Yongji wrote: This patch introduces a new method in the vdpa_config_ops to support processing the raw vhost memory mapping message in the vDPA device driver. Signed-off-by: Xie Yongji --- drivers/vhost/vdpa.c | 5 - include/linux/vdpa.h | 7 +++ 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 448be7875b6d..ccbb391e38be 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -728,6 +728,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, if (r) return r; + if (ops->process_iotlb_msg) + return ops->process_iotlb_msg(vdpa, msg); + switch (msg->type) { case VHOST_IOTLB_UPDATE: r = vhost_vdpa_process_iotlb_update(v, msg); @@ -770,7 +773,7 @@ static int vhost_vdpa_alloc_domain(struct vhost_vdpa *v) int ret; /* Device want to do DMA by itself */ - if (ops->set_map || ops->dma_map) + if (ops->set_map || ops->dma_map || ops->process_iotlb_msg) return 0; bus = dma_dev->bus; diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index 656fe264234e..7bccedf22f4b 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -5,6 +5,7 @@ #include #include #include +#include #include #include @@ -172,6 +173,10 @@ struct vdpa_iova_range { *@vdev: vdpa device *Returns the iova range supported by *the device. + * @process_iotlb_msg: Process vhost memory mapping message (optional) + * Only used for VDUSE device now + * @vdev: vdpa device + * @msg: vhost memory mapping message * @set_map: Set device memory mapping (optional) *Needed for device that using device *specific DMA translation (on-chip IOMMU) @@ -240,6 +245,8 @@ struct vdpa_config_ops { struct vdpa_iova_range (*get_iova_range)(struct vdpa_device *vdev); /* DMA ops */ + int (*process_iotlb_msg)(struct vdpa_device *vdev, +struct vhost_iotlb_msg *msg); int (*set_map)(struct vdpa_device *vdev, struct vhost_iotlb *iotlb); int (*dma_map)(struct vdpa_device *vdev, u64 iova, u64 size, u64 pa, u32 perm); Is there any reason that it can't be done via dma_map/dma_unmap or set_map? Thanks
Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
On 2020/12/22 下午10:52, Xie Yongji wrote: To support vhost-vdpa bus driver, we need a way to share the vhost-vdpa backend process's memory with the userspace VDUSE process. This patch tries to make use of the vhost iotlb message to achieve that. We will get the shm file from the iotlb message and pass it to the userspace VDUSE process. Signed-off-by: Xie Yongji --- Documentation/driver-api/vduse.rst | 15 +++- drivers/vdpa/vdpa_user/vduse_dev.c | 147 - include/uapi/linux/vduse.h | 11 +++ 3 files changed, 171 insertions(+), 2 deletions(-) diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst index 623f7b040ccf..48e4b1ba353f 100644 --- a/Documentation/driver-api/vduse.rst +++ b/Documentation/driver-api/vduse.rst @@ -46,13 +46,26 @@ The following types of messages are provided by the VDUSE framework now: - VDUSE_GET_CONFIG: Read from device specific configuration space +- VDUSE_UPDATE_IOTLB: Update the memory mapping in device IOTLB + +- VDUSE_INVALIDATE_IOTLB: Invalidate the memory mapping in device IOTLB + Please see include/linux/vdpa.h for details. -In the data path, VDUSE framework implements a MMU-based on-chip IOMMU +The data path of userspace vDPA device is implemented in different ways +depending on the vdpa bus to which it is attached. + +In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU driver which supports mapping the kernel dma buffer to a userspace iova region dynamically. The userspace iova region can be created by passing the userspace vDPA device fd to mmap(2). +In vhost-vdpa case, the dma buffer is reside in a userspace memory region +which will be shared to the VDUSE userspace processs via the file +descriptor in VDUSE_UPDATE_IOTLB message. And the corresponding address +mapping (IOVA of dma buffer <-> VA of the memory region) is also included +in this message. + Besides, the eventfd mechanism is used to trigger interrupt callbacks and receive virtqueue kicks in userspace. The following ioctls on the userspace vDPA device fd are provided to support that: diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c index b974333ed4e9..d24aaacb6008 100644 --- a/drivers/vdpa/vdpa_user/vduse_dev.c +++ b/drivers/vdpa/vdpa_user/vduse_dev.c @@ -34,6 +34,7 @@ struct vduse_dev_msg { struct vduse_dev_request req; + struct file *iotlb_file; struct vduse_dev_response resp; struct list_head list; wait_queue_head_t waitq; @@ -325,12 +326,80 @@ static int vduse_dev_set_vq_state(struct vduse_dev *dev, return ret; } +static int vduse_dev_update_iotlb(struct vduse_dev *dev, struct file *file, + u64 offset, u64 iova, u64 size, u8 perm) +{ + struct vduse_dev_msg *msg; + int ret; + + if (!size) + return -EINVAL; + + msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB); + msg->req.size = sizeof(struct vduse_iotlb); + msg->req.iotlb.offset = offset; + msg->req.iotlb.iova = iova; + msg->req.iotlb.size = size; + msg->req.iotlb.perm = perm; + msg->req.iotlb.fd = -1; + msg->iotlb_file = get_file(file); + + ret = vduse_dev_msg_sync(dev, msg); My feeling is that we should provide consistent API for the userspace device to use. E.g we'd better carry the IOTLB message for both virtio/vhost drivers. It looks to me for virtio drivers we can still use UPDAT_IOTLB message by using VDUSE file as msg->iotlb_file here. + vduse_dev_msg_put(msg); + fput(file); + + return ret; +} + +static int vduse_dev_invalidate_iotlb(struct vduse_dev *dev, + u64 iova, u64 size) +{ + struct vduse_dev_msg *msg; + int ret; + + if (!size) + return -EINVAL; + + msg = vduse_dev_new_msg(dev, VDUSE_INVALIDATE_IOTLB); + msg->req.size = sizeof(struct vduse_iotlb); + msg->req.iotlb.iova = iova; + msg->req.iotlb.size = size; + + ret = vduse_dev_msg_sync(dev, msg); + vduse_dev_msg_put(msg); + + return ret; +} + +static unsigned int perm_to_file_flags(u8 perm) +{ + unsigned int flags = 0; + + switch (perm) { + case VHOST_ACCESS_WO: + flags |= O_WRONLY; + break; + case VHOST_ACCESS_RO: + flags |= O_RDONLY; + break; + case VHOST_ACCESS_RW: + flags |= O_RDWR; + break; + default: + WARN(1, "invalidate vhost IOTLB permission\n"); + break; + } + + return flags; +} + static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to) { struct file *file = iocb->ki_filp; struct vduse_dev *dev = file->private_data; struct vduse_dev_msg *msg; - int size = sizeof(struct vduse_dev_request); +
Re: [External] Re: [PATCH 0/7] Introduce vdpa management tool
On 2020/11/30 下午3:07, Yongji Xie wrote: Thanks for adding me, Jason! Now I'm working on a v2 patchset for VDUSE (vDPA Device in Userspace) [1]. This tool is very useful for the vduse device. So I'm considering integrating this into my v2 patchset. But there is one problem: In this tool, vdpa device config action and enable action are combined into one netlink msg: VDPA_CMD_DEV_NEW. But in vduse case, it needs to be splitted because a chardev should be created and opened by a userspace process before we enable the vdpa device (call vdpa_register_device()). So I'd like to know whether it's possible (or have some plans) to add two new netlink msgs something like: VDPA_CMD_DEV_ENABLE and VDPA_CMD_DEV_DISABLE to make the config path more flexible. Actually, we've discussed such intermediate step in some early discussion. It looks to me VDUSE could be one of the users of this. Or I wonder whether we can switch to use anonymous inode(fd) for VDUSE then fetching it via an VDUSE_GET_DEVICE_FD ioctl? Yes, we can. Actually the current implementation in VDUSE is like this. But seems like this is still a intermediate step. The fd should be binded to a name or something else which need to be configured before. The name could be specified via the netlink. It looks to me the real issue is that until the device is connected with a userspace, it can't be used. So we also need to fail the enabling if it doesn't opened. Thanks Thanks, Yongji
Re: [External] Re: [PATCH 0/7] Introduce vdpa management tool
On 2020/12/1 下午5:55, Yongji Xie wrote: On Tue, Dec 1, 2020 at 2:25 PM Jason Wang wrote: On 2020/11/30 下午3:07, Yongji Xie wrote: Thanks for adding me, Jason! Now I'm working on a v2 patchset for VDUSE (vDPA Device in Userspace) [1]. This tool is very useful for the vduse device. So I'm considering integrating this into my v2 patchset. But there is one problem: In this tool, vdpa device config action and enable action are combined into one netlink msg: VDPA_CMD_DEV_NEW. But in vduse case, it needs to be splitted because a chardev should be created and opened by a userspace process before we enable the vdpa device (call vdpa_register_device()). So I'd like to know whether it's possible (or have some plans) to add two new netlink msgs something like: VDPA_CMD_DEV_ENABLE and VDPA_CMD_DEV_DISABLE to make the config path more flexible. Actually, we've discussed such intermediate step in some early discussion. It looks to me VDUSE could be one of the users of this. Or I wonder whether we can switch to use anonymous inode(fd) for VDUSE then fetching it via an VDUSE_GET_DEVICE_FD ioctl? Yes, we can. Actually the current implementation in VDUSE is like this. But seems like this is still a intermediate step. The fd should be binded to a name or something else which need to be configured before. The name could be specified via the netlink. It looks to me the real issue is that until the device is connected with a userspace, it can't be used. So we also need to fail the enabling if it doesn't opened. Yes, that's true. So you mean we can firstly try to fetch the fd binded to a name/vduse_id via an VDUSE_GET_DEVICE_FD, then use the name/vduse_id as a attribute to create vdpa device? It looks fine to me. Yes, something like this. The anonymous fd will be created during dev_add() and the fd will be carried in the msg to userspace. Thanks Thanks, Yongji
Re: [External] Re: [PATCH 0/7] Introduce vdpa management tool
On 2020/12/2 下午12:53, Parav Pandit wrote: From: Yongji Xie Sent: Wednesday, December 2, 2020 9:00 AM On Tue, Dec 1, 2020 at 11:59 PM Parav Pandit wrote: From: Yongji Xie Sent: Tuesday, December 1, 2020 7:49 PM On Tue, Dec 1, 2020 at 7:32 PM Parav Pandit wrote: From: Yongji Xie Sent: Tuesday, December 1, 2020 3:26 PM On Tue, Dec 1, 2020 at 2:25 PM Jason Wang wrote: On 2020/11/30 下午3:07, Yongji Xie wrote: Thanks for adding me, Jason! Now I'm working on a v2 patchset for VDUSE (vDPA Device in Userspace) [1]. This tool is very useful for the vduse device. So I'm considering integrating this into my v2 patchset. But there is one problem: In this tool, vdpa device config action and enable action are combined into one netlink msg: VDPA_CMD_DEV_NEW. But in vduse case, it needs to be splitted because a chardev should be created and opened by a userspace process before we enable the vdpa device (call vdpa_register_device()). So I'd like to know whether it's possible (or have some plans) to add two new netlink msgs something like: VDPA_CMD_DEV_ENABLE and VDPA_CMD_DEV_DISABLE to make the config path more flexible. Actually, we've discussed such intermediate step in some early discussion. It looks to me VDUSE could be one of the users of this. Or I wonder whether we can switch to use anonymous inode(fd) for VDUSE then fetching it via an VDUSE_GET_DEVICE_FD ioctl? Yes, we can. Actually the current implementation in VDUSE is like this. But seems like this is still a intermediate step. The fd should be binded to a name or something else which need to be configured before. The name could be specified via the netlink. It looks to me the real issue is that until the device is connected with a userspace, it can't be used. So we also need to fail the enabling if it doesn't opened. Yes, that's true. So you mean we can firstly try to fetch the fd binded to a name/vduse_id via an VDUSE_GET_DEVICE_FD, then use the name/vduse_id as a attribute to create vdpa device? It looks fine to me. I probably do not well understand. I tried reading patch [1] and few things do not look correct as below. Creating the vdpa device on the bus device and destroying the device from the workqueue seems unnecessary and racy. It seems vduse driver needs This is something should be done as part of the vdpa dev add command, instead of connecting two sides separately and ensuring race free access to it. So VDUSE_DEV_START and VDUSE_DEV_STOP should possibly be avoided. Yes, we can avoid these two ioctls with the help of the management tool. $ vdpa dev add parentdev vduse_mgmtdev type net name foo2 When above command is executed it creates necessary vdpa device foo2 on the bus. When user binds foo2 device with the vduse driver, in the probe(), it creates respective char device to access it from user space. I see. So vduse cannot work with any existing vdpa devices like ifc, mlx5 or netdevsim. It has its own implementation similar to fuse with its own backend of choice. More below. But vduse driver is not a vdpa bus driver. It works like vdpasim driver, but offloads the data plane and control plane to a user space process. In that case to draw parallel lines, 1. netdevsim: (a) create resources in kernel sw (b) datapath simulates in kernel 2. ifc + mlx5 vdpa dev: (a) creates resource in hw (b) data path is in hw 3. vduse: (a) creates resources in userspace sw (b) data path is in user space. hence creates data path resources for user space. So char device is created, removed as result of vdpa device creation. For example, $ vdpa dev add parentdev vduse_mgmtdev type net name foo2 Above command will create char device for user space. Similar command for ifc/mlx5 would have created similar channel for rest of the config commands in hw. vduse channel = char device, eventfd etc. ifc/mlx5 hw channel = bar, irq, command interface etc Netdev sim channel = sw direct calls Does it make sense? In my understanding, to make vdpa work, we need a backend (datapath resources) and a frontend (a vdpa device attached to a vdpa bus). In the above example, it looks like we use the command "vdpa dev add ..." to create a backend, so do we need another command to create a frontend? For block device there is certainly some backend to process the IOs. Sometimes backend to be setup first, before its front end is exposed. "vdpa dev add" is the front end command who connects to the backend (implicitly) for network device. vhost->vdpa_block_device->backend_io_processor (usr,hw,kernel). And it needs a way to connect to backend when explicitly specified during creation time. Something like, $ vdpa dev add parentdev vdpa_vduse type block name foo3 handle In above example some vendor device specific unique handle is passed based on backend setup in hardware/user space. In below 3 examples, vdpa block simulator is connecting to backend block
Re: [External] Re: [PATCH 0/7] Introduce vdpa management tool
On 2020/12/2 下午2:24, Parav Pandit wrote: From: Jason Wang Sent: Wednesday, December 2, 2020 11:21 AM On 2020/12/2 下午12:53, Parav Pandit wrote: From: Yongji Xie Sent: Wednesday, December 2, 2020 9:00 AM On Tue, Dec 1, 2020 at 11:59 PM Parav Pandit wrote: From: Yongji Xie Sent: Tuesday, December 1, 2020 7:49 PM On Tue, Dec 1, 2020 at 7:32 PM Parav Pandit wrote: From: Yongji Xie Sent: Tuesday, December 1, 2020 3:26 PM On Tue, Dec 1, 2020 at 2:25 PM Jason Wang wrote: On 2020/11/30 下午3:07, Yongji Xie wrote: Thanks for adding me, Jason! Now I'm working on a v2 patchset for VDUSE (vDPA Device in Userspace) [1]. This tool is very useful for the vduse device. So I'm considering integrating this into my v2 patchset. But there is one problem: In this tool, vdpa device config action and enable action are combined into one netlink msg: VDPA_CMD_DEV_NEW. But in vduse case, it needs to be splitted because a chardev should be created and opened by a userspace process before we enable the vdpa device (call vdpa_register_device()). So I'd like to know whether it's possible (or have some plans) to add two new netlink msgs something like: VDPA_CMD_DEV_ENABLE and VDPA_CMD_DEV_DISABLE to make the config path more flexible. Actually, we've discussed such intermediate step in some early discussion. It looks to me VDUSE could be one of the users of this. Or I wonder whether we can switch to use anonymous inode(fd) for VDUSE then fetching it via an VDUSE_GET_DEVICE_FD ioctl? Yes, we can. Actually the current implementation in VDUSE is like this. But seems like this is still a intermediate step. The fd should be binded to a name or something else which need to be configured before. The name could be specified via the netlink. It looks to me the real issue is that until the device is connected with a userspace, it can't be used. So we also need to fail the enabling if it doesn't opened. Yes, that's true. So you mean we can firstly try to fetch the fd binded to a name/vduse_id via an VDUSE_GET_DEVICE_FD, then use the name/vduse_id as a attribute to create vdpa device? It looks fine to me. I probably do not well understand. I tried reading patch [1] and few things do not look correct as below. Creating the vdpa device on the bus device and destroying the device from the workqueue seems unnecessary and racy. It seems vduse driver needs This is something should be done as part of the vdpa dev add command, instead of connecting two sides separately and ensuring race free access to it. So VDUSE_DEV_START and VDUSE_DEV_STOP should possibly be avoided. Yes, we can avoid these two ioctls with the help of the management tool. $ vdpa dev add parentdev vduse_mgmtdev type net name foo2 When above command is executed it creates necessary vdpa device foo2 on the bus. When user binds foo2 device with the vduse driver, in the probe(), it creates respective char device to access it from user space. I see. So vduse cannot work with any existing vdpa devices like ifc, mlx5 or netdevsim. It has its own implementation similar to fuse with its own backend of choice. More below. But vduse driver is not a vdpa bus driver. It works like vdpasim driver, but offloads the data plane and control plane to a user space process. In that case to draw parallel lines, 1. netdevsim: (a) create resources in kernel sw (b) datapath simulates in kernel 2. ifc + mlx5 vdpa dev: (a) creates resource in hw (b) data path is in hw 3. vduse: (a) creates resources in userspace sw (b) data path is in user space. hence creates data path resources for user space. So char device is created, removed as result of vdpa device creation. For example, $ vdpa dev add parentdev vduse_mgmtdev type net name foo2 Above command will create char device for user space. Similar command for ifc/mlx5 would have created similar channel for rest of the config commands in hw. vduse channel = char device, eventfd etc. ifc/mlx5 hw channel = bar, irq, command interface etc Netdev sim channel = sw direct calls Does it make sense? In my understanding, to make vdpa work, we need a backend (datapath resources) and a frontend (a vdpa device attached to a vdpa bus). In the above example, it looks like we use the command "vdpa dev add ..." to create a backend, so do we need another command to create a frontend? For block device there is certainly some backend to process the IOs. Sometimes backend to be setup first, before its front end is exposed. "vdpa dev add" is the front end command who connects to the backend (implicitly) for network device. vhost->vdpa_block_device->backend_io_processor (usr,hw,kernel). And it needs a way to connect to backend when explicitly specified during creation time. Something like, $ vdpa dev add parentdev vdpa_vduse type block name foo3 handle In above example some vendor device specific unique handle is passed based o
Re: [PATCH net-next] tun: fix ubuf refcount incorrectly on error path
On 2020/12/3 下午4:00, wangyunjian wrote: From: Yunjian Wang After setting callback for ubuf_info of skb, the callback (vhost_net_zerocopy_callback) will be called to decrease the refcount when freeing skb. But when an exception occurs afterwards, the error handling in vhost handle_tx() will try to decrease the same refcount again. This is wrong and fix this by clearing ubuf_info when meeting errors. Fixes: 4477138fa0ae ("tun: properly test for IFF_UP") Fixes: 90e33d459407 ("tun: enable napi_gro_frags() for TUN/TAP driver") Signed-off-by: Yunjian Wang --- drivers/net/tun.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 2dc1988a8973..3614bb1b6d35 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1861,6 +1861,12 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, if (unlikely(!(tun->dev->flags & IFF_UP))) { err = -EIO; rcu_read_unlock(); + if (zerocopy) { + skb_shinfo(skb)->destructor_arg = NULL; + skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY; + skb_shinfo(skb)->tx_flags &= ~SKBTX_SHARED_FRAG; + } + goto drop; } @@ -1874,6 +1880,11 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, if (unlikely(headlen > skb_headlen(skb))) { atomic_long_inc(&tun->dev->rx_dropped); + if (zerocopy) { + skb_shinfo(skb)->destructor_arg = NULL; + skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY; + skb_shinfo(skb)->tx_flags &= ~SKBTX_SHARED_FRAG; + } napi_free_frags(&tfile->napi); rcu_read_unlock(); mutex_unlock(&tfile->napi_mutex); It looks to me then we miss the failure feedback. The issues comes from the inconsistent error handling in tun. I wonder whether we can simply do uarg->callback(uarg, false) if necessary on every failture path on tun_get_user(). Note that, zerocopy has a lot of issues which makes it not good for production environment. Thanks
Re: [PATCH net-next] tun: fix ubuf refcount incorrectly on error path
On 2020/12/4 下午6:22, wangyunjian wrote: -Original Message- From: Jason Wang [mailto:jasow...@redhat.com] Sent: Friday, December 4, 2020 2:11 PM To: wangyunjian ; m...@redhat.com Cc: virtualizat...@lists.linux-foundation.org; netdev@vger.kernel.org; Lilijun (Jerry) ; xudingke Subject: Re: [PATCH net-next] tun: fix ubuf refcount incorrectly on error path On 2020/12/3 下午4:00, wangyunjian wrote: From: Yunjian Wang After setting callback for ubuf_info of skb, the callback (vhost_net_zerocopy_callback) will be called to decrease the refcount when freeing skb. But when an exception occurs afterwards, the error handling in vhost handle_tx() will try to decrease the same refcount again. This is wrong and fix this by clearing ubuf_info when meeting errors. Fixes: 4477138fa0ae ("tun: properly test for IFF_UP") Fixes: 90e33d459407 ("tun: enable napi_gro_frags() for TUN/TAP driver") Signed-off-by: Yunjian Wang --- drivers/net/tun.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 2dc1988a8973..3614bb1b6d35 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1861,6 +1861,12 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, if (unlikely(!(tun->dev->flags & IFF_UP))) { err = -EIO; rcu_read_unlock(); + if (zerocopy) { + skb_shinfo(skb)->destructor_arg = NULL; + skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY; + skb_shinfo(skb)->tx_flags &= ~SKBTX_SHARED_FRAG; + } + goto drop; } @@ -1874,6 +1880,11 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, if (unlikely(headlen > skb_headlen(skb))) { atomic_long_inc(&tun->dev->rx_dropped); + if (zerocopy) { + skb_shinfo(skb)->destructor_arg = NULL; + skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY; + skb_shinfo(skb)->tx_flags &= ~SKBTX_SHARED_FRAG; + } napi_free_frags(&tfile->napi); rcu_read_unlock(); mutex_unlock(&tfile->napi_mutex); It looks to me then we miss the failure feedback. The issues comes from the inconsistent error handling in tun. I wonder whether we can simply do uarg->callback(uarg, false) if necessary on every failture path on tun_get_user(). How about this? --- drivers/net/tun.c | 29 ++--- 1 file changed, 18 insertions(+), 11 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 2dc1988a8973..36a8d8eacd7b 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1637,6 +1637,19 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun, return NULL; } +/* copy ubuf_info for callback when skb has no error */ +inline static tun_copy_ubuf_info(struct sk_buff *skb, bool zerocopy, void *msg_control) +{ + if (zerocopy) { + skb_shinfo(skb)->destructor_arg = msg_control; + skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY; + skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG; + } else if (msg_control) { + struct ubuf_info *uarg = msg_control; + uarg->callback(uarg, false); + } +} + /* Get packet from user space buffer */ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, void *msg_control, struct iov_iter *from, @@ -1812,16 +1825,6 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, break; } - /* copy skb_ubuf_info for callback when skb has no error */ - if (zerocopy) { - skb_shinfo(skb)->destructor_arg = msg_control; - skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY; - skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG; - } else if (msg_control) { - struct ubuf_info *uarg = msg_control; - uarg->callback(uarg, false); - } - skb_reset_network_header(skb); skb_probe_transport_header(skb); skb_record_rx_queue(skb, tfile->queue_index); @@ -1830,6 +1833,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, struct bpf_prog *xdp_prog; int ret; + tun_copy_ubuf_info(skb, zerocopy, msg_control); If you think disabling zerocopy for XDP (which I think it makes sense). It's better to do this in another patch. local_bh_disable(); rcu_read_lock(); xdp_prog = rcu_dereference(tun->xdp_prog); @@
Re: [PATCH] vhost scsi: fix error return code in vhost_scsi_set_endpoint()
On 2020/12/4 下午4:43, Zhang Changzhong wrote: Fix to return a negative error code from the error handling case instead of 0, as done elsewhere in this function. Fixes: 25b98b64e284 ("vhost scsi: alloc cmds per vq instead of session") Reported-by: Hulk Robot Signed-off-by: Zhang Changzhong --- drivers/vhost/scsi.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index 6ff8a5096..4ce9f00 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -1643,7 +1643,8 @@ vhost_scsi_set_endpoint(struct vhost_scsi *vs, if (!vhost_vq_is_setup(vq)) continue; - if (vhost_scsi_setup_vq_cmds(vq, vq->num)) + ret = vhost_scsi_setup_vq_cmds(vq, vq->num); + if (ret) goto destroy_vq_cmds; } Acked-by: Jason Wang
Re: [PATCH net-next] tun: fix ubuf refcount incorrectly on error path
On 2020/12/7 下午9:38, wangyunjian wrote: I think the newly added code is easy to miss this problem, so I want to copy ubuf_info until we're sure there's no errors. Thanks, Yunjian But isn't this actually a disabling of zerocopy? Thanks
Re: [PATCH net-next] tun: fix ubuf refcount incorrectly on error path
On 2020/12/8 上午10:32, Jason Wang wrote: On 2020/12/7 下午9:38, wangyunjian wrote: I think the newly added code is easy to miss this problem, so I want to copy ubuf_info until we're sure there's no errors. Thanks, Yunjian But isn't this actually a disabling of zerocopy? Thanks Sorry, I misread the patch. Please send a formal version, and let's move the discussion there. Thanks
Re: [PATCH net] vhost_vdpa: Return -EFUALT if copy_from_user() fails
On 2020/11/18 下午4:59, Michael S. Tsirkin wrote: On Wed, Nov 18, 2020 at 02:08:17PM +0800, Jason Wang wrote: On 2020/10/26 上午10:59, Jason Wang wrote: On 2020/10/23 下午11:34, Michael S. Tsirkin wrote: On Fri, Oct 23, 2020 at 03:08:53PM +0300, Dan Carpenter wrote: The copy_to/from_user() functions return the number of bytes which we weren't able to copy but the ioctl should return -EFAULT if they fail. Fixes: a127c5bbb6a8 ("vhost-vdpa: fix backend feature ioctls") Signed-off-by: Dan Carpenter Acked-by: Michael S. Tsirkin Needed for stable I guess. Agree. Acked-by: Jason Wang Hi Michael. I don't see this in your tree, please consider to merge. Thanks I do see it there: commit 7922460e33c81f41e0d2421417228b32e6fdbe94 Author: Dan Carpenter Date: Fri Oct 23 15:08:53 2020 +0300 vhost_vdpa: Return -EFAULT if copy_from_user() fails the reason you can't find it is probably because I fixed up a typo in the subject. I see that. Thanks
Re: netconsole deadlock with virtnet
On 2020/11/24 上午3:21, Jakub Kicinski wrote: On Mon, 23 Nov 2020 14:09:34 -0500 Steven Rostedt wrote: On Mon, 23 Nov 2020 10:52:52 -0800 Jakub Kicinski wrote: On Mon, 23 Nov 2020 09:31:28 -0500 Steven Rostedt wrote: On Mon, 23 Nov 2020 13:08:55 +0200 Leon Romanovsky wrote: [ 10.028024] Chain exists of: [ 10.028025] console_owner --> target_list_lock --> _xmit_ETHER#2 Note, the problem is that we have a location that grabs the xmit_lock while holding target_list_lock (and possibly console_owner). Well, it try_locks the xmit_lock. Does lockdep understand try-locks? (not that I condone the shenanigans that are going on here) Does it? virtnet_poll_tx() { __netif_tx_lock() { spin_lock(&txq->_xmit_lock); Umpf. Right. I was looking at virtnet_poll_cleantx() That looks like we can have: CPU0CPU1 lock(xmit_lock) lock(console) lock(target_list_lock) __netif_tx_lock() lock(xmit_lock); [BLOCKED] lock(console) [BLOCKED] DEADLOCK. So where is the trylock here? Perhaps you need the trylock in virtnet_poll_tx()? That could work. Best if we used normal lock if !!budget, and trylock when budget is 0. But maybe that's too hairy. If we use trylock, we probably lose(or delay) tx notification that may have side effects to the stack. I'm assuming all this trickiness comes from virtqueue_get_buf() needing locking vs the TX path? It's pretty unusual for the completion path to need locking vs xmit path. Two reasons for doing this: 1) For some historical reason, we try to free transmitted tx packets in xmit (see free_old_xmit_skbs() in start_xmit()), we can probably remove this if we remove the non tx interrupt mode. 2) virtio core requires virtqueue_get_buf() to be synchronized with virtqueue_add(), we probably can solve this but it requires some non trivial refactoring in the virtio core Btw, have a quick search, there are several other drivers that uses tx lock in the tx NAPI. Thanks
Re: [PATCH net-next 00/13] Add mlx5 subfunction support
On 2020/11/21 上午3:04, Samudrala, Sridhar wrote: On 11/20/2020 9:58 AM, Alexander Duyck wrote: On Thu, Nov 19, 2020 at 5:29 PM Jakub Kicinski wrote: On Wed, 18 Nov 2020 21:35:29 -0700 David Ahern wrote: On 11/18/20 7:14 PM, Jakub Kicinski wrote: On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote: On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote: Just to refresh all our memory, we discussed and settled on the flow in [2]; RFC [1] followed this discussion. vdpa tool of [3] can add one or more vdpa device(s) on top of already spawned PF, VF, SF device. Nack for the networking part of that. It'd basically be VMDq. What are you NAK'ing? Spawning multiple netdevs from one device by slicing up its queues. Why do you object to that? Slicing up h/w resources for virtual what ever has been common practice for a long time. My memory of the VMDq debate is hazy, let me rope in Alex into this. I believe the argument was that we should offload software constructs, not create HW-specific APIs which depend on HW availability and implementation. So the path we took was offloading macvlan. I think it somewhat depends on the type of interface we are talking about. What we were wanting to avoid was drivers spawning their own unique VMDq netdevs and each having a different way of doing it. The approach Intel went with was to use a MACVLAN offload to approach it. Although I would imagine many would argue the approach is somewhat dated and limiting since you cannot do many offloads on a MACVLAN interface. Yes. We talked about this at netdev 0x14 and the limitations of macvlan based offloads. https://netdevconf.info/0x14/session.html?talk-hardware-acceleration-of-container-networking-interfaces Subfunction seems to be a good model to expose VMDq VSI or SIOV ADI as a netdev for kernel containers. AF_XDP ZC in a container is one of the usecase this would address. Today we have to pass the entire PF/VF to a container to do AF_XDP. Looks like the current model is to create a subfunction of a specific type on auxiliary bus, do some configuration to assign resources and then activate the subfunction. With the VDPA case I believe there is a set of predefined virtio devices that are being emulated and presented so it isn't as if they are creating a totally new interface for this. vDPA doesn't have any limitation of how the devices is created or implemented. It could be predefined or created dynamically. vDPA leaves all of those to the parent device with the help of a unified management API[1]. E.g It could be a PCI device (PF or VF), sub-function or software emulated devices. What I would be interested in seeing is if there are any other vendors that have reviewed this and sign off on this approach. For "this approach" do you mean vDPA subfucntion? My understanding is that it's totally vendor specific, vDPA subsystem don't want to be limited by a specific type of device. What we don't want to see is Nivida/Mellanox do this one way, then Broadcom or Intel come along later and have yet another way of doing this. We need an interface and feature set that will work for everyone in terms of how this will look going forward. For feature set, it would be hard to force (we can have a recommendation set of features) vendors to implement a common set of features consider they can be negotiated. So the management interface is expected to implement features like cpu clusters in order to make sure the migration compatibility, or qemu can assist for the missing feature with performance lose. Thanks
Re: [PATCH net-next 00/13] Add mlx5 subfunction support
On 2020/11/24 下午3:01, Jason Wang wrote: On 2020/11/21 上午3:04, Samudrala, Sridhar wrote: On 11/20/2020 9:58 AM, Alexander Duyck wrote: On Thu, Nov 19, 2020 at 5:29 PM Jakub Kicinski wrote: On Wed, 18 Nov 2020 21:35:29 -0700 David Ahern wrote: On 11/18/20 7:14 PM, Jakub Kicinski wrote: On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote: On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote: Just to refresh all our memory, we discussed and settled on the flow in [2]; RFC [1] followed this discussion. vdpa tool of [3] can add one or more vdpa device(s) on top of already spawned PF, VF, SF device. Nack for the networking part of that. It'd basically be VMDq. What are you NAK'ing? Spawning multiple netdevs from one device by slicing up its queues. Why do you object to that? Slicing up h/w resources for virtual what ever has been common practice for a long time. My memory of the VMDq debate is hazy, let me rope in Alex into this. I believe the argument was that we should offload software constructs, not create HW-specific APIs which depend on HW availability and implementation. So the path we took was offloading macvlan. I think it somewhat depends on the type of interface we are talking about. What we were wanting to avoid was drivers spawning their own unique VMDq netdevs and each having a different way of doing it. The approach Intel went with was to use a MACVLAN offload to approach it. Although I would imagine many would argue the approach is somewhat dated and limiting since you cannot do many offloads on a MACVLAN interface. Yes. We talked about this at netdev 0x14 and the limitations of macvlan based offloads. https://netdevconf.info/0x14/session.html?talk-hardware-acceleration-of-container-networking-interfaces Subfunction seems to be a good model to expose VMDq VSI or SIOV ADI as a netdev for kernel containers. AF_XDP ZC in a container is one of the usecase this would address. Today we have to pass the entire PF/VF to a container to do AF_XDP. Looks like the current model is to create a subfunction of a specific type on auxiliary bus, do some configuration to assign resources and then activate the subfunction. With the VDPA case I believe there is a set of predefined virtio devices that are being emulated and presented so it isn't as if they are creating a totally new interface for this. vDPA doesn't have any limitation of how the devices is created or implemented. It could be predefined or created dynamically. vDPA leaves all of those to the parent device with the help of a unified management API[1]. E.g It could be a PCI device (PF or VF), sub-function or software emulated devices. Miss the link, https://www.spinics.net/lists/netdev/msg699374.html. Thanks What I would be interested in seeing is if there are any other vendors that have reviewed this and sign off on this approach. For "this approach" do you mean vDPA subfucntion? My understanding is that it's totally vendor specific, vDPA subsystem don't want to be limited by a specific type of device. What we don't want to see is Nivida/Mellanox do this one way, then Broadcom or Intel come along later and have yet another way of doing this. We need an interface and feature set that will work for everyone in terms of how this will look going forward. For feature set, it would be hard to force (we can have a recommendation set of features) vendors to implement a common set of features consider they can be negotiated. So the management interface is expected to implement features like cpu clusters in order to make sure the migration compatibility, or qemu can assist for the missing feature with performance lose. Thanks
Re: netconsole deadlock with virtnet
On 2020/11/24 下午4:01, Leon Romanovsky wrote: On Tue, Nov 24, 2020 at 11:22:03AM +0800, Jason Wang wrote: On 2020/11/24 上午3:21, Jakub Kicinski wrote: On Mon, 23 Nov 2020 14:09:34 -0500 Steven Rostedt wrote: On Mon, 23 Nov 2020 10:52:52 -0800 Jakub Kicinski wrote: On Mon, 23 Nov 2020 09:31:28 -0500 Steven Rostedt wrote: On Mon, 23 Nov 2020 13:08:55 +0200 Leon Romanovsky wrote: [ 10.028024] Chain exists of: [ 10.028025] console_owner --> target_list_lock --> _xmit_ETHER#2 Note, the problem is that we have a location that grabs the xmit_lock while holding target_list_lock (and possibly console_owner). Well, it try_locks the xmit_lock. Does lockdep understand try-locks? (not that I condone the shenanigans that are going on here) Does it? virtnet_poll_tx() { __netif_tx_lock() { spin_lock(&txq->_xmit_lock); Umpf. Right. I was looking at virtnet_poll_cleantx() That looks like we can have: CPU0CPU1 lock(xmit_lock) lock(console) lock(target_list_lock) __netif_tx_lock() lock(xmit_lock); [BLOCKED] lock(console) [BLOCKED] DEADLOCK. So where is the trylock here? Perhaps you need the trylock in virtnet_poll_tx()? That could work. Best if we used normal lock if !!budget, and trylock when budget is 0. But maybe that's too hairy. If we use trylock, we probably lose(or delay) tx notification that may have side effects to the stack. I'm assuming all this trickiness comes from virtqueue_get_buf() needing locking vs the TX path? It's pretty unusual for the completion path to need locking vs xmit path. Two reasons for doing this: 1) For some historical reason, we try to free transmitted tx packets in xmit (see free_old_xmit_skbs() in start_xmit()), we can probably remove this if we remove the non tx interrupt mode. 2) virtio core requires virtqueue_get_buf() to be synchronized with virtqueue_add(), we probably can solve this but it requires some non trivial refactoring in the virtio core So how will we solve our lockdep issues? Thanks It's not clear to me that whether it's a virtio-net specific issue. E.g the above deadlock looks like a generic issue so workaround it via virtio-net may not help for other drivers. Thanks Btw, have a quick search, there are several other drivers that uses tx lock in the tx NAPI. Thanks
Re: netconsole deadlock with virtnet
On 2020/11/24 下午10:31, Steven Rostedt wrote: On Tue, 24 Nov 2020 11:22:03 +0800 Jason Wang wrote: Btw, have a quick search, there are several other drivers that uses tx lock in the tx NAPI. tx NAPI is not the issue. The issue is that write_msg() (in netconsole.c) calls this polling logic with the target_list_lock held. But in the tx NAPI poll it tries to lock TX instead of using trylock. Are those other drivers called by netconsole? If not, then this is unique to virtio-net. I think the answer is yes, since net console is not disabled in the codes. Thanks -- Steve
Re: netconsole deadlock with virtnet
On 2020/11/25 上午12:20, Jakub Kicinski wrote: On Tue, 24 Nov 2020 11:22:03 +0800 Jason Wang wrote: Perhaps you need the trylock in virtnet_poll_tx()? That could work. Best if we used normal lock if !!budget, and trylock when budget is 0. But maybe that's too hairy. If we use trylock, we probably lose(or delay) tx notification that may have side effects to the stack. That's why I said only trylock with budget == 0. Only netpoll calls with budget == 0, AFAIK. Oh right. So I think maybe we can switch to use trylock when budget is zero and try to schedule another TX NAPI if we trylock fail. I'm assuming all this trickiness comes from virtqueue_get_buf() needing locking vs the TX path? It's pretty unusual for the completion path to need locking vs xmit path. Two reasons for doing this: 1) For some historical reason, we try to free transmitted tx packets in xmit (see free_old_xmit_skbs() in start_xmit()), we can probably remove this if we remove the non tx interrupt mode. 2) virtio core requires virtqueue_get_buf() to be synchronized with virtqueue_add(), we probably can solve this but it requires some non trivial refactoring in the virtio core Btw, have a quick search, there are several other drivers that uses tx lock in the tx NAPI. Unless they do: netdev->priv_flags |= IFF_DISABLE_NETPOLL; they are all broken. Yes. Thanks
Re: [PATCH 0/7] Introduce vdpa management tool
On 2020/11/12 下午2:39, Parav Pandit wrote: This patchset covers user requirements for managing existing vdpa devices, using a tool and its internal design notes for kernel drivers. Background and user requirements: -- (1) Currently VDPA device is created by driver when driver is loaded. However, user should have a choice when to create or not create a vdpa device for the underlying parent device. For example, mlx5 PCI VF and subfunction device supports multiple classes of device such netdev, vdpa, rdma. Howevever it is not required to always created vdpa device for such device. (2) In another use case, a device may support creating one or multiple vdpa device of same or different class such as net and block. Creating vdpa devices at driver load time further limits this use case. (3) A user should be able to monitor and query vdpa queue level or device level statistics for a given vdpa device. (4) A user should be able to query what class of vdpa devices are supported by its parent device. (5) A user should be able to view supported features and negotiated features of the vdpa device. (6) A user should be able to create a vdpa device in vendor agnostic manner using single tool. Hence, it is required to have a tool through which user can create one or more vdpa devices from a parent device which addresses above user requirements. Example devices: +---+ +---+ +-+ ++ +---+ |vdpa dev 0 | |vdpa dev 1 | |rdma dev | |netdev | |vdpa dev 3 | |type=net | |type=block | |mlx5_0 | |ens3f0 | |type=net | ++--+ +-+-+ +++ +-+--+ ++--+ | ||| | | ||| | ++-+| +++ |+++ | mlx5++ |mlx5 +---+|mlx5 | |pci vf 2 ||pci vf 4 ||pci sf 8 | |03:00:2 ||03:00.4 ||mlx5_sf.8| ++-+++++++ | | | | ++-+| +--+mlx5 ++ |pci pf 0 | |03:00.0 | +--+ vdpa tool: -- vdpa tool is a tool to create, delete vdpa devices from a parent device. It is a tool that enables user to query statistics, features and may be more attributes in future. vdpa tool command draft: (a) List parent devices which supports creating vdpa devices. It also shows which class types supported by this parent device. In below command example two parent devices support vdpa device creation. First is PCI VF whose bdf is 03.00:2. Second is PCI VF whose name is 03:00.4. Third is PCI SF whose name is mlx5_core.sf.8 $ vdpa parentdev list vdpasim supported_classes net pci/:03.00:3 supported_classes net block pci/:03.00:4 supported_classes net block auxiliary/mlx5_core.sf.8 supported_classes net (b) Now add a vdpa device of networking class and show the device. $ vdpa dev add parentdev pci/:03.00:2 type net name foo0 $ vdpa dev show foo0 foo0: parentdev pci/:03.00:2 type network parentdev vdpasim vendor_id 0 max_vqs 2 max_vq_size 256 (c) Show features of a vdpa device $ vdpa dev features show foo0 supported iommu platform version 1 (d) Dump vdpa device statistics $ vdpa dev stats show foo0 kickdoorbells 10 wqes 100 (e) Now delete a vdpa device previously created. $ vdpa dev del foo0 vdpa tool support in this patchset: --- vdpa tool is created to create, delete and query vdpa devices. examples: Show vdpa parent device that supports creating, deleting vdpa devices. $ vdpa parentdev show vdpasim: supported_classes net $ vdpa parentdev show -jp { "show": { "vdpasim": { "supported_classes": { "net" } } } Create a vdpa device of type networking named as "foo2" from the parent device vdpasim: $ vdpa dev add parentdev vdpasim type net name foo2 Show the newly created vdpa device by its name: $ vdpa dev show foo2 foo2: type network parentdev vdpasim vendor_id 0 max_vqs 2 max_vq_size 256 $ vdpa dev show foo2 -jp { "dev": { "foo2": { "type": "network", "parentdev": "vdpasim", "vendor_id": 0, "max_vqs": 2, "max_vq_size": 256 } } } Delete the vdpa device after its use: $ vdpa dev del foo2 vdpa tool support by kernel: vdpa tool user interface will be supported by existing vdpa kernel framework, i.e. drivers/vdpa/vdpa.c It services user command through a netlink interface. Each parent device r
Re: [PATCH v4] vdpa: mlx5: fix vdpa/vhost dependencies
On 2020/11/29 上午5:39, Randy Dunlap wrote: drivers/vdpa/mlx5/ uses vhost_iotlb*() interfaces, so select VHOST_IOTLB to make them be built. However, if VHOST_IOTLB is the only VHOST symbol that is set/enabled, the object file still won't be built because drivers/Makefile won't descend into drivers/vhost/ to build it, so make drivers/Makefile build the needed binary whenever VHOST_IOTLB is set, like it does for VHOST_RING. Fixes these build errors: ERROR: modpost: "vhost_iotlb_itree_next" [drivers/vdpa/mlx5/mlx5_vdpa.ko] undefined! ERROR: modpost: "vhost_iotlb_itree_first" [drivers/vdpa/mlx5/mlx5_vdpa.ko] undefined! Fixes: 29064bfdabd5 ("vdpa/mlx5: Add support library for mlx5 VDPA implementation") Fixes: aff90770e54c ("vdpa/mlx5: Fix dependency on MLX5_CORE") Reported-by: kernel test robot Signed-off-by: Randy Dunlap Cc: Eli Cohen Cc: Parav Pandit Cc: "Michael S. Tsirkin" Cc: Jason Wang Cc: virtualizat...@lists.linux-foundation.org Cc: Saeed Mahameed Cc: Leon Romanovsky Cc: netdev@vger.kernel.org --- v2: change from select to depends on VHOST (Saeed) v3: change to depends on VHOST_IOTLB (Jason) v4: use select VHOST_IOTLB (Michael); also add to drivers/Makefile drivers/Makefile |1 + drivers/vdpa/Kconfig |1 + 2 files changed, 2 insertions(+) --- linux-next-20201127.orig/drivers/vdpa/Kconfig +++ linux-next-20201127/drivers/vdpa/Kconfig @@ -32,6 +32,7 @@ config IFCVF config MLX5_VDPA bool + select VHOST_IOTLB help Support library for Mellanox VDPA drivers. Provides code that is common for all types of VDPA drivers. The following drivers are planned: --- linux-next-20201127.orig/drivers/Makefile +++ linux-next-20201127/drivers/Makefile @@ -143,6 +143,7 @@ obj-$(CONFIG_OF)+= of/ obj-$(CONFIG_SSB) += ssb/ obj-$(CONFIG_BCMA)+= bcma/ obj-$(CONFIG_VHOST_RING) += vhost/ +obj-$(CONFIG_VHOST_IOTLB) += vhost/ obj-$(CONFIG_VHOST) += vhost/ obj-$(CONFIG_VLYNQ) += vlynq/ obj-$(CONFIG_GREYBUS) += greybus/ Acked-by: Jason Wang Thanks
Re: [External] Re: [PATCH 0/7] Introduce vdpa management tool
On 2020/11/27 下午1:52, Yongji Xie wrote: On Fri, Nov 27, 2020 at 11:53 AM Jason Wang <mailto:jasow...@redhat.com>> wrote: On 2020/11/12 下午2:39, Parav Pandit wrote: > This patchset covers user requirements for managing existing vdpa devices, > using a tool and its internal design notes for kernel drivers. > > Background and user requirements: > -- > (1) Currently VDPA device is created by driver when driver is loaded. > However, user should have a choice when to create or not create a vdpa device > for the underlying parent device. > > For example, mlx5 PCI VF and subfunction device supports multiple classes of > device such netdev, vdpa, rdma. Howevever it is not required to always created > vdpa device for such device. > > (2) In another use case, a device may support creating one or multiple vdpa > device of same or different class such as net and block. > Creating vdpa devices at driver load time further limits this use case. > > (3) A user should be able to monitor and query vdpa queue level or device level > statistics for a given vdpa device. > > (4) A user should be able to query what class of vdpa devices are supported > by its parent device. > > (5) A user should be able to view supported features and negotiated features > of the vdpa device. > > (6) A user should be able to create a vdpa device in vendor agnostic manner > using single tool. > > Hence, it is required to have a tool through which user can create one or more > vdpa devices from a parent device which addresses above user requirements. > > Example devices: > > +---+ +---+ +-+ ++ +---+ > |vdpa dev 0 | |vdpa dev 1 | |rdma dev | |netdev | |vdpa dev 3 | > |type=net | |type=block | |mlx5_0 | |ens3f0 | |type=net | > ++--+ +-+-+ +++ +-+--+ ++--+ > | | | | | > | | | | | > ++-+ | +++ | +++ > | mlx5 ++ |mlx5 +---+ |mlx5 | > |pci vf 2 | |pci vf 4 | |pci sf 8 | > |03:00:2 | |03:00.4 | |mlx5_sf.8| > ++-+ +++ +++ > | | | > | ++-+ | > +--+mlx5 ++ > |pci pf 0 | > |03:00.0 | > +--+ > > vdpa tool: > -- > vdpa tool is a tool to create, delete vdpa devices from a parent device. It is a > tool that enables user to query statistics, features and may be more attributes > in future. > > vdpa tool command draft: > > (a) List parent devices which supports creating vdpa devices. > It also shows which class types supported by this parent device. > In below command example two parent devices support vdpa device creation. > First is PCI VF whose bdf is 03.00:2. > Second is PCI VF whose name is 03:00.4. > Third is PCI SF whose name is mlx5_core.sf.8 > > $ vdpa parentdev list > vdpasim > supported_classes > net > pci/:03.00:3 > supported_classes > net block > pci/:03.00:4 > supported_classes > net block > auxiliary/mlx5_core.sf.8 > supported_classes > net > > (b) Now add a vdpa device of networking class and show the device. > $ vdpa dev add parentdev pci/:03.00:2 type net name foo0 $ vdpa dev show foo0 > foo0: parentdev pci/:03.00:2 type network parentdev vdpasim vendor_id 0 max_vqs 2 max_vq_size 256 > > (c) Show features of a vdpa device > $ vdpa dev features show foo0 > supported > iommu platform > version 1 > > (d) Dump vdpa device statistics > $ vdpa dev stats show foo0 > kickdoorbells 10 > wqes 100 > > (e) Now delete a vdpa device previously created. > $ vdpa dev del foo0 > > vdpa tool support in this patchset: > --- > vdpa tool is created to create, delete and query vdpa devices. > examples: > Sho
Re: [PATCH netdev 1/2] virtio: add module option to turn off guest offloads
On 2020/11/12 下午4:11, Xuan Zhuo wrote: * VIRTIO_NET_F_GUEST_CSUM * VIRTIO_NET_F_GUEST_TSO4 * VIRTIO_NET_F_GUEST_TSO6 * VIRTIO_NET_F_GUEST_ECN * VIRTIO_NET_F_GUEST_UFO * VIRTIO_NET_F_MTU If these features are negotiated successfully, it may cause virtio-net to receive large packages, which will cause xdp to fail to load. And in many cases, it cannot be dynamically turned off, so add a module option to turn off these features. Actually we will disable those through control virtqueue. Or does it mean your hardware doesn't support control guest offloads? Module parameters may introduce a lot of burden for management and use-ability. Disabling guest offloads means you may suffer from low RX throughput. Thanks Signed-off-by: Xuan Zhuo --- drivers/net/virtio_net.c | 36 +++- 1 file changed, 35 insertions(+), 1 deletion(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 21b7114..232a539 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -26,10 +26,11 @@ static int napi_weight = NAPI_POLL_WEIGHT; module_param(napi_weight, int, 0444); -static bool csum = true, gso = true, napi_tx = true; +static bool csum = true, gso = true, napi_tx, guest_offload = true; module_param(csum, bool, 0444); module_param(gso, bool, 0444); module_param(napi_tx, bool, 0644); +module_param(guest_offload, bool, 0644); /* FIXME: MTU in config. */ #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN) @@ -3245,6 +3246,18 @@ static __maybe_unused int virtnet_restore(struct virtio_device *vdev) VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \ VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_STANDBY +#define VIRTNET_FEATURES_WITHOUT_GUEST_OFFLOADS \ + VIRTIO_NET_F_CSUM, \ + VIRTIO_NET_F_MAC, \ + VIRTIO_NET_F_HOST_TSO4, VIRTIO_NET_F_HOST_UFO, VIRTIO_NET_F_HOST_TSO6, \ + VIRTIO_NET_F_HOST_ECN, \ + VIRTIO_NET_F_MRG_RXBUF, VIRTIO_NET_F_STATUS, VIRTIO_NET_F_CTRL_VQ, \ + VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, \ + VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \ + VIRTIO_NET_F_CTRL_MAC_ADDR, \ + VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \ + VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_STANDBY + static unsigned int features[] = { VIRTNET_FEATURES, }; @@ -3255,6 +3268,16 @@ static __maybe_unused int virtnet_restore(struct virtio_device *vdev) VIRTIO_F_ANY_LAYOUT, }; +static unsigned int features_without_offloads[] = { + VIRTNET_FEATURES_WITHOUT_GUEST_OFFLOADS, +}; + +static unsigned int features_without_offloads_legacy[] = { + VIRTNET_FEATURES_WITHOUT_GUEST_OFFLOADS, + VIRTIO_NET_F_GSO, + VIRTIO_F_ANY_LAYOUT, +}; + static struct virtio_driver virtio_net_driver = { .feature_table = features, .feature_table_size = ARRAY_SIZE(features), @@ -3288,6 +3311,17 @@ static __init int virtio_net_driver_init(void) if (ret) goto err_dead; + if (!guest_offload) { + virtio_net_driver.feature_table = features_without_offloads; + virtio_net_driver.feature_table_size = + ARRAY_SIZE(features_without_offloads); + + virtio_net_driver.feature_table_legacy = + features_without_offloads_legacy; + virtio_net_driver.feature_table_size_legacy = + ARRAY_SIZE(features_without_offloads_legacy); + } + ret = register_virtio_driver(&virtio_net_driver); if (ret) goto err_virtio;
Re: [PATCH netdev 2/2] virtio, xdp: Allow xdp to load, even if there is not enough queue
On 2020/11/12 下午5:15, Xuan Zhuo wrote: Since XDP_TX uses an independent queue to send data by default, and requires a queue for each CPU, this condition is often not met when the number of CPUs is relatively large, but XDP_TX is not used in many cases. I hope In this case, XDP is allowed to load and a warning message is submitted. If the user uses XDP_TX, another error is generated. This is not a perfect solution, but I still hope to solve some of the problems first. Signed-off-by: Xuan Zhuo This leads bad user experiences. Let's do something like this: 1) When TX queues is sufficient, go as in the past 2) When TX queue is not, use tx lock to synchronize Thanks
Re: [PATCH] vringh: fix vringh_iov_push_*() documentation
On 2020/11/17 上午12:16, Stefano Garzarella wrote: vringh_iov_push_*() functions don't have 'dst' parameter, but have the 'src' parameter. Replace 'dst' description with 'src' description. Signed-off-by: Stefano Garzarella Acked-by: Jason Wang --- drivers/vhost/vringh.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/vhost/vringh.c b/drivers/vhost/vringh.c index 8bd8b403f087..b7403ba8e7f7 100644 --- a/drivers/vhost/vringh.c +++ b/drivers/vhost/vringh.c @@ -730,7 +730,7 @@ EXPORT_SYMBOL(vringh_iov_pull_user); /** * vringh_iov_push_user - copy bytes into vring_iov. * @wiov: the wiov as passed to vringh_getdesc_user() (updated as we consume) - * @dst: the place to copy. + * @src: the place to copy from. * @len: the maximum length to copy. * * Returns the bytes copied <= len or a negative errno. @@ -976,7 +976,7 @@ EXPORT_SYMBOL(vringh_iov_pull_kern); /** * vringh_iov_push_kern - copy bytes into vring_iov. * @wiov: the wiov as passed to vringh_getdesc_kern() (updated as we consume) - * @dst: the place to copy. + * @src: the place to copy from. * @len: the maximum length to copy. * * Returns the bytes copied <= len or a negative errno. @@ -1333,7 +1333,7 @@ EXPORT_SYMBOL(vringh_iov_pull_iotlb); * vringh_iov_push_iotlb - copy bytes into vring_iov. * @vrh: the vring. * @wiov: the wiov as passed to vringh_getdesc_iotlb() (updated as we consume) - * @dst: the place to copy. + * @src: the place to copy from. * @len: the maximum length to copy. * * Returns the bytes copied <= len or a negative errno.
Re: [PATCH net] vhost_vdpa: Return -EFUALT if copy_from_user() fails
On 2020/10/26 上午10:59, Jason Wang wrote: On 2020/10/23 下午11:34, Michael S. Tsirkin wrote: On Fri, Oct 23, 2020 at 03:08:53PM +0300, Dan Carpenter wrote: The copy_to/from_user() functions return the number of bytes which we weren't able to copy but the ioctl should return -EFAULT if they fail. Fixes: a127c5bbb6a8 ("vhost-vdpa: fix backend feature ioctls") Signed-off-by: Dan Carpenter Acked-by: Michael S. Tsirkin Needed for stable I guess. Agree. Acked-by: Jason Wang Hi Michael. I don't see this in your tree, please consider to merge. Thanks
Re: [PATCH v2] vhost/vsock: add IOTLB API support
On 2020/12/23 下午10:36, Stefano Garzarella wrote: This patch enables the IOTLB API support for vhost-vsock devices, allowing the userspace to emulate an IOMMU for the guest. These changes were made following vhost-net, in details this patch: - exposes VIRTIO_F_ACCESS_PLATFORM feature and inits the iotlb device if the feature is acked - implements VHOST_GET_BACKEND_FEATURES and VHOST_SET_BACKEND_FEATURES ioctls - calls vq_meta_prefetch() before vq processing to prefetch vq metadata address in IOTLB - provides .read_iter, .write_iter, and .poll callbacks for the chardev; they are used by the userspace to exchange IOTLB messages This patch was tested specifying "intel_iommu=strict" in the guest kernel command line. I used QEMU with a patch applied [1] to fix a simple issue (that patch was merged in QEMU v5.2.0): $ qemu -M q35,accel=kvm,kernel-irqchip=split \ -drive file=fedora.qcow2,format=qcow2,if=virtio \ -device intel-iommu,intremap=on,device-iotlb=on \ -device vhost-vsock-pci,guest-cid=3,iommu_platform=on,ats=on [1] https://lists.gnu.org/archive/html/qemu-devel/2020-10/msg09077.html Reviewed-by: Stefan Hajnoczi Signed-off-by: Stefano Garzarella Acked-by: Jason Wang --- The patch is the same of v1, but I re-tested it with: - QEMU v5.2.0-551-ga05f8ecd88 - Linux 5.9.15 (host) - Linux 5.9.15 and 5.10.0 (guest) Now, enabling 'ats' it works well, there are just a few simple changes. v1: https://www.spinics.net/lists/kernel/msg3716022.html v2: - updated commit message about QEMU version and string used to test - rebased on mst/vhost branch Thanks, Stefano --- drivers/vhost/vsock.c | 68 +-- 1 file changed, 65 insertions(+), 3 deletions(-) diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c index a483cec31d5c..5e78fb719602 100644 --- a/drivers/vhost/vsock.c +++ b/drivers/vhost/vsock.c @@ -30,7 +30,12 @@ #define VHOST_VSOCK_PKT_WEIGHT 256 enum { - VHOST_VSOCK_FEATURES = VHOST_FEATURES, + VHOST_VSOCK_FEATURES = VHOST_FEATURES | + (1ULL << VIRTIO_F_ACCESS_PLATFORM) +}; + +enum { + VHOST_VSOCK_BACKEND_FEATURES = (1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2) }; /* Used to track all the vhost_vsock instances on the system. */ @@ -94,6 +99,9 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock, if (!vhost_vq_get_backend(vq)) goto out; + if (!vq_meta_prefetch(vq)) + goto out; + /* Avoid further vmexits, we're already processing the virtqueue */ vhost_disable_notify(&vsock->dev, vq); @@ -449,6 +457,9 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work) if (!vhost_vq_get_backend(vq)) goto out; + if (!vq_meta_prefetch(vq)) + goto out; + vhost_disable_notify(&vsock->dev, vq); do { u32 len; @@ -766,8 +777,12 @@ static int vhost_vsock_set_features(struct vhost_vsock *vsock, u64 features) mutex_lock(&vsock->dev.mutex); if ((features & (1 << VHOST_F_LOG_ALL)) && !vhost_log_access_ok(&vsock->dev)) { - mutex_unlock(&vsock->dev.mutex); - return -EFAULT; + goto err; + } + + if ((features & (1ULL << VIRTIO_F_ACCESS_PLATFORM))) { + if (vhost_init_device_iotlb(&vsock->dev, true)) + goto err; } for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) { @@ -778,6 +793,10 @@ static int vhost_vsock_set_features(struct vhost_vsock *vsock, u64 features) } mutex_unlock(&vsock->dev.mutex); return 0; + +err: + mutex_unlock(&vsock->dev.mutex); + return -EFAULT; } static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl, @@ -811,6 +830,18 @@ static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl, if (copy_from_user(&features, argp, sizeof(features))) return -EFAULT; return vhost_vsock_set_features(vsock, features); + case VHOST_GET_BACKEND_FEATURES: + features = VHOST_VSOCK_BACKEND_FEATURES; + if (copy_to_user(argp, &features, sizeof(features))) + return -EFAULT; + return 0; + case VHOST_SET_BACKEND_FEATURES: + if (copy_from_user(&features, argp, sizeof(features))) + return -EFAULT; + if (features & ~VHOST_VSOCK_BACKEND_FEATURES) + return -EOPNOTSUPP; + vhost_set_backend_features(&vsock->dev, features); + return 0; default: mutex_lock(&vsock->dev.mutex); r = vhost_dev_ioctl(&vsock->dev, ioctl, argp);
Re: [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace
On 2020/12/23 下午6:59, Yongji Xie wrote: On Wed, Dec 23, 2020 at 2:38 PM Jason Wang wrote: On 2020/12/22 下午10:52, Xie Yongji wrote: This series introduces a framework, which can be used to implement vDPA Devices in a userspace program. The work consist of two parts: control path forwarding and data path offloading. In the control path, the VDUSE driver will make use of message mechnism to forward the config operation from vdpa bus driver to userspace. Userspace can use read()/write() to receive/reply those control messages. In the data path, the core is mapping dma buffer into VDUSE daemon's address space, which can be implemented in different ways depending on the vdpa bus to which the vDPA device is attached. In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver with bounce-buffering mechanism to achieve that. Rethink about the bounce buffer stuffs. I wonder instead of using kernel pages with mmap(), how about just use userspace pages like what vhost did? It means we need a worker to do bouncing but we don't need to care about annoying stuffs like page reclaiming? Now the I/O bouncing is done in the streaming DMA mapping routines which can be called from interrupt context. If we put this into a kworker, that means we need to synchronize with a kworker in an interrupt context. I think it can't work. We just need to make sure the buffer is ready before the user is trying to access them. But I admit it would be tricky (require shadow virtqueue etc) which is probably not a good idea. Thanks Thanks, Yongji
Re: [RFC v2 08/13] vdpa: Introduce process_iotlb_msg() in vdpa_config_ops
On 2020/12/23 下午7:06, Yongji Xie wrote: On Wed, Dec 23, 2020 at 4:37 PM Jason Wang wrote: On 2020/12/22 下午10:52, Xie Yongji wrote: This patch introduces a new method in the vdpa_config_ops to support processing the raw vhost memory mapping message in the vDPA device driver. Signed-off-by: Xie Yongji --- drivers/vhost/vdpa.c | 5 - include/linux/vdpa.h | 7 +++ 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 448be7875b6d..ccbb391e38be 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -728,6 +728,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, if (r) return r; + if (ops->process_iotlb_msg) + return ops->process_iotlb_msg(vdpa, msg); + switch (msg->type) { case VHOST_IOTLB_UPDATE: r = vhost_vdpa_process_iotlb_update(v, msg); @@ -770,7 +773,7 @@ static int vhost_vdpa_alloc_domain(struct vhost_vdpa *v) int ret; /* Device want to do DMA by itself */ - if (ops->set_map || ops->dma_map) + if (ops->set_map || ops->dma_map || ops->process_iotlb_msg) return 0; bus = dma_dev->bus; diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index 656fe264234e..7bccedf22f4b 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -5,6 +5,7 @@ #include #include #include +#include #include #include @@ -172,6 +173,10 @@ struct vdpa_iova_range { * @vdev: vdpa device * Returns the iova range supported by * the device. + * @process_iotlb_msg: Process vhost memory mapping message (optional) + * Only used for VDUSE device now + * @vdev: vdpa device + * @msg: vhost memory mapping message * @set_map:Set device memory mapping (optional) * Needed for device that using device * specific DMA translation (on-chip IOMMU) @@ -240,6 +245,8 @@ struct vdpa_config_ops { struct vdpa_iova_range (*get_iova_range)(struct vdpa_device *vdev); /* DMA ops */ + int (*process_iotlb_msg)(struct vdpa_device *vdev, + struct vhost_iotlb_msg *msg); int (*set_map)(struct vdpa_device *vdev, struct vhost_iotlb *iotlb); int (*dma_map)(struct vdpa_device *vdev, u64 iova, u64 size, u64 pa, u32 perm); Is there any reason that it can't be done via dma_map/dma_unmap or set_map? To get the shmfd, we need the vma rather than physical address. And it's not necessary to pin the user pages in VDUSE case. Right, actually, vhost-vDPA is planning to support shared virtual address space. So let's try to reuse the existing config ops. How about just introduce an attribute to vdpa device that tells the bus tells the bus it can do shared virtual memory. Then when the device is probed by vhost-vDPA, use pages won't be pinned and we will do VA->VA mapping as IOVA->PA mapping in the vhost IOTLB and the config ops. vhost IOTLB needs to be extended to accept opaque pointer to store the file. And the file was pass via the config ops as well. Thanks Thanks, Yongji