[PATCH net v1 2/2] virtio_net: Close queue pairs using helper function
Use newly introduced helper function that exactly does the same of closing the queue pairs. Signed-off-by: Feng Liu Reviewed-by: William Tu Reviewed-by: Parav Pandit --- drivers/net/virtio_net.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index fc6ee833a09f..5cd78e154d14 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -2319,11 +2319,8 @@ static int virtnet_close(struct net_device *dev) /* Make sure refill_work doesn't re-enable napi! */ cancel_delayed_work_sync(>refill); - for (i = 0; i < vi->max_queue_pairs; i++) { - virtnet_napi_tx_disable(>sq[i].napi); - napi_disable(>rq[i].napi); - xdp_rxq_info_unreg(>rq[i].xdp_rxq); - } + for (i = 0; i < vi->max_queue_pairs; i++) + virtnet_disable_qp(vi, i); return 0; } -- 2.37.1 (Apple Git-137.1) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net v1 1/2] virtio_net: Fix error unwinding of XDP initialization
When initializing XDP in virtnet_open(), some rq xdp initialization may hit an error causing net device open failed. However, previous rqs have already initialized XDP and enabled NAPI, which is not the expected behavior. Need to roll back the previous rq initialization to avoid leaks in error unwinding of init code. Fixes: 754b8a21a96d ("virtio_net: setup xdp_rxq_info") Signed-off-by: Feng Liu Reviewed-by: William Tu Reviewed-by: Parav Pandit --- drivers/net/virtio_net.c | 24 +++- 1 file changed, 19 insertions(+), 5 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 8d8038538fc4..fc6ee833a09f 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -1868,6 +1868,13 @@ static int virtnet_poll(struct napi_struct *napi, int budget) return received; } +static void virtnet_disable_qp(struct virtnet_info *vi, int qp_index) +{ + virtnet_napi_tx_disable(>sq[qp_index].napi); + napi_disable(>rq[qp_index].napi); + xdp_rxq_info_unreg(>rq[qp_index].xdp_rxq); +} + static int virtnet_open(struct net_device *dev) { struct virtnet_info *vi = netdev_priv(dev); @@ -1883,20 +1890,27 @@ static int virtnet_open(struct net_device *dev) err = xdp_rxq_info_reg(>rq[i].xdp_rxq, dev, i, vi->rq[i].napi.napi_id); if (err < 0) - return err; + goto err_xdp_info_reg; err = xdp_rxq_info_reg_mem_model(>rq[i].xdp_rxq, MEM_TYPE_PAGE_SHARED, NULL); - if (err < 0) { - xdp_rxq_info_unreg(>rq[i].xdp_rxq); - return err; - } + if (err < 0) + goto err_xdp_reg_mem_model; virtnet_napi_enable(vi->rq[i].vq, >rq[i].napi); virtnet_napi_tx_enable(vi, vi->sq[i].vq, >sq[i].napi); } return 0; + + /* error unwinding of xdp init */ +err_xdp_reg_mem_model: + xdp_rxq_info_unreg(>rq[i].xdp_rxq); +err_xdp_info_reg: + for (i = i - 1; i >= 0; i--) + virtnet_disable_qp(vi, i); + + return err; } static int virtnet_poll_tx(struct napi_struct *napi, int budget) -- 2.37.1 (Apple Git-137.1) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v7 00/14] vhost: multiple worker support
The following patches were built over Linux's tree. They allow us to support multiple vhost workers tasks per device. The design is a modified version of Stefan's original idea where userspace has the kernel create a worker and we pass back the pid. In this version instead of passing the pid between user/kernel space we use a worker_id which is just an integer managed by the vhost driver and we allow userspace to create and free workers and then attach them to virtqueues at setup time. All review comments from the past reviews should be handled. If I didn't reply to a review comment, I agreed with the comment and should have handled it in this posting. Let me know if I missed one. Results: fio jobs1 2 4 8 12 16 -- 1 worker160k 488k - - - - worker per vq 160k 310k620k1300k 1836k 2326k Notes: 0. This used a simple fio command: fio --filename=/dev/sdb --direct=1 --rw=randrw --bs=4k \ --ioengine=libaio --iodepth=128 --numjobs=$JOBS_ABOVE and I used a VM with 16 vCPUs and 16 virtqueues. 1. The patches were tested with LIO's emulate_pr=0 which drops the LIO PR lock use. This was a bottleneck at around 12 vqs/jobs. 2. Because we have a hard limit of 1024 cmds, if the num jobs * iodepth was greater than 1024, I would decrease iodepth. So 12 jobs used 85 cmds, and 16 used 64. 3. The perf issue above at 2 jobs is because when we only have 1 worker we execute more cmds per vhost_work due to all vqs funneling to one worker. I have 2 patches that fix this. One is just submit from the worker thread instead of kikcing off to a workqueue like how the vhost block patches do. I'll post this after the worker support is merged. I'm also working on patches to add back batching during lio completion and do polling on the submission side. We will still want the threading patches, because if we batch at the fio level plus use the vhost theading patches, we can see a big boost like below. So hopefully doing it at the kernel will allow apps to just work without having to be smart like fio. fio using io_uring and batching with the iodepth_batch* settings: fio jobs1 2 4 8 12 16 - 1 worker494k520k- - - - worker per vq 496k878k1542k 2436k 2304k 2590k V7: - Add more comments about assumptions. - Drop refcounting and just use an attachment_cnt variable, so there is no confusion about when a worker is freed. - Do a opt-in model, because vsiock has an issue where it can queue works before it's running and it doesn't even need multiple workers, so there is no point in chaning the driver or core code. - Add back get worker ioctl. - Broke up last patches to make it easier to read. V6: - Rebase against vhost_task patchset. - Used xa instead of idr. V5: - Rebase against user_worker patchset. - Rebase against flush patchset. - Redo vhost-scsi tmf flush handling so it doesn't access vq->worker. V4: - fix vhost-sock VSOCK_VQ_RX use. - name functions called directly by ioctl cmd's to match the ioctl cmd. - break up VHOST_SET_VRING_WORKER into a new, free and attach cmd. - document worker lifetime, and cgroup, namespace, mm, rlimit inheritance, make it clear we currently only support sharing within the device. - add support to attach workers while IO is running. - instead of passing a pid_t of the kernel thread, pass a int allocated by the vhost layer with an idr. V3: - fully convert vhost code to use vq based APIs instead of leaving it half per dev and half per vq. - rebase against kernel worker API. - Drop delayed worker creation. We always create the default worker at VHOST_SET_OWNER time. Userspace can create and bind workers after that. V2: - change loop that we take a refcount to the worker in - replaced pid == -1 with define. - fixed tabbing/spacing coding style issue - use hash instead of list to lookup workers. - I dropped the patch that added an ioctl cmd to get a vq's worker's pid. I saw we might do a generic netlink interface instead. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 14/14] vhost_scsi: add support for worker ioctls
This has vhost-scsi support the worker ioctls by calling the vhost_worker_ioctl helper. With a single worker, the single thread becomes a bottlneck when trying to use 3 or more virtqueues like: fio --filename=/dev/sdb --direct=1 --rw=randrw --bs=4k \ --ioengine=libaio --iodepth=128 --numjobs=3 With the patches and doing a worker per vq, we can scale to at least 16 vCPUs/vqs (that's my system limit) with the same command fio command above with numjobs=16: fio --filename=/dev/sdb --direct=1 --rw=randrw --bs=4k \ --ioengine=libaio --iodepth=64 --numjobs=16 which gives around 2326K IOPs. Note that for testing I dropped depth to 64 above because the vhost/virt layer supports only 1024 total commands per device. And the only tuning I did was set LIO's emulate_pr to 0 to avoid LIO's PR lock in the main IO path which becomes an issue at around 12 jobs/virtqueues. Signed-off-by: Mike Christie --- drivers/vhost/scsi.c | 8 1 file changed, 8 insertions(+) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index 2c3cf487cc71..c83f7f043470 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -1927,6 +1927,14 @@ vhost_scsi_ioctl(struct file *f, if (copy_from_user(, featurep, sizeof features)) return -EFAULT; return vhost_scsi_set_features(vs, features); + case VHOST_NEW_WORKER: + case VHOST_FREE_WORKER: + case VHOST_ATTACH_VRING_WORKER: + case VHOST_GET_VRING_WORKER: + mutex_lock(>dev.mutex); + r = vhost_worker_ioctl(>dev, ioctl, argp); + mutex_unlock(>dev.mutex); + return r; default: mutex_lock(>dev.mutex); r = vhost_dev_ioctl(>dev, ioctl, argp); -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 12/14] vhost: replace single worker pointer with xarray
The next patch allows userspace to create multiple workers per device, so this patch replaces the vhost_worker pointer with an xarray so we can store mupltiple workers and look them up. Signed-off-by: Mike Christie --- drivers/vhost/vhost.c | 48 +-- drivers/vhost/vhost.h | 3 ++- 2 files changed, 39 insertions(+), 12 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 930b703eb6b6..4b0b82292379 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -271,7 +271,11 @@ EXPORT_SYMBOL_GPL(vhost_vq_work_queue); void vhost_dev_flush(struct vhost_dev *dev) { - vhost_work_flush_on(dev->worker); + struct vhost_worker *worker; + unsigned long i; + + xa_for_each(>worker_xa, i, worker) + vhost_work_flush_on(worker); } EXPORT_SYMBOL_GPL(vhost_dev_flush); @@ -489,7 +493,6 @@ void vhost_dev_init(struct vhost_dev *dev, dev->umem = NULL; dev->iotlb = NULL; dev->mm = NULL; - dev->worker = NULL; dev->iov_limit = iov_limit; dev->weight = weight; dev->byte_weight = byte_weight; @@ -499,7 +502,7 @@ void vhost_dev_init(struct vhost_dev *dev, INIT_LIST_HEAD(>read_list); INIT_LIST_HEAD(>pending_list); spin_lock_init(>iotlb_lock); - + xa_init_flags(>worker_xa, XA_FLAGS_ALLOC); for (i = 0; i < dev->nvqs; ++i) { vq = dev->vqs[i]; @@ -562,30 +565,46 @@ static void vhost_detach_mm(struct vhost_dev *dev) dev->mm = NULL; } -static void vhost_worker_free(struct vhost_dev *dev) +static void vhost_worker_destroy(struct vhost_dev *dev, +struct vhost_worker *worker) { - struct vhost_worker *worker = dev->worker; - if (!worker) return; - dev->worker = NULL; WARN_ON(!llist_empty(>work_list)); + xa_erase(>worker_xa, worker->id); vhost_task_stop(worker->vtsk); kfree(worker); } +static void vhost_workers_free(struct vhost_dev *dev) +{ + struct vhost_worker *worker; + unsigned long i; + + if (!dev->use_worker) + return; + /* +* Free the default worker we created and cleanup workers userspace +* created but couldn't clean up (it forgot or crashed). +*/ + xa_for_each(>worker_xa, i, worker) + vhost_worker_destroy(dev, worker); + xa_destroy(>worker_xa); +} + static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev) { struct vhost_worker *worker; struct vhost_task *vtsk; char name[TASK_COMM_LEN]; + int ret; + u32 id; worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT); if (!worker) return NULL; - dev->worker = worker; worker->kcov_handle = kcov_common_handle(); init_llist_head(>work_list); snprintf(name, sizeof(name), "vhost-%d", current->pid); @@ -596,11 +615,18 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev) worker->vtsk = vtsk; vhost_task_start(vtsk); + + ret = xa_alloc(>worker_xa, , worker, xa_limit_32b, GFP_KERNEL); + if (ret < 0) + goto stop_worker; + worker->id = id; + return worker; +stop_worker: + vhost_task_stop(vtsk); free_worker: kfree(worker); - dev->worker = NULL; return NULL; } @@ -654,7 +680,7 @@ long vhost_dev_set_owner(struct vhost_dev *dev) return 0; err_iovecs: - vhost_worker_free(dev); + vhost_workers_free(dev); err_worker: vhost_detach_mm(dev); err_mm: @@ -747,7 +773,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev) dev->iotlb = NULL; vhost_clear_msg(dev); wake_up_interruptible_poll(>wait, EPOLLIN | EPOLLRDNORM); - vhost_worker_free(dev); + vhost_workers_free(dev); vhost_detach_mm(dev); } EXPORT_SYMBOL_GPL(vhost_dev_cleanup); diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 395707c680e5..2eea20d54134 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -30,6 +30,7 @@ struct vhost_worker { struct vhost_task *vtsk; struct llist_head work_list; u64 kcov_handle; + u32 id; }; /* Poll a file (eventfd or socket) */ @@ -156,7 +157,6 @@ struct vhost_dev { struct vhost_virtqueue **vqs; int nvqs; struct eventfd_ctx *log_ctx; - struct vhost_worker *worker; struct vhost_iotlb *umem; struct vhost_iotlb *iotlb; spinlock_t iotlb_lock; @@ -166,6 +166,7 @@ struct vhost_dev { int iov_limit; int weight; int byte_weight; + struct xarray worker_xa; bool use_worker; int (*msg_handler)(struct vhost_dev *dev, u32 asid, struct vhost_iotlb_msg *msg); -- 2.25.1
[PATCH 11/14] vhost: add helper to parse userspace vring state/file
The next patches add new vhost worker ioctls which will need to get a vhost_virtqueue from a userspace struct which specifies the vq's index. This moves the vhost_vring_ioctl code to do this to a helper so it can be shared. Signed-off-by: Mike Christie --- drivers/vhost/vhost.c | 29 ++--- 1 file changed, 22 insertions(+), 7 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 2e27d24634f5..930b703eb6b6 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -604,6 +604,27 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev) return NULL; } +static int vhost_get_vq_from_user(struct vhost_dev *dev, void __user *argp, + struct vhost_virtqueue **vq, u32 *id) +{ + u32 __user *idxp = argp; + u32 idx; + long r; + + r = get_user(idx, idxp); + if (r < 0) + return r; + + if (idx >= dev->nvqs) + return -ENOBUFS; + + idx = array_index_nospec(idx, dev->nvqs); + + *vq = dev->vqs[idx]; + *id = idx; + return 0; +} + /* Caller should have device mutex */ long vhost_dev_set_owner(struct vhost_dev *dev) { @@ -1614,21 +1635,15 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg struct file *eventfp, *filep = NULL; bool pollstart = false, pollstop = false; struct eventfd_ctx *ctx = NULL; - u32 __user *idxp = argp; struct vhost_virtqueue *vq; struct vhost_vring_state s; struct vhost_vring_file f; u32 idx; long r; - r = get_user(idx, idxp); + r = vhost_get_vq_from_user(d, argp, , ); if (r < 0) return r; - if (idx >= d->nvqs) - return -ENOBUFS; - - idx = array_index_nospec(idx, d->nvqs); - vq = d->vqs[idx]; if (ioctl == VHOST_SET_VRING_NUM || ioctl == VHOST_SET_VRING_ADDR) { -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 09/14] vhost: remove vhost_work_queue
vhost_work_queue is no longer used. Each driver is using the poll or vq based queueing, so remove vhost_work_queue. Signed-off-by: Mike Christie --- drivers/vhost/vhost.c | 6 -- drivers/vhost/vhost.h | 1 - 2 files changed, 7 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 331728f58121..adc2678a7b80 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -263,12 +263,6 @@ static void vhost_work_flush_on(struct vhost_worker *worker) wait_for_completion(_event); } -void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) -{ - vhost_work_queue_on(dev->worker, work); -} -EXPORT_SYMBOL_GPL(vhost_work_queue); - void vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work) { vhost_work_queue_on(vq->worker, work); diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index d9b8abbe3a26..ef55fae2517c 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -45,7 +45,6 @@ struct vhost_poll { }; void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn); -void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work); void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn, __poll_t mask, struct vhost_dev *dev, -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 13/14] vhost: allow userspace to create workers
For vhost-scsi with 3 vqs or more and a workload that tries to use them in parallel like: fio --filename=/dev/sdb --direct=1 --rw=randrw --bs=4k \ --ioengine=libaio --iodepth=128 --numjobs=3 the single vhost worker thread will become a bottlneck and we are stuck at around 500K IOPs no matter how many jobs, virtqueues, and CPUs are used. To better utilize virtqueues and available CPUs, this patch allows userspace to create workers and bind them to vqs. You can have N workers per dev and also share N workers with M vqs on that dev. This patch adds the interface related code and the next patch will hook vhost-scsi into it. The patches do not try to hook net and vsock into the interface because: 1. multiple workers don't seem to help vsock. The problem is that with only 2 virtqueues we never fully use the existing worker when doing bidirectional tests. This seems to match vhost-scsi where we don't see the worker as a bottleneck until 3 virtqueues are used. 2. net already has a way to use multiple workers. Signed-off-by: Mike Christie --- drivers/vhost/vhost.c| 145 ++- drivers/vhost/vhost.h| 3 + include/uapi/linux/vhost.h | 33 +++ include/uapi/linux/vhost_types.h | 16 4 files changed, 196 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 4b0b82292379..e8f829f35814 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -630,6 +630,80 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev) return NULL; } +/* Caller must have device mutex */ +static void __vhost_vq_attach_worker(struct vhost_virtqueue *vq, +struct vhost_worker *worker) +{ + if (vq->worker) + vq->worker->attachment_cnt--; + worker->attachment_cnt++; + vq->worker = worker; +} + +/** + * vhost_vq_attach_worker - set a virtqueue's worker from an ioctl command + * @vq: the virtqueue we will set the worker for + * @info: the worker userspace has requested us to use + * + * We only allow userspace to set a virtqueue's worker if it's not active and + * polling is not enabled. We also assume drivers supporting this will not be + * internally queueing works directly or via calls like vhost_dev_flush at + * this time. + * + * Caller must have device and virtqueue mutex. + */ +static int vhost_vq_attach_worker(struct vhost_virtqueue *vq, + struct vhost_vring_worker *info) +{ + unsigned long index = info->worker_id; + struct vhost_dev *dev = vq->dev; + struct vhost_worker *worker; + + if (!dev->use_worker) + return -EINVAL; + + if (vhost_vq_get_backend(vq) || vq->kick) + return -EBUSY; + + worker = xa_find(>worker_xa, , UINT_MAX, XA_PRESENT); + if (!worker || worker->id != info->worker_id) + return -ENODEV; + + __vhost_vq_attach_worker(vq, worker); + return 0; +} + +/* Caller must have device mutex */ +static int vhost_new_worker(struct vhost_dev *dev, + struct vhost_worker_state *info) +{ + struct vhost_worker *worker; + + worker = vhost_worker_create(dev); + if (!worker) + return -ENOMEM; + + info->worker_id = worker->id; + return 0; +} + +static int vhost_free_worker(struct vhost_dev *dev, +struct vhost_worker_state *info) +{ + unsigned long index = info->worker_id; + struct vhost_worker *worker; + + worker = xa_find(>worker_xa, , UINT_MAX, XA_PRESENT); + if (!worker || worker->id != info->worker_id) + return -ENODEV; + + if (worker->attachment_cnt) + return -EBUSY; + + vhost_worker_destroy(dev, worker); + return 0; +} + static int vhost_get_vq_from_user(struct vhost_dev *dev, void __user *argp, struct vhost_virtqueue **vq, u32 *id) { @@ -651,6 +725,75 @@ static int vhost_get_vq_from_user(struct vhost_dev *dev, void __user *argp, return 0; } +/* Caller must have device mutex */ +long vhost_worker_ioctl(struct vhost_dev *dev, unsigned int ioctl, + void __user *argp) +{ + struct vhost_vring_worker ring_worker; + struct vhost_worker_state state; + struct vhost_virtqueue *vq; + long ret; + u32 idx; + + if (!dev->use_worker) + return -EINVAL; + + if (!vhost_dev_has_owner(dev)) + return -EINVAL; + + switch (ioctl) { + /* dev worker ioctls */ + case VHOST_NEW_WORKER: + ret = vhost_new_worker(dev, ); + if (!ret && copy_to_user(argp, , sizeof(state))) + ret = -EFAULT; + return ret; + case VHOST_FREE_WORKER: + if (copy_from_user(, argp, sizeof(state))) + return -EFAULT; +
[PATCH 08/14] vhost_scsi: convert to vhost_vq_work_queue
Convert from vhost_work_queue to vhost_vq_work_queue. Signed-off-by: Mike Christie --- drivers/vhost/scsi.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index a77c53bb035a..1668009bd489 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -353,8 +353,9 @@ static void vhost_scsi_release_cmd(struct se_cmd *se_cmd) if (se_cmd->se_cmd_flags & SCF_SCSI_TMR_CDB) { struct vhost_scsi_tmf *tmf = container_of(se_cmd, struct vhost_scsi_tmf, se_cmd); + struct vhost_virtqueue *vq = >svq->vq; - vhost_work_queue(>vhost->dev, >vwork); + vhost_vq_work_queue(vq, >vwork); } else { struct vhost_scsi_cmd *cmd = container_of(se_cmd, struct vhost_scsi_cmd, tvc_se_cmd); @@ -1332,11 +1333,9 @@ static void vhost_scsi_ctl_handle_kick(struct vhost_work *work) } static void -vhost_scsi_send_evt(struct vhost_scsi *vs, - struct vhost_scsi_tpg *tpg, - struct se_lun *lun, - u32 event, - u32 reason) +vhost_scsi_send_evt(struct vhost_scsi *vs, struct vhost_virtqueue *vq, + struct vhost_scsi_tpg *tpg, struct se_lun *lun, + u32 event, u32 reason) { struct vhost_scsi_evt *evt; @@ -1358,7 +1357,7 @@ vhost_scsi_send_evt(struct vhost_scsi *vs, } llist_add(>list, >vs_event_list); - vhost_work_queue(>dev, >vs_event_work); + vhost_vq_work_queue(vq, >vs_event_work); } static void vhost_scsi_evt_handle_kick(struct vhost_work *work) @@ -1372,7 +1371,8 @@ static void vhost_scsi_evt_handle_kick(struct vhost_work *work) goto out; if (vs->vs_events_missed) - vhost_scsi_send_evt(vs, NULL, NULL, VIRTIO_SCSI_T_NO_EVENT, 0); + vhost_scsi_send_evt(vs, vq, NULL, NULL, VIRTIO_SCSI_T_NO_EVENT, + 0); out: mutex_unlock(>mutex); } @@ -1991,7 +1991,7 @@ vhost_scsi_do_plug(struct vhost_scsi_tpg *tpg, goto unlock; if (vhost_has_feature(vq, VIRTIO_SCSI_F_HOTPLUG)) - vhost_scsi_send_evt(vs, tpg, lun, + vhost_scsi_send_evt(vs, vq, tpg, lun, VIRTIO_SCSI_T_TRANSPORT_RESET, reason); unlock: mutex_unlock(>mutex); -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 10/14] vhost_scsi: flush IO vqs then send TMF rsp
With one worker we will always send the scsi cmd responses then send the TMF rsp, because LIO will always complete the scsi cmds first then call into us to send the TMF response. With multiple workers, the IO vq workers could be running while the TMF/ctl vq worker is so this has us do a flush before completing the TMF to make sure cmds are completed when it's work is later queued and run. Signed-off-by: Mike Christie --- drivers/vhost/scsi.c | 21 ++--- drivers/vhost/vhost.c | 6 ++ drivers/vhost/vhost.h | 1 + 3 files changed, 25 insertions(+), 3 deletions(-) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index 1668009bd489..2c3cf487cc71 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -1133,12 +1133,27 @@ static void vhost_scsi_tmf_resp_work(struct vhost_work *work) { struct vhost_scsi_tmf *tmf = container_of(work, struct vhost_scsi_tmf, vwork); - int resp_code; + struct vhost_virtqueue *ctl_vq, *vq; + int resp_code, i; + + if (tmf->scsi_resp == TMR_FUNCTION_COMPLETE) { + /* +* Flush IO vqs that don't share a worker with the ctl to make +* sure they have sent their responses before us. +*/ + ctl_vq = >vhost->vqs[VHOST_SCSI_VQ_CTL].vq; + for (i = VHOST_SCSI_VQ_IO; i < tmf->vhost->dev.nvqs; i++) { + vq = >vhost->vqs[i].vq; + + if (vhost_vq_is_setup(vq) && + vq->worker != ctl_vq->worker) + vhost_vq_flush(vq); + } - if (tmf->scsi_resp == TMR_FUNCTION_COMPLETE) resp_code = VIRTIO_SCSI_S_FUNCTION_SUCCEEDED; - else + } else { resp_code = VIRTIO_SCSI_S_FUNCTION_REJECTED; + } vhost_scsi_send_tmf_resp(tmf->vhost, >svq->vq, tmf->in_iovs, tmf->vq_desc, >resp_iov, resp_code); diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index adc2678a7b80..2e27d24634f5 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -275,6 +275,12 @@ void vhost_dev_flush(struct vhost_dev *dev) } EXPORT_SYMBOL_GPL(vhost_dev_flush); +void vhost_vq_flush(struct vhost_virtqueue *vq) +{ + vhost_work_flush_on(vq->worker); +} +EXPORT_SYMBOL_GPL(vhost_vq_flush); + /* A lockless hint for busy polling code to exit the loop */ bool vhost_vq_has_work(struct vhost_virtqueue *vq) { diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index ef55fae2517c..395707c680e5 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -53,6 +53,7 @@ int vhost_poll_start(struct vhost_poll *poll, struct file *file); void vhost_poll_stop(struct vhost_poll *poll); void vhost_poll_queue(struct vhost_poll *poll); void vhost_dev_flush(struct vhost_dev *dev); +void vhost_vq_flush(struct vhost_virtqueue *vq); struct vhost_log { u64 addr; -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 06/14] vhost_sock: convert to vhost_vq_work_queue
Convert from vhost_work_queue to vhost_vq_work_queue. Signed-off-by: Mike Christie --- drivers/vhost/vsock.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c index 6578db78f0ae..817d377a3f36 100644 --- a/drivers/vhost/vsock.c +++ b/drivers/vhost/vsock.c @@ -285,7 +285,7 @@ vhost_transport_send_pkt(struct sk_buff *skb) atomic_inc(>queued_replies); virtio_vsock_skb_queue_tail(>send_pkt_queue, skb); - vhost_work_queue(>dev, >send_pkt_work); + vhost_vq_work_queue(>vqs[VSOCK_VQ_RX], >send_pkt_work); rcu_read_unlock(); return len; @@ -583,7 +583,7 @@ static int vhost_vsock_start(struct vhost_vsock *vsock) /* Some packets may have been queued before the device was started, * let's kick the send worker to send them. */ - vhost_work_queue(>dev, >send_pkt_work); + vhost_vq_work_queue(>vqs[VSOCK_VQ_RX], >send_pkt_work); mutex_unlock(>dev.mutex); return 0; -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 05/14] vhost: convert poll work to be vq based
This has the drivers pass in their poll to vq mapping and then converts the core poll code to use the vq based helpers. In the next patches we will allow vqs to be handled by different workers, so to allow drivers to execute operations like queue, stop, flush, etc on specific polls/vqs we need to know the mappings. Signed-off-by: Mike Christie --- drivers/vhost/net.c | 6 -- drivers/vhost/vhost.c | 8 +--- drivers/vhost/vhost.h | 4 +++- 3 files changed, 12 insertions(+), 6 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 8ed63651b9eb..4a9b757071a2 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -1342,8 +1342,10 @@ static int vhost_net_open(struct inode *inode, struct file *f) VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true, NULL); - vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev); - vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev); + vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev, + vqs[VHOST_NET_VQ_TX]); + vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev, + vqs[VHOST_NET_VQ_RX]); f->private_data = n; n->page_frag.page = NULL; diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 957f33f9ad25..331728f58121 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -187,13 +187,15 @@ EXPORT_SYMBOL_GPL(vhost_work_init); /* Init poll structure */ void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn, -__poll_t mask, struct vhost_dev *dev) +__poll_t mask, struct vhost_dev *dev, +struct vhost_virtqueue *vq) { init_waitqueue_func_entry(>wait, vhost_poll_wakeup); init_poll_funcptr(>table, vhost_poll_func); poll->mask = mask; poll->dev = dev; poll->wqh = NULL; + poll->vq = vq; vhost_work_init(>work, fn); } @@ -288,7 +290,7 @@ EXPORT_SYMBOL_GPL(vhost_vq_has_work); void vhost_poll_queue(struct vhost_poll *poll) { - vhost_work_queue(poll->dev, >work); + vhost_vq_work_queue(poll->vq, >work); } EXPORT_SYMBOL_GPL(vhost_poll_queue); @@ -510,7 +512,7 @@ void vhost_dev_init(struct vhost_dev *dev, vhost_vq_reset(dev, vq); if (vq->handle_kick) vhost_poll_init(>poll, vq->handle_kick, - EPOLLIN, dev); + EPOLLIN, dev, vq); } } EXPORT_SYMBOL_GPL(vhost_dev_init); diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index b64ee4ef387d..d9b8abbe3a26 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -41,13 +41,15 @@ struct vhost_poll { struct vhost_work work; __poll_tmask; struct vhost_dev*dev; + struct vhost_virtqueue *vq; }; void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn); void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work); void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn, -__poll_t mask, struct vhost_dev *dev); +__poll_t mask, struct vhost_dev *dev, +struct vhost_virtqueue *vq); int vhost_poll_start(struct vhost_poll *poll, struct file *file); void vhost_poll_stop(struct vhost_poll *poll); void vhost_poll_queue(struct vhost_poll *poll); -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 07/14] vhost_scsi: make SCSI cmd completion per vq
This patch separates the scsi cmd completion code paths so we can complete cmds based on their vq instead of having all cmds complete on the same worker/CPU. This will be useful with the next patches that allow us to create mulitple worker threads and bind them to different vqs, so we can have completions running on different threads/CPUs. Signed-off-by: Mike Christie Reviewed-by: Stefan Hajnoczi --- drivers/vhost/scsi.c | 56 1 file changed, 26 insertions(+), 30 deletions(-) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index bb10fa4bb4f6..a77c53bb035a 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -167,6 +167,7 @@ MODULE_PARM_DESC(max_io_vqs, "Set the max number of IO virtqueues a vhost scsi d struct vhost_scsi_virtqueue { struct vhost_virtqueue vq; + struct vhost_scsi *vs; /* * Reference counting for inflight reqs, used for flush operation. At * each time, one reference tracks new commands submitted, while we @@ -181,6 +182,9 @@ struct vhost_scsi_virtqueue { struct vhost_scsi_cmd *scsi_cmds; struct sbitmap scsi_tags; int max_cmds; + + struct vhost_work completion_work; + struct llist_head completion_list; }; struct vhost_scsi { @@ -190,12 +194,8 @@ struct vhost_scsi { struct vhost_dev dev; struct vhost_scsi_virtqueue *vqs; - unsigned long *compl_bitmap; struct vhost_scsi_inflight **old_inflight; - struct vhost_work vs_completion_work; /* cmd completion work item */ - struct llist_head vs_completion_list; /* cmd completion queue */ - struct vhost_work vs_event_work; /* evt injection work item */ struct llist_head vs_event_list; /* evt injection queue */ @@ -358,10 +358,11 @@ static void vhost_scsi_release_cmd(struct se_cmd *se_cmd) } else { struct vhost_scsi_cmd *cmd = container_of(se_cmd, struct vhost_scsi_cmd, tvc_se_cmd); - struct vhost_scsi *vs = cmd->tvc_vhost; + struct vhost_scsi_virtqueue *svq = container_of(cmd->tvc_vq, + struct vhost_scsi_virtqueue, vq); - llist_add(>tvc_completion_list, >vs_completion_list); - vhost_work_queue(>dev, >vs_completion_work); + llist_add(>tvc_completion_list, >completion_list); + vhost_vq_work_queue(>vq, >completion_work); } } @@ -509,17 +510,17 @@ static void vhost_scsi_evt_work(struct vhost_work *work) */ static void vhost_scsi_complete_cmd_work(struct vhost_work *work) { - struct vhost_scsi *vs = container_of(work, struct vhost_scsi, - vs_completion_work); + struct vhost_scsi_virtqueue *svq = container_of(work, + struct vhost_scsi_virtqueue, completion_work); struct virtio_scsi_cmd_resp v_rsp; struct vhost_scsi_cmd *cmd, *t; struct llist_node *llnode; struct se_cmd *se_cmd; struct iov_iter iov_iter; - int ret, vq; + bool signal = false; + int ret; - bitmap_zero(vs->compl_bitmap, vs->dev.nvqs); - llnode = llist_del_all(>vs_completion_list); + llnode = llist_del_all(>completion_list); llist_for_each_entry_safe(cmd, t, llnode, tvc_completion_list) { se_cmd = >tvc_se_cmd; @@ -539,21 +540,17 @@ static void vhost_scsi_complete_cmd_work(struct vhost_work *work) cmd->tvc_in_iovs, sizeof(v_rsp)); ret = copy_to_iter(_rsp, sizeof(v_rsp), _iter); if (likely(ret == sizeof(v_rsp))) { - struct vhost_scsi_virtqueue *q; + signal = true; + vhost_add_used(cmd->tvc_vq, cmd->tvc_vq_desc, 0); - q = container_of(cmd->tvc_vq, struct vhost_scsi_virtqueue, vq); - vq = q - vs->vqs; - __set_bit(vq, vs->compl_bitmap); } else pr_err("Faulted on virtio_scsi_cmd_resp\n"); vhost_scsi_release_cmd_res(se_cmd); } - vq = -1; - while ((vq = find_next_bit(vs->compl_bitmap, vs->dev.nvqs, vq + 1)) - < vs->dev.nvqs) - vhost_signal(>dev, >vqs[vq].vq); + if (signal) + vhost_signal(>vs->dev, >vq); } static struct vhost_scsi_cmd * @@ -1770,6 +1767,7 @@ static int vhost_scsi_set_features(struct vhost_scsi *vs, u64 features) static int vhost_scsi_open(struct inode *inode, struct file *f) { + struct vhost_scsi_virtqueue *svq; struct vhost_scsi *vs; struct vhost_virtqueue **vqs; int r = -ENOMEM, i, nvqs = vhost_scsi_max_io_vqs; @@ -1788,10 +1786,6 @@ static int vhost_scsi_open(struct inode *inode, struct file *f) } nvqs +=
[PATCH 04/14] vhost: take worker or vq instead of dev for flushing
This patch has the core work flush function take a worker. When we support multiple workers we can then flush each worker during device removal, stoppage, etc. Signed-off-by: Mike Christie Acked-by: Jason Wang --- drivers/vhost/vhost.c | 24 +++- 1 file changed, 15 insertions(+), 9 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index faf1dcf0af30..957f33f9ad25 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -247,6 +247,20 @@ static void vhost_work_queue_on(struct vhost_worker *worker, } } +static void vhost_work_flush_on(struct vhost_worker *worker) +{ + struct vhost_flush_struct flush; + + if (!worker) + return; + + init_completion(_event); + vhost_work_init(, vhost_flush_work); + + vhost_work_queue_on(worker, ); + wait_for_completion(_event); +} + void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) { vhost_work_queue_on(dev->worker, work); @@ -261,15 +275,7 @@ EXPORT_SYMBOL_GPL(vhost_vq_work_queue); void vhost_dev_flush(struct vhost_dev *dev) { - struct vhost_flush_struct flush; - - if (dev->worker) { - init_completion(_event); - vhost_work_init(, vhost_flush_work); - - vhost_work_queue(dev, ); - wait_for_completion(_event); - } + vhost_work_flush_on(dev->worker); } EXPORT_SYMBOL_GPL(vhost_dev_flush); -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 03/14] vhost: take worker or vq instead of dev for queueing
This patch has the core work queueing function take a worker for when we support multiple workers. It also adds a helper that takes a vq during queueing so modules can control which vq/worker to queue work on. This temp leaves vhost_work_queue. It will be removed when the drivers are converted in the next patches. Signed-off-by: Mike Christie Acked-by: Jason Wang --- drivers/vhost/vhost.c | 44 +++ drivers/vhost/vhost.h | 1 + 2 files changed, 29 insertions(+), 16 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index aefb52952e1d..faf1dcf0af30 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -231,6 +231,34 @@ void vhost_poll_stop(struct vhost_poll *poll) } EXPORT_SYMBOL_GPL(vhost_poll_stop); +static void vhost_work_queue_on(struct vhost_worker *worker, + struct vhost_work *work) +{ + if (!worker) + return; + + if (!test_and_set_bit(VHOST_WORK_QUEUED, >flags)) { + /* We can only add the work to the list after we're +* sure it was not in the list. +* test_and_set_bit() implies a memory barrier. +*/ + llist_add(>node, >work_list); + wake_up_process(worker->vtsk->task); + } +} + +void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) +{ + vhost_work_queue_on(dev->worker, work); +} +EXPORT_SYMBOL_GPL(vhost_work_queue); + +void vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work) +{ + vhost_work_queue_on(vq->worker, work); +} +EXPORT_SYMBOL_GPL(vhost_vq_work_queue); + void vhost_dev_flush(struct vhost_dev *dev) { struct vhost_flush_struct flush; @@ -245,22 +273,6 @@ void vhost_dev_flush(struct vhost_dev *dev) } EXPORT_SYMBOL_GPL(vhost_dev_flush); -void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) -{ - if (!dev->worker) - return; - - if (!test_and_set_bit(VHOST_WORK_QUEUED, >flags)) { - /* We can only add the work to the list after we're -* sure it was not in the list. -* test_and_set_bit() implies a memory barrier. -*/ - llist_add(>node, >worker->work_list); - wake_up_process(dev->worker->vtsk->task); - } -} -EXPORT_SYMBOL_GPL(vhost_work_queue); - /* A lockless hint for busy polling code to exit the loop */ bool vhost_vq_has_work(struct vhost_virtqueue *vq) { diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 0dde119fb0ee..b64ee4ef387d 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -194,6 +194,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *, struct vhost_log *log, unsigned int *log_num); void vhost_discard_vq_desc(struct vhost_virtqueue *, int n); +void vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work); bool vhost_vq_has_work(struct vhost_virtqueue *vq); bool vhost_vq_is_setup(struct vhost_virtqueue *vq); int vhost_vq_init_access(struct vhost_virtqueue *); -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 01/14] vhost: add vhost_worker pointer to vhost_virtqueue
This patchset allows userspace to map vqs to different workers. This patch adds a worker pointer to the vq so we can store that info. Signed-off-by: Mike Christie Acked-by: Jason Wang --- drivers/vhost/vhost.c | 24 +--- drivers/vhost/vhost.h | 1 + 2 files changed, 14 insertions(+), 11 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 10bf35a3db6e..7a8eef246e1f 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -486,6 +486,7 @@ void vhost_dev_init(struct vhost_dev *dev, vq->log = NULL; vq->indirect = NULL; vq->heads = NULL; + vq->worker = NULL; vq->dev = dev; mutex_init(>mutex); vhost_vq_reset(dev, vq); @@ -554,16 +555,15 @@ static void vhost_worker_free(struct vhost_dev *dev) kfree(worker); } -static int vhost_worker_create(struct vhost_dev *dev) +static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev) { struct vhost_worker *worker; struct vhost_task *vtsk; char name[TASK_COMM_LEN]; - int ret; worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT); if (!worker) - return -ENOMEM; + return NULL; dev->worker = worker; worker->kcov_handle = kcov_common_handle(); @@ -571,25 +571,24 @@ static int vhost_worker_create(struct vhost_dev *dev) snprintf(name, sizeof(name), "vhost-%d", current->pid); vtsk = vhost_task_create(vhost_worker, worker, name); - if (!vtsk) { - ret = -ENOMEM; + if (!vtsk) goto free_worker; - } worker->vtsk = vtsk; vhost_task_start(vtsk); - return 0; + return worker; free_worker: kfree(worker); dev->worker = NULL; - return ret; + return NULL; } /* Caller should have device mutex */ long vhost_dev_set_owner(struct vhost_dev *dev) { - int err; + struct vhost_worker *worker; + int err, i; /* Is there an owner already? */ if (vhost_dev_has_owner(dev)) { @@ -600,9 +599,12 @@ long vhost_dev_set_owner(struct vhost_dev *dev) vhost_attach_mm(dev); if (dev->use_worker) { - err = vhost_worker_create(dev); - if (err) + worker = vhost_worker_create(dev); + if (!worker) goto err_worker; + + for (i = 0; i < dev->nvqs; i++) + dev->vqs[i]->worker = worker; } err = vhost_dev_alloc_iovecs(dev); diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 0308638cdeee..e72b665ba3a5 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -74,6 +74,7 @@ struct vhost_vring_call { /* The virtqueue structure describes a queue attached to a device. */ struct vhost_virtqueue { struct vhost_dev *dev; + struct vhost_worker *worker; /* The actual ring of buffers. */ struct mutex mutex; -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 02/14] vhost, vhost_net: add helper to check if vq has work
In the next patches each vq might have different workers so one could have work but others do not. For net, we only want to check specific vqs, so this adds a helper to check if a vq has work pending and converts vhost-net to use it. Signed-off-by: Mike Christie Acked-by: Jason Wang --- drivers/vhost/net.c | 2 +- drivers/vhost/vhost.c | 6 +++--- drivers/vhost/vhost.h | 2 +- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 07181cd8d52e..8ed63651b9eb 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -546,7 +546,7 @@ static void vhost_net_busy_poll(struct vhost_net *net, endtime = busy_clock() + busyloop_timeout; while (vhost_can_busy_poll(endtime)) { - if (vhost_has_work(>dev)) { + if (vhost_vq_has_work(vq)) { *busyloop_intr = true; break; } diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 7a8eef246e1f..aefb52952e1d 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -262,11 +262,11 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) EXPORT_SYMBOL_GPL(vhost_work_queue); /* A lockless hint for busy polling code to exit the loop */ -bool vhost_has_work(struct vhost_dev *dev) +bool vhost_vq_has_work(struct vhost_virtqueue *vq) { - return dev->worker && !llist_empty(>worker->work_list); + return vq->worker && !llist_empty(>worker->work_list); } -EXPORT_SYMBOL_GPL(vhost_has_work); +EXPORT_SYMBOL_GPL(vhost_vq_has_work); void vhost_poll_queue(struct vhost_poll *poll) { diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index e72b665ba3a5..0dde119fb0ee 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -45,7 +45,6 @@ struct vhost_poll { void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn); void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work); -bool vhost_has_work(struct vhost_dev *dev); void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn, __poll_t mask, struct vhost_dev *dev); @@ -195,6 +194,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *, struct vhost_log *log, unsigned int *log_num); void vhost_discard_vq_desc(struct vhost_virtqueue *, int n); +bool vhost_vq_has_work(struct vhost_virtqueue *vq); bool vhost_vq_is_setup(struct vhost_virtqueue *vq); int vhost_vq_init_access(struct vhost_virtqueue *); int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len); -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio_net: suppress cpu stall when free_unused_bufs
Qi Zheng wrote: > > > On 2023/4/27 16:23, Michael S. Tsirkin wrote: > > On Thu, Apr 27, 2023 at 04:13:45PM +0800, Xuan Zhuo wrote: > >> On Thu, 27 Apr 2023 04:12:44 -0400, "Michael S. Tsirkin" > >> wrote: > >>> On Thu, Apr 27, 2023 at 03:13:44PM +0800, Xuan Zhuo wrote: > On Thu, 27 Apr 2023 15:02:26 +0800, Wenliang Wang > wrote: > > > > > > On 4/27/23 2:20 PM, Xuan Zhuo wrote: > >> On Thu, 27 Apr 2023 12:34:33 +0800, Wenliang Wang > >> wrote: > >>> For multi-queue and large rx-ring-size use case, the following error > >> > >> Cound you give we one number for example? > > > > 128 queues and 16K queue_size is typical. > > > >> > >>> occurred when free_unused_bufs: > >>> rcu: INFO: rcu_sched self-detected stall on CPU. > >>> > >>> Signed-off-by: Wenliang Wang > >>> --- > >>>drivers/net/virtio_net.c | 1 + > >>>1 file changed, 1 insertion(+) > >>> > >>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > >>> index ea1bd4bb326d..21d8382fd2c7 100644 > >>> --- a/drivers/net/virtio_net.c > >>> +++ b/drivers/net/virtio_net.c > >>> @@ -3565,6 +3565,7 @@ static void free_unused_bufs(struct > >>> virtnet_info *vi) > >>> struct virtqueue *vq = vi->rq[i].vq; > >>> while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) > >>> virtnet_rq_free_unused_buf(vq, buf); > >>> + schedule(); > >> > >> Just for rq? > >> > >> Do we need to do the same thing for sq? > > Rq buffers are pre-allocated, take seconds to free rq unused buffers. > > > > Sq unused buffers are much less, so do the same for sq is optional. > > I got. > > I think we should look for a way, compatible with the less queues or the > smaller > rings. Calling schedule() directly may be not a good way. > > Thanks. > >>> > >>> Why isn't it a good way? > >> > >> For the small ring, I don't think it is a good way, maybe we only deal > >> with one > >> buf, then call schedule(). > >> > >> We can call the schedule() after processing a certain number of buffers, > >> or check need_resched () first. > >> > >> Thanks. > > > > > > Wenliang, does > > if (need_resched()) > > schedule(); > > Can we just use cond_resched()? I believe that is preferred. But v2 still calls schedule directly. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC net-next v2 0/4] virtio/vsock: support datagrams
On Sat, Apr 15, 2023 at 07:13:47AM +, Bobby Eshleman wrote: CC'ing virtio-...@lists.oasis-open.org because this thread is starting to touch the spec. On Wed, Apr 19, 2023 at 12:00:17PM +0200, Stefano Garzarella wrote: Hi Bobby, On Fri, Apr 14, 2023 at 11:18:40AM +, Bobby Eshleman wrote: > CC'ing Cong. > > On Fri, Apr 14, 2023 at 12:25:56AM +, Bobby Eshleman wrote: > > Hey all! > > > > This series introduces support for datagrams to virtio/vsock. Great! Thanks for restarting this work! No problem! > > > > It is a spin-off (and smaller version) of this series from the summer: > > https://lore.kernel.org/all/cover.1660362668.git.bobby.eshle...@bytedance.com/ > > > > Please note that this is an RFC and should not be merged until > > associated changes are made to the virtio specification, which will > > follow after discussion from this series. > > > > This series first supports datagrams in a basic form for virtio, and > > then optimizes the sendpath for all transports. > > > > The result is a very fast datagram communication protocol that > > outperforms even UDP on multi-queue virtio-net w/ vhost on a variety > > of multi-threaded workload samples. > > > > For those that are curious, some summary data comparing UDP and VSOCK > > DGRAM (N=5): > > > > vCPUS: 16 > > virtio-net queues: 16 > > payload size: 4KB > > Setup: bare metal + vm (non-nested) > > > > UDP: 287.59 MB/s > > VSOCK DGRAM: 509.2 MB/s > > > > Some notes about the implementation... > > > > This datagram implementation forces datagrams to self-throttle according > > to the threshold set by sk_sndbuf. It behaves similar to the credits > > used by streams in its effect on throughput and memory consumption, but > > it is not influenced by the receiving socket as credits are. So, sk_sndbuf influece the sender and sk_rcvbuf the receiver, right? Correct. We should check if VMCI behaves the same. > > > > The device drops packets silently. There is room for improvement by > > building into the device and driver some intelligence around how to > > reduce frequency of kicking the virtqueue when packet loss is high. I > > think there is a good discussion to be had on this. Can you elaborate a bit here? Do you mean some mechanism to report to the sender that a destination (cid, port) is full so the packet will be dropped? Correct. There is also the case of there being no receiver at all for this address since this case isn't rejected upon connect(). Ideally, such a socket (which will have 100% packet loss) will be throttled aggressively. Before we go down too far on this path, I also want to clarify that using UDP over vhost/virtio-net also has this property... this can be observed by using tcpdump to dump the UDP packets on the bridge network your VM is using. UDP packets sent to a garbage address can be seen on the host bridge (this is the nature of UDP, how is the host supposed to know the address eventually goes nowhere). I mention the above because I think it is possible for vsock to avoid this cost, given that it benefits from being point-to-point and g2h/h2g. If we're okay with vsock being on par, then the current series does that. I propose something below that can be added later and maybe negotiated as a feature bit too. I see and I agree on that, let's do it step by step. If we can do it in the first phase is great, but I think is fine to add this feature later. Can we adapt the credit mechanism? I've thought about this a lot because the attraction of the approach for me would be that we could get the wait/buffer-limiting logic for free and without big changes to the protocol, but the problem is that the unreliable nature of datagrams means that the source's free-running tx_cnt will become out-of-sync with the destination's fwd_cnt upon packet loss. We need to understand where the packet can be lost. If the packet always reaches the destination (vsock driver or device), we can discard it, but also update the counters. Imagine a source that initializes and starts sending packets before a destination socket even is created, the source's self-throttling will be dysfunctional because its tx_cnt will always far exceed the destination's fwd_cnt. Right, the other problem I see is that the socket aren't connected, so we have 1-N relationship. We could play tricks with the meaning of the CREDIT_UPDATE message and fwd_cnt/buf_alloc fields, but I don't think we want to go down that path. I think that the best and simplest approach introduces a congestion notification (VIRTIO_VSOCK_OP_CN?). When a packet is dropped, the destination sends this notification. At a given repeated time period T, the source can check if it has received any notifications in the last T. If so, it halves its buffer allocation. If not, it doubles its buffer allocation unless it is already at its max or original value. An "invalid" socket which never has any receiver will converge towards a
Re: [PATCH RFC net-next v2 3/4] vsock: Add lockless sendmsg() support
On Sat, Apr 15, 2023 at 10:30:55AM +, Bobby Eshleman wrote: On Wed, Apr 19, 2023 at 11:30:53AM +0200, Stefano Garzarella wrote: On Fri, Apr 14, 2023 at 12:25:59AM +, Bobby Eshleman wrote: > Because the dgram sendmsg() path for AF_VSOCK acquires the socket lock > it does not scale when many senders share a socket. > > Prior to this patch the socket lock is used to protect the local_addr, > remote_addr, transport, and buffer size variables. What follows are the > new protection schemes for the various protected fields that ensure a > race-free multi-sender sendmsg() path for vsock dgrams. > > - local_addr >local_addr changes as a result of binding a socket. The write path >for local_addr is bind() and various vsock_auto_bind() call sites. >After a socket has been bound via vsock_auto_bind() or bind(), subsequent >calls to bind()/vsock_auto_bind() do not write to local_addr again. bind() >rejects the user request and vsock_auto_bind() early exits. >Therefore, the local addr can not change while a parallel thread is >in sendmsg() and lock-free reads of local addr in sendmsg() are safe. >Change: only acquire lock for auto-binding as-needed in sendmsg(). > > - vsk->transport >Updated upon socket creation and it doesn't change again until the This is true only for dgram, right? Yes. How do we decide which transport to assign for dgram? The transport is assigned in proto->create() [vsock_create()]. It is assigned there *only* for dgrams, whereas for streams/seqpackets it is assigned in connect(). vsock_create() sets transport to 'transport_dgram' if sock->type == SOCK_DGRAM. vsock_sk_destruct() then eventually sets vsk->transport to NULL. Neither destruct nor create can occur in parallel with sendmsg(). create() hasn't yet returned the sockfd for the user to call upon it sendmsg(), and AFAICT destruct is only called after the final socket reference is released, which only happens after the socket no longer exists in the fd lookup table and so sendmsg() will fail before it ever has the chance to race. This is okay if a socket can be assigned to a single transport, but with dgrams I'm not sure we can do that. Since it is not connected, a socket can send or receive packets from different transports, so maybe we can't assign it to a specific one, but do a lookup for every packets to understand which transport to use. >socket is destroyed, which only happens after the socket refcnt reaches >zero. This prevents any sendmsg() call from being entered because the >sockfd lookup fails beforehand. That is, sendmsg() and vsk->transport >writes cannot execute in parallel. Additionally, connect() doesn't >update vsk->transport for dgrams as it does for streams. Therefore >vsk->transport is also safe to access lock-free in the sendmsg() path. >No change. > > - buffer size variables >Not used by dgram, so they do not need protection. No change. Is this true because for dgram we use the socket buffer? Is it the same for VMCI? Yes. The buf_alloc derived from buffer_size is also always ignored after being initialized once since credits aren't used. My reading of the VMCI code is that the buffer_size and buffer_{min,max}_size variables are used only in connectible calls (like connect and recv_listen), but not for datagrams. Okay, thanks for checking! > > - remote_addr >Needs additional protection because before this patch the >remote_addr (consisting of several fields such as cid, port, and flags) >only changed atomically under socket lock context. By acquiring the >socket lock to read the structure, the changes made by connect() were >always made visible to sendmsg() atomically. Consequently, to retain >atomicity of updates but offer lock-free access, this patch >redesigns this field as an RCU-protected pointer. > >Writers are still synchronized using the socket lock, but readers >only read inside RCU read-side critical sections. > > Helpers are introduced for accessing and updating the new pointer. > > The remote_addr structure is wrapped together with an rcu_head into a > sockaddr_vm_rcu structure so that kfree_rcu() can be used. This removes > the need of writers to use synchronize_rcu() after freeing old structures > which is simply more efficient and reduces code churn where remote_addr > is already being updated inside read-side sections. > > Only virtio has been tested, but updates were necessary to the VMCI and > hyperv code. Unfortunately the author does not have access to > VMCI/hyperv systems so those changes are untested. > > Perf Tests > vCPUS: 16 > Threads: 16 > Payload: 4KB > Test Runs: 5 > Type: SOCK_DGRAM > > Before: 245.2 MB/s > After: 509.2 MB/s (+107%) > > Notably, on the same test system, vsock dgram even outperforms > multi-threaded UDP over virtio-net with vhost and MQ support enabled. Cool! This patch is quite large, so I need to review it carefully in future