Re: [PATCH v2] fuse: rename some files and clean up Makefile

2021-12-01 Thread Stefan Hajnoczi
On Mon, Nov 29, 2021 at 09:27:17PM +0800, Tiezhu Yang wrote:
> On 11/29/2021 06:19 PM, Stefan Hajnoczi wrote:
> > On Sat, Nov 27, 2021 at 06:13:22PM +0800, Tiezhu Yang wrote:
> > > No need to generate virtio_fs.o first and then link to virtiofs.o, just
> > > rename virtio_fs.c to virtiofs.c and remove "virtiofs-y := virtio_fs.o"
> > > in Makefile, also update MAINTAINERS. Additionally, rename the private
> > > header file fuse_i.h to fuse.h, like ext4.h in fs/ext4, xfs.h in fs/xfs
> > > and f2fs.h in fs/f2fs.
> > 
> > There are two separate changes in this patch (virtio_fs.c -> virtiofs.c
> > and fuse_i.h -> fuse.h). A patch series with two patches would be easier
> > to review and cleaner to backport.
> > 
> > I'm happy with renaming virtio_fs.c to virtiofs.c:
> > 
> > Reviewed-by: Stefan Hajnoczi 
> > 
> 
> Hi Stefan and Miklos,
> 
> Thanks for your reply, what should I do now?
> 
> (1) split this patch into two separate patches to send v3;
> (2) just ignore this patch because
> "This will make backport of bugfixes harder for no good reason."
> said by Miklos.

Miklos' point makes sense to me and he is the FUSE maintainer.

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2] fuse: rename some files and clean up Makefile

2021-11-29 Thread Stefan Hajnoczi
On Sat, Nov 27, 2021 at 06:13:22PM +0800, Tiezhu Yang wrote:
> No need to generate virtio_fs.o first and then link to virtiofs.o, just
> rename virtio_fs.c to virtiofs.c and remove "virtiofs-y := virtio_fs.o"
> in Makefile, also update MAINTAINERS. Additionally, rename the private
> header file fuse_i.h to fuse.h, like ext4.h in fs/ext4, xfs.h in fs/xfs
> and f2fs.h in fs/f2fs.

There are two separate changes in this patch (virtio_fs.c -> virtiofs.c
and fuse_i.h -> fuse.h). A patch series with two patches would be easier
to review and cleaner to backport.

I'm happy with renaming virtio_fs.c to virtiofs.c:

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] virtio-blk: modify the value type of num in virtio_queue_rq()

2021-11-24 Thread Stefan Hajnoczi
On Wed, Nov 17, 2021 at 06:39:55AM +, cgel@gmail.com wrote:
> From: Ye Guojin 
> 
> This was found by coccicheck:
> ./drivers/block/virtio_blk.c, 334, 14-17, WARNING Unsigned expression
> compared with zero  num < 0
> 
> Reported-by: Zeal Robot 
> Signed-off-by: Ye Guojin 
> ---
>  drivers/block/virtio_blk.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 0/2] vhost/vsock: fix used length and cleanup in vhost_vsock_handle_tx_kick()

2021-11-23 Thread Stefan Hajnoczi
On Mon, Nov 22, 2021 at 05:35:23PM +0100, Stefano Garzarella wrote:
> This is a follow-up to Micheal's patch [1] and the discussion with Halil and
> Jason [2].
> 
> I made two patches, one to fix the problem and one for cleanup. This should
> simplify the backport of the fix because we've had the problem since
> vhost-vsock was introduced (v4.8) and that part has been touched a bit
> recently.
> 
> Thanks,
> Stefano
> 
> [1] 
> https://lore.kernel.org/virtualization/20211122105822.onarsa4sydzxqynu@steredhat/T/#t
> [2] 
> https://lore.kernel.org/virtualization/20211027022107.14357-1-jasow...@redhat.com/T/#t
> 
> Stefano Garzarella (2):
>   vhost/vsock: fix incorrect used length reported to the guest
>   vhost/vsock: cleanup removing `len` variable
> 
>  drivers/vhost/vsock.c | 8 ++--
>  1 file changed, 2 insertions(+), 6 deletions(-)
> 
> -- 
> 2.31.1
> 

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] vsock/virtio: suppress used length validation

2021-11-23 Thread Stefan Hajnoczi
On Mon, Nov 22, 2021 at 04:32:01AM -0500, Michael S. Tsirkin wrote:
> It turns out that vhost vsock violates the virtio spec
> by supplying the out buffer length in the used length
> (should just be the in length).
> As a result, attempts to validate the used length fail with:
> vmw_vsock_virtio_transport virtio1: tx: used len 44 is larger than in buflen 0
> 
> Since vsock driver does not use the length fox tx and
> validates the length before use for rx, it is safe to
> suppress the validation in virtio core for this driver.
> 
> Reported-by: Halil Pasic 
> Fixes: 939779f5152d ("virtio_ring: validate used buffer length")
> Cc: "Jason Wang" 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  net/vmw_vsock/virtio_transport.c | 1 +
>  1 file changed, 1 insertion(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC] hypercall-vsock: add a new vsock transport

2021-11-10 Thread Stefan Hajnoczi
On Wed, Nov 10, 2021 at 07:12:36AM +, Wang, Wei W wrote:
> We plan to add a new vsock transport based on hypercall (e.g. vmcall on Intel 
> CPUs).
> It transports AF_VSOCK packets between the guest and host, which is similar to
> virtio-vsock, vmci-vsock and hyperv-vsock.
> 
> Compared to the above listed vsock transports which are designed for high 
> performance,
> the main advantages of hypercall-vsock are:
> 
> 1)   It is VMM agnostic. For example, one guest working on 
> hypercall-vsock can run on
> 
> either KVM, Hyperv, or VMware.
> 
> 2)   It is simpler. It doesn't rely on any complex bus enumeration
> 
> (e.g. virtio-pci based vsock device may need the whole implementation of PCI).
> 
> An example usage is the communication between MigTD and host (Page 8 at
> https://static.sched.com/hosted_files/kvmforum2021/ef/TDX%20Live%20Migration_Wei%20Wang.pdf).
> MigTD communicates to host to assist the migration of the target (user) TD.
> MigTD is part of the TCB, so its implementation is expected to be as simple 
> as possible
> (e.g. bare mental implementation without OS, no PCI driver support).

AF_VSOCK is designed to allow multiple transports, so why not. There is
a cost to developing and maintaining a vsock transport though.

I think Amazon Nitro enclaves use virtio-vsock and I've CCed Andra in
case she has thoughts on the pros/cons and how to minimize the trusted
computing base.

If simplicity is the top priority then VIRTIO's MMIO transport without
indirect descriptors and using the packed virtqueue layout reduces the
size of the implementation:
https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-1440002

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH][next] virtio_blk: Fix spelling mistake: "advertisted" -> "advertised"

2021-10-28 Thread Stefan Hajnoczi
On Mon, Oct 25, 2021 at 11:22:40AM +0100, Colin Ian King wrote:
> There is a spelling mistake in a dev_err error message. Fix it.
> 
> Signed-off-by: Colin Ian King 
> ---
>  drivers/block/virtio_blk.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V3 11/11] vhost: allow userspace to create workers

2021-10-27 Thread Stefan Hajnoczi
On Tue, Oct 26, 2021 at 11:49:37AM -0500, michael.chris...@oracle.com wrote:
> On 10/26/21 12:37 AM, Jason Wang wrote:
> > Do we need VHOST_VRING_FREE_WORKER? And I wonder if using dedicated ioctls 
> > are better:
> > 
> > VHOST_VRING_NEW/FREE_WORKER
> > VHOST_VRING_ATTACH_WORKER
> 
> 
> We didn't need a free worker, because the kernel handles it for userspace. I
> tried to make it easy for userspace because in some cases it may not be able
> to do syscalls like close on the device. For example if qemu crashes or for
> vhost-scsi we don't do an explicit close during VM shutdown.
> 
> So we start off with the default worker thread that's used by all vqs like we 
> do
> today. Userspace can then override it by creating a new worker. That also 
> unbinds/
> detaches the existing worker and does a put on the workers refcount. We also 
> do a
> put on the worker when we stop using it during device 
> shutdown/closure/release.
> When the worker's refcount goes to zero the kernel deletes it.

Please document the worker (p)id lifetime for the ioctl. Otherwise
userspace doesn't know whether a previously created worker is still
alive.

SSTefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V3 11/11] vhost: allow userspace to create workers

2021-10-27 Thread Stefan Hajnoczi
On Wed, Oct 27, 2021 at 10:55:04AM +0800, Jason Wang wrote:
> On Tue, Oct 26, 2021 at 11:45 PM Stefan Hajnoczi  wrote:
> >
> > On Tue, Oct 26, 2021 at 01:37:14PM +0800, Jason Wang wrote:
> > >
> > > 在 2021/10/22 下午1:19, Mike Christie 写道:
> > > > This patch allows userspace to create workers and bind them to vqs. You
> > > > can have N workers per dev and also share N workers with M vqs.
> > > >
> > > > Signed-off-by: Mike Christie 
> > >
> > >
> > > A question, who is the best one to determine the binding? Is it the VMM
> > > (Qemu etc) or the management stack? If the latter, it looks to me it's
> > > better to expose this via sysfs?
> >
> > A few options that let the management stack control vhost worker CPU
> > affinity:
> >
> > 1. The management tool opens the vhost device node, calls
> >ioctl(VHOST_SET_VRING_WORKER), sets up CPU affinity, and then passes
> >the fd to the VMM. In this case the VMM is still able to call the
> >ioctl, which may be undesirable from an attack surface perspective.
> 
> Yes, and we can't do post or dynamic configuration afterwards after
> the VM is launched?

Yes, at least it's a little risky for the management stack to keep the
vhost fd open and make ioctl calls while the VMM is using it.

> >
> > 2. The VMM calls ioctl(VHOST_SET_VRING_WORKER) itself and the management
> >tool queries the vq:worker details from the VMM (e.g. a new QEMU QMP
> >query-vhost-workers command similar to query-iothreads). The
> >management tool can then control CPU affinity on the vhost worker
> >threads.
> >
> >(This is how CPU affinity works in QEMU and libvirt today.)
> 
> Then we also need a "bind-vhost-workers" command.

The VMM doesn't but the management tool does.

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V3 11/11] vhost: allow userspace to create workers

2021-10-26 Thread Stefan Hajnoczi
On Tue, Oct 26, 2021 at 09:09:52AM -0400, Michael S. Tsirkin wrote:
> On Tue, Oct 26, 2021 at 01:37:14PM +0800, Jason Wang wrote:
> > 
> > 在 2021/10/22 下午1:19, Mike Christie 写道:
> > > This patch allows userspace to create workers and bind them to vqs. You
> > > can have N workers per dev and also share N workers with M vqs.
> > > 
> > > Signed-off-by: Mike Christie 
> > 
> > 
> > A question, who is the best one to determine the binding? Is it the VMM
> > (Qemu etc) or the management stack? If the latter, it looks to me it's
> > better to expose this via sysfs?
> 
> I think it's a bit much to expect this from management.

The management stack controls the number of vqs used as well as the vCPU
and IOThread CPU affinity. It seems natural for it to also control the
vhost worker CPU affinity. Where else should that be controlled?

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V3 11/11] vhost: allow userspace to create workers

2021-10-26 Thread Stefan Hajnoczi
On Tue, Oct 26, 2021 at 01:37:14PM +0800, Jason Wang wrote:
> 
> 在 2021/10/22 下午1:19, Mike Christie 写道:
> > This patch allows userspace to create workers and bind them to vqs. You
> > can have N workers per dev and also share N workers with M vqs.
> > 
> > Signed-off-by: Mike Christie 
> 
> 
> A question, who is the best one to determine the binding? Is it the VMM
> (Qemu etc) or the management stack? If the latter, it looks to me it's
> better to expose this via sysfs?

A few options that let the management stack control vhost worker CPU
affinity:

1. The management tool opens the vhost device node, calls
   ioctl(VHOST_SET_VRING_WORKER), sets up CPU affinity, and then passes
   the fd to the VMM. In this case the VMM is still able to call the
   ioctl, which may be undesirable from an attack surface perspective.

2. The VMM calls ioctl(VHOST_SET_VRING_WORKER) itself and the management
   tool queries the vq:worker details from the VMM (e.g. a new QEMU QMP
   query-vhost-workers command similar to query-iothreads). The
   management tool can then control CPU affinity on the vhost worker
   threads.

   (This is how CPU affinity works in QEMU and libvirt today.)

3. The sysfs approach you suggested. Does sysfs export vq-0/, vq-1/, etc
   directories with a "worker" attribute? Do we need to define a point
   when the VMM has set up vqs and the management stack is able to query
   them? Vhost devices currently pre-allocate the maximum number of vqs
   and I'm not sure how to determine the number of vqs that will
   actually be used?

   One advantage of this is that access to the vq:worker mapping can be
   limited to the management stack and the VMM cannot access it. But it
   seems a little tricky because the vhost model today doesn't use sysfs
   or define a lifecycle where the management stack can configure
   devices.

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V3 11/11] vhost: allow userspace to create workers

2021-10-26 Thread Stefan Hajnoczi
On Fri, Oct 22, 2021 at 12:19:11AM -0500, Mike Christie wrote:
> diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
> index c998860d7bbc..e5c0669430e5 100644
> --- a/include/uapi/linux/vhost.h
> +++ b/include/uapi/linux/vhost.h
> @@ -70,6 +70,17 @@
>  #define VHOST_VRING_BIG_ENDIAN 1
>  #define VHOST_SET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x13, struct 
> vhost_vring_state)
>  #define VHOST_GET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x14, struct 
> vhost_vring_state)
> +/* By default, a device gets one vhost_worker created during VHOST_SET_OWNER
> + * that its virtqueues share. This allows userspace to create a vhost_worker
> + * and map a virtqueue to it or map a virtqueue to an existing worker.
> + *
> + * If pid > 0 and it matches an existing vhost_worker thread it will be bound
> + * to the vq. If pid is VHOST_VRING_NEW_WORKER, then a new worker will be
> + * created and bound to the vq.
> + *
> + * This must be called after VHOST_SET_OWNER and before the vq is active.
> + */
> +#define VHOST_SET_VRING_WORKER _IOWR(VHOST_VIRTIO, 0x15, struct 
> vhost_vring_worker)

Please clarify whether or not multiple devices can attach vqs to the same 
worker.

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V3 11/11] vhost: allow userspace to create workers

2021-10-26 Thread Stefan Hajnoczi
On Fri, Oct 22, 2021 at 12:19:11AM -0500, Mike Christie wrote:
> +/* Caller must have device mutex */
> +static int vhost_vq_setup_worker(struct vhost_virtqueue *vq,
> +  struct vhost_vring_worker *info)

It's clearer if the function name matches the ioctl name
(VHOST_SET_VRING_WORKER).


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] virtio_blk: corrent types for status handling

2021-10-25 Thread Stefan Hajnoczi
On Mon, Oct 25, 2021 at 03:58:54AM -0400, Michael S. Tsirkin wrote:
> virtblk_setup_cmd returns blk_status_t in an int, callers then assign it
> back to a blk_status_t variable. blk_status_t is either u32 or (more
> typically) u8 so it works, but is inelegant and causes sparse warnings.
> 
> Pass the status in blk_status_t in a consistent way.
> 
> Reported-by: kernel test robot 
> Fixes: b2c5221fd074 ("virtio-blk: avoid preallocating big SGL for data")
> Cc: Max Gurtovoy 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  drivers/block/virtio_blk.c | 14 --
>  1 file changed, 8 insertions(+), 6 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] virtio-blk: fixup coccinelle warnings

2021-10-21 Thread Stefan Hajnoczi
On Thu, Oct 21, 2021 at 12:08:23AM -0700, Joe Perches wrote:
> On Thu, 2021-10-21 at 06:51 +, cgel@gmail.com wrote:
> > From: Ye Guojin 
> > 
> > coccicheck complains about the use of snprintf() in sysfs show
> > functions:
> > WARNING  use scnprintf or sprintf
> > 
> > Use sysfs_emit instead of scnprintf or sprintf makes more sense.
> []
> > diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> []
> > @@ -624,7 +624,7 @@ cache_type_show(struct device *dev, struct 
> > device_attribute *attr, char *buf)
> > -   return snprintf(buf, 40, "%s\n", virtblk_cache_types[writeback]);
> > +   return sysfs_emit(buf, "%s\n", virtblk_cache_types[writeback]);
> 
> Perhaps scripts/coccinelle/api/device_attr_show.cocci should be updated
> to be more like the script used in commit 1c7fd72687d6

This won't catch the case covered by the patch because it's
"snprintf(buf, 40" instead of "snprintf(buf, PAGE_SIZE", although
any size that's not PAGE_SIZE needs to be reviewed carefully in case the
intent of the statement is to truncate the output.

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] virtio-blk: fixup coccinelle warnings

2021-10-21 Thread Stefan Hajnoczi
On Thu, Oct 21, 2021 at 06:51:11AM +, cgel@gmail.com wrote:
> From: Ye Guojin 
> 
> coccicheck complains about the use of snprintf() in sysfs show
> functions:
> WARNING  use scnprintf or sprintf
> 
> Use sysfs_emit instead of scnprintf or sprintf makes more sense.
> 
> Reported-by: Zeal Robot 
> Signed-off-by: Ye Guojin 
> ---
>  drivers/block/virtio_blk.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V3 01/10] virtio-blk: validate num_queues during probe

2021-10-20 Thread Stefan Hajnoczi
On Tue, Oct 19, 2021 at 03:01:43PM +0800, Jason Wang wrote:
> If an untrusted device neogitates BLK_F_MQ but advertises a zero
> num_queues, the driver may end up trying to allocating zero size
> buffers where ZERO_SIZE_PTR is returned which may pass the checking
> against the NULL. This will lead unexpected results.
> 
> Fixing this by failing the probe in this case.
> 
> Cc: Paolo Bonzini 
> Cc: Stefan Hajnoczi 
> Cc: Stefano Garzarella 
> Signed-off-by: Jason Wang 
> ---
>  drivers/block/virtio_blk.c | 4 
>  1 file changed, 4 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC] virtio: wrap config->reset calls

2021-10-18 Thread Stefan Hajnoczi
On Wed, Oct 13, 2021 at 06:55:31AM -0400, Michael S. Tsirkin wrote:
> This will enable cleanups down the road.
> The idea is to disable cbs, then add "flush_queued_cbs" callback
> as a parameter, this way drivers can flush any work
> queued after callbacks have been disabled.
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  arch/um/drivers/virt-pci.c | 2 +-
>  drivers/block/virtio_blk.c | 4 ++--
>  drivers/bluetooth/virtio_bt.c  | 2 +-
>  drivers/char/hw_random/virtio-rng.c| 2 +-
>  drivers/char/virtio_console.c  | 4 ++--
>  drivers/crypto/virtio/virtio_crypto_core.c | 8 
>  drivers/firmware/arm_scmi/virtio.c | 2 +-
>  drivers/gpio/gpio-virtio.c | 2 +-
>  drivers/gpu/drm/virtio/virtgpu_kms.c   | 2 +-
>  drivers/i2c/busses/i2c-virtio.c| 2 +-
>  drivers/iommu/virtio-iommu.c   | 2 +-
>  drivers/net/caif/caif_virtio.c | 2 +-
>  drivers/net/virtio_net.c   | 4 ++--
>  drivers/net/wireless/mac80211_hwsim.c  | 2 +-
>  drivers/nvdimm/virtio_pmem.c   | 2 +-
>  drivers/rpmsg/virtio_rpmsg_bus.c   | 2 +-
>  drivers/scsi/virtio_scsi.c | 2 +-
>  drivers/virtio/virtio.c| 5 +
>  drivers/virtio/virtio_balloon.c| 2 +-
>  drivers/virtio/virtio_input.c  | 2 +-
>  drivers/virtio/virtio_mem.c| 2 +-
>  fs/fuse/virtio_fs.c| 4 ++--
>  include/linux/virtio.h | 1 +
>  net/9p/trans_virtio.c  | 2 +-
>  net/vmw_vsock/virtio_transport.c   | 4 ++--
>  sound/virtio/virtio_card.c | 4 ++--
>  26 files changed, 39 insertions(+), 33 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 2/2] virtio-blk: set NUMA affinity for a tagset

2021-09-30 Thread Stefan Hajnoczi
On Wed, Sep 29, 2021 at 06:07:52PM +0300, Max Gurtovoy wrote:
> 
> On 9/28/2021 9:47 AM, Stefan Hajnoczi wrote:
> > On Mon, Sep 27, 2021 at 08:39:30PM +0300, Max Gurtovoy wrote:
> > > On 9/27/2021 11:09 AM, Stefan Hajnoczi wrote:
> > > > On Sun, Sep 26, 2021 at 05:55:18PM +0300, Max Gurtovoy wrote:
> > > > > To optimize performance, set the affinity of the block device tagset
> > > > > according to the virtio device affinity.
> > > > > 
> > > > > Signed-off-by: Max Gurtovoy 
> > > > > ---
> > > > >drivers/block/virtio_blk.c | 2 +-
> > > > >1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> > > > > index 9b3bd083b411..1c68c3e0ebf9 100644
> > > > > --- a/drivers/block/virtio_blk.c
> > > > > +++ b/drivers/block/virtio_blk.c
> > > > > @@ -774,7 +774,7 @@ static int virtblk_probe(struct virtio_device 
> > > > > *vdev)
> > > > >   memset(>tag_set, 0, sizeof(vblk->tag_set));
> > > > >   vblk->tag_set.ops = _mq_ops;
> > > > >   vblk->tag_set.queue_depth = queue_depth;
> > > > > - vblk->tag_set.numa_node = NUMA_NO_NODE;
> > > > > + vblk->tag_set.numa_node = virtio_dev_to_node(vdev);
> > > > >   vblk->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
> > > > >   vblk->tag_set.cmd_size =
> > > > >   sizeof(struct virtblk_req) +
> > > > I implemented NUMA affinity in the past and could not demonstrate a
> > > > performance improvement:
> > > > https://lists.linuxfoundation.org/pipermail/virtualization/2020-June/048248.html
> > > > 
> > > > The pathological case is when a guest with vNUMA has the virtio-blk-pci
> > > > device on the "wrong" host NUMA node. Then memory accesses should cross
> > > > NUMA nodes. Still, it didn't seem to matter.
> > > I think the reason you didn't see any improvement is since you didn't use
> > > the right device for the node query. See my patch 1/2.
> > That doesn't seem to be the case. Please see
> > drivers/base/core.c:device_add():
> > 
> >/* use parent numa_node */
> >if (parent && (dev_to_node(dev) == NUMA_NO_NODE))
> >set_dev_node(dev, dev_to_node(parent));
> > 
> > IMO it's cleaner to use dev_to_node(>dev) than to directly access
> > the parent.
> > 
> > Have I missed something?
> 
> but dev_to_node(dev) is 0 IMO.
> 
> who set it to NUMA_NO_NODE ?

drivers/virtio/virtio.c:register_virtio_device():

  device_initialize(>dev);

drivers/base/core.c:device_initialize():

  set_dev_node(dev, -1);

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 2/2] virtio-blk: set NUMA affinity for a tagset

2021-09-28 Thread Stefan Hajnoczi
On Mon, Sep 27, 2021 at 08:39:30PM +0300, Max Gurtovoy wrote:
> 
> On 9/27/2021 11:09 AM, Stefan Hajnoczi wrote:
> > On Sun, Sep 26, 2021 at 05:55:18PM +0300, Max Gurtovoy wrote:
> > > To optimize performance, set the affinity of the block device tagset
> > > according to the virtio device affinity.
> > > 
> > > Signed-off-by: Max Gurtovoy 
> > > ---
> > >   drivers/block/virtio_blk.c | 2 +-
> > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> > > index 9b3bd083b411..1c68c3e0ebf9 100644
> > > --- a/drivers/block/virtio_blk.c
> > > +++ b/drivers/block/virtio_blk.c
> > > @@ -774,7 +774,7 @@ static int virtblk_probe(struct virtio_device *vdev)
> > >   memset(>tag_set, 0, sizeof(vblk->tag_set));
> > >   vblk->tag_set.ops = _mq_ops;
> > >   vblk->tag_set.queue_depth = queue_depth;
> > > - vblk->tag_set.numa_node = NUMA_NO_NODE;
> > > + vblk->tag_set.numa_node = virtio_dev_to_node(vdev);
> > >   vblk->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
> > >   vblk->tag_set.cmd_size =
> > >   sizeof(struct virtblk_req) +
> > I implemented NUMA affinity in the past and could not demonstrate a
> > performance improvement:
> > https://lists.linuxfoundation.org/pipermail/virtualization/2020-June/048248.html
> > 
> > The pathological case is when a guest with vNUMA has the virtio-blk-pci
> > device on the "wrong" host NUMA node. Then memory accesses should cross
> > NUMA nodes. Still, it didn't seem to matter.
> 
> I think the reason you didn't see any improvement is since you didn't use
> the right device for the node query. See my patch 1/2.

That doesn't seem to be the case. Please see
drivers/base/core.c:device_add():

  /* use parent numa_node */
  if (parent && (dev_to_node(dev) == NUMA_NO_NODE))
  set_dev_node(dev, dev_to_node(parent));

IMO it's cleaner to use dev_to_node(>dev) than to directly access
the parent.

Have I missed something?

> 
> I can try integrating these patches in my series and fix it.
> 
> BTW, we might not see a big improvement because of other bottlenecks but
> this is known perf optimization we use often in block storage drivers.

Let's see benchmark results. Otherwise this is just dead code that adds
complexity.

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 2/2] virtio-blk: set NUMA affinity for a tagset

2021-09-27 Thread Stefan Hajnoczi
On Sun, Sep 26, 2021 at 05:55:18PM +0300, Max Gurtovoy wrote:
> To optimize performance, set the affinity of the block device tagset
> according to the virtio device affinity.
> 
> Signed-off-by: Max Gurtovoy 
> ---
>  drivers/block/virtio_blk.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 9b3bd083b411..1c68c3e0ebf9 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -774,7 +774,7 @@ static int virtblk_probe(struct virtio_device *vdev)
>   memset(>tag_set, 0, sizeof(vblk->tag_set));
>   vblk->tag_set.ops = _mq_ops;
>   vblk->tag_set.queue_depth = queue_depth;
> - vblk->tag_set.numa_node = NUMA_NO_NODE;
> + vblk->tag_set.numa_node = virtio_dev_to_node(vdev);
>   vblk->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
>   vblk->tag_set.cmd_size =
>   sizeof(struct virtblk_req) +

I implemented NUMA affinity in the past and could not demonstrate a
performance improvement:
https://lists.linuxfoundation.org/pipermail/virtualization/2020-June/048248.html

The pathological case is when a guest with vNUMA has the virtio-blk-pci
device on the "wrong" host NUMA node. Then memory accesses should cross
NUMA nodes. Still, it didn't seem to matter.

Please share your benchmark results. If you haven't collected data yet
you could even combine our patches to see if it helps. Thanks!

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/2] virtio: introduce virtio_dev_to_node helper

2021-09-27 Thread Stefan Hajnoczi
On Sun, Sep 26, 2021 at 05:55:17PM +0300, Max Gurtovoy wrote:
> Also expose numa_node field as a sysfs attribute. Now virtio device
> drivers will be able to allocate memory that is node-local to the
> device. This significantly helps performance and it's oftenly used in
> other drivers such as NVMe, for example.
> 
> Signed-off-by: Max Gurtovoy 
> ---
>  drivers/virtio/virtio.c | 10 ++
>  include/linux/virtio.h  | 13 +
>  2 files changed, 23 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v3 1/1] virtio-blk: avoid preallocating big SGL for data

2021-09-14 Thread Stefan Hajnoczi
On Mon, Sep 13, 2021 at 05:50:21PM +0300, Max Gurtovoy wrote:
> 
> On 9/6/2021 6:09 PM, Stefan Hajnoczi wrote:
> > On Wed, Sep 01, 2021 at 04:14:34PM +0300, Max Gurtovoy wrote:
> > > No need to pre-allocate a big buffer for the IO SGL anymore. If a device
> > > has lots of deep queues, preallocation for the sg list can consume
> > > substantial amounts of memory. For HW virtio-blk device, nr_hw_queues
> > > can be 64 or 128 and each queue's depth might be 128. This means the
> > > resulting preallocation for the data SGLs is big.
> > > 
> > > Switch to runtime allocation for SGL for lists longer than 2 entries.
> > > This is the approach used by NVMe drivers so it should be reasonable for
> > > virtio block as well. Runtime SGL allocation has always been the case
> > > for the legacy I/O path so this is nothing new.
> > > 
> > > The preallocated small SGL depends on SG_CHAIN so if the ARCH doesn't
> > > support SG_CHAIN, use only runtime allocation for the SGL.
> > > 
> > > Re-organize the setup of the IO request to fit the new sg chain
> > > mechanism.
> > > 
> > > No performance degradation was seen (fio libaio engine with 16 jobs and
> > > 128 iodepth):
> > > 
> > > IO size  IOPs Rand Read (before/after) IOPs Rand Write 
> > > (before/after)
> > >  -
> > > --
> > > 512B  318K/316K329K/325K
> > > 
> > > 4KB   323K/321K353K/349K
> > > 
> > > 16KB  199K/208K250K/275K
> > > 
> > > 128KB 36K/36.1K39.2K/41.7K
> > I ran fio randread benchmarks with 4k, 16k, 64k, and 128k at iodepth 1,
> > 8, and 64 on two vCPUs. The results look fine, there is no significant
> > regression.
> > 
> > iodepth=1 and iodepth=64 are very consistent. For some reason the
> > iodepth=8 has significant variance but I don't think it's the fault of
> > this patch.
> > 
> > Fio results and the Jupyter notebook export are available here (check
> > out benchmark.html to see the graphs):
> > 
> > https://gitlab.com/stefanha/virt-playbooks/-/tree/virtio-blk-sgl-allocation-benchmark/notebook
> > 
> > Guest:
> > - Fedora 34
> > - Linux v5.14
> > - 2 vCPUs (pinned), 4 GB RAM (single host NUMA node)
> > - 1 IOThread (pinned)
> > - virtio-blk aio=native,cache=none,format=raw
> > - QEMU 6.1.0
> > 
> > Host:
> > - RHEL 8.3
> > - Linux 4.18.0-240.22.1.el8_3.x86_64
> > - Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz
> > - Intel Optane DC P4800X
> > 
> > Stefan
> 
> Thanks, Stefan.
> 
> Would you like me to add some of the results in my commit msg ? or Tested-By
> sign ?

Thanks, there's no need to change the commit description.

Reviewed-by: Stefan Hajnoczi 
Tested-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/9] virtio-blk: validate num_queues during probe

2021-09-13 Thread Stefan Hajnoczi
On Mon, Sep 13, 2021 at 01:53:45PM +0800, Jason Wang wrote:
> If an untrusted device neogitates BLK_F_MQ but advertises a zero
> num_queues, the driver may end up trying to allocating zero size
> buffers where ZERO_SIZE_PTR is returned which may pass the checking
> against the NULL. This will lead unexpected results.
> 
> Fixing this by using single queue if num_queues is zero.
> 
> Cc: Paolo Bonzini 
> Cc: Stefan Hajnoczi 
> Signed-off-by: Jason Wang 
> ---
>  drivers/block/virtio_blk.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH 0/1] virtio: false unhandled irqs from vring_interrupt()

2021-09-06 Thread Stefan Hajnoczi
On Tue, Aug 24, 2021 at 07:31:29AM -0400, Michael S. Tsirkin wrote:
> On Tue, Aug 24, 2021 at 11:59:43AM +0100, Stefan Hajnoczi wrote:
> > While investigating an unhandled irq from vring_interrupt() with virtiofs I
> > stumbled onto a possible race that also affects virtio_gpu. This theory is
> > based on code inspection and hopefully you can point out something that 
> > makes
> > this a non-issue in practice :).
> > 
> > The vring_interrupt() function returns IRQ_NONE when an MSI-X interrupt is
> > taken with no used (completed) buffers in the virtqueue. The kernel disables
> > the irq and the driver is no longer receives irqs when this happens:
> > 
> >   irq 77: nobody cared (try booting with the "irqpoll" option)
> >   ...
> >   handlers:
> >   [<a40a49bb>] vring_interrupt
> >   Disabling IRQ #77
> > 
> > Consider the following:
> > 
> > 1. An virtiofs irq is handled and the virtio_fs_requests_done_work() work
> >function is scheduled to dequeue used buffers:
> >vring_interrupt() -> virtio_fs_vq_done() -> schedule_work()
> > 
> > 2. The device adds more used requests and just before...
> > 
> > 3. ...virtio_fs_requests_done_work() empties the virtqueue with
> >virtqueue_get_buf().
> > 
> > 4. The device raises the irq and vring_interrupt() is called after
> >virtio_fs_requests_done_work emptied the virtqueue:
> > 
> >irqreturn_t vring_interrupt(int irq, void *_vq)
> >{
> >struct vring_virtqueue *vq = to_vvq(_vq);
> > 
> >if (!more_used(vq)) {
> >pr_debug("virtqueue interrupt with no work for %p\n", vq);
> >return IRQ_NONE;
> >
> > 
> > I have included a patch that switches virtiofs from spin_lock() to
> > spin_lock_irqsave() to prevent vring_interrupt() from running while the
> > virtqueue is processed from a work function.
> > 
> > virtio_gpu has a similar case where virtio_gpu_dequeue_ctrl_func() and
> > virtio_gpu_dequeue_cursor_func() work functions only use spin_lock().
> > I think this can result in the same false unhandled irq problem as virtiofs.
> > 
> > This race condition could in theory affect all drivers. The VIRTIO
> > specification says:
> > 
> >   Neither of these notification suppression methods are reliable, as they 
> > are
> >   not synchronized with the device, but they serve as useful optimizations.
> > 
> > If virtqueue_disable_cb() is just a hint and might not disable virtqueue 
> > irqs
> > then virtio_net and other drivers have a problem because because an irq 
> > could
> > be raised while the driver is dequeuing used buffers. I think we haven't 
> > seen
> > this because software VIRTIO devices honor virtqueue_disable_cb(). Hardware
> > devices might cache the value and not disable notifications for some time...
> > 
> > Have I missed something?
> > 
> > The virtiofs patch I attached is being stress tested to see if the unhandled
> > irqs still occur.
> > 
> > Stefan Hajnoczi (1):
> >   fuse: disable local irqs when processing vq completions
> > 
> >  fs/fuse/virtio_fs.c | 15 ++-
> >  1 file changed, 10 insertions(+), 5 deletions(-)
> 
> Fundamentally it is not a problem to have an unhandled IRQ
> once in a while. It's only a problem if this happens time
> after time.
> 
> 
> Does the kernel you are testing include
> commit 8d622d21d24803408b256d96463eac4574dcf067
> Author: Michael S. Tsirkin 
> Date:   Tue Apr 13 01:19:16 2021 -0400
> 
> virtio: fix up virtio_disable_cb
> 
> ?
> 
> If not it's worth checking whether the latest kernel still
> has the issue.

A new kernel with your patch doesn't have this issue.

Please disregard the patch I posted, your patch seems to be enough.

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v3 1/1] virtio-blk: avoid preallocating big SGL for data

2021-09-06 Thread Stefan Hajnoczi
On Wed, Sep 01, 2021 at 04:14:34PM +0300, Max Gurtovoy wrote:
> No need to pre-allocate a big buffer for the IO SGL anymore. If a device
> has lots of deep queues, preallocation for the sg list can consume
> substantial amounts of memory. For HW virtio-blk device, nr_hw_queues
> can be 64 or 128 and each queue's depth might be 128. This means the
> resulting preallocation for the data SGLs is big.
> 
> Switch to runtime allocation for SGL for lists longer than 2 entries.
> This is the approach used by NVMe drivers so it should be reasonable for
> virtio block as well. Runtime SGL allocation has always been the case
> for the legacy I/O path so this is nothing new.
> 
> The preallocated small SGL depends on SG_CHAIN so if the ARCH doesn't
> support SG_CHAIN, use only runtime allocation for the SGL.
> 
> Re-organize the setup of the IO request to fit the new sg chain
> mechanism.
> 
> No performance degradation was seen (fio libaio engine with 16 jobs and
> 128 iodepth):
> 
> IO size  IOPs Rand Read (before/after) IOPs Rand Write 
> (before/after)
>  -
> --
> 512B  318K/316K329K/325K
> 
> 4KB   323K/321K353K/349K
> 
> 16KB  199K/208K250K/275K
> 
> 128KB 36K/36.1K39.2K/41.7K

I ran fio randread benchmarks with 4k, 16k, 64k, and 128k at iodepth 1,
8, and 64 on two vCPUs. The results look fine, there is no significant
regression.

iodepth=1 and iodepth=64 are very consistent. For some reason the
iodepth=8 has significant variance but I don't think it's the fault of
this patch.

Fio results and the Jupyter notebook export are available here (check
out benchmark.html to see the graphs):

https://gitlab.com/stefanha/virt-playbooks/-/tree/virtio-blk-sgl-allocation-benchmark/notebook

Guest:
- Fedora 34
- Linux v5.14
- 2 vCPUs (pinned), 4 GB RAM (single host NUMA node)
- 1 IOThread (pinned)
- virtio-blk aio=native,cache=none,format=raw
- QEMU 6.1.0

Host:
- RHEL 8.3
- Linux 4.18.0-240.22.1.el8_3.x86_64
- Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz
- Intel Optane DC P4800X

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] MAINTAINERS: add VM SOCKETS (AF_VSOCK) entry

2021-09-06 Thread Stefan Hajnoczi
On Mon, Sep 06, 2021 at 11:11:59AM +0200, Stefano Garzarella wrote:
> Add a new entry for VM Sockets (AF_VSOCK) that covers vsock core,
> tests, and headers. Move some general vsock stuff from virtio-vsock
> entry into this new more general vsock entry.
> 
> I've been reviewing and contributing for the last few years,
> so I'm available to help maintain this code.
> 
> Cc: Dexuan Cui 
> Cc: Jorgen Hansen 
> Cc: Stefan Hajnoczi 
> Suggested-by: Michael S. Tsirkin 
> Signed-off-by: Stefano Garzarella 
> ---
> 
> Dexuan, Jorgen, Stefan, would you like to co-maintain or
> be added as a reviewer?
> 
> Thanks,
> Stefano
> ---
>  MAINTAINERS | 20 +---
>  1 file changed, 13 insertions(+), 7 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] MAINTAINERS: add VM SOCKETS (AF_VSOCK) entry

2021-09-06 Thread Stefan Hajnoczi
On Mon, Sep 06, 2021 at 11:11:59AM +0200, Stefano Garzarella wrote:
> Add a new entry for VM Sockets (AF_VSOCK) that covers vsock core,
> tests, and headers. Move some general vsock stuff from virtio-vsock
> entry into this new more general vsock entry.
> 
> I've been reviewing and contributing for the last few years,
> so I'm available to help maintain this code.
> 
> Cc: Dexuan Cui 
> Cc: Jorgen Hansen 
> Cc: Stefan Hajnoczi 
> Suggested-by: Michael S. Tsirkin 
> Signed-off-by: Stefano Garzarella 
> ---
> 
> Dexuan, Jorgen, Stefan, would you like to co-maintain or
> be added as a reviewer?

Please skip me for now. I'm available if you take an extended period of
time off and other special situations but don't have enough time to play
an active role.

Thanks,
Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/1] virtio-blk: remove unneeded "likely" statements

2021-09-02 Thread Stefan Hajnoczi
On Mon, Aug 30, 2021 at 03:01:11PM +0300, Max Gurtovoy wrote:
> Usually we use "likely/unlikely" to optimize the fast path. Remove
> redundant "likely" statements in the control path to ease on the code.
> 
> Signed-off-by: Max Gurtovoy 
> ---
>  drivers/block/virtio_blk.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

It would be nice to tweak the commit description before merging this. I
had trouble parsing the second sentence. If I understand correctly the
purpose of this patch is to make the code simpler and easier to read:

  s/ease on the code/simplify the code and make it easier to read/

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v5] virtio-vsock: add description for datagram type

2021-09-02 Thread Stefan Hajnoczi
On Thu, Jun 10, 2021 at 06:12:03PM +, Jiang Wang wrote:
> Add supports for datagram type for virtio-vsock. Datagram
> sockets are connectionless and unreliable. To avoid contention
> with stream and other sockets, add two more virtqueues and
> a new feature bit to identify if those two new queues exist or not.
> 
> Also add descriptions for resource management of datagram, which
> does not use the existing credit update mechanism associated with
> stream sockets.
> 
> Signed-off-by: Jiang Wang 
> ---

Overall this looks good. The tricky thing will be implementing dgram
sockets in a way that minimizes dropped packets and provides some degree
of fairness between senders. Those are implementation issues though and
not visible at the device specification level.

> diff --git a/virtio-vsock.tex b/virtio-vsock.tex
> index da7e641..26a62ac 100644
> --- a/virtio-vsock.tex
> +++ b/virtio-vsock.tex
> @@ -9,14 +9,37 @@ \subsection{Device ID}\label{sec:Device Types / Socket 
> Device / Device ID}
>  
>  \subsection{Virtqueues}\label{sec:Device Types / Socket Device / Virtqueues}
>  \begin{description}
> -\item[0] rx
> -\item[1] tx
> +\item[0] stream rx
> +\item[1] stream tx
> +\item[2] datagram rx
> +\item[3] datagram tx
> +\item[4] event
> +\end{description}
> +The virtio socket device uses 5 queues if feature bit VIRTIO_VSOCK_F_DRGAM 
> is set. Otherwise, it
> +only uses 3 queues, as the following.

s/as the following/as follows:/

> +
> +\begin{description}
> +\item[0] stream rx
> +\item[1] stream tx
>  \item[2] event
>  \end{description}
>  
> +When behavior differs between stream and datagram rx/tx virtqueues
> +their full names are used. Common behavior is simply described in
> +terms of rx/tx virtqueues and applies to both stream and datagram
> +virtqueues.
> +
>  \subsection{Feature bits}\label{sec:Device Types / Socket Device / Feature 
> bits}
>  
> -There are currently no feature bits defined for this device.
> +\begin{description}
> +\item[VIRTIO_VSOCK_F_STREAM (0)] Device has support for stream socket type.
> +\end{description}
> +
> +\begin{description}
> +\item[VIRTIO_VSOCK_F_DGRAM (2)] Device has support for datagram socket type.

Is this really bit 2 or did you mean bit 1 (value 0x2)?

What happens to the virtqueue layout when VIRTIO_VSOCK_F_DGRAM is
present and VIRTIO_VSOCK_F_STREAM is absent? The virtqueue section above
implies that VIRTIO_VSOCK_F_STREAM is always present.

> +\end{description}
> +
> +If no feature bits are defined, assume device only supports stream socket 
> type.

It's cleaner to define VIRTIO_VSOCK_F_NO_STREAM (0) instead. When the
bit is set the stream socket type is not available and the stream_rx/tx
virtqueues are absent.

This way it's not necessary to define special behavior depending on
certain combinations of feature bits.

>  \subsubsection{Receive and Transmit}\label{sec:Device Types / Socket Device 
> / Device Operation / Receive and Transmit}
> -The driver queues outgoing packets on the tx virtqueue and incoming packet
> +The driver queues outgoing packets on the tx virtqueue and allocates 
> incoming packet
>  receive buffers on the rx virtqueue. Packets are of the following form:

This change seems unrelated to dgram sockets. I don't think adding the
word "allocates" makes things clearer or more precise. The driver may
reuse receive buffers rather than allocating fresh buffers. I suggest
dropping this change.

>  
>  \begin{lstlisting}
> @@ -195,6 +235,7 @@ \subsubsection{Receive and Transmit}\label{sec:Device 
> Types / Socket Device / De
>  };
>  \end{lstlisting}
>  
> +
>  Virtqueue buffers for outgoing packets are read-only. Virtqueue buffers for
>  incoming packets are write-only.
>  

Unnecessary whitespace change. Please drop.


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 1/1] virtio-blk: add num_io_queues module parameter

2021-09-02 Thread Stefan Hajnoczi
On Tue, Aug 31, 2021 at 04:50:35PM +0300, Max Gurtovoy wrote:
> Sometimes a user would like to control the amount of IO queues to be
> created for a block device. For example, for limiting the memory
> footprint of virtio-blk devices.
> 
> Signed-off-by: Max Gurtovoy 
> ---
> 
> changes from v1:
>  - use param_set_uint_minmax (from Christoph)
>  - added "Should > 0" to module description
> 
> Note: This commit apply on top of Jens's branch for-5.15/drivers
> ---
>  drivers/block/virtio_blk.c | 20 +++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 4b49df2dfd23..9332fc4e9b31 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -24,6 +24,22 @@
>  /* The maximum number of sg elements that fit into a virtqueue */
>  #define VIRTIO_BLK_MAX_SG_ELEMS 32768
>  
> +static int virtblk_queue_count_set(const char *val,
> + const struct kernel_param *kp)
> +{
> + return param_set_uint_minmax(val, kp, 1, nr_cpu_ids);
> +}
> +
> +static const struct kernel_param_ops queue_count_ops = {
> + .set = virtblk_queue_count_set,
> + .get = param_get_uint,
> +};
> +
> +static unsigned int num_io_queues;
> +module_param_cb(num_io_queues, _count_ops, _io_queues, 0644);
> +MODULE_PARM_DESC(num_io_queues,
> +  "Number of IO virt queues to use for blk device. Should > 0");
> +
>  static int major;
>  static DEFINE_IDA(vd_index_ida);
>  
> @@ -501,7 +517,9 @@ static int init_vq(struct virtio_blk *vblk)
>   if (err)
>   num_vqs = 1;
>  
> - num_vqs = min_t(unsigned int, nr_cpu_ids, num_vqs);
> + num_vqs = min_t(unsigned int,
> + min_not_zero(num_io_queues, nr_cpu_ids),
> + num_vqs);

If you respin, please consider calling them request queues. That's the
terminology from the VIRTIO spec and it's nice to keep it consistent.
But the purpose of num_io_queues is clear, so:

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v3 1/1] virtio-blk: avoid preallocating big SGL for data

2021-09-02 Thread Stefan Hajnoczi
On Wed, Sep 01, 2021 at 04:14:34PM +0300, Max Gurtovoy wrote:
> No need to pre-allocate a big buffer for the IO SGL anymore. If a device
> has lots of deep queues, preallocation for the sg list can consume
> substantial amounts of memory. For HW virtio-blk device, nr_hw_queues
> can be 64 or 128 and each queue's depth might be 128. This means the
> resulting preallocation for the data SGLs is big.
> 
> Switch to runtime allocation for SGL for lists longer than 2 entries.
> This is the approach used by NVMe drivers so it should be reasonable for
> virtio block as well. Runtime SGL allocation has always been the case
> for the legacy I/O path so this is nothing new.
> 
> The preallocated small SGL depends on SG_CHAIN so if the ARCH doesn't
> support SG_CHAIN, use only runtime allocation for the SGL.
> 
> Re-organize the setup of the IO request to fit the new sg chain
> mechanism.
> 
> No performance degradation was seen (fio libaio engine with 16 jobs and
> 128 iodepth):
> 
> IO size  IOPs Rand Read (before/after) IOPs Rand Write 
> (before/after)
>  -
> --
> 512B  318K/316K329K/325K
> 
> 4KB   323K/321K353K/349K
> 
> 16KB  199K/208K250K/275K
> 
> 128KB 36K/36.1K39.2K/41.7K
> 
> Signed-off-by: Max Gurtovoy 
> Reviewed-by: Israel Rukshin 
> ---
> 
> changes from V2:
>  - initialize vbr->out_hdr.sector during virtblk_setup_cmd
> 
> changes from V1:
>  - Kconfig update (from Christoph)
>  - Re-order cmd setup (from Christoph)
>  - use flexible sg pointer in the cmd (from Christoph)
>  - added perf numbers to commit msg (from Feng Li)
> 
> ---
>  drivers/block/Kconfig  |   1 +
>  drivers/block/virtio_blk.c | 155 +++--
>  2 files changed, 100 insertions(+), 56 deletions(-)

Hi Max,
I can run benchmark to give everyone more confidence about this change.
Should I test this version or are you still planning to make code
changes?

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH 0/1] virtio: false unhandled irqs from vring_interrupt()

2021-08-26 Thread Stefan Hajnoczi
On Tue, Aug 24, 2021 at 07:31:29AM -0400, Michael S. Tsirkin wrote:
> On Tue, Aug 24, 2021 at 11:59:43AM +0100, Stefan Hajnoczi wrote:
> > While investigating an unhandled irq from vring_interrupt() with virtiofs I
> > stumbled onto a possible race that also affects virtio_gpu. This theory is
> > based on code inspection and hopefully you can point out something that 
> > makes
> > this a non-issue in practice :).
> > 
> > The vring_interrupt() function returns IRQ_NONE when an MSI-X interrupt is
> > taken with no used (completed) buffers in the virtqueue. The kernel disables
> > the irq and the driver is no longer receives irqs when this happens:
> > 
> >   irq 77: nobody cared (try booting with the "irqpoll" option)
> >   ...
> >   handlers:
> >   [<a40a49bb>] vring_interrupt
> >   Disabling IRQ #77
> > 
> > Consider the following:
> > 
> > 1. An virtiofs irq is handled and the virtio_fs_requests_done_work() work
> >function is scheduled to dequeue used buffers:
> >vring_interrupt() -> virtio_fs_vq_done() -> schedule_work()
> > 
> > 2. The device adds more used requests and just before...
> > 
> > 3. ...virtio_fs_requests_done_work() empties the virtqueue with
> >virtqueue_get_buf().
> > 
> > 4. The device raises the irq and vring_interrupt() is called after
> >virtio_fs_requests_done_work emptied the virtqueue:
> > 
> >irqreturn_t vring_interrupt(int irq, void *_vq)
> >{
> >struct vring_virtqueue *vq = to_vvq(_vq);
> > 
> >if (!more_used(vq)) {
> >pr_debug("virtqueue interrupt with no work for %p\n", vq);
> >return IRQ_NONE;
> >
> > 
> > I have included a patch that switches virtiofs from spin_lock() to
> > spin_lock_irqsave() to prevent vring_interrupt() from running while the
> > virtqueue is processed from a work function.
> > 
> > virtio_gpu has a similar case where virtio_gpu_dequeue_ctrl_func() and
> > virtio_gpu_dequeue_cursor_func() work functions only use spin_lock().
> > I think this can result in the same false unhandled irq problem as virtiofs.
> > 
> > This race condition could in theory affect all drivers. The VIRTIO
> > specification says:
> > 
> >   Neither of these notification suppression methods are reliable, as they 
> > are
> >   not synchronized with the device, but they serve as useful optimizations.
> > 
> > If virtqueue_disable_cb() is just a hint and might not disable virtqueue 
> > irqs
> > then virtio_net and other drivers have a problem because because an irq 
> > could
> > be raised while the driver is dequeuing used buffers. I think we haven't 
> > seen
> > this because software VIRTIO devices honor virtqueue_disable_cb(). Hardware
> > devices might cache the value and not disable notifications for some time...
> > 
> > Have I missed something?
> > 
> > The virtiofs patch I attached is being stress tested to see if the unhandled
> > irqs still occur.
> > 
> > Stefan Hajnoczi (1):
> >   fuse: disable local irqs when processing vq completions
> > 
> >  fs/fuse/virtio_fs.c | 15 ++-
> >  1 file changed, 10 insertions(+), 5 deletions(-)
> 
> Fundamentally it is not a problem to have an unhandled IRQ
> once in a while. It's only a problem if this happens time
> after time.

That virtiofs patch in this series passed the stress test. We must have
been getting too many unhandled interrupts due to processing the
virtqueue from a work function that's not synchronized with
vring_interrupt() irq handler.

> Does the kernel you are testing include
> commit 8d622d21d24803408b256d96463eac4574dcf067
> Author: Michael S. Tsirkin 
> Date:   Tue Apr 13 01:19:16 2021 -0400
> 
> virtio: fix up virtio_disable_cb
> 
> ?
> 
> If not it's worth checking whether the latest kernel still
> has the issue.
> 
> Note cherry picking that commit isn't trivial there are
> a bunch of dependencies.

We'll test a recent kernel with just your patch to see if it solves the
issue.

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH 0/1] virtio: false unhandled irqs from vring_interrupt()

2021-08-24 Thread Stefan Hajnoczi
On Tue, Aug 24, 2021 at 07:31:29AM -0400, Michael S. Tsirkin wrote:
> On Tue, Aug 24, 2021 at 11:59:43AM +0100, Stefan Hajnoczi wrote:
> > While investigating an unhandled irq from vring_interrupt() with virtiofs I
> > stumbled onto a possible race that also affects virtio_gpu. This theory is
> > based on code inspection and hopefully you can point out something that 
> > makes
> > this a non-issue in practice :).
> > 
> > The vring_interrupt() function returns IRQ_NONE when an MSI-X interrupt is
> > taken with no used (completed) buffers in the virtqueue. The kernel disables
> > the irq and the driver is no longer receives irqs when this happens:
> > 
> >   irq 77: nobody cared (try booting with the "irqpoll" option)
> >   ...
> >   handlers:
> >   [<a40a49bb>] vring_interrupt
> >   Disabling IRQ #77
> > 
> > Consider the following:
> > 
> > 1. An virtiofs irq is handled and the virtio_fs_requests_done_work() work
> >function is scheduled to dequeue used buffers:
> >vring_interrupt() -> virtio_fs_vq_done() -> schedule_work()
> > 
> > 2. The device adds more used requests and just before...
> > 
> > 3. ...virtio_fs_requests_done_work() empties the virtqueue with
> >virtqueue_get_buf().
> > 
> > 4. The device raises the irq and vring_interrupt() is called after
> >virtio_fs_requests_done_work emptied the virtqueue:
> > 
> >irqreturn_t vring_interrupt(int irq, void *_vq)
> >{
> >struct vring_virtqueue *vq = to_vvq(_vq);
> > 
> >if (!more_used(vq)) {
> >pr_debug("virtqueue interrupt with no work for %p\n", vq);
> >return IRQ_NONE;
> >
> > 
> > I have included a patch that switches virtiofs from spin_lock() to
> > spin_lock_irqsave() to prevent vring_interrupt() from running while the
> > virtqueue is processed from a work function.
> > 
> > virtio_gpu has a similar case where virtio_gpu_dequeue_ctrl_func() and
> > virtio_gpu_dequeue_cursor_func() work functions only use spin_lock().
> > I think this can result in the same false unhandled irq problem as virtiofs.
> > 
> > This race condition could in theory affect all drivers. The VIRTIO
> > specification says:
> > 
> >   Neither of these notification suppression methods are reliable, as they 
> > are
> >   not synchronized with the device, but they serve as useful optimizations.
> > 
> > If virtqueue_disable_cb() is just a hint and might not disable virtqueue 
> > irqs
> > then virtio_net and other drivers have a problem because because an irq 
> > could
> > be raised while the driver is dequeuing used buffers. I think we haven't 
> > seen
> > this because software VIRTIO devices honor virtqueue_disable_cb(). Hardware
> > devices might cache the value and not disable notifications for some time...
> > 
> > Have I missed something?
> > 
> > The virtiofs patch I attached is being stress tested to see if the unhandled
> > irqs still occur.
> > 
> > Stefan Hajnoczi (1):
> >   fuse: disable local irqs when processing vq completions
> > 
> >  fs/fuse/virtio_fs.c | 15 ++-
> >  1 file changed, 10 insertions(+), 5 deletions(-)
> 
> Fundamentally it is not a problem to have an unhandled IRQ
> once in a while. It's only a problem if this happens time
> after time.
> 
> 
> Does the kernel you are testing include
> commit 8d622d21d24803408b256d96463eac4574dcf067
> Author: Michael S. Tsirkin 
> Date:   Tue Apr 13 01:19:16 2021 -0400
> 
> virtio: fix up virtio_disable_cb
> 
> ?
> 
> If not it's worth checking whether the latest kernel still
> has the issue.
> 
> Note cherry picking that commit isn't trivial there are
> a bunch of dependencies.
> 
> If you want to try on an old kernel, disable event index.

Thanks for pointing out this commit. My kernel does not have it. The
device is using the split vring layout (probably with EVENT_IDX) so
virtqueue_disable_cb() has no effect.

kernel/irq/spurious.c:note_interrupt() disables the irq after 99.9k
unhandled irqs. It's surprising that virtiofs is hitting this limit
through the scenario I described, but maybe.

I'll try different kernel versions and/or disable EVENT_IDX as you
suggested. Thanks!

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[RFC PATCH 1/1] fuse: disable local irqs when processing vq completions

2021-08-24 Thread Stefan Hajnoczi
The virtqueue completion handler function runs on a work queue and local
irqs are still enabled. There is a race where the completion handler
function grabs the next completed request just before vring_interrupt()
runs. vring_interrupt() sees an empty virtqueue and returns IRQ_NONE,
falsely declaring this interrupt unhandled.

The unhandled irq causes the kernel to disable the irq:

  irq 77: nobody cared (try booting with the "irqpoll" option)
  ...
  handlers:
  [<d33eeed7>] vring_interrupt
  Disabling IRQ #77

The driver hangs afterwards since virtqueue irqs are now ignored.

Disable local irqs before calling virtqueue_get_buf() and re-enable them
afterwards so that vring_interrupt() doesn't run during the race window.

Reported-by: Xiaoling Gao 
Cc: Michael Tsirkin 
Cc: Jason Wang 
Signed-off-by: Stefan Hajnoczi 
---
I'm not 100% convinced this fixes everything because vring_interrupt()
can still run after our critical section and find the virtqueue empty.
virtqueue_disable_cb() should minimize that but it's only a hint and
there is a small window when the race condition can happen before it's
called.
---
 fs/fuse/virtio_fs.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 8f52cdaa8445..57e1f264b0a8 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -319,9 +319,10 @@ static void virtio_fs_hiprio_done_work(struct work_struct 
*work)
struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq,
 done_work);
struct virtqueue *vq = fsvq->vq;
+   unsigned long flags;
 
/* Free completed FUSE_FORGET requests */
-   spin_lock(>lock);
+   spin_lock_irqsave(>lock, flags);
do {
unsigned int len;
void *req;
@@ -333,7 +334,7 @@ static void virtio_fs_hiprio_done_work(struct work_struct 
*work)
dec_in_flight_req(fsvq);
}
} while (!virtqueue_enable_cb(vq) && likely(!virtqueue_is_broken(vq)));
-   spin_unlock(>lock);
+   spin_unlock_irqrestore(>lock, flags);
 }
 
 static void virtio_fs_request_dispatch_work(struct work_struct *work)
@@ -601,11 +602,15 @@ static void virtio_fs_requests_done_work(struct 
work_struct *work)
struct virtqueue *vq = fsvq->vq;
struct fuse_req *req;
struct fuse_req *next;
+   unsigned long flags;
unsigned int len;
LIST_HEAD(reqs);
 
-   /* Collect completed requests off the virtqueue */
-   spin_lock(>lock);
+   /*
+* Collect completed requests off the virtqueue with irqs disabled to
+* prevent races with vring_interrupt().
+*/
+   spin_lock_irqsave(>lock, flags);
do {
virtqueue_disable_cb(vq);
 
@@ -615,7 +620,7 @@ static void virtio_fs_requests_done_work(struct work_struct 
*work)
spin_unlock(>lock);
}
} while (!virtqueue_enable_cb(vq) && likely(!virtqueue_is_broken(vq)));
-   spin_unlock(>lock);
+   spin_unlock_irqrestore(>lock, flags);
 
/* End requests */
list_for_each_entry_safe(req, next, , list) {
-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC PATCH 0/1] virtio: false unhandled irqs from vring_interrupt()

2021-08-24 Thread Stefan Hajnoczi
While investigating an unhandled irq from vring_interrupt() with virtiofs I
stumbled onto a possible race that also affects virtio_gpu. This theory is
based on code inspection and hopefully you can point out something that makes
this a non-issue in practice :).

The vring_interrupt() function returns IRQ_NONE when an MSI-X interrupt is
taken with no used (completed) buffers in the virtqueue. The kernel disables
the irq and the driver is no longer receives irqs when this happens:

  irq 77: nobody cared (try booting with the "irqpoll" option)
  ...
  handlers:
  [<a40a49bb>] vring_interrupt
  Disabling IRQ #77

Consider the following:

1. An virtiofs irq is handled and the virtio_fs_requests_done_work() work
   function is scheduled to dequeue used buffers:
   vring_interrupt() -> virtio_fs_vq_done() -> schedule_work()

2. The device adds more used requests and just before...

3. ...virtio_fs_requests_done_work() empties the virtqueue with
   virtqueue_get_buf().

4. The device raises the irq and vring_interrupt() is called after
   virtio_fs_requests_done_work emptied the virtqueue:

   irqreturn_t vring_interrupt(int irq, void *_vq)
   {
   struct vring_virtqueue *vq = to_vvq(_vq);

   if (!more_used(vq)) {
   pr_debug("virtqueue interrupt with no work for %p\n", vq);
   return IRQ_NONE;
   

I have included a patch that switches virtiofs from spin_lock() to
spin_lock_irqsave() to prevent vring_interrupt() from running while the
virtqueue is processed from a work function.

virtio_gpu has a similar case where virtio_gpu_dequeue_ctrl_func() and
virtio_gpu_dequeue_cursor_func() work functions only use spin_lock().
I think this can result in the same false unhandled irq problem as virtiofs.

This race condition could in theory affect all drivers. The VIRTIO
specification says:

  Neither of these notification suppression methods are reliable, as they are
  not synchronized with the device, but they serve as useful optimizations.

If virtqueue_disable_cb() is just a hint and might not disable virtqueue irqs
then virtio_net and other drivers have a problem because because an irq could
be raised while the driver is dequeuing used buffers. I think we haven't seen
this because software VIRTIO devices honor virtqueue_disable_cb(). Hardware
devices might cache the value and not disable notifications for some time...

Have I missed something?

The virtiofs patch I attached is being stress tested to see if the unhandled
irqs still occur.

Stefan Hajnoczi (1):
  fuse: disable local irqs when processing vq completions

 fs/fuse/virtio_fs.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

-- 
2.31.1


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] vhost scsi: Convert to SPDX identifier

2021-08-23 Thread Stefan Hajnoczi
On Sat, Aug 21, 2021 at 08:33:20PM +0800, Cai Huoqing wrote:
> use SPDX-License-Identifier instead of a verbose license text
> 
> Signed-off-by: Cai Huoqing 
> ---
>  drivers/vhost/scsi.c | 14 +-
>  1 file changed, 1 insertion(+), 13 deletions(-)

I have CCed Nic.

Reviewed-by: Stefan Hajnoczi 

> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> index 46f897e41217..532e204f2b1b 100644
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -1,24 +1,12 @@
> +// SPDX-License-Identifier: GPL-2.0+
>  
> /***
>   * Vhost kernel TCM fabric driver for virtio SCSI initiators
>   *
>   * (C) Copyright 2010-2013 Datera, Inc.
>   * (C) Copyright 2010-2012 IBM Corp.
>   *
> - * Licensed to the Linux Foundation under the General Public License (GPL) 
> version 2.
> - *
>   * Authors: Nicholas A. Bellinger 
>   *  Stefan Hajnoczi 
> - *
> - * This program is free software; you can redistribute it and/or modify
> - * it under the terms of the GNU General Public License as published by
> - * the Free Software Foundation; either version 2 of the License, or
> - * (at your option) any later version.
> - *
> - * This program is distributed in the hope that it will be useful,
> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> - * GNU General Public License for more details.
> - *
>   
> /
>  
>  #include 
> -- 
> 2.25.1
> 


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 08/15] virtio_blk: use bvec_virt

2021-08-05 Thread Stefan Hajnoczi
On Wed, Aug 04, 2021 at 11:56:27AM +0200, Christoph Hellwig wrote:
> Use bvec_virt instead of open coding it.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/block/virtio_blk.c | 7 ++-
>  1 file changed, 2 insertions(+), 5 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC 0/3] cpuidle: add poll_source API and virtio vq polling

2021-07-27 Thread Stefan Hajnoczi
On Mon, Jul 26, 2021 at 06:37:13PM +0200, Rafael J. Wysocki wrote:
> On Mon, Jul 26, 2021 at 6:04 PM Stefan Hajnoczi  wrote:
> >
> > On Mon, Jul 26, 2021 at 05:47:19PM +0200, Rafael J. Wysocki wrote:
> > > On Mon, Jul 26, 2021 at 5:17 PM Stefan Hajnoczi  
> > > wrote:
> > > >
> > > > On Thu, Jul 22, 2021 at 05:04:57PM +0800, Jason Wang wrote:
> > > > >
> > > > > 在 2021/7/21 下午5:41, Stefan Hajnoczi 写道:
> > > > > > On Wed, Jul 21, 2021 at 11:29:55AM +0800, Jason Wang wrote:
> > > > > > > 在 2021/7/14 上午12:19, Stefan Hajnoczi 写道:
> > > > > > > > These patches are not polished yet but I would like request 
> > > > > > > > feedback on this
> > > > > > > > approach and share performance results with you.
> > > > > > > >
> > > > > > > > Idle CPUs tentatively enter a busy wait loop before halting 
> > > > > > > > when the cpuidle
> > > > > > > > haltpoll driver is enabled inside a virtual machine. This 
> > > > > > > > reduces wakeup
> > > > > > > > latency for events that occur soon after the vCPU becomes idle.
> > > > > > > >
> > > > > > > > This patch series extends the cpuidle busy wait loop with the 
> > > > > > > > new poll_source
> > > > > > > > API so drivers can participate in polling. Such polling-aware 
> > > > > > > > drivers disable
> > > > > > > > their device's irq during the busy wait loop to avoid the cost 
> > > > > > > > of interrupts.
> > > > > > > > This reduces latency further than regular cpuidle haltpoll, 
> > > > > > > > which still relies
> > > > > > > > on irqs.
> > > > > > > >
> > > > > > > > Virtio drivers are modified to use the poll_source API so all 
> > > > > > > > virtio device
> > > > > > > > types get this feature. The following virtio-blk fio benchmark 
> > > > > > > > results show the
> > > > > > > > improvement:
> > > > > > > >
> > > > > > > >IOPS (numjobs=4, iodepth=1, 4 virtqueues)
> > > > > > > >  before   poll_source  io_poll
> > > > > > > > 4k randread167102  186049 (+11%)  186654 (+11%)
> > > > > > > > 4k randwrite   162204  181214 (+11%)  181850 (+12%)
> > > > > > > > 4k randrw  159520  177071 (+11%)  177928 (+11%)
> > > > > > > >
> > > > > > > > The comparison against io_poll shows that cpuidle poll_source 
> > > > > > > > achieves
> > > > > > > > equivalent performance to the block layer's io_poll feature 
> > > > > > > > (which I
> > > > > > > > implemented in a separate patch series [1]).
> > > > > > > >
> > > > > > > > The advantage of poll_source is that applications do not need 
> > > > > > > > to explicitly set
> > > > > > > > the RWF_HIPRI I/O request flag. The poll_source approach is 
> > > > > > > > attractive because
> > > > > > > > few applications actually use RWF_HIPRI and it takes advantage 
> > > > > > > > of CPU cycles we
> > > > > > > > would have spent in cpuidle haltpoll anyway.
> > > > > > > >
> > > > > > > > The current series does not improve virtio-net. I haven't 
> > > > > > > > investigated deeply,
> > > > > > > > but it is possible that NAPI and poll_source do not combine. 
> > > > > > > > See the final
> > > > > > > > patch for a starting point on making the two work together.
> > > > > > > >
> > > > > > > > I have not tried this on bare metal but it might help there 
> > > > > > > > too. The cost of
> > > > > > > > disabling a device's irq must be less than the savings from 
> > > > > > > > avoiding irq
> > > > > > > > handling for this optimization to make sense.
> > > > > > > >
> > > > > &

Re: [RFC 0/3] cpuidle: add poll_source API and virtio vq polling

2021-07-26 Thread Stefan Hajnoczi
On Mon, Jul 26, 2021 at 05:47:19PM +0200, Rafael J. Wysocki wrote:
> On Mon, Jul 26, 2021 at 5:17 PM Stefan Hajnoczi  wrote:
> >
> > On Thu, Jul 22, 2021 at 05:04:57PM +0800, Jason Wang wrote:
> > >
> > > 在 2021/7/21 下午5:41, Stefan Hajnoczi 写道:
> > > > On Wed, Jul 21, 2021 at 11:29:55AM +0800, Jason Wang wrote:
> > > > > 在 2021/7/14 上午12:19, Stefan Hajnoczi 写道:
> > > > > > These patches are not polished yet but I would like request 
> > > > > > feedback on this
> > > > > > approach and share performance results with you.
> > > > > >
> > > > > > Idle CPUs tentatively enter a busy wait loop before halting when 
> > > > > > the cpuidle
> > > > > > haltpoll driver is enabled inside a virtual machine. This reduces 
> > > > > > wakeup
> > > > > > latency for events that occur soon after the vCPU becomes idle.
> > > > > >
> > > > > > This patch series extends the cpuidle busy wait loop with the new 
> > > > > > poll_source
> > > > > > API so drivers can participate in polling. Such polling-aware 
> > > > > > drivers disable
> > > > > > their device's irq during the busy wait loop to avoid the cost of 
> > > > > > interrupts.
> > > > > > This reduces latency further than regular cpuidle haltpoll, which 
> > > > > > still relies
> > > > > > on irqs.
> > > > > >
> > > > > > Virtio drivers are modified to use the poll_source API so all 
> > > > > > virtio device
> > > > > > types get this feature. The following virtio-blk fio benchmark 
> > > > > > results show the
> > > > > > improvement:
> > > > > >
> > > > > >IOPS (numjobs=4, iodepth=1, 4 virtqueues)
> > > > > >  before   poll_source  io_poll
> > > > > > 4k randread167102  186049 (+11%)  186654 (+11%)
> > > > > > 4k randwrite   162204  181214 (+11%)  181850 (+12%)
> > > > > > 4k randrw  159520  177071 (+11%)  177928 (+11%)
> > > > > >
> > > > > > The comparison against io_poll shows that cpuidle poll_source 
> > > > > > achieves
> > > > > > equivalent performance to the block layer's io_poll feature (which I
> > > > > > implemented in a separate patch series [1]).
> > > > > >
> > > > > > The advantage of poll_source is that applications do not need to 
> > > > > > explicitly set
> > > > > > the RWF_HIPRI I/O request flag. The poll_source approach is 
> > > > > > attractive because
> > > > > > few applications actually use RWF_HIPRI and it takes advantage of 
> > > > > > CPU cycles we
> > > > > > would have spent in cpuidle haltpoll anyway.
> > > > > >
> > > > > > The current series does not improve virtio-net. I haven't 
> > > > > > investigated deeply,
> > > > > > but it is possible that NAPI and poll_source do not combine. See 
> > > > > > the final
> > > > > > patch for a starting point on making the two work together.
> > > > > >
> > > > > > I have not tried this on bare metal but it might help there too. 
> > > > > > The cost of
> > > > > > disabling a device's irq must be less than the savings from 
> > > > > > avoiding irq
> > > > > > handling for this optimization to make sense.
> > > > > >
> > > > > > [1] 
> > > > > > https://lore.kernel.org/linux-block/20210520141305.355961-1-stefa...@redhat.com/
> > > > >
> > > > > Hi Stefan:
> > > > >
> > > > > Some questions:
> > > > >
> > > > > 1) What's the advantages of introducing polling at virtio level 
> > > > > instead of
> > > > > doing it at each subsystems? Polling in virtio level may only work 
> > > > > well if
> > > > > all (or most) of the devices are virtio
> > > > I'm not sure I understand the question. cpuidle haltpoll benefits all
> > > > devices today, except it incurs interrupt latency. The poll_source API
> > > > eliminates the interrupt latency for drivers t

Re: [RFC 0/3] cpuidle: add poll_source API and virtio vq polling

2021-07-26 Thread Stefan Hajnoczi
On Thu, Jul 22, 2021 at 05:04:57PM +0800, Jason Wang wrote:
> 
> 在 2021/7/21 下午5:41, Stefan Hajnoczi 写道:
> > On Wed, Jul 21, 2021 at 11:29:55AM +0800, Jason Wang wrote:
> > > 在 2021/7/14 上午12:19, Stefan Hajnoczi 写道:
> > > > These patches are not polished yet but I would like request feedback on 
> > > > this
> > > > approach and share performance results with you.
> > > > 
> > > > Idle CPUs tentatively enter a busy wait loop before halting when the 
> > > > cpuidle
> > > > haltpoll driver is enabled inside a virtual machine. This reduces wakeup
> > > > latency for events that occur soon after the vCPU becomes idle.
> > > > 
> > > > This patch series extends the cpuidle busy wait loop with the new 
> > > > poll_source
> > > > API so drivers can participate in polling. Such polling-aware drivers 
> > > > disable
> > > > their device's irq during the busy wait loop to avoid the cost of 
> > > > interrupts.
> > > > This reduces latency further than regular cpuidle haltpoll, which still 
> > > > relies
> > > > on irqs.
> > > > 
> > > > Virtio drivers are modified to use the poll_source API so all virtio 
> > > > device
> > > > types get this feature. The following virtio-blk fio benchmark results 
> > > > show the
> > > > improvement:
> > > > 
> > > >IOPS (numjobs=4, iodepth=1, 4 virtqueues)
> > > >  before   poll_source  io_poll
> > > > 4k randread167102  186049 (+11%)  186654 (+11%)
> > > > 4k randwrite   162204  181214 (+11%)  181850 (+12%)
> > > > 4k randrw  159520  177071 (+11%)  177928 (+11%)
> > > > 
> > > > The comparison against io_poll shows that cpuidle poll_source achieves
> > > > equivalent performance to the block layer's io_poll feature (which I
> > > > implemented in a separate patch series [1]).
> > > > 
> > > > The advantage of poll_source is that applications do not need to 
> > > > explicitly set
> > > > the RWF_HIPRI I/O request flag. The poll_source approach is attractive 
> > > > because
> > > > few applications actually use RWF_HIPRI and it takes advantage of CPU 
> > > > cycles we
> > > > would have spent in cpuidle haltpoll anyway.
> > > > 
> > > > The current series does not improve virtio-net. I haven't investigated 
> > > > deeply,
> > > > but it is possible that NAPI and poll_source do not combine. See the 
> > > > final
> > > > patch for a starting point on making the two work together.
> > > > 
> > > > I have not tried this on bare metal but it might help there too. The 
> > > > cost of
> > > > disabling a device's irq must be less than the savings from avoiding irq
> > > > handling for this optimization to make sense.
> > > > 
> > > > [1] 
> > > > https://lore.kernel.org/linux-block/20210520141305.355961-1-stefa...@redhat.com/
> > > 
> > > Hi Stefan:
> > > 
> > > Some questions:
> > > 
> > > 1) What's the advantages of introducing polling at virtio level instead of
> > > doing it at each subsystems? Polling in virtio level may only work well if
> > > all (or most) of the devices are virtio
> > I'm not sure I understand the question. cpuidle haltpoll benefits all
> > devices today, except it incurs interrupt latency. The poll_source API
> > eliminates the interrupt latency for drivers that can disable device
> > interrupts cheaply.
> > 
> > This patch adds poll_source to core virtio code so that all virtio
> > drivers get this feature for free. No driver-specific changes are
> > needed.
> > 
> > If you mean networking, block layer, etc by "subsystems" then there's
> > nothing those subsystems can do to help. Whether poll_source can be used
> > depends on the specific driver, not the subsystem. If you consider
> > drivers/virtio/ a subsystem, then that's exactly what the patch series
> > is doing.
> 
> 
> I meant, if we choose to use idle poll, we have some several choices:
> 
> 1) bus level (e.g the virtio)
> 2) subsystem level (e.g the networking and block)
> 
> I'm not sure which one is better.

This API is intended to be driver- or bus-level. I don't think
subsystems can do very much since they don't know the hardware
capabilities (cheap interrupt disabling) 

Re: [RFC 0/3] cpuidle: add poll_source API and virtio vq polling

2021-07-21 Thread Stefan Hajnoczi
On Wed, Jul 21, 2021 at 11:29:55AM +0800, Jason Wang wrote:
> 
> 在 2021/7/14 上午12:19, Stefan Hajnoczi 写道:
> > These patches are not polished yet but I would like request feedback on this
> > approach and share performance results with you.
> > 
> > Idle CPUs tentatively enter a busy wait loop before halting when the cpuidle
> > haltpoll driver is enabled inside a virtual machine. This reduces wakeup
> > latency for events that occur soon after the vCPU becomes idle.
> > 
> > This patch series extends the cpuidle busy wait loop with the new 
> > poll_source
> > API so drivers can participate in polling. Such polling-aware drivers 
> > disable
> > their device's irq during the busy wait loop to avoid the cost of 
> > interrupts.
> > This reduces latency further than regular cpuidle haltpoll, which still 
> > relies
> > on irqs.
> > 
> > Virtio drivers are modified to use the poll_source API so all virtio device
> > types get this feature. The following virtio-blk fio benchmark results show 
> > the
> > improvement:
> > 
> >   IOPS (numjobs=4, iodepth=1, 4 virtqueues)
> > before   poll_source  io_poll
> > 4k randread167102  186049 (+11%)  186654 (+11%)
> > 4k randwrite   162204  181214 (+11%)  181850 (+12%)
> > 4k randrw  159520  177071 (+11%)  177928 (+11%)
> > 
> > The comparison against io_poll shows that cpuidle poll_source achieves
> > equivalent performance to the block layer's io_poll feature (which I
> > implemented in a separate patch series [1]).
> > 
> > The advantage of poll_source is that applications do not need to explicitly 
> > set
> > the RWF_HIPRI I/O request flag. The poll_source approach is attractive 
> > because
> > few applications actually use RWF_HIPRI and it takes advantage of CPU 
> > cycles we
> > would have spent in cpuidle haltpoll anyway.
> > 
> > The current series does not improve virtio-net. I haven't investigated 
> > deeply,
> > but it is possible that NAPI and poll_source do not combine. See the final
> > patch for a starting point on making the two work together.
> > 
> > I have not tried this on bare metal but it might help there too. The cost of
> > disabling a device's irq must be less than the savings from avoiding irq
> > handling for this optimization to make sense.
> > 
> > [1] 
> > https://lore.kernel.org/linux-block/20210520141305.355961-1-stefa...@redhat.com/
> 
> 
> Hi Stefan:
> 
> Some questions:
> 
> 1) What's the advantages of introducing polling at virtio level instead of
> doing it at each subsystems? Polling in virtio level may only work well if
> all (or most) of the devices are virtio

I'm not sure I understand the question. cpuidle haltpoll benefits all
devices today, except it incurs interrupt latency. The poll_source API
eliminates the interrupt latency for drivers that can disable device
interrupts cheaply.

This patch adds poll_source to core virtio code so that all virtio
drivers get this feature for free. No driver-specific changes are
needed.

If you mean networking, block layer, etc by "subsystems" then there's
nothing those subsystems can do to help. Whether poll_source can be used
depends on the specific driver, not the subsystem. If you consider
drivers/virtio/ a subsystem, then that's exactly what the patch series
is doing.

> 2) What's the advantages of using cpuidle instead of using a thread (and
> leverage the scheduler)?

In order to combine with the existing cpuidle infrastructure. No new
polling loop is introduced and no additional CPU cycles are spent on
polling.

If cpuidle itself is converted to threads then poll_source would
automatically operate in a thread too, but this patch series doesn't
change how the core cpuidle code works.

> 3) Any reason it's virtio_pci specific not a general virtio one?

Good idea, it is possible to move the virtio_pci changes into virtio.c.

Other transports can't use this feature yet though. Only virtio_pci
supports vq irq affinity. But the code can be generic and if other
transports ever support vq irq affinity they'll get it for free.

> (Btw, do we need to cc scheduler guys?)

I'm not sure. This patch series doesn't change how cpuidle interacts
with the scheduler. The cpuidle maintainers can pull in more people, if
necessary.

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC 1/3] cpuidle: add poll_source API

2021-07-20 Thread Stefan Hajnoczi
On Mon, Jul 19, 2021 at 06:03:55PM -0300, Marcelo Tosatti wrote:
> Hi Stefan,
> 
> On Tue, Jul 13, 2021 at 05:19:04PM +0100, Stefan Hajnoczi wrote:
> > Introduce an API for adding cpuidle poll callbacks:
> > 
> >   struct poll_source_ops {
> >   void (*start)(struct poll_source *src);
> >   void (*stop)(struct poll_source *src);
> >   void (*poll)(struct poll_source *src);
> >   };
> > 
> >   int poll_source_register(struct poll_source *src);
> >   int poll_source_unregister(struct poll_source *src);
> > 
> > When cpuidle enters the poll state it invokes ->start() and then invokes
> > ->poll() repeatedly from the busy wait loop. Finally ->stop() is invoked
> > when the busy wait loop finishes.
> > 
> > The ->poll() function should check for activity and cause
> > TIF_NEED_RESCHED to be set in order to stop the busy wait loop.
> > 
> > This API is intended to be used by drivers that can cheaply poll for
> > events. Participating in cpuidle polling allows them to avoid interrupt
> > latencies during periods where the CPU is going to poll anyway.
> > 
> > Note that each poll_source is bound to a particular CPU. The API is
> > mainly intended to by used by drivers that have multiple queues with irq
> > affinity.
> > 
> > Signed-off-by: Stefan Hajnoczi 
> > ---
> >  drivers/cpuidle/Makefile  |  1 +
> >  include/linux/poll_source.h   | 53 +++
> >  drivers/cpuidle/poll_source.c | 99 +++
> >  drivers/cpuidle/poll_state.c  |  6 +++
> >  4 files changed, 159 insertions(+)
> >  create mode 100644 include/linux/poll_source.h
> >  create mode 100644 drivers/cpuidle/poll_source.c
> > 
> > diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile
> > index 26bbc5e74123..994f72d6fe95 100644
> > --- a/drivers/cpuidle/Makefile
> > +++ b/drivers/cpuidle/Makefile
> > @@ -7,6 +7,7 @@ obj-y += cpuidle.o driver.o governor.o sysfs.o governors/
> >  obj-$(CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED) += coupled.o
> >  obj-$(CONFIG_DT_IDLE_STATES) += dt_idle_states.o
> >  obj-$(CONFIG_ARCH_HAS_CPU_RELAX) += poll_state.o
> > +obj-$(CONFIG_ARCH_HAS_CPU_RELAX) += poll_source.o
> >  obj-$(CONFIG_HALTPOLL_CPUIDLE)   += cpuidle-haltpoll.o
> >  
> >  
> > ##
> > diff --git a/include/linux/poll_source.h b/include/linux/poll_source.h
> > new file mode 100644
> > index ..ccfb424e170b
> > --- /dev/null
> > +++ b/include/linux/poll_source.h
> > @@ -0,0 +1,53 @@
> > +/* SPDX-License-Identifier: GPL-2.0-or-later */
> > +/*
> > + * poll_source.h - cpuidle busy waiting API
> > + */
> > +#ifndef __LINUX_POLLSOURCE_H__
> > +#define __LINUX_POLLSOURCE_H__
> > +
> > +#include 
> > +
> > +struct poll_source;
> > +
> > +struct poll_source_ops {
> > +   void (*start)(struct poll_source *src);
> > +   void (*stop)(struct poll_source *src);
> > +   void (*poll)(struct poll_source *src);
> > +};
> > +
> > +struct poll_source {
> > +   const struct poll_source_ops *ops;
> > +   struct list_head node;
> > +   int cpu;
> > +};
> > +
> > +/**
> > + * poll_source_register - Add a poll_source for a CPU
> > + */
> > +#if defined(CONFIG_CPU_IDLE) && defined(CONFIG_ARCH_HAS_CPU_RELAX)
> > +int poll_source_register(struct poll_source *src);
> > +#else
> > +static inline int poll_source_register(struct poll_source *src)
> > +{
> > +   return 0;
> > +}
> > +#endif
> > +
> > +/**
> > + * poll_source_unregister - Remove a previously registered poll_source
> > + */
> > +#if defined(CONFIG_CPU_IDLE) && defined(CONFIG_ARCH_HAS_CPU_RELAX)
> > +int poll_source_unregister(struct poll_source *src);
> > +#else
> > +static inline int poll_source_unregister(struct poll_source *src)
> > +{
> > +   return 0;
> > +}
> > +#endif
> > +
> > +/* Used by the cpuidle driver */
> > +void poll_source_start(void);
> > +void poll_source_run_once(void);
> > +void poll_source_stop(void);
> > +
> > +#endif /* __LINUX_POLLSOURCE_H__ */
> > diff --git a/drivers/cpuidle/poll_source.c b/drivers/cpuidle/poll_source.c
> > new file mode 100644
> > index ..46100e5a71e4
> > --- /dev/null
> > +++ b/drivers/cpuidle/poll_source.c
> > @@ -0,0 +1,99 @@
> > +// SPDX-License-Identifier: GPL-2.0-or-la

Re: [RFC v3 00/29] vDPA software assisted live migration

2021-07-19 Thread Stefan Hajnoczi
On Mon, May 24, 2021 at 07:29:06AM -0400, Michael S. Tsirkin wrote:
> On Mon, May 24, 2021 at 12:37:48PM +0200, Eugenio Perez Martin wrote:
> > On Mon, May 24, 2021 at 11:38 AM Michael S. Tsirkin  wrote:
> > >
> > > On Wed, May 19, 2021 at 06:28:34PM +0200, Eugenio Pérez wrote:
> > > > Commit 17 introduces the buffer forwarding. Previous one are for
> > > > preparations again, and laters are for enabling some obvious
> > > > optimizations. However, it needs the vdpa device to be able to map
> > > > every IOVA space, and some vDPA devices are not able to do so. Checking
> > > > of this is added in previous commits.
> > >
> > > That might become a significant limitation. And it worries me that
> > > this is such a big patchset which might yet take a while to get
> > > finalized.
> > >
> > 
> > Sorry, maybe I've been unclear here: Latter commits in this series
> > address this limitation. Still not perfect: for example, it does not
> > support adding or removing guest's memory at the moment, but this
> > should be easy to implement on top.
> > 
> > The main issue I'm observing is from the kernel if I'm not wrong: If I
> > unmap every address, I cannot re-map them again. But code in this
> > patchset is mostly final, except for the comments it may arise in the
> > mail list of course.
> > 
> > > I have an idea: how about as a first step we implement a transparent
> > > switch from vdpa to a software virtio in QEMU or a software vhost in
> > > kernel?
> > >
> > > This will give us live migration quickly with performance comparable
> > > to failover but without dependance on guest cooperation.
> > >
> > 
> > I think it should be doable. I'm not sure about the effort that needs
> > to be done in qemu to hide these "hypervisor-failover devices" from
> > guest's view but it should be comparable to failover, as you say.
> > 
> > Networking should be ok by its nature, although it could require care
> > on the host hardware setup. But I'm not sure how other types of
> > vhost/vdpa devices may work that way. How would a disk/scsi device
> > switch modes? Can the kernel take control of the vdpa device through
> > vhost, and just start reporting with a dirty bitmap?
> > 
> > Thanks!
> 
> It depends of course, e.g. blk is mostly reads/writes so
> not a lot of state. just don't reorder or drop requests.

QEMU's virtio-blk does not attempt to change states (e.g. quiesce the
device or switch between vhost kernel/QEMU, etc) while there are
in-flight requests. Instead all currently active requests must complete
(in some cases they can be cancelled to stop them early). Note that
failed requests can be kept in a list across the switch and then
resubmitted later.

The underlying storage never has requests in flight while the device is
switched. The reason QEMU does this is because there's no way to hand
over an in-flight preadv(2), Linux AIO, or other host kernel block layer
request to another process.

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [virtio-dev] Fwd: [PATCH v2] Provide detailed specification of virtio-blk lifetime metrics

2021-07-14 Thread Stefan Hajnoczi
On Wed, Jul 14, 2021 at 11:36:58AM +0200, Cornelia Huck wrote:
> On Wed, Jul 14 2021, Stefan Hajnoczi  wrote:
> 
> > On Wed, May 05, 2021 at 12:37:26PM -0700, Enrico Granata wrote:
> >> -- Forwarded message -
> >> From: Enrico Granata 
> >> Date: Wed, May 5, 2021 at 1:37 PM
> >> Subject: [PATCH v2] Provide detailed specification of virtio-blk
> >> lifetime metrics
> >> To: 
> >> Cc: , , ,
> >> ,
> >> 
> >> 
> >> 
> >> In the course of review, some concerns were surfaced about the
> >> original virtio-blk lifetime proposal, as it depends on the eMMC
> >> spec which is not open
> >> 
> >> Add a more detailed description of the meaning of the fields
> >> added by that proposal to the virtio-blk specification, as to
> >> make it feasible to understand and implement the new lifetime
> >> metrics feature without needing to refer to JEDEC's specification
> >> 
> >> This patch does not change the meaning of those fields nor add
> >> any new fields, but it is intended to provide an open and more
> >> clear description of the meaning associated with those fields.
> >> 
> >> Signed-off-by: Enrico Granata 
> >> ---
> >> Changes in v2:
> >>   - clarified JEDEC references;
> >>   - added VIRTIO_BLK prefix and cleaned up comment syntax;
> >>   - clarified reserved block references
> >> 
> >>  content.tex | 40 
> >>  1 file changed, 32 insertions(+), 8 deletions(-)
> >
> > Ping?
> >
> > Reviewed-by: Stefan Hajnoczi 
> 
> [Enrico is currently on leave of absence]
> 
> Is there any outstanding feedback from the linux-block folks? I've lost
> track of this, I'm afraid. (Your R-b is the only feedback I see on this
> list...)
> 
> We can certainly put up a vote. This is one of the things I wanted to
> see fixed for the next release of the standard.

I have CCed Christoph, linux-block@, and virtualization@. Here is the
link to the patch that we're discussing:
https://patchwork.kernel.org/project/linux-block/patch/20210505193655.2414268-1-egran...@google.com/

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[RFC 2/3] virtio: add poll_source virtqueue polling

2021-07-13 Thread Stefan Hajnoczi
VIRTIO drivers can cheaply disable interrupts by setting
RING_EVENT_FLAGS_DISABLE in the packed virtqueues's Driver Event
Suppression structure. See "2.7.10 Driver and Device Event Suppression"
in the VIRTIO 1.1 specification for details
(https://docs.oasis-open.org/virtio/virtio/v1.1/virtio-v1.1.html).

Add a per-virtqueue poll_source that disables events in ->start(),
processes completed virtqueue buffers in ->poll(), and re-enables events
in ->stop().

This optimization avoids interrupt injection during cpuidle polling.
Workloads that submit requests and then halt the vCPU until completion
benefit from this.

To enable this optimization:

1. Enable the cpuidle haltpoll driver:
   https://www.kernel.org/doc/html/latest/virt/guest-halt-polling.html

2. Enable poll_source on the virtio device:
   # echo -n 1 > /sys/block/vda/device/poll_source

Note that this feature currently as no effect on split virtqueues when
VIRTIO_F_EVENT_IDX is negotiated. It may be possible to tweak
virtqueue_disable_cb_split() but I haven't attempted it here.

This optimization has been benchmarked successfully with virtio-blk
devices. Currently it does not improve virtio-net performance, probably
because it doesn't combine with NAPI polling.

Signed-off-by: Stefan Hajnoczi 
---
 drivers/virtio/virtio_pci_common.h |  7 +++
 include/linux/virtio.h |  2 +
 include/linux/virtio_config.h  |  2 +
 drivers/virtio/virtio.c| 34 
 drivers/virtio/virtio_pci_common.c | 86 ++
 drivers/virtio/virtio_pci_modern.c |  2 +
 6 files changed, 133 insertions(+)

diff --git a/drivers/virtio/virtio_pci_common.h 
b/drivers/virtio/virtio_pci_common.h
index beec047a8f8d..630746043183 100644
--- a/drivers/virtio/virtio_pci_common.h
+++ b/drivers/virtio/virtio_pci_common.h
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -38,6 +39,9 @@ struct virtio_pci_vq_info {
 
/* MSI-X vector (or none) */
unsigned msix_vector;
+
+   /* the cpuidle poll_source */
+   struct poll_source poll_source;
 };
 
 /* Our device structure */
@@ -102,6 +106,9 @@ static struct virtio_pci_device *to_vp_device(struct 
virtio_device *vdev)
return container_of(vdev, struct virtio_pci_device, vdev);
 }
 
+/* enable poll_source API for vq polling */
+int vp_enable_poll_source(struct virtio_device *vdev, bool enable);
+
 /* wait for pending irq handlers */
 void vp_synchronize_vectors(struct virtio_device *vdev);
 /* the notify function used when creating a virt queue */
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index b1894e0323fa..baaa3c8fadb1 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -93,6 +93,7 @@ dma_addr_t virtqueue_get_used_addr(struct virtqueue *vq);
  * virtio_device - representation of a device using virtio
  * @index: unique position on the virtio bus
  * @failed: saved value for VIRTIO_CONFIG_S_FAILED bit (for restore)
+ * @poll_source_enabled: poll_source API enabled for vq polling
  * @config_enabled: configuration change reporting enabled
  * @config_change_pending: configuration change reported while disabled
  * @config_lock: protects configuration change reporting
@@ -107,6 +108,7 @@ dma_addr_t virtqueue_get_used_addr(struct virtqueue *vq);
 struct virtio_device {
int index;
bool failed;
+   bool poll_source_enabled;
bool config_enabled;
bool config_change_pending;
spinlock_t config_lock;
diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index 8519b3ae5d52..5fb78d56fd44 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -72,6 +72,7 @@ struct virtio_shm_region {
  * @set_vq_affinity: set the affinity for a virtqueue (optional).
  * @get_vq_affinity: get the affinity for a virtqueue (optional).
  * @get_shm_region: get a shared memory region based on the index.
+ * @enable_poll_source: enable/disable poll_source API vq polling (optional).
  */
 typedef void vq_callback_t(struct virtqueue *);
 struct virtio_config_ops {
@@ -97,6 +98,7 @@ struct virtio_config_ops {
int index);
bool (*get_shm_region)(struct virtio_device *vdev,
   struct virtio_shm_region *region, u8 id);
+   int (*enable_poll_source)(struct virtio_device *dev, bool enable);
 };
 
 /* If driver didn't advertise the feature, it will never appear. */
diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index 4b15c00c0a0a..22fee71bbe34 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -59,12 +59,44 @@ static ssize_t features_show(struct device *_d,
 }
 static DEVICE_ATTR_RO(features);
 
+static ssize_t poll_source_show(struct device *_d,
+   struct device_attribute *attr, char *buf)
+{
+   struct virtio_device *dev = dev_to_virtio(_d);
+   return sprintf(buf, "%d\n

[RFC 3/3] softirq: participate in cpuidle polling

2021-07-13 Thread Stefan Hajnoczi
Normally softirqs are invoked when exiting irqs. When polling in the
cpuidle driver there may be no irq. Therefore pending softirqs go
unnoticed and polling continues without invoking them.

Add a softirq_poll() function to explicitly check for and invoke
softirqs.

Signed-off-by: Stefan Hajnoczi 
---
This commit is not needed for virtio-blk. I added it when I realized
virtio-net's NAPI scheduling might not be detected by the cpuidle busy
wait loop because it is unaware of softirqs. However, even after doing
this virtio-net's NAPI polling doesn't combine with cpuidle haltpoll.

Perhaps this patch is still desirable for cpuidle poll_state in case a
softirq is raised?
---
 include/linux/interrupt.h |  2 ++
 drivers/cpuidle/poll_source.c |  3 +++
 kernel/softirq.c  | 14 ++
 3 files changed, 19 insertions(+)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 4777850a6dc7..9bfdcc466ba8 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -573,6 +573,8 @@ struct softirq_action
 asmlinkage void do_softirq(void);
 asmlinkage void __do_softirq(void);
 
+extern void softirq_poll(void);
+
 extern void open_softirq(int nr, void (*action)(struct softirq_action *));
 extern void softirq_init(void);
 extern void __raise_softirq_irqoff(unsigned int nr);
diff --git a/drivers/cpuidle/poll_source.c b/drivers/cpuidle/poll_source.c
index 46100e5a71e4..ed200feb0daa 100644
--- a/drivers/cpuidle/poll_source.c
+++ b/drivers/cpuidle/poll_source.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* The per-cpu list of registered poll sources */
 DEFINE_PER_CPU(struct list_head, poll_source_list);
@@ -26,6 +27,8 @@ void poll_source_run_once(void)
 
list_for_each_entry(src, this_cpu_ptr(_source_list), node)
src->ops->poll(src);
+
+   softirq_poll();
 }
 
 /* Called from idle task with TIF_POLLING_NRFLAG set and irqs enabled */
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 4992853ef53d..f45bf0204218 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -611,6 +611,20 @@ void irq_enter(void)
irq_enter_rcu();
 }
 
+/**
+ * softirq_poll() - invoke pending softirqs
+ *
+ * Normally it is not necessary to explicitly poll for softirqs, but in the
+ * cpuidle driver a polling function may have raised a softirq with no irq exit
+ * to invoke it. Therefore it is necessary to poll for pending softirqs and
+ * invoke them explicitly.
+ */
+void softirq_poll(void)
+{
+   if (!in_interrupt() && local_softirq_pending())
+   invoke_softirq();
+}
+
 static inline void tick_irq_exit(void)
 {
 #ifdef CONFIG_NO_HZ_COMMON
-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC 1/3] cpuidle: add poll_source API

2021-07-13 Thread Stefan Hajnoczi
Introduce an API for adding cpuidle poll callbacks:

  struct poll_source_ops {
  void (*start)(struct poll_source *src);
  void (*stop)(struct poll_source *src);
  void (*poll)(struct poll_source *src);
  };

  int poll_source_register(struct poll_source *src);
  int poll_source_unregister(struct poll_source *src);

When cpuidle enters the poll state it invokes ->start() and then invokes
->poll() repeatedly from the busy wait loop. Finally ->stop() is invoked
when the busy wait loop finishes.

The ->poll() function should check for activity and cause
TIF_NEED_RESCHED to be set in order to stop the busy wait loop.

This API is intended to be used by drivers that can cheaply poll for
events. Participating in cpuidle polling allows them to avoid interrupt
latencies during periods where the CPU is going to poll anyway.

Note that each poll_source is bound to a particular CPU. The API is
mainly intended to by used by drivers that have multiple queues with irq
affinity.

Signed-off-by: Stefan Hajnoczi 
---
 drivers/cpuidle/Makefile  |  1 +
 include/linux/poll_source.h   | 53 +++
 drivers/cpuidle/poll_source.c | 99 +++
 drivers/cpuidle/poll_state.c  |  6 +++
 4 files changed, 159 insertions(+)
 create mode 100644 include/linux/poll_source.h
 create mode 100644 drivers/cpuidle/poll_source.c

diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile
index 26bbc5e74123..994f72d6fe95 100644
--- a/drivers/cpuidle/Makefile
+++ b/drivers/cpuidle/Makefile
@@ -7,6 +7,7 @@ obj-y += cpuidle.o driver.o governor.o sysfs.o governors/
 obj-$(CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED) += coupled.o
 obj-$(CONFIG_DT_IDLE_STATES) += dt_idle_states.o
 obj-$(CONFIG_ARCH_HAS_CPU_RELAX) += poll_state.o
+obj-$(CONFIG_ARCH_HAS_CPU_RELAX) += poll_source.o
 obj-$(CONFIG_HALTPOLL_CPUIDLE)   += cpuidle-haltpoll.o
 
 
##
diff --git a/include/linux/poll_source.h b/include/linux/poll_source.h
new file mode 100644
index ..ccfb424e170b
--- /dev/null
+++ b/include/linux/poll_source.h
@@ -0,0 +1,53 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * poll_source.h - cpuidle busy waiting API
+ */
+#ifndef __LINUX_POLLSOURCE_H__
+#define __LINUX_POLLSOURCE_H__
+
+#include 
+
+struct poll_source;
+
+struct poll_source_ops {
+   void (*start)(struct poll_source *src);
+   void (*stop)(struct poll_source *src);
+   void (*poll)(struct poll_source *src);
+};
+
+struct poll_source {
+   const struct poll_source_ops *ops;
+   struct list_head node;
+   int cpu;
+};
+
+/**
+ * poll_source_register - Add a poll_source for a CPU
+ */
+#if defined(CONFIG_CPU_IDLE) && defined(CONFIG_ARCH_HAS_CPU_RELAX)
+int poll_source_register(struct poll_source *src);
+#else
+static inline int poll_source_register(struct poll_source *src)
+{
+   return 0;
+}
+#endif
+
+/**
+ * poll_source_unregister - Remove a previously registered poll_source
+ */
+#if defined(CONFIG_CPU_IDLE) && defined(CONFIG_ARCH_HAS_CPU_RELAX)
+int poll_source_unregister(struct poll_source *src);
+#else
+static inline int poll_source_unregister(struct poll_source *src)
+{
+   return 0;
+}
+#endif
+
+/* Used by the cpuidle driver */
+void poll_source_start(void);
+void poll_source_run_once(void);
+void poll_source_stop(void);
+
+#endif /* __LINUX_POLLSOURCE_H__ */
diff --git a/drivers/cpuidle/poll_source.c b/drivers/cpuidle/poll_source.c
new file mode 100644
index ..46100e5a71e4
--- /dev/null
+++ b/drivers/cpuidle/poll_source.c
@@ -0,0 +1,99 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * poll_source.c - cpuidle busy waiting API
+ */
+
+#include 
+#include 
+#include 
+
+/* The per-cpu list of registered poll sources */
+DEFINE_PER_CPU(struct list_head, poll_source_list);
+
+/* Called from idle task with TIF_POLLING_NRFLAG set and irqs enabled */
+void poll_source_start(void)
+{
+   struct poll_source *src;
+
+   list_for_each_entry(src, this_cpu_ptr(_source_list), node)
+   src->ops->start(src);
+}
+
+/* Called from idle task with TIF_POLLING_NRFLAG set and irqs enabled */
+void poll_source_run_once(void)
+{
+   struct poll_source *src;
+
+   list_for_each_entry(src, this_cpu_ptr(_source_list), node)
+   src->ops->poll(src);
+}
+
+/* Called from idle task with TIF_POLLING_NRFLAG set and irqs enabled */
+void poll_source_stop(void)
+{
+   struct poll_source *src;
+
+   list_for_each_entry(src, this_cpu_ptr(_source_list), node)
+   src->ops->stop(src);
+}
+
+static void poll_source_register_this_cpu(void *opaque)
+{
+   struct poll_source *src = opaque;
+
+   lockdep_assert_irqs_disabled();
+
+   list_add_tail(>node, this_cpu_ptr(_source_list));
+}
+
+int poll_source_register(struct poll_source *src)
+{
+   if (

[RFC 0/3] cpuidle: add poll_source API and virtio vq polling

2021-07-13 Thread Stefan Hajnoczi
These patches are not polished yet but I would like request feedback on this
approach and share performance results with you.

Idle CPUs tentatively enter a busy wait loop before halting when the cpuidle
haltpoll driver is enabled inside a virtual machine. This reduces wakeup
latency for events that occur soon after the vCPU becomes idle.

This patch series extends the cpuidle busy wait loop with the new poll_source
API so drivers can participate in polling. Such polling-aware drivers disable
their device's irq during the busy wait loop to avoid the cost of interrupts.
This reduces latency further than regular cpuidle haltpoll, which still relies
on irqs.

Virtio drivers are modified to use the poll_source API so all virtio device
types get this feature. The following virtio-blk fio benchmark results show the
improvement:

 IOPS (numjobs=4, iodepth=1, 4 virtqueues)
   before   poll_source  io_poll
4k randread167102  186049 (+11%)  186654 (+11%)
4k randwrite   162204  181214 (+11%)  181850 (+12%)
4k randrw  159520  177071 (+11%)  177928 (+11%)

The comparison against io_poll shows that cpuidle poll_source achieves
equivalent performance to the block layer's io_poll feature (which I
implemented in a separate patch series [1]).

The advantage of poll_source is that applications do not need to explicitly set
the RWF_HIPRI I/O request flag. The poll_source approach is attractive because
few applications actually use RWF_HIPRI and it takes advantage of CPU cycles we
would have spent in cpuidle haltpoll anyway.

The current series does not improve virtio-net. I haven't investigated deeply,
but it is possible that NAPI and poll_source do not combine. See the final
patch for a starting point on making the two work together.

I have not tried this on bare metal but it might help there too. The cost of
disabling a device's irq must be less than the savings from avoiding irq
handling for this optimization to make sense.

[1] 
https://lore.kernel.org/linux-block/20210520141305.355961-1-stefa...@redhat.com/

Stefan Hajnoczi (3):
  cpuidle: add poll_source API
  virtio: add poll_source virtqueue polling
  softirq: participate in cpuidle polling

 drivers/cpuidle/Makefile   |   1 +
 drivers/virtio/virtio_pci_common.h |   7 ++
 include/linux/interrupt.h  |   2 +
 include/linux/poll_source.h|  53 +++
 include/linux/virtio.h |   2 +
 include/linux/virtio_config.h  |   2 +
 drivers/cpuidle/poll_source.c  | 102 +
 drivers/cpuidle/poll_state.c   |   6 ++
 drivers/virtio/virtio.c|  34 ++
 drivers/virtio/virtio_pci_common.c |  86 
 drivers/virtio/virtio_pci_modern.c |   2 +
 kernel/softirq.c   |  14 
 12 files changed, 311 insertions(+)
 create mode 100644 include/linux/poll_source.h
 create mode 100644 drivers/cpuidle/poll_source.c

-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [bug report]scsi: drive letter drift problem

2021-07-12 Thread Stefan Hajnoczi
On Mon, Jul 12, 2021 at 12:47:05PM +0800, Wenchao Hao wrote:
> We deploy two SCSI disk in one SCSI host(virtio-scsi bus) for one machine,
> whose ids are [0:0:0:0] and [0:0:1:0].
> 
> Mostly, the device letter are assigned as following after reboot:
> [0:0:0:0] --> sda
> [0:0:1:0] --> sdb
> 
> While in rare cases, the device letter shown as following:
> [0:0:0:0] --> sdb
> [0:0:1:0] --> sda
> 
> Could we guarantee "sda" is assigned to [0:0:0:0] and "sdb" is assigned to
> [0:0:1:0] or not?
> If we can, then how?

Is there a stable ID that you can use in /dev/disk/by-*?

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-08 Thread Stefan Hajnoczi
On Wed, Jul 07, 2021 at 05:09:13PM +0800, Yongji Xie wrote:
> On Tue, Jul 6, 2021 at 6:22 PM Stefan Hajnoczi  wrote:
> >
> > On Tue, Jul 06, 2021 at 11:04:18AM +0800, Yongji Xie wrote:
> > > On Mon, Jul 5, 2021 at 8:50 PM Stefan Hajnoczi  
> > > wrote:
> > > >
> > > > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> > > > >
> > > > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > > > OK, I get you now. Since the VIRTIO specification says "Device
> > > > > > > > configuration space is generally used for rarely-changing or
> > > > > > > > initialization-time parameters". I assume the 
> > > > > > > > VDUSE_DEV_SET_CONFIG
> > > > > > > > ioctl should not be called frequently.
> > > > > > > The spec uses MUST and other terms to define the precise 
> > > > > > > requirements.
> > > > > > > Here the language (especially the word "generally") is weaker and 
> > > > > > > means
> > > > > > > there may be exceptions.
> > > > > > >
> > > > > > > Another type of access that doesn't work with the 
> > > > > > > VDUSE_DEV_SET_CONFIG
> > > > > > > approach is reads that have side-effects. For example, imagine a 
> > > > > > > field
> > > > > > > containing an error code if the device encounters a problem 
> > > > > > > unrelated to
> > > > > > > a specific virtqueue request. Reading from this field resets the 
> > > > > > > error
> > > > > > > code to 0, saving the driver an extra configuration space write 
> > > > > > > access
> > > > > > > and possibly race conditions. It isn't possible to implement those
> > > > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, 
> > > > > > > but it
> > > > > > > makes me think that the interface does not allow full VIRTIO 
> > > > > > > semantics.
> > > > >
> > > > >
> > > > > Note that though you're correct, my understanding is that config 
> > > > > space is
> > > > > not suitable for this kind of error propagating. And it would be very 
> > > > > hard
> > > > > to implement such kind of semantic in some transports.  Virtqueue 
> > > > > should be
> > > > > much better. As Yong Ji quoted, the config space is used for
> > > > > "rarely-changing or intialization-time parameters".
> > > > >
> > > > >
> > > > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> > > > > > handle the message failure, I'm going to add a return value to
> > > > > > virtio_config_ops.get() and virtio_cread_* API so that the error can
> > > > > > be propagated to the virtio device driver. Then the virtio-blk 
> > > > > > device
> > > > > > driver can be modified to handle that.
> > > > > >
> > > > > > Jason and Stefan, what do you think of this way?
> > > >
> > > > Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
> > > >
> > >
> > > We add a timeout and return error in case userspace never replies to
> > > the message.
> > >
> > > > The VIRTIO spec provides no way for the device to report errors from
> > > > config space accesses.
> > > >
> > > > The QEMU virtio-pci implementation returns -1 from invalid
> > > > virtio_config_read*() and silently discards virtio_config_write*()
> > > > accesses.
> > > >
> > > > VDUSE can take the same approach with
> > > > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
> > > >
> > >
> > > I noticed that virtio_config_read*() only returns -1 when we access a
> > > invalid field. But in the VDUSE case, VDUSE_DEV_GET_CONFIG might fail
> > > when we access a valid field. Not sure if it's ok to silently ignore
> > > this kind of error.
> >
> > That's a good point but it's a general VIRTIO issue. Any device
> > implementation (QEMU userspace, hardware vDPA, etc) can fail, so the
> > VIRTIO specification needs to provide a way for the driver to detect
> > this.
> >
> > If userspace violates the contract then VDUSE needs to mark the device
> > broken. QEMU's device emulation does something similar with the
> > vdev->broken flag.
> >
> > The VIRTIO Device Status field DEVICE_NEEDS_RESET bit can be set by
> > vDPA/VDUSE to indicate that the device is not operational and must be
> > reset.
> >
> 
> It might be a solution. But DEVICE_NEEDS_RESET  is not implemented
> currently. So I'm thinking whether it's ok to add a check of
> DEVICE_NEEDS_RESET status bit in probe function of virtio device
> driver (e.g. virtio-blk driver). Then VDUSE can make use of it to fail
> device initailization when configuration space access failed.

Okay.

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-08 Thread Stefan Hajnoczi
On Thu, Jul 08, 2021 at 12:17:56PM +0800, Jason Wang wrote:
> 
> 在 2021/7/7 下午11:54, Stefan Hajnoczi 写道:
> > On Wed, Jul 07, 2021 at 05:24:08PM +0800, Jason Wang wrote:
> > > 在 2021/7/7 下午4:55, Stefan Hajnoczi 写道:
> > > > On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote:
> > > > > 在 2021/7/7 上午1:11, Stefan Hajnoczi 写道:
> > > > > > On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote:
> > > > > > > On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi 
> > > > > > >  wrote:
> > > > > > > > On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote:
> > > > > > > > > 在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:
> > > > > > > > > > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> > > > > > > > > > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > > > > > > > > > OK, I get you now. Since the VIRTIO specification 
> > > > > > > > > > > > > > says "Device
> > > > > > > > > > > > > > configuration space is generally used for 
> > > > > > > > > > > > > > rarely-changing or
> > > > > > > > > > > > > > initialization-time parameters". I assume the 
> > > > > > > > > > > > > > VDUSE_DEV_SET_CONFIG
> > > > > > > > > > > > > > ioctl should not be called frequently.
> > > > > > > > > > > > > The spec uses MUST and other terms to define the 
> > > > > > > > > > > > > precise requirements.
> > > > > > > > > > > > > Here the language (especially the word "generally") 
> > > > > > > > > > > > > is weaker and means
> > > > > > > > > > > > > there may be exceptions.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Another type of access that doesn't work with the 
> > > > > > > > > > > > > VDUSE_DEV_SET_CONFIG
> > > > > > > > > > > > > approach is reads that have side-effects. For 
> > > > > > > > > > > > > example, imagine a field
> > > > > > > > > > > > > containing an error code if the device encounters a 
> > > > > > > > > > > > > problem unrelated to
> > > > > > > > > > > > > a specific virtqueue request. Reading from this field 
> > > > > > > > > > > > > resets the error
> > > > > > > > > > > > > code to 0, saving the driver an extra configuration 
> > > > > > > > > > > > > space write access
> > > > > > > > > > > > > and possibly race conditions. It isn't possible to 
> > > > > > > > > > > > > implement those
> > > > > > > > > > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another 
> > > > > > > > > > > > > corner case, but it
> > > > > > > > > > > > > makes me think that the interface does not allow full 
> > > > > > > > > > > > > VIRTIO semantics.
> > > > > > > > > > > Note that though you're correct, my understanding is that 
> > > > > > > > > > > config space is
> > > > > > > > > > > not suitable for this kind of error propagating. And it 
> > > > > > > > > > > would be very hard
> > > > > > > > > > > to implement such kind of semantic in some transports.  
> > > > > > > > > > > Virtqueue should be
> > > > > > > > > > > much better. As Yong Ji quoted, the config space is used 
> > > > > > > > > > > for
> > > > > > > > > > > "rarely-changing or intialization-time parameters".
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-07 Thread Stefan Hajnoczi
On Wed, Jul 07, 2021 at 05:24:08PM +0800, Jason Wang wrote:
> 
> 在 2021/7/7 下午4:55, Stefan Hajnoczi 写道:
> > On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote:
> > > 在 2021/7/7 上午1:11, Stefan Hajnoczi 写道:
> > > > On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote:
> > > > > On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi  
> > > > > wrote:
> > > > > > On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote:
> > > > > > > 在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:
> > > > > > > > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> > > > > > > > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > > > > > > > OK, I get you now. Since the VIRTIO specification says 
> > > > > > > > > > > > "Device
> > > > > > > > > > > > configuration space is generally used for 
> > > > > > > > > > > > rarely-changing or
> > > > > > > > > > > > initialization-time parameters". I assume the 
> > > > > > > > > > > > VDUSE_DEV_SET_CONFIG
> > > > > > > > > > > > ioctl should not be called frequently.
> > > > > > > > > > > The spec uses MUST and other terms to define the precise 
> > > > > > > > > > > requirements.
> > > > > > > > > > > Here the language (especially the word "generally") is 
> > > > > > > > > > > weaker and means
> > > > > > > > > > > there may be exceptions.
> > > > > > > > > > > 
> > > > > > > > > > > Another type of access that doesn't work with the 
> > > > > > > > > > > VDUSE_DEV_SET_CONFIG
> > > > > > > > > > > approach is reads that have side-effects. For example, 
> > > > > > > > > > > imagine a field
> > > > > > > > > > > containing an error code if the device encounters a 
> > > > > > > > > > > problem unrelated to
> > > > > > > > > > > a specific virtqueue request. Reading from this field 
> > > > > > > > > > > resets the error
> > > > > > > > > > > code to 0, saving the driver an extra configuration space 
> > > > > > > > > > > write access
> > > > > > > > > > > and possibly race conditions. It isn't possible to 
> > > > > > > > > > > implement those
> > > > > > > > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner 
> > > > > > > > > > > case, but it
> > > > > > > > > > > makes me think that the interface does not allow full 
> > > > > > > > > > > VIRTIO semantics.
> > > > > > > > > Note that though you're correct, my understanding is that 
> > > > > > > > > config space is
> > > > > > > > > not suitable for this kind of error propagating. And it would 
> > > > > > > > > be very hard
> > > > > > > > > to implement such kind of semantic in some transports.  
> > > > > > > > > Virtqueue should be
> > > > > > > > > much better. As Yong Ji quoted, the config space is used for
> > > > > > > > > "rarely-changing or intialization-time parameters".
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next 
> > > > > > > > > > version. And to
> > > > > > > > > > handle the message failure, I'm going to add a return value 
> > > > > > > > > > to
> > > > > > > > > > virtio_config_ops.get() and virtio_cread_* API so that the 
> > > > > > > > > > error can
> > > > > > > > > > be propagated to the virtio device driver. Then the 
> > > > > > > > > > virtio-blk device
> > > > > > > > > > driver can be modified to handle that.
> > > > >

Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-07-07 Thread Stefan Hajnoczi
On Tue, Jun 15, 2021 at 10:13:30PM +0800, Xie Yongji wrote:
> +static bool vduse_validate_config(struct vduse_dev_config *config)
> +{

The name field needs to be NUL terminated?

> + case VDUSE_CREATE_DEV: {
> + struct vduse_dev_config config;
> + unsigned long size = offsetof(struct vduse_dev_config, config);
> + void *buf;
> +
> + ret = -EFAULT;
> + if (copy_from_user(, argp, size))
> + break;
> +
> + ret = -EINVAL;
> + if (vduse_validate_config() == false)
> + break;
> +
> + buf = vmemdup_user(argp + size, config.config_size);
> + if (IS_ERR(buf)) {
> + ret = PTR_ERR(buf);
> + break;
> + }
> + ret = vduse_create_dev(, buf, control->api_version);
> + break;
> + }
> + case VDUSE_DESTROY_DEV: {
> + char name[VDUSE_NAME_MAX];
> +
> + ret = -EFAULT;
> + if (copy_from_user(name, argp, VDUSE_NAME_MAX))
> + break;

Is this missing a NUL terminator?


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/2] vdpa: support per virtqueue max queue size

2021-07-06 Thread Stefan Hajnoczi
On Mon, Jul 05, 2021 at 03:19:09PM +0800, Jason Wang wrote:
> Virtio spec allows the device to specify the per virtqueue max queue
> size. vDPA needs to adapt to this flexibility. E.g Qemu advertise a
> small control virtqueue for virtio-net.
> 
> So this patch adds a index parameter to get_vq_num_max bus operations
> for the device to report its per virtqueue max queue size.
> 
> Both VHOST_VDPA_GET_VRING_NUM and VDPA_ATTR_DEV_MAX_VQ_SIZE assume a
> global maximum size. So we iterate all the virtqueues to return the
> minimal size in this case. Actually, the VHOST_VDPA_GET_VRING_NUM is
> not a must for the userspace. Userspace may choose to check the
> VHOST_SET_VRING_NUM for proving or validating the maximum virtqueue
> size. Anyway, we can invent a per vq version of
> VHOST_VDPA_GET_VRING_NUM in the future if it's necessary.
> 
> Signed-off-by: Jason Wang 
> ---
>  drivers/vdpa/ifcvf/ifcvf_main.c   |  2 +-
>  drivers/vdpa/mlx5/net/mlx5_vnet.c |  2 +-
>  drivers/vdpa/vdpa.c   | 22 +-
>  drivers/vdpa/vdpa_sim/vdpa_sim.c  |  2 +-
>  drivers/vdpa/virtio_pci/vp_vdpa.c |  2 +-
>  drivers/vhost/vdpa.c  |  9 ++---
>  drivers/virtio/virtio_vdpa.c  |  2 +-
>  include/linux/vdpa.h  |  5 -
>  8 files changed, 36 insertions(+), 10 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-06 Thread Stefan Hajnoczi
On Tue, Jul 06, 2021 at 11:04:18AM +0800, Yongji Xie wrote:
> On Mon, Jul 5, 2021 at 8:50 PM Stefan Hajnoczi  wrote:
> >
> > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> > >
> > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > OK, I get you now. Since the VIRTIO specification says "Device
> > > > > > configuration space is generally used for rarely-changing or
> > > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> > > > > > ioctl should not be called frequently.
> > > > > The spec uses MUST and other terms to define the precise requirements.
> > > > > Here the language (especially the word "generally") is weaker and 
> > > > > means
> > > > > there may be exceptions.
> > > > >
> > > > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
> > > > > approach is reads that have side-effects. For example, imagine a field
> > > > > containing an error code if the device encounters a problem unrelated 
> > > > > to
> > > > > a specific virtqueue request. Reading from this field resets the error
> > > > > code to 0, saving the driver an extra configuration space write access
> > > > > and possibly race conditions. It isn't possible to implement those
> > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
> > > > > makes me think that the interface does not allow full VIRTIO 
> > > > > semantics.
> > >
> > >
> > > Note that though you're correct, my understanding is that config space is
> > > not suitable for this kind of error propagating. And it would be very hard
> > > to implement such kind of semantic in some transports.  Virtqueue should 
> > > be
> > > much better. As Yong Ji quoted, the config space is used for
> > > "rarely-changing or intialization-time parameters".
> > >
> > >
> > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> > > > handle the message failure, I'm going to add a return value to
> > > > virtio_config_ops.get() and virtio_cread_* API so that the error can
> > > > be propagated to the virtio device driver. Then the virtio-blk device
> > > > driver can be modified to handle that.
> > > >
> > > > Jason and Stefan, what do you think of this way?
> >
> > Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
> >
> 
> We add a timeout and return error in case userspace never replies to
> the message.
> 
> > The VIRTIO spec provides no way for the device to report errors from
> > config space accesses.
> >
> > The QEMU virtio-pci implementation returns -1 from invalid
> > virtio_config_read*() and silently discards virtio_config_write*()
> > accesses.
> >
> > VDUSE can take the same approach with
> > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
> >
> 
> I noticed that virtio_config_read*() only returns -1 when we access a
> invalid field. But in the VDUSE case, VDUSE_DEV_GET_CONFIG might fail
> when we access a valid field. Not sure if it's ok to silently ignore
> this kind of error.

That's a good point but it's a general VIRTIO issue. Any device
implementation (QEMU userspace, hardware vDPA, etc) can fail, so the
VIRTIO specification needs to provide a way for the driver to detect
this.

If userspace violates the contract then VDUSE needs to mark the device
broken. QEMU's device emulation does something similar with the
vdev->broken flag.

The VIRTIO Device Status field DEVICE_NEEDS_RESET bit can be set by
vDPA/VDUSE to indicate that the device is not operational and must be
reset.

The driver code may still process the -1 value read from the
configuration space. Hopefully this isn't a problem. There is currently
no VIRTIO interface besides DEVICE_NEEDS_RESET to indicate configuration
space access failures. On the other hand, drivers need to handle
malicious devices so they should be able to cope with the -1 value
anyway.

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-06 Thread Stefan Hajnoczi
On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote:
> 
> 在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:
> > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > OK, I get you now. Since the VIRTIO specification says "Device
> > > > > > configuration space is generally used for rarely-changing or
> > > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> > > > > > ioctl should not be called frequently.
> > > > > The spec uses MUST and other terms to define the precise requirements.
> > > > > Here the language (especially the word "generally") is weaker and 
> > > > > means
> > > > > there may be exceptions.
> > > > > 
> > > > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
> > > > > approach is reads that have side-effects. For example, imagine a field
> > > > > containing an error code if the device encounters a problem unrelated 
> > > > > to
> > > > > a specific virtqueue request. Reading from this field resets the error
> > > > > code to 0, saving the driver an extra configuration space write access
> > > > > and possibly race conditions. It isn't possible to implement those
> > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
> > > > > makes me think that the interface does not allow full VIRTIO 
> > > > > semantics.
> > > 
> > > Note that though you're correct, my understanding is that config space is
> > > not suitable for this kind of error propagating. And it would be very hard
> > > to implement such kind of semantic in some transports.  Virtqueue should 
> > > be
> > > much better. As Yong Ji quoted, the config space is used for
> > > "rarely-changing or intialization-time parameters".
> > > 
> > > 
> > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> > > > handle the message failure, I'm going to add a return value to
> > > > virtio_config_ops.get() and virtio_cread_* API so that the error can
> > > > be propagated to the virtio device driver. Then the virtio-blk device
> > > > driver can be modified to handle that.
> > > > 
> > > > Jason and Stefan, what do you think of this way?
> > Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
> > 
> > The VIRTIO spec provides no way for the device to report errors from
> > config space accesses.
> > 
> > The QEMU virtio-pci implementation returns -1 from invalid
> > virtio_config_read*() and silently discards virtio_config_write*()
> > accesses.
> > 
> > VDUSE can take the same approach with
> > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
> > 
> > > I'd like to stick to the current assumption thich get_config won't fail.
> > > That is to say,
> > > 
> > > 1) maintain a config in the kernel, make sure the config space read can
> > > always succeed
> > > 2) introduce an ioctl for the vduse usersapce to update the config space.
> > > 3) we can synchronize with the vduse userspace during set_config
> > > 
> > > Does this work?
> > I noticed that caching is also allowed by the vhost-user protocol
> > messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't
> > know whether or not caching is in effect. The interface you outlined
> > above requires caching.
> > 
> > Is there a reason why the host kernel vDPA code needs to cache the
> > configuration space?
> 
> 
> Because:
> 
> 1) Kernel can not wait forever in get_config(), this is the major difference
> with vhost-user.

virtio_cread() can sleep:

  #define virtio_cread(vdev, structname, member, ptr) \
  do {\
  typeof(((structname*)0)->member) virtio_cread_v;\
  \
  might_sleep();  \
  ^^

Which code path cannot sleep?

> 2) Stick to the current assumption that virtio_cread() should always
> succeed.

That can be done by reading -1 (like QEMU does) when the read fails.

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-05 Thread Stefan Hajnoczi
On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> 
> 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > OK, I get you now. Since the VIRTIO specification says "Device
> > > > configuration space is generally used for rarely-changing or
> > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> > > > ioctl should not be called frequently.
> > > The spec uses MUST and other terms to define the precise requirements.
> > > Here the language (especially the word "generally") is weaker and means
> > > there may be exceptions.
> > > 
> > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
> > > approach is reads that have side-effects. For example, imagine a field
> > > containing an error code if the device encounters a problem unrelated to
> > > a specific virtqueue request. Reading from this field resets the error
> > > code to 0, saving the driver an extra configuration space write access
> > > and possibly race conditions. It isn't possible to implement those
> > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
> > > makes me think that the interface does not allow full VIRTIO semantics.
> 
> 
> Note that though you're correct, my understanding is that config space is
> not suitable for this kind of error propagating. And it would be very hard
> to implement such kind of semantic in some transports.  Virtqueue should be
> much better. As Yong Ji quoted, the config space is used for
> "rarely-changing or intialization-time parameters".
> 
> 
> > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> > handle the message failure, I'm going to add a return value to
> > virtio_config_ops.get() and virtio_cread_* API so that the error can
> > be propagated to the virtio device driver. Then the virtio-blk device
> > driver can be modified to handle that.
> > 
> > Jason and Stefan, what do you think of this way?

Why does VDUSE_DEV_GET_CONFIG need to support an error return value?

The VIRTIO spec provides no way for the device to report errors from
config space accesses.

The QEMU virtio-pci implementation returns -1 from invalid
virtio_config_read*() and silently discards virtio_config_write*()
accesses.

VDUSE can take the same approach with
VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.

> I'd like to stick to the current assumption thich get_config won't fail.
> That is to say,
> 
> 1) maintain a config in the kernel, make sure the config space read can
> always succeed
> 2) introduce an ioctl for the vduse usersapce to update the config space.
> 3) we can synchronize with the vduse userspace during set_config
> 
> Does this work?

I noticed that caching is also allowed by the vhost-user protocol
messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't
know whether or not caching is in effect. The interface you outlined
above requires caching.

Is there a reason why the host kernel vDPA code needs to cache the
configuration space?

Here are the vhost-user protocol messages:

  Virtio device config space
  ^^

  ++--+---+-+
  | offset | size | flags | payload |
  ++--+---+-+

  :offset: a 32-bit offset of virtio device's configuration space

  :size: a 32-bit configuration space access size in bytes

  :flags: a 32-bit value:
- 0: Vhost master messages used for writeable fields
- 1: Vhost master messages used for live migration

  :payload: Size bytes array holding the contents of the virtio
device's configuration space

  ...

  ``VHOST_USER_GET_CONFIG``
:id: 24
:equivalent ioctl: N/A
:master payload: virtio device config space
:slave payload: virtio device config space

When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
submitted by the vhost-user master to fetch the contents of the
virtio device configuration space, vhost-user slave's payload size
MUST match master's request, vhost-user slave uses zero length of
payload to indicate an error to vhost-user master. The vhost-user
master may cache the contents to avoid repeated
``VHOST_USER_GET_CONFIG`` calls.

  ``VHOST_USER_SET_CONFIG``
:id: 25
:equivalent ioctl: N/A
:master payload: virtio device config space
:slave payload: N/A

When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
submitted by the vhost-user master when the Guest changes the virtio
device configuration space and also can be used for live migration
on the destination host. The vhost-user slave must check the flags
field, and slaves MUST NOT accept SET_CONFIG for read-only
configuration space fields unless the live migration bit is set.

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: Re: Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-01 Thread Stefan Hajnoczi
On Thu, Jul 01, 2021 at 06:00:48PM +0800, Yongji Xie wrote:
> On Wed, Jun 30, 2021 at 6:06 PM Stefan Hajnoczi  wrote:
> >
> > On Tue, Jun 29, 2021 at 01:43:11PM +0800, Yongji Xie wrote:
> > > On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi  
> > > wrote:
> > > > On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote:
> > > > > + static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t 
> > > > > *len)
> > > > > + {
> > > > > + int fd;
> > > > > + void *addr;
> > > > > + size_t size;
> > > > > + struct vduse_iotlb_entry entry;
> > > > > +
> > > > > + entry.start = iova;
> > > > > + entry.last = iova + 1;
> > > >
> > > > Why +1?
> > > >
> > > > I expected the request to include *len so that VDUSE can create a bounce
> > > > buffer for the full iova range, if necessary.
> > > >
> > >
> > > The function is used to translate iova to va. And the *len is not
> > > specified by the caller. Instead, it's used to tell the caller the
> > > length of the contiguous iova region from the specified iova. And the
> > > ioctl VDUSE_IOTLB_GET_FD will get the file descriptor to the first
> > > overlapped iova region. So using iova + 1 should be enough here.
> >
> > Does the entry.last field have any purpose with VDUSE_IOTLB_GET_FD? I
> > wonder why userspace needs to assign a value at all if it's always +1.
> >
> 
> If we need to get some iova regions in the specified range, we need
> the entry.last field. For example, we can use [0, ULONG_MAX] to get
> the first overlapped iova region which might be [4096, 8192]. But in
> this function, we don't use VDUSE_IOTLB_GET_FD like this. We need to
> get the iova region including the specified iova.

I see, thanks for explaining!

> > > > > + return addr + iova - entry.start;
> > > > > + }
> > > > > +
> > > > > +- VDUSE_DEV_GET_FEATURES: Get the negotiated features
> > > >
> > > > Are these VIRTIO feature bits? Please explain how feature negotiation
> > > > works. There must be a way for userspace to report the device's
> > > > supported feature bits to the kernel.
> > > >
> > >
> > > Yes, these are VIRTIO feature bits. Userspace will specify the
> > > device's supported feature bits when creating a new VDUSE device with
> > > ioctl(VDUSE_CREATE_DEV).
> >
> > Can the VDUSE device influence feature bit negotiation? For example, if
> > the VDUSE virtio-blk device does not implement discard/write-zeroes, how
> > does QEMU or the guest find out about this?
> >
> 
> There is a "features" field in struct vduse_dev_config which is used
> to do feature negotiation.

This approach is more restrictive than required by the VIRTIO
specification:

  "The device SHOULD accept any valid subset of features the driver
  accepts, otherwise it MUST fail to set the FEATURES_OK device status
  bit when the driver writes it."

  
https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-130002

The spec allows a device to reject certain subsets of features. For
example, if feature B depends on feature A and can only be enabled when
feature A is also enabled.

From your description I think VDUSE would accept feature B without
feature A since the device implementation has no opportunity to fail
negotiation with custom logic.

Ideally VDUSE would send a SET_FEATURES message to userspace, allowing
the device implementation full flexibility in which subsets of features
to accept.

This is a corner case. Many or maybe even all existing VIRTIO devices
don't need this flexibility, but I want to point out this limitation in
the VDUSE interface because it may cause issues in the future.

> > > > > +- VDUSE_DEV_UPDATE_CONFIG: Update the configuration space and inject 
> > > > > a config interrupt
> > > >
> > > > Does this mean the contents of the configuration space are cached by
> > > > VDUSE?
> > >
> > > Yes, but the kernel will also store the same contents.
> > >
> > > > The downside is that the userspace code cannot generate the
> > > > contents on demand. Most devices doin't need to generate the contents
> > > > on demand, so I think this is okay but I had expected a different
> > > > interface:
> > > >
> > > > kernel->userspace VDUSE_DEV_GET_CONFIG
> 

Re: Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-06-30 Thread Stefan Hajnoczi
On Tue, Jun 29, 2021 at 01:43:11PM +0800, Yongji Xie wrote:
> On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi  wrote:
> > On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote:
> > > + static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t *len)
> > > + {
> > > + int fd;
> > > + void *addr;
> > > + size_t size;
> > > + struct vduse_iotlb_entry entry;
> > > +
> > > + entry.start = iova;
> > > + entry.last = iova + 1;
> >
> > Why +1?
> >
> > I expected the request to include *len so that VDUSE can create a bounce
> > buffer for the full iova range, if necessary.
> >
> 
> The function is used to translate iova to va. And the *len is not
> specified by the caller. Instead, it's used to tell the caller the
> length of the contiguous iova region from the specified iova. And the
> ioctl VDUSE_IOTLB_GET_FD will get the file descriptor to the first
> overlapped iova region. So using iova + 1 should be enough here.

Does the entry.last field have any purpose with VDUSE_IOTLB_GET_FD? I
wonder why userspace needs to assign a value at all if it's always +1.

> 
> > > + fd = ioctl(dev_fd, VDUSE_IOTLB_GET_FD, );
> > > + if (fd < 0)
> > > + return NULL;
> > > +
> > > + size = entry.last - entry.start + 1;
> > > + *len = entry.last - iova + 1;
> > > + addr = mmap(0, size, perm_to_prot(entry.perm), MAP_SHARED,
> > > + fd, entry.offset);
> > > + close(fd);
> > > + if (addr == MAP_FAILED)
> > > + return NULL;
> > > +
> > > + /* do something to cache this iova region */
> >
> > How is userspace expected to manage iotlb mmaps? When should munmap(2)
> > be called?
> >
> 
> The simple way is using a list to store the iotlb mappings. And we
> should call the munmap(2) for the old mappings when VDUSE_UPDATE_IOTLB
> or VDUSE_STOP_DATAPLANE message is received.

Thanks for explaining. It would be helpful to have a description of
IOTLB operation in this document.

> > Should userspace expect VDUSE_IOTLB_GET_FD to return a full chunk of
> > guest RAM (e.g. multiple gigabytes) that can be cached permanently or
> > will it return just enough pages to cover [start, last)?
> >
> 
> It should return one iotlb mapping that covers [start, last). In
> vhost-vdpa cases, it might be a full chunk of guest RAM. In
> virtio-vdpa cases, it might be the whole bounce buffer or one coherent
> mapping (produced by dma_alloc_coherent()).

Great, thanks. Adding something about this to the documentation would
help others implementing VDUSE devices or libraries.

> > > +
> > > + return addr + iova - entry.start;
> > > + }
> > > +
> > > +- VDUSE_DEV_GET_FEATURES: Get the negotiated features
> >
> > Are these VIRTIO feature bits? Please explain how feature negotiation
> > works. There must be a way for userspace to report the device's
> > supported feature bits to the kernel.
> >
> 
> Yes, these are VIRTIO feature bits. Userspace will specify the
> device's supported feature bits when creating a new VDUSE device with
> ioctl(VDUSE_CREATE_DEV).

Can the VDUSE device influence feature bit negotiation? For example, if
the VDUSE virtio-blk device does not implement discard/write-zeroes, how
does QEMU or the guest find out about this?

> > > +- VDUSE_DEV_UPDATE_CONFIG: Update the configuration space and inject a 
> > > config interrupt
> >
> > Does this mean the contents of the configuration space are cached by
> > VDUSE?
> 
> Yes, but the kernel will also store the same contents.
> 
> > The downside is that the userspace code cannot generate the
> > contents on demand. Most devices doin't need to generate the contents
> > on demand, so I think this is okay but I had expected a different
> > interface:
> >
> > kernel->userspace VDUSE_DEV_GET_CONFIG
> > userspace->kernel VDUSE_DEV_INJECT_CONFIG_IRQ
> >
> 
> The problem is how to handle the failure of VDUSE_DEV_GET_CONFIG. We
> will need lots of modification of virtio codes to support that. So to
> make it simple, we choose this way:
> 
> userspace -> kernel VDUSE_DEV_SET_CONFIG
> userspace -> kernel VDUSE_DEV_INJECT_CONFIG_IRQ
> 
> > I think you can leave it the way it is, but I wanted to mention this in
> > case someone thinks it's important to support generating the contents of
> &g

Re: Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-06-30 Thread Stefan Hajnoczi
On Tue, Jun 29, 2021 at 10:59:51AM +0800, Yongji Xie wrote:
> On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi  wrote:
> >
> > On Tue, Jun 15, 2021 at 10:13:30PM +0800, Xie Yongji wrote:
> > > +/* ioctls */
> > > +
> > > +struct vduse_dev_config {
> > > + char name[VDUSE_NAME_MAX]; /* vduse device name */
> > > + __u32 vendor_id; /* virtio vendor id */
> > > + __u32 device_id; /* virtio device id */
> > > + __u64 features; /* device features */
> > > + __u64 bounce_size; /* bounce buffer size for iommu */
> > > + __u16 vq_size_max; /* the max size of virtqueue */
> >
> > The VIRTIO specification allows per-virtqueue sizes. A device can have
> > two virtqueues, where the first one allows up to 1024 descriptors and
> > the second one allows only 128 descriptors, for example.
> >
> 
> Good point! But it looks like virtio-vdpa/virtio-pci doesn't support
> that now. All virtqueues have the same maximum size.

I see struct vpda_config_ops only supports a per-device max vq size:
u16 (*get_vq_num_max)(struct vdpa_device *vdev);

virtio-pci supports per-virtqueue sizes because the struct
virtio_pci_common_cfg->queue_size register is per-queue (controlled by
queue_select).

I guess this is a question for Jason: will vdpa will keep this limitation?
If yes, then VDUSE can stick to it too without running into problems in
the future.

> > > + __u16 padding; /* padding */
> > > + __u32 vq_num; /* the number of virtqueues */
> > > + __u32 vq_align; /* the allocation alignment of virtqueue's metadata 
> > > */
> >
> > I'm not sure what this is?
> >
> 
>  This will be used by vring_create_virtqueue() too.

If there is no official definition for the meaning of this value then
"/* same as vring_create_virtqueue()'s vring_align parameter */" would
be clearer. That way the reader knows what to research in order to
understand how this field works.

I don't remember but maybe it was used to support vrings when the
host/guest have non-4KB page sizes. I wonder if anyone has an official
definition for this value?


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace

2021-06-28 Thread Stefan Hajnoczi
On Tue, Jun 15, 2021 at 10:13:21PM +0800, Xie Yongji wrote:
> This series introduces a framework that makes it possible to implement
> software-emulated vDPA devices in userspace. And to make it simple, the
> emulated vDPA device's control path is handled in the kernel and only the
> data path is implemented in the userspace.

This looks interesting. Unfortunately I don't have enough time to do a
full review, but I looked at the documentation and uapi header file to
give feedback on the userspace ABI.

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-06-28 Thread Stefan Hajnoczi
On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote:
> VDUSE (vDPA Device in Userspace) is a framework to support
> implementing software-emulated vDPA devices in userspace. This
> document is intended to clarify the VDUSE design and usage.
> 
> Signed-off-by: Xie Yongji 
> ---
>  Documentation/userspace-api/index.rst |   1 +
>  Documentation/userspace-api/vduse.rst | 222 
> ++
>  2 files changed, 223 insertions(+)
>  create mode 100644 Documentation/userspace-api/vduse.rst
> 
> diff --git a/Documentation/userspace-api/index.rst 
> b/Documentation/userspace-api/index.rst
> index 0b5eefed027e..c432be070f67 100644
> --- a/Documentation/userspace-api/index.rst
> +++ b/Documentation/userspace-api/index.rst
> @@ -27,6 +27,7 @@ place where this information is gathered.
> iommu
> media/index
> sysfs-platform_profile
> +   vduse
>  
>  .. only::  subproject and html
>  
> diff --git a/Documentation/userspace-api/vduse.rst 
> b/Documentation/userspace-api/vduse.rst
> new file mode 100644
> index ..2f9cd1a4e530
> --- /dev/null
> +++ b/Documentation/userspace-api/vduse.rst
> @@ -0,0 +1,222 @@
> +==
> +VDUSE - "vDPA Device in Userspace"
> +==
> +
> +vDPA (virtio data path acceleration) device is a device that uses a
> +datapath which complies with the virtio specifications with vendor
> +specific control path. vDPA devices can be both physically located on
> +the hardware or emulated by software. VDUSE is a framework that makes it
> +possible to implement software-emulated vDPA devices in userspace. And
> +to make it simple, the emulated vDPA device's control path is handled in
> +the kernel and only the data path is implemented in the userspace.
> +
> +Note that only virtio block device is supported by VDUSE framework now,
> +which can reduce security risks when the userspace process that implements
> +the data path is run by an unprivileged user. The Support for other device
> +types can be added after the security issue is clarified or fixed in the 
> future.
> +
> +Start/Stop VDUSE devices
> +
> +
> +VDUSE devices are started as follows:
> +
> +1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
> +   /dev/vduse/control.
> +
> +2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
> +   messages will arrive while attaching the VDUSE instance to vDPA bus.
> +
> +3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
> +   instance to vDPA bus.
> +
> +VDUSE devices are stopped as follows:
> +
> +1. Send the VDPA_CMD_DEV_DEL netlink message to detach the VDUSE
> +   instance from vDPA bus.
> +
> +2. Close the file descriptor referring to /dev/vduse/$NAME
> +
> +3. Destroy the VDUSE instance with ioctl(VDUSE_DESTROY_DEV) on
> +   /dev/vduse/control
> +
> +The netlink messages metioned above can be sent via vdpa tool in iproute2
> +or use the below sample codes:
> +
> +.. code-block:: c
> +
> + static int netlink_add_vduse(const char *name, enum vdpa_command cmd)
> + {
> + struct nl_sock *nlsock;
> + struct nl_msg *msg;
> + int famid;
> +
> + nlsock = nl_socket_alloc();
> + if (!nlsock)
> + return -ENOMEM;
> +
> + if (genl_connect(nlsock))
> + goto free_sock;
> +
> + famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
> + if (famid < 0)
> + goto close_sock;
> +
> + msg = nlmsg_alloc();
> + if (!msg)
> + goto close_sock;
> +
> + if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0, 
> cmd, 0))
> + goto nla_put_failure;
> +
> + NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
> + if (cmd == VDPA_CMD_DEV_NEW)
> + NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, 
> "vduse");
> +
> + if (nl_send_sync(nlsock, msg))
> + goto close_sock;
> +
> + nl_close(nlsock);
> + nl_socket_free(nlsock);
> +
> + return 0;
> + nla_put_failure:
> + nlmsg_free(msg);
> + close_sock:
> + nl_close(nlsock);
> + free_sock:
> + nl_socket_free(nlsock);
> + return -1;
> + }
> +
> +How VDUSE works
> +---
> +
> +Since the emuldated vDPA device's control path is handled in the kernel,

s/emuldated/emulated/

> +a message-based communication protocol and few types of control messages
> +are introduced by VDUSE framework to make userspace be aware of the data
> +path related changes:
> +
> +- VDUSE_GET_VQ_STATE: Get the state for virtqueue from userspace
> +
> +- VDUSE_START_DATAPLANE: Notify userspace to start the dataplane
> +
> +- VDUSE_STOP_DATAPLANE: Notify userspace to stop the dataplane
> +
> +- VDUSE_UPDATE_IOTLB: Notify userspace to update 

Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-06-28 Thread Stefan Hajnoczi
On Tue, Jun 15, 2021 at 10:13:30PM +0800, Xie Yongji wrote:
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> new file mode 100644
> index ..f21b2e51b5c8
> --- /dev/null
> +++ b/include/uapi/linux/vduse.h
> @@ -0,0 +1,143 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_VDUSE_H_
> +#define _UAPI_VDUSE_H_
> +
> +#include 
> +
> +#define VDUSE_API_VERSION0
> +
> +#define VDUSE_NAME_MAX   256
> +
> +/* the control messages definition for read/write */
> +
> +enum vduse_req_type {
> + /* Get the state for virtqueue from userspace */
> + VDUSE_GET_VQ_STATE,
> + /* Notify userspace to start the dataplane, no reply */
> + VDUSE_START_DATAPLANE,
> + /* Notify userspace to stop the dataplane, no reply */
> + VDUSE_STOP_DATAPLANE,
> + /* Notify userspace to update the memory mapping in device IOTLB */
> + VDUSE_UPDATE_IOTLB,
> +};
> +
> +struct vduse_vq_state {
> + __u32 index; /* virtqueue index */
> + __u32 avail_idx; /* virtqueue state (last_avail_idx) */
> +};
> +
> +struct vduse_iova_range {
> + __u64 start; /* start of the IOVA range */
> + __u64 last; /* end of the IOVA range */

Please clarify whether this describes a closed range [start, last] or an
open range [start, last).

> +};
> +
> +struct vduse_dev_request {
> + __u32 type; /* request type */
> + __u32 request_id; /* request id */
> +#define VDUSE_REQ_FLAGS_NO_REPLY (1 << 0) /* No need to reply */
> + __u32 flags; /* request flags */
> + __u32 reserved; /* for future use */
> + union {
> + struct vduse_vq_state vq_state; /* virtqueue state */
> + struct vduse_iova_range iova; /* iova range for updating */
> + __u32 padding[16]; /* padding */
> + };
> +};
> +
> +struct vduse_dev_response {
> + __u32 request_id; /* corresponding request id */
> +#define VDUSE_REQ_RESULT_OK  0x00
> +#define VDUSE_REQ_RESULT_FAILED  0x01
> + __u32 result; /* the result of request */
> + __u32 reserved[2]; /* for future use */
> + union {
> + struct vduse_vq_state vq_state; /* virtqueue state */
> + __u32 padding[16]; /* padding */
> + };
> +};
> +
> +/* ioctls */
> +
> +struct vduse_dev_config {
> + char name[VDUSE_NAME_MAX]; /* vduse device name */
> + __u32 vendor_id; /* virtio vendor id */
> + __u32 device_id; /* virtio device id */
> + __u64 features; /* device features */
> + __u64 bounce_size; /* bounce buffer size for iommu */
> + __u16 vq_size_max; /* the max size of virtqueue */

The VIRTIO specification allows per-virtqueue sizes. A device can have
two virtqueues, where the first one allows up to 1024 descriptors and
the second one allows only 128 descriptors, for example.

This constant seems to impose the constraint that all virtqueues have
the same maximum size. Is this really necessary?

> + __u16 padding; /* padding */
> + __u32 vq_num; /* the number of virtqueues */
> + __u32 vq_align; /* the allocation alignment of virtqueue's metadata */

I'm not sure what this is?

> + __u32 config_size; /* the size of the configuration space */
> + __u32 reserved[15]; /* for future use */
> + __u8 config[0]; /* the buffer of the configuration space */
> +};
> +
> +struct vduse_iotlb_entry {
> + __u64 offset; /* the mmap offset on fd */
> + __u64 start; /* start of the IOVA range */
> + __u64 last; /* last of the IOVA range */

Same here, please specify whether this is an open range or a closed
range.

> +#define VDUSE_ACCESS_RO 0x1
> +#define VDUSE_ACCESS_WO 0x2
> +#define VDUSE_ACCESS_RW 0x3
> + __u8 perm; /* access permission of this range */
> +};
> +
> +struct vduse_config_update {
> + __u32 offset; /* offset from the beginning of configuration space */
> + __u32 length; /* the length to write to configuration space */
> + __u8 buffer[0]; /* buffer used to write from */
> +};
> +
> +struct vduse_vq_info {
> + __u32 index; /* virtqueue index */
> + __u32 avail_idx; /* virtqueue state (last_avail_idx) */
> + __u64 desc_addr; /* address of desc area */
> + __u64 driver_addr; /* address of driver area */
> + __u64 device_addr; /* address of device area */
> + __u32 num; /* the size of virtqueue */
> + __u8 ready; /* ready status of virtqueue */
> +};
> +
> +struct vduse_vq_eventfd {
> + __u32 index; /* virtqueue index */
> +#define VDUSE_EVENTFD_DEASSIGN -1
> + int fd; /* eventfd, -1 means de-assigning the eventfd */
> +};
> +
> +#define VDUSE_BASE   0x81
> +
> +/* Get the version of VDUSE API. This is used for future extension */
> +#define VDUSE_GET_API_VERSION_IOR(VDUSE_BASE, 0x00, __u64)
> +
> +/* Set the version of VDUSE API. */
> +#define VDUSE_SET_API_VERSION_IOW(VDUSE_BASE, 0x01, __u64)
> +
> +/* Create a vduse device which is represented by a char device 
> (/dev/vduse/) */
> +#define 

Re: [PATCH 0/3] kthread: pass in user and check RLIMIT_NPROC

2021-06-28 Thread Stefan Hajnoczi
On Wed, Jun 23, 2021 at 10:08:01PM -0500, Mike Christie wrote:
> The vhost driver will create a kthread when userspace does a
> VHOST_SET_OWNER ioctl, but the thread is charged to the kthreadd thread.
> We can then end up violating the userspace process's RLIMIT_NPROC. This
> patchset allows drivers to pass in the user to charge/check.
> 
> The patches were made over Linus's current tree.

Makes sense from a vhost perspective and for future users, but I'm not
familiar with the kthread internals:

Acked-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v3] virtio-blk: Add validation for block size in config space

2021-06-22 Thread Stefan Hajnoczi
On Thu, Jun 17, 2021 at 01:10:04PM +0800, Xie Yongji wrote:
> This ensures that we will not use an invalid block size
> in config space (might come from an untrusted device).
> 
> Signed-off-by: Xie Yongji 
> ---
>  drivers/block/virtio_blk.c | 29 +++--
>  1 file changed, 23 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index b9fa3ef5b57c..bbdae989f1ea 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -696,6 +696,28 @@ static const struct blk_mq_ops virtio_mq_ops = {
>  static unsigned int virtblk_queue_depth;
>  module_param_named(queue_depth, virtblk_queue_depth, uint, 0444);
>  
> +static int virtblk_validate(struct virtio_device *vdev)
> +{
> + u32 blk_size;
> +
> + if (!vdev->config->get) {
> + dev_err(>dev, "%s failure: config access disabled\n",
> + __func__);
> + return -EINVAL;
> + }
> +
> + if (!virtio_has_feature(vdev, VIRTIO_BLK_F_BLK_SIZE))
> + return 0;
> +
> + blk_size = virtio_cread32(vdev,
> + offsetof(struct virtio_blk_config, blk_size));
> +
> + if (blk_size < SECTOR_SIZE || blk_size > PAGE_SIZE)
> + __virtio_clear_bit(vdev, VIRTIO_BLK_F_BLK_SIZE);
> +
> + return 0;
> +}

I saw Michael asked for .validate() in v2. I would prefer to keep
everything in virtblk_probe() instead of adding .validate() because:

- There is a race condition that an untrusted device can exploit since
  virtblk_probe() fetches the value again.

- It's more complex now that .validate() takes a first shot at blk_size
  and then virtblk_probe() deals with it again later on.


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 7/9] vhost: allow userspace to create workers

2021-06-21 Thread Stefan Hajnoczi
On Thu, Jun 17, 2021 at 09:49:07PM -0500, Mike Christie wrote:
> Again it's io_uring. Check out fs/io_uring.c:__io_account_mem(). For 
> RLIMIT_MEMLOCK
> it just does the check and increments the user's counter itself. It's simple 
> like
> option 2, and it handles the issue where the process doing the ioctl wasn't 
> having
> its RLIMIT_NPROC checked/updated.

This can work too. It doesn't cover cases where code called indirectly
acquires resources, but that's probably fine for the vhost worker thread
case.

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 7/9] vhost: allow userspace to create workers

2021-06-10 Thread Stefan Hajnoczi
On Wed, Jun 09, 2021 at 04:03:55PM -0500, Mike Christie wrote:
> On 6/7/21 10:19 AM, Stefan Hajnoczi wrote:
> > My concern is that threads should probably accounted against
> > RLIMIT_NPROC and max_threads rather than something indirect like 128 *
> > RLIMIT_NOFILE (a userspace process can only have RLIMIT_NOFILE
> > vhost-user file descriptors open).
> > 
> 
> Ah ok, I see what you want I think.
> 
> Ok, I think the options are:
> 
> 0. Nothing. Just use existing indirect/RLIMIT_NOFILE.
> 
> 1. Do something like io_uring's create_io_thread/copy_process. If we call
> copy_process from the vhost ioctl context, then the userspace process that
> did the ioctl will have it's processes count incremented and checked against
> its rlimit.
> 
> The drawbacks:
> - This gets a little more complicated than just calling copy_process though.
> We end up duplicating a lot of the kthread API.
> - We have to deal with new error cases like the parent exiting early.
> - I think all devs sharing a worker have to have the same owner. 
> kthread_use_mm
> and kthread_unuse_mm to switch between mm's for differrent owner's devs seem 
> to
> be causing lots of errors. I'm still looking into this one though.
> 
> 2.  It's not really what you want, but for unbound work io_uring has a check 
> for
> RLIMIT_NPROC in the io_uring code. It does:
> 
> wqe->acct[IO_WQ_ACCT_UNBOUND].max_workers =
>   task_rlimit(current, RLIMIT_NPROC);
> 
> then does:
> 
> if (!ret && acct->nr_workers < acct->max_workers) {
> 
> Drawbacks:
> In vhost.c, we could do something similar. It would make sure that vhost.c 
> does
> not create more worker threads than the rlimit value, but we wouldn't be
> incrementing the userspace process's process count. The userspace process 
> could
> then create RLIMIT_NPROC threads and vhost.c could also create RLIMIT_NPROC
> threads, so we end up with 2 * RLIMIT_NPROC threads.

Yes, in that case we might as well go with Option 0, so I think this
option can be eliminated.

> 3. Change the kthread and copy_process code so we can pass in the thread
> (or it's creds or some struct that has the values that need to be check) that
> needs to be checked and updated.
> 
> Drawback:
> This might be considered too ugly for how special case vhost is. For example, 
> we
> need checks/code like the io_thread/PF_IO_WORKER code in copy_process for 
> io_uring.
> I can see how added that for io_uring because it affects so many users, but I 
> can
> see how vhost is not special enough.

This seems like the most general solution. If you try it and get
negative feedback then maybe the maintainers can help suggest how to
solve this problem :).

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: vhost: multiple worker support

2021-06-07 Thread Stefan Hajnoczi
On Sat, Jun 05, 2021 at 05:40:17PM -0500, michael.chris...@oracle.com wrote:
> On 6/3/21 5:16 PM, Mike Christie wrote:
> > On 6/3/21 9:37 AM, Stefan Hajnoczi wrote:
> >> On Tue, May 25, 2021 at 01:05:51PM -0500, Mike Christie wrote:
> >>> The following patches apply over linus's tree or mst's vhost branch
> >>> and my cleanup patchset:
> >>>
> >>> https://urldefense.com/v3/__https://lists.linuxfoundation.org/pipermail/virtualization/2021-May/054354.html__;!!GqivPVa7Brio!P55eslnMW_iZkoTUZckwhnSw_9Z35JBScgtSYRh4CMFTRkSsWwKOYqY0huUfBfIPM8BV$
> >>>  
> >>>
> >>> These patches allow us to support multiple vhost workers per device. I
> >>> ended up just doing Stefan's original idea where userspace has the
> >>> kernel create a worker and we pass back the pid. This has the benefit
> >>> over the workqueue and userspace thread approach where we only have
> >>> one'ish code path in the kernel during setup to detect old tools. The
> >>> main IO paths and device/vq setup/teardown paths all use common code.
> >>>
> >>> The kernel patches here allow us to then do N workers device and also
> >>> share workers across devices.
> >>>
> >>> I've also included a patch for qemu so you can get an idea of how it
> >>> works. If we are ok with the kernel code then I'll break that up into
> >>> a patchset and send to qemu-devel.
> >>
> >> It seems risky to allow userspace process A to "share" a vhost worker
> >> thread with userspace process B based on a matching pid alone. Should
> >> they have ptrace_may_access() or similar?
> >>
> > 
> > I'm not sure. I already made it a little restrictive in this posting, but
> 
> Made a mistake here. In this posting I did not make it restrictive and
> I was allowing any old 2 processes to share. So we would need something
> like ptrace_may_access if go this route.
> 
> If we restrict sharing workers with the same owner, then I'm not sure if
> need anything.

Agreed.

Sharing between processes becomes most interesting when there is busy
polling (because it consumes CPU and we should consolidate polling onto
as few CPUs as possible). Without polling we can just share the threads
within a process.

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 7/9] vhost: allow userspace to create workers

2021-06-07 Thread Stefan Hajnoczi
On Sat, Jun 05, 2021 at 06:53:58PM -0500, michael.chris...@oracle.com wrote:
> On 6/3/21 9:30 AM, Stefan Hajnoczi wrote:
> >> +  if (info->pid == VHOST_VRING_NEW_WORKER) {
> >> +  worker = vhost_worker_create(dev);
> > 
> > The maximum number of kthreads created is limited by
> > vhost_dev_init(nvqs)? For example VHOST_SCSI_MAX_VQ 128.
> > 
> > IIUC kthread_create is not limited by or accounted against the current
> > task, so I'm a little worried that a process can create a lot of
> > kthreads.
> > 
> > I haven't investigated other kthread_create() users reachable from
> > userspace applications to see how they bound the number of threads
> > effectively.
> 
> Do we want something like io_uring's copy_process use? It's what fork uses,
> so we get checks like RLIMIT_NPROC and max_threads.
> 
> I know I didn't look at everything, but it looks like for some software
> drivers we just allow the user to run wild. For example for nbd, when we
> create the device to do alloc_workqueue and use the default max_active
> value (256). We then don't have a limit on connections, so we could end
> up with 256 workqueue threads per device. And then there is no limit on
> devices a user can make.
> 
> 
> > 
> > Any thoughts?
> >
> 
> Is the concern a bad VM could create N devs each with 128 vqs/threads
> and it would slow down other VMs? How do we handle the case where
> some VM makes M * N devs and that is equal to N * 128 so we would end
> up with the same number of threads either way? Is there a limit to the
> number of vhost devices a VM can make and can I just stick in a similar
> check for workers?
> 
> For vhost-scsi specifically, the 128 limit does not make a lot of sense.
> I think we want the max to be the number of vCPUs the VM has so we can
> add checks for that. Then we would assume someone making a VM with lots of
> CPUs is going to have the resources to support them.
> 
> Note: It does make sense from the point of view that we don't know the
> number of vCPUs when vhost-scsi calls vhost_dev_init, so I get we had to
> select an initial limit.

My concern is that threads should probably accounted against
RLIMIT_NPROC and max_threads rather than something indirect like 128 *
RLIMIT_NOFILE (a userspace process can only have RLIMIT_NOFILE
vhost-user file descriptors open).

> >> +  if (!dev->workers) {
> >> +  vhost_worker_put(worker);
> >> +  return -ENOMEM;
> >> +  }
> >> +  }
> >> +
> >> +  vq->worker = worker;
> >> +
> >> +  dev->workers[dev->num_workers] = worker;
> >> +  dev->num_workers++;
> > 
> > Hmm...should we really append to workers[] in the vhost_worker_find()
> > case?
> 
> 
> As it's coded now, yes. Every successful vhost_worker_find call does a
> get on the worker's refcount. Later when we delete the device, we loop
> over the workers array and for every entry we do a put.
> 
> I can add in some code to first check if the worker is already in the
> dev's worker list. If so then skip the refcount and skip adding to the
> workers array. If not in the dev's worker list then do a vhost_worker_find.
> 
> I thought it might be nicer how it is now with the single path. It's less
> code at least. Later if we add support to change a vq's worker then we also
> don't have to worry about refcounts as much. We just always drop the count
> taken from when it was added.

Thanks for explaining.

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 0/3] virtio_blk: blk-mq io_poll support

2021-06-03 Thread Stefan Hajnoczi
On Thu, May 20, 2021 at 03:13:02PM +0100, Stefan Hajnoczi wrote:
> This patch series implements blk_mq_ops->poll() so REQ_HIPRI requests can be
> polled. IOPS for 4k and 16k block sizes increases by 5-18% on a virtio-blk
> device with 4 virtqueues backed by an NVMe drive.
> 
> - Benchmark: fio ioengine=pvsync2 numjobs=4 direct=1
> - Guest: 4 vCPUs with one virtio-blk device (4 virtqueues)
> - Disk: Intel Corporation NVMe Datacenter SSD [Optane] [8086:2701]
> - CPU: Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz
> 
> rw  bs hipri=0 hipri=1
> --
> randread4k 149,426 170,763 +14%
> randread   16k 118,939 134,269 +12%
> randread   64k  34,886  34,906   0%
> randread  128k  17,655  17,667   0%
> randwrite   4k 138,578 163,600 +18%
> randwrite  16k 102,089 120,950 +18%
> randwrite  64k  32,364  32,561   0%
> randwrite 128k  16,154  16,237   0%
> read4k 146,032 170,620 +16%
> read   16k 117,097 130,437 +11%
> read   64k  34,834  35,037   0%
> read  128k  17,680  17,658   0%
> write   4k 134,562 151,422 +12%
> write  16k 101,796 107,606  +5%
> write  64k  32,364  32,594   0%
> write 128k  16,259  16,265   0%
> 
> Larger block sizes do not benefit from polling as much but the
> improvement is worthwhile for smaller block sizes.
> 
> Stefan Hajnoczi (3):
>   virtio: add virtioqueue_more_used()
>   virtio_blk: avoid repeating vblk->vqs[qid]
>   virtio_blk: implement blk_mq_ops->poll()
> 
>  include/linux/virtio.h   |   2 +
>  drivers/block/virtio_blk.c   | 126 +--
>  drivers/virtio/virtio_ring.c |  17 +
>  3 files changed, 123 insertions(+), 22 deletions(-)

Christoph and Jens: Any more thoughts on this irq toggling approach?

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()

2021-06-03 Thread Stefan Hajnoczi
On Thu, May 27, 2021 at 01:48:36PM +0800, Jason Wang wrote:
> 
> 在 2021/5/25 下午4:59, Stefan Hajnoczi 写道:
> > On Tue, May 25, 2021 at 11:21:41AM +0800, Jason Wang wrote:
> > > 在 2021/5/20 下午10:13, Stefan Hajnoczi 写道:
> > > > Request completion latency can be reduced by using polling instead of
> > > > irqs. Even Posted Interrupts or similar hardware support doesn't beat
> > > > polling. The reason is that disabling virtqueue notifications saves
> > > > critical-path CPU cycles on the host by skipping irq injection and in
> > > > the guest by skipping the irq handler. So let's add blk_mq_ops->poll()
> > > > support to virtio_blk.
> > > > 
> > > > The approach taken by this patch differs from the NVMe driver's
> > > > approach. NVMe dedicates hardware queues to polling and submits
> > > > REQ_HIPRI requests only on those queues. This patch does not require
> > > > exclusive polling queues for virtio_blk. Instead, it switches between
> > > > irqs and polling when one or more REQ_HIPRI requests are in flight on a
> > > > virtqueue.
> > > > 
> > > > This is possible because toggling virtqueue notifications is cheap even
> > > > while the virtqueue is running. NVMe cqs can't do this because irqs are
> > > > only enabled/disabled at queue creation time.
> > > > 
> > > > This toggling approach requires no configuration. There is no need to
> > > > dedicate queues ahead of time or to teach users and orchestration tools
> > > > how to set up polling queues.
> > > > 
> > > > Possible drawbacks of this approach:
> > > > 
> > > > - Hardware virtio_blk implementations may find virtqueue_disable_cb()
> > > > expensive since it requires DMA.
> > > 
> > > Note that it's probably not related to the behavior of the driver but the
> > > design of the event suppression mechanism.
> > > 
> > > Device can choose to ignore the suppression flag and keep sending
> > > interrupts.
> > Yes, it's the design of the event suppression mechanism.
> > 
> > If we use dedicated polling virtqueues then the hardware doesn't need to
> > check whether interrupts are enabled for each notification. However,
> > there's no mechanism to tell the device that virtqueue interrupts are
> > permanently disabled. This means that as of today, even dedicated
> > virtqueues cannot suppress interrupts without the device checking for
> > each notification.
> 
> 
> This can be detected via a transport specific way.
> 
> E.g in the case of MSI, VIRTIO_MSI_NO_VECTOR could be a hint.

Nice idea :). Then there would be no need for changes to the hardware
interface. IRQ-less virtqueues is could still be mentioned explicitly in
the VIRTIO spec so that driver/device authors are aware of the
VIRTIO_MSI_NO_VECTOR trick.

> > > > +static int virtblk_poll(struct blk_mq_hw_ctx *hctx)
> > > > +{
> > > > +   struct virtio_blk *vblk = hctx->queue->queuedata;
> > > > +   struct virtqueue *vq = vblk->vqs[hctx->queue_num].vq;
> > > > +
> > > > +   if (!virtqueue_more_used(vq))
> > > 
> > > I'm not familiar with block polling but what happens if a buffer is made
> > > available after virtqueue_more_used() returns false here?
> > Can you explain the scenario, I'm not sure I understand? "buffer is made
> > available" -> are you thinking about additional requests being submitted
> > by the driver or an in-flight request being marked used by the device?
> 
> 
> Something like that:
> 
> 1) requests are submitted
> 2) poll but virtqueue_more_used() return false
> 3) device make buffer used
> 
> In this case, will poll() be triggered again by somebody else? (I think
> interrupt is disabled here).

Yes. An example blk_poll() user is
fs/block_dev.c:__blkdev_direct_IO_simple():

  qc = submit_bio();
  for (;;) {
  set_current_state(TASK_UNINTERRUPTIBLE);
  if (!READ_ONCE(bio.bi_private))
  break;
  if (!(iocb->ki_flags & IOCB_HIPRI) ||
  !blk_poll(bdev_get_queue(bdev), qc, true))
  blk_io_schedule();
  }

That's the infinite loop. The block layer implements the generic portion
of blk_poll(). blk_poll() calls mq_ops->poll() (virtblk_poll()).

So in general the polling loop will keep iterating, but there are
exceptions:
1. need_resched() causes blk_poll() to return 0 and blk_io_schedule()
   will be called.
2. blk-mq has a fancier io_poll algorithm that estimates I/O time and
   sleeps until th

Re: [PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()

2021-06-03 Thread Stefan Hajnoczi
On Thu, May 27, 2021 at 10:44:51AM +0800, Ming Lei wrote:
> On Thu, May 20, 2021 at 03:13:05PM +0100, Stefan Hajnoczi wrote:
> > Request completion latency can be reduced by using polling instead of
> > irqs. Even Posted Interrupts or similar hardware support doesn't beat
> > polling. The reason is that disabling virtqueue notifications saves
> > critical-path CPU cycles on the host by skipping irq injection and in
> > the guest by skipping the irq handler. So let's add blk_mq_ops->poll()
> > support to virtio_blk.
> > 
> > The approach taken by this patch differs from the NVMe driver's
> > approach. NVMe dedicates hardware queues to polling and submits
> > REQ_HIPRI requests only on those queues. This patch does not require
> > exclusive polling queues for virtio_blk. Instead, it switches between
> > irqs and polling when one or more REQ_HIPRI requests are in flight on a
> > virtqueue.
> > 
> > This is possible because toggling virtqueue notifications is cheap even
> > while the virtqueue is running. NVMe cqs can't do this because irqs are
> > only enabled/disabled at queue creation time.
> > 
> > This toggling approach requires no configuration. There is no need to
> > dedicate queues ahead of time or to teach users and orchestration tools
> > how to set up polling queues.
> 
> This approach looks good, and very neat thanks per-vq lock.
> 
> BTW, is there any virt-exit saved by disabling vq interrupt? I understand
> there isn't since virt-exit may only be involved in remote completion
> via sending IPI.

This patch doesn't eliminate vmexits. QEMU already has virtqueue polling
code that disables the vq notification (the virtio-pci hardware register
write that causes a vmexit).

However, when both the guest
driver and the emulated device are polling then there are no vmexits or
interrupt injections with this patch.

> > 
> > Possible drawbacks of this approach:
> > 
> > - Hardware virtio_blk implementations may find virtqueue_disable_cb()
> >   expensive since it requires DMA. If such devices become popular then
> 
> You mean the hardware need to consider order between DMA completion and
> interrupt notify? But it is disabling notify, guest just calls
> virtqueue_get_buf() to see if there is buffer available, if not, it will be
> polled again.

Software devices have cheap access to guest RAM for looking at the
virtqueue_disable_cb() state before injecting an irq. Hardware devices
need to perform a DMA transaction to read that state. They have to do
this every time they want to raise an irq because the guest driver may
have changed the value.

I'm not sure if the DMA overhead is acceptable. This problem is not
introduced by this patch, it's a VIRTIO spec design issue.

I was trying to express that dedicated polling queues would avoid the
DMA since the device knows that irqs are never needed for this virtqueue.

> 
> >   the virtio_blk driver could use a similar approach to NVMe when
> >   VIRTIO_F_ACCESS_PLATFORM is detected in the future.
> > 
> > - If a blk_poll() thread is descheduled it not only hurts polling
> >   performance but also delays completion of non-REQ_HIPRI requests on
> >   that virtqueue since vq notifications are disabled.
> > 
> > Performance:
> > 
> > - Benchmark: fio ioengine=pvsync2 numjobs=4 direct=1
> > - Guest: 4 vCPUs with one virtio-blk device (4 virtqueues)
> 
> 4 jobs can consume up all 4 vCPUs. Just run a quick fio test with
> 'ioengine=io_uring --numjobs=1' on single vq, and IOPS can be improved
> by ~20%(hipri=1 vs hipri=0) with the 3 patches, and the virtio-blk is
> still backed on NVMe SSD.

Nice, thank you for sharing the data!

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: vhost: multiple worker support

2021-06-03 Thread Stefan Hajnoczi
On Tue, May 25, 2021 at 01:05:51PM -0500, Mike Christie wrote:
> The following patches apply over linus's tree or mst's vhost branch
> and my cleanup patchset:
> 
> https://lists.linuxfoundation.org/pipermail/virtualization/2021-May/054354.html
> 
> These patches allow us to support multiple vhost workers per device. I
> ended up just doing Stefan's original idea where userspace has the
> kernel create a worker and we pass back the pid. This has the benefit
> over the workqueue and userspace thread approach where we only have
> one'ish code path in the kernel during setup to detect old tools. The
> main IO paths and device/vq setup/teardown paths all use common code.
> 
> The kernel patches here allow us to then do N workers device and also
> share workers across devices.
> 
> I've also included a patch for qemu so you can get an idea of how it
> works. If we are ok with the kernel code then I'll break that up into
> a patchset and send to qemu-devel.

It seems risky to allow userspace process A to "share" a vhost worker
thread with userspace process B based on a matching pid alone. Should
they have ptrace_may_access() or similar?

For example, two competing users could sabotague each other by running
lots of work items on their competitor's vhost worker thread.


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 9/9] vhost: support sharing workers across devs

2021-06-03 Thread Stefan Hajnoczi
On Tue, May 25, 2021 at 01:06:00PM -0500, Mike Christie wrote:
> This allows a worker to handle multiple device's vqs.
> 
> TODO:
> - The worker is attached to the cgroup of the device that created it. In
> this patch you can share workers with devices with different owners which
> could be in different cgroups. Do we want to restict sharing workers with
> devices that have the same owner (dev->mm value)?

Question for Michael or Jason.


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 8/9] vhost: add vhost_dev pointer to vhost_work

2021-06-03 Thread Stefan Hajnoczi
On Tue, May 25, 2021 at 01:05:59PM -0500, Mike Christie wrote:
> The next patch allows a vhost_worker to handle different devices. To
> prepare for that, this patch adds a pointer to the device on the work so
> we can get to the different mms in the vhost_worker thread.
> 
> Signed-off-by: Mike Christie 
> ---
>  drivers/vhost/scsi.c  |  7 ---
>  drivers/vhost/vhost.c | 24 ++--
>  drivers/vhost/vhost.h |  4 +++-
>  drivers/vhost/vsock.c |  3 ++-
>  4 files changed, 23 insertions(+), 15 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 7/9] vhost: allow userspace to create workers

2021-06-03 Thread Stefan Hajnoczi
On Tue, May 25, 2021 at 01:05:58PM -0500, Mike Christie wrote:
> This patch allows userspace to create workers and bind them to vqs, so you
> can have N workers per dev and also share N workers with M vqs. The next
> patch will allow sharing across devices.
> 
> Signed-off-by: Mike Christie 
> ---
>  drivers/vhost/vhost.c| 94 +++-
>  drivers/vhost/vhost.h|  3 +
>  include/uapi/linux/vhost.h   |  6 ++
>  include/uapi/linux/vhost_types.h | 12 
>  4 files changed, 113 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 345ade0af133..981e9bac7a31 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -30,6 +30,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "vhost.h"
>  
> @@ -42,6 +43,9 @@ module_param(max_iotlb_entries, int, 0444);
>  MODULE_PARM_DESC(max_iotlb_entries,
>   "Maximum number of iotlb entries. (default: 2048)");
>  
> +static DEFINE_HASHTABLE(vhost_workers, 5);
> +static DEFINE_SPINLOCK(vhost_workers_lock);
> +
>  enum {
>   VHOST_MEMORY_F_LOG = 0x1,
>  };
> @@ -617,8 +621,17 @@ static void vhost_detach_mm(struct vhost_dev *dev)
>   dev->mm = NULL;
>  }
>  
> -static void vhost_worker_free(struct vhost_worker *worker)
> +static void vhost_worker_put(struct vhost_worker *worker)
>  {
> + spin_lock(_workers_lock);
> + if (!refcount_dec_and_test(>refcount)) {
> + spin_unlock(_workers_lock);
> + return;
> + }
> +
> + hash_del(>h_node);
> + spin_unlock(_workers_lock);
> +
>   WARN_ON(!llist_empty(>work_list));
>   kthread_stop(worker->task);
>   kfree(worker);
> @@ -632,7 +645,7 @@ static void vhost_workers_free(struct vhost_dev *dev)
>   return;
>  
>   for (i = 0; i < dev->num_workers; i++)
> - vhost_worker_free(dev->workers[i]);
> + vhost_worker_put(dev->workers[i]);
>  
>   kfree(dev->workers);
>   dev->num_workers = 0;
> @@ -652,6 +665,8 @@ static struct vhost_worker *vhost_worker_create(struct 
> vhost_dev *dev)
>   worker->id = dev->num_workers;
>   worker->dev = dev;
>   init_llist_head(>work_list);
> + INIT_HLIST_NODE(>h_node);
> + refcount_set(>refcount, 1);
>  
>   task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid);
>   if (IS_ERR(task))
> @@ -664,6 +679,9 @@ static struct vhost_worker *vhost_worker_create(struct 
> vhost_dev *dev)
>   if (ret)
>   goto stop_worker;
>  
> + spin_lock(_workers_lock);
> + hash_add(vhost_workers, >h_node, worker->task->pid);
> + spin_unlock(_workers_lock);
>   return worker;
>  
>  stop_worker:
> @@ -673,6 +691,67 @@ static struct vhost_worker *vhost_worker_create(struct 
> vhost_dev *dev)
>   return NULL;
>  }
>  
> +static struct vhost_worker *vhost_worker_find(struct vhost_dev *dev, pid_t 
> pid)
> +{
> + struct vhost_worker *worker, *found_worker = NULL;
> +
> + spin_lock(_workers_lock);
> + hash_for_each_possible(vhost_workers, worker, h_node, pid) {
> + if (worker->task->pid == pid) {
> + /* tmp - next patch allows sharing across devs */
> + if (worker->dev != dev)
> + break;
> +
> + found_worker = worker;
> + refcount_inc(>refcount);
> + break;
> + }
> + }
> + spin_unlock(_workers_lock);
> + return found_worker;
> +}
> +
> +/* Caller must have device mutex */
> +static int vhost_vq_set_worker(struct vhost_virtqueue *vq,
> +struct vhost_vring_worker *info)
> +{
> + struct vhost_dev *dev = vq->dev;
> + struct vhost_worker *worker;
> +
> + if (vq->worker) {
> + /* TODO - support changing while works are running */
> + return -EBUSY;
> + }
> +
> + if (info->pid == VHOST_VRING_NEW_WORKER) {
> + worker = vhost_worker_create(dev);

The maximum number of kthreads created is limited by
vhost_dev_init(nvqs)? For example VHOST_SCSI_MAX_VQ 128.

IIUC kthread_create is not limited by or accounted against the current
task, so I'm a little worried that a process can create a lot of
kthreads.

I haven't investigated other kthread_create() users reachable from
userspace applications to see how they bound the number of threads
effectively.

Any thoughts?

> + if (!worker)
> + return -ENOMEM;
> +
> + info->pid = worker->task->pid;
> + } else {
> + worker = vhost_worker_find(dev, info->pid);
> + if (!worker)
> + return -ENODEV;
> + }
> +
> + if (!dev->workers) {
> + dev->workers = kcalloc(vq->dev->nvqs,
> +sizeof(struct vhost_worker *),
> +GFP_KERNEL);

Another candidate for 

Re: [PATCH 6/9] vhost-scsi: make SCSI cmd completion per vq

2021-06-03 Thread Stefan Hajnoczi
On Tue, May 25, 2021 at 01:05:57PM -0500, Mike Christie wrote:
> This patch separates the scsi cmd completion code paths so we can complete
> cmds based on their vq instead of having all cmds complete on the same
> worker/CPU. This will be useful with the next patch that allows us to
> create mulitple worker threads and bind them to different vqs, so we can
> have completions running on different threads/CPUs.
> 
> Signed-off-by: Mike Christie 
> ---
>  drivers/vhost/scsi.c | 48 +++-
>  1 file changed, 25 insertions(+), 23 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 5/9] vhost-scsi: flush IO vqs then send TMF rsp

2021-06-03 Thread Stefan Hajnoczi
On Tue, May 25, 2021 at 01:05:56PM -0500, Mike Christie wrote:
> With one worker we will always send the scsi cmd responses then send the
> TMF rsp, because LIO will always complete the scsi cmds first which
> calls vhost_scsi_release_cmd to add them to the work queue.
> 
> When the next patches adds multiple worker support, the worker threads
> could still be sending their responses when the tmf's work is run.
> So this patch has vhost-scsi flush the IO vqs on other worker threads
> before we send the tmf response.
> 
> Signed-off-by: Mike Christie 
> ---
>  drivers/vhost/scsi.c  | 17 ++---
>  drivers/vhost/vhost.c |  6 ++
>  drivers/vhost/vhost.h |  1 +
>  3 files changed, 21 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> index 46f897e41217..e585f2180457 100644
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -1168,12 +1168,23 @@ static void vhost_scsi_tmf_resp_work(struct 
> vhost_work *work)
>  {
>   struct vhost_scsi_tmf *tmf = container_of(work, struct vhost_scsi_tmf,
> vwork);
> - int resp_code;
> + int resp_code, i;
> +
> + if (tmf->scsi_resp == TMR_FUNCTION_COMPLETE) {
> + /*
> +  * When processing a TMF lio completes the cmds then the TMF,
> +  * so with one worker the TMF always completes after cmds. For
> +  * multiple worker support we must flush the IO vqs to make
> +      * sure if they have differrent workers then the cmds have

s/differrent/different/

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 4/9] vhost: allow vhost_polls to use different vhost_workers

2021-06-03 Thread Stefan Hajnoczi
On Tue, May 25, 2021 at 01:05:55PM -0500, Mike Christie wrote:
> This allows vhost_polls to use the worker the vq the poll is associated
> with.
> 
> This also exports a helper vhost_vq_work_queue which is used here
> internally, and will be used in the vhost-scsi patches.
> 
> Signed-off-by: Mike Christie 
> ---
>  drivers/vhost/net.c   |  6 --
>  drivers/vhost/vhost.c | 19 ---
>  drivers/vhost/vhost.h |  6 +-
>  3 files changed, 25 insertions(+), 6 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 3/9] vhost: modify internal functions to take a vhost_worker

2021-06-03 Thread Stefan Hajnoczi
On Tue, May 25, 2021 at 01:05:54PM -0500, Mike Christie wrote:
> -void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
> +static void vhost_work_queue_on(struct vhost_work *work,
> + struct vhost_worker *worker)
>  {
> - if (!dev->worker)
> - return;
> -
>   if (!test_and_set_bit(VHOST_WORK_QUEUED, >flags)) {
>   /* We can only add the work to the list after we're
>* sure it was not in the list.
>* test_and_set_bit() implies a memory barrier.
>*/
> - llist_add(>node, >worker->work_list);
> - wake_up_process(dev->worker->task);
> + llist_add(>node, >work_list);
> + wake_up_process(worker->task);
>   }
>  }
> +
> +void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)

When should this function still be used? A doc comment contrasting it to
vhost_work_queue_on() would be helpful. I would expect callers to switch
to that instead of queuing work on dev->workers[0].

>  /* A lockless hint for busy polling code to exit the loop */
>  bool vhost_has_work(struct vhost_dev *dev)
>  {
> - return dev->worker && !llist_empty(>worker->work_list);
> + int i;
> +
> + for (i = 0; i < dev->num_workers; i++) {
> + if (!llist_empty(>workers[i]->work_list))
> + return true;
> + }
> +
> + return false;
>  }
>  EXPORT_SYMBOL_GPL(vhost_has_work);

It's probably not necessary to poll all workers:

drivers/vhost/net.c calls vhost_has_work() to busy poll a specific
virtqueue. If the vq:worker mapping is 1:1 or N:1 then vhost_has_work()
should be extended to include the struct vhost_virtqueue so we can poll
just that vq worker's work_list.
>  /* Caller must have device mutex */
>  static int vhost_worker_try_create_def(struct vhost_dev *dev)
>  {
> - if (!dev->use_worker || dev->worker)
> + struct vhost_worker *worker;
> +
> + if (!dev->use_worker || dev->workers)
>   return 0;
>  
> - return vhost_worker_create(dev);
> + dev->workers = kcalloc(1, sizeof(struct vhost_worker *), GFP_KERNEL);

GFP_KERNEL_ACCOUNT so that vhost memory associated with a process (the
one that invoked the ioctl) is accounted? This may get trickier if the
workers are shared between processes.

The same applies for struct vhost_worker in vhost_worker_create().


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 2/9] vhost: move vhost worker creation to kick setup

2021-06-03 Thread Stefan Hajnoczi
On Tue, May 25, 2021 at 01:05:53PM -0500, Mike Christie wrote:
> The next patch will add new ioctls that allows userspace to create workers
> and bind them to devs and vqs after VHOST_SET_OWNER. To support older
> tools, newer tools that want to go wild with worker threads, and newer
> tools that want the old/default behavior this patch moves the default
> worker setup to the kick setup.
> 
> After the first vq's kick/poll setup is done we could start to get works
> queued, so this is the point when we must have a worker setup. If we are
> using older tools or the newer tools just want the default single vhost
> thread per dev behavior then it will automatically be done here. If the
> tools are using the newer ioctls that have already created the workers
> then we also detect that here and do nothing.
> 
> Signed-off-by: Mike Christie 
> ---
>  drivers/vhost/vhost.c | 23 ---
>  1 file changed, 16 insertions(+), 7 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/9] vhost: move worker thread fields to new struct

2021-06-03 Thread Stefan Hajnoczi
On Tue, May 25, 2021 at 01:05:52PM -0500, Mike Christie wrote:
> This is just a prep patch. It moves the worker related fields to a new
> vhost_worker struct and moves the code around to create some helpers that
> will be used in the next patches.
> 
> Signed-off-by: Mike Christie 
> ---
>  drivers/vhost/vhost.c | 94 +--
>  drivers/vhost/vhost.h |  9 -
>  2 files changed, 70 insertions(+), 33 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: vhost: multiple worker support

2021-06-03 Thread Stefan Hajnoczi
On Tue, May 25, 2021 at 01:05:51PM -0500, Mike Christie wrote:
> Results:
> 
> When running with the null_blk driver and vhost-scsi I can get 1.2
> million IOPs by just running a simple
> 
> fio --filename=/dev/sda --direct=1 --rw=randrw --bs=4k --ioengine=libaio
> --iodepth=128  --numjobs=8 --time_based --group_reporting --name=iops
> --runtime=60 --eta-newline=1
> 
> The VM has 8 vCPUs and sda has 8 virtqueues and we can do a total of
> 1024 cmds per devices. To get 1.2 million IOPs I did have to tune and
> ran the virsh emulatorpin command so the vhost threads were running
> on different CPUs than the VM. If the vhost threads share CPUs then I
> get around 800K.
> 
> For a more real device that are also CPU hogs like iscsi, I can still
> get 1 million IOPs using 1 dm-multipath device over 8 iscsi paths
> (natively it gets 1.1 million IOPs).

There is no comparison against a baseline, but I guess it would be the
same 8 vCPU guest with single queue vhost-scsi?

> 
> Results/TODO Note:
> 
> - I ported the vdpa sim code to support multiple workers and as-is now
> it made perf much worse. If I increase vdpa_sim_blk's num queues to 4-8
> I get 700K IOPs with the fio command above. However with the multiple
> worker support it drops to 400K. The problem is the vdpa_sim lock
> and the iommu_lock. If I hack (like comment out locks or not worry about
> data corruption or crashes) then I can get around 1.2M - 1.6M IOPs with
> 8 queues and fio command above.
> 
> So these patches could help other drivers, but it will just take more
> work to remove those types of locks. I was hoping the 2 items could be
> done indepentently since it helps vhost-scsi immediately.

Cool, thank you for taking a look. That's useful info for Stefano. vDPA
and vhost can be handled independently though in the long term hopefully
they can share code.


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC] virtio_scsi: to poll and kick the virtqueue in timeout handler

2021-05-25 Thread Stefan Hajnoczi
On Mon, May 24, 2021 at 11:33:33PM -0700, Dongli Zhang wrote:
> On 5/24/21 6:24 AM, Stefan Hajnoczi wrote:
> > On Sun, May 23, 2021 at 09:39:51AM +0200, Hannes Reinecke wrote:
> >> On 5/23/21 8:38 AM, Dongli Zhang wrote:
> >>> This RFC is to trigger the discussion about to poll and kick the
> >>> virtqueue on purpose in virtio-scsi timeout handler.
> >>>
> >>> The virtio-scsi relies on the virtio vring shared between VM and host.
> >>> The VM side produces requests to vring and kicks the virtqueue, while the
> >>> host side produces responses to vring and interrupts the VM side.
> >>>
> >>> By default the virtio-scsi handler depends on the host timeout handler
> >>> by BLK_EH_RESET_TIMER to give host a chance to perform EH.
> >>>
> >>> However, this is not helpful for the case that the responses are available
> >>> on vring but the notification from host to VM is lost.
> >>>
> >> How can this happen?
> >> If responses are lost the communication between VM and host is broken, and
> >> we should rather reset the virtio rings themselves.
> > 
> > I agree. In principle it's fine to poll the virtqueue at any time, but I
> > don't understand the failure scenario here. It's not clear to me why the
> > device-to-driver vq notification could be lost.
> > 
> 
> One example is the CPU hotplug issue before the commit bf0beec0607d ("blk-mq:
> drain I/O when all CPUs in a hctx are offline") was available. The issue is
> equivalent to loss of interrupt. Without the CPU hotplug fix, while NVMe 
> driver
> relies on the timeout handler to complete inflight IO requests, the PV
> virtio-scsi may hang permanently.
> 
> In addition, as the virtio/vhost/QEMU are complex software, we are not able to
> guarantee there is no further lost of interrupt/kick issue in the future. It 
> is
> really painful if we encounter such issue in production environment.

Any number of hardware or software bugs might exist that we don't know
about, yet we don't pre-emptively add workarounds for them because where
do you draw the line?

I checked other SCSI/block drivers and found it's rare to poll in the
timeout function so there does not seem to be a consensus that it's
useful to do this.

That said, it's technically fine to do it, the virtqueue APIs are there
and can be used like this. So if you and others think this is necessary,
then it's a pretty small change and I'm not against merging a patch like
this.

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()

2021-05-25 Thread Stefan Hajnoczi
On Tue, May 25, 2021 at 03:38:42PM +0800, Ming Lei wrote:
> On Tue, May 25, 2021 at 09:22:48AM +0200, Paolo Bonzini wrote:
> > On 24/05/21 16:59, Christoph Hellwig wrote:
> > > On Thu, May 20, 2021 at 03:13:05PM +0100, Stefan Hajnoczi wrote:
> > > > Possible drawbacks of this approach:
> > > > 
> > > > - Hardware virtio_blk implementations may find virtqueue_disable_cb()
> > > >expensive since it requires DMA. If such devices become popular then
> > > >the virtio_blk driver could use a similar approach to NVMe when
> > > >VIRTIO_F_ACCESS_PLATFORM is detected in the future.
> > > > 
> > > > - If a blk_poll() thread is descheduled it not only hurts polling
> > > >performance but also delays completion of non-REQ_HIPRI requests on
> > > >that virtqueue since vq notifications are disabled.
> > > 
> > > Yes, I think this is a dangerous configuration.  What argument exists
> > > again just using dedicated poll queues?
> > 
> > There isn't an equivalent of the admin queue in virtio-blk, which would
> > allow the guest to configure the desired number of poll queues.  The number
> > of queues is fixed.
> 
> Dedicated vqs can be used for poll only, and I understand VM needn't to know
> if the vq is polled or driven by IRQ in VM.
> 
> I tried that in v5.4, but not see obvious IOPS boost, so give up.
> 
> https://github.com/ming1/linux/commits/my_v5.4-virtio-irq-poll

Hey, that's cool. I see a lot of similarity between our patches :).

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()

2021-05-25 Thread Stefan Hajnoczi
On Mon, May 24, 2021 at 04:59:28PM +0200, Christoph Hellwig wrote:
> On Thu, May 20, 2021 at 03:13:05PM +0100, Stefan Hajnoczi wrote:
> > Possible drawbacks of this approach:
> > 
> > - Hardware virtio_blk implementations may find virtqueue_disable_cb()
> >   expensive since it requires DMA. If such devices become popular then
> >   the virtio_blk driver could use a similar approach to NVMe when
> >   VIRTIO_F_ACCESS_PLATFORM is detected in the future.
> > 
> > - If a blk_poll() thread is descheduled it not only hurts polling
> >   performance but also delays completion of non-REQ_HIPRI requests on
> >   that virtqueue since vq notifications are disabled.
> 
> Yes, I think this is a dangerous configuration.  What argument exists
> again just using dedicated poll queues?

Aside from what Paolo described (the lack of a hardware interface to
designate polling queues), the poll_queues=N parameter needs to be added
to the full guest and host software stack. Users also need to learn
about this so they can enable it in all the places. This is much more
hassle for users to configure. The dynamic polling mode approach
requires no configuration.

Paolo's suggestion is to notify the driver when irqs need to be
re-enabled if the polling thread is descheduled. I actually have a
prototype that achieves something similar here:
https://gitlab.com/stefanha/linux/-/commits/cpuidle-haltpoll-virtqueue

It's a different approach from the current patch series: the cpuidle
driver provides poll_start/stop() callbacks and virtio_blk participates
in cpuidle-haltpoll. That means all virtio-blk devices temporarily use
polling mode while the vCPU is doing cpuidle-haltpoll polling. The neat
thing about the cpuidle approach is:
1. Applications don't need to set the RWF_HIPRI flag!
2. Other drivers besides virtio_blk can participate in polling too.

Maybe we should go with cpuidle polling instead?

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()

2021-05-25 Thread Stefan Hajnoczi
On Tue, May 25, 2021 at 11:21:41AM +0800, Jason Wang wrote:
> 
> 在 2021/5/20 下午10:13, Stefan Hajnoczi 写道:
> > Request completion latency can be reduced by using polling instead of
> > irqs. Even Posted Interrupts or similar hardware support doesn't beat
> > polling. The reason is that disabling virtqueue notifications saves
> > critical-path CPU cycles on the host by skipping irq injection and in
> > the guest by skipping the irq handler. So let's add blk_mq_ops->poll()
> > support to virtio_blk.
> > 
> > The approach taken by this patch differs from the NVMe driver's
> > approach. NVMe dedicates hardware queues to polling and submits
> > REQ_HIPRI requests only on those queues. This patch does not require
> > exclusive polling queues for virtio_blk. Instead, it switches between
> > irqs and polling when one or more REQ_HIPRI requests are in flight on a
> > virtqueue.
> > 
> > This is possible because toggling virtqueue notifications is cheap even
> > while the virtqueue is running. NVMe cqs can't do this because irqs are
> > only enabled/disabled at queue creation time.
> > 
> > This toggling approach requires no configuration. There is no need to
> > dedicate queues ahead of time or to teach users and orchestration tools
> > how to set up polling queues.
> > 
> > Possible drawbacks of this approach:
> > 
> > - Hardware virtio_blk implementations may find virtqueue_disable_cb()
> >expensive since it requires DMA.
> 
> 
> Note that it's probably not related to the behavior of the driver but the
> design of the event suppression mechanism.
> 
> Device can choose to ignore the suppression flag and keep sending
> interrupts.

Yes, it's the design of the event suppression mechanism.

If we use dedicated polling virtqueues then the hardware doesn't need to
check whether interrupts are enabled for each notification. However,
there's no mechanism to tell the device that virtqueue interrupts are
permanently disabled. This means that as of today, even dedicated
virtqueues cannot suppress interrupts without the device checking for
each notification.

> >   If such devices become popular then
> >the virtio_blk driver could use a similar approach to NVMe when
> >VIRTIO_F_ACCESS_PLATFORM is detected in the future.
> > 
> > - If a blk_poll() thread is descheduled it not only hurts polling
> >performance but also delays completion of non-REQ_HIPRI requests on
> >that virtqueue since vq notifications are disabled.
> 
> 
> Can we poll only when only high pri requests are pending?

Yes, that's what this patch does.

> If the backend is a remote one, I think the polling may cause more cpu
> cycles.

Right, but polling is only done when userspace sets the RWF_HIPRI
request flag. Most applications don't support it and for those that do
it's probably an option that the user needs to enable explicitly.

Stefan

> > diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> > index fc0fb1dcd399..f0243dcd745a 100644
> > --- a/drivers/block/virtio_blk.c
> > +++ b/drivers/block/virtio_blk.c
> > @@ -29,6 +29,16 @@ static struct workqueue_struct *virtblk_wq;
> >   struct virtio_blk_vq {
> > struct virtqueue *vq;
> > spinlock_t lock;
> > +
> > +   /* Number of non-REQ_HIPRI requests in flight. Protected by lock. */
> > +   unsigned int num_lopri;
> > +
> > +   /* Number of REQ_HIPRI requests in flight. Protected by lock. */
> > +   unsigned int num_hipri;
> > +
> > +   /* Are vq notifications enabled? Protected by lock. */
> > +   bool cb_enabled;
> 
> 
> We had event_flag_shadow, is it sufficient to introduce a new helper
> virtqueue_cb_is_enabled()?

Yes, I'll try that in the next revision.

> > +
> > char name[VQ_NAME_LEN];
> >   } cacheline_aligned_in_smp;
> > @@ -171,33 +181,67 @@ static inline void virtblk_request_done(struct 
> > request *req)
> > blk_mq_end_request(req, virtblk_result(vbr));
> >   }
> > -static void virtblk_done(struct virtqueue *vq)
> > +/* Returns true if one or more requests completed */
> > +static bool virtblk_complete_requests(struct virtqueue *vq)
> >   {
> > struct virtio_blk *vblk = vq->vdev->priv;
> > struct virtio_blk_vq *vbq = >vqs[vq->index];
> > bool req_done = false;
> > +   bool last_hipri_done = false;
> > struct virtblk_req *vbr;
> > unsigned long flags;
> > unsigned int len;
> > spin_lock_irqsave(>lock, flags);
> > +
> > do {
> > -   virtqueue_disable_cb(vq);
> > +   if (vbq->cb_enabled)
&

Re: [PATCH 1/3] virtio: add virtioqueue_more_used()

2021-05-25 Thread Stefan Hajnoczi
On Tue, May 25, 2021 at 10:23:39AM +0800, Jason Wang wrote:
> 
> 在 2021/5/20 下午10:13, Stefan Hajnoczi 写道:
> > Add an API to check whether there are pending used buffers. There is
> > already a similar API called virtqueue_poll() but it only works together
> > with virtqueue_enable_cb_prepare(). The patches that follow add blk-mq
> > ->poll() support to virtio_blk and they need to check for used buffers
> > without re-enabling virtqueue callbacks, so introduce an API for it.
> > 
> > Signed-off-by: Stefan Hajnoczi 
> 
> 
> Typo in the subject.

Thanks, will fix.

> > +/**
> > + * virtqueue_more_used - check if there are used buffers pending
> > + * @_vq: the struct virtqueue we're talking about.
> > + *
> > + * Returns true if there are used buffers, false otherwise. May be called 
> > at
> > + * the same time as other virtqueue operations, but actually calling
> > + * virtqueue_get_buf() requires serialization so be mindful of the race 
> > between
> > + * calling virtqueue_more_used() and virtqueue_get_buf().
> > + */
> > +bool virtqueue_more_used(const struct virtqueue *_vq)
> > +{
> > +   struct vring_virtqueue *vq = to_vvq(_vq);
> > +
> > +   return more_used(vq);
> > +}
> > +EXPORT_SYMBOL_GPL(virtqueue_more_used);
> 
> 
> It's worth to mention that the function is not serialized (no barriers).

Thanks, will fix.

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/1] virtio: disable partitions scanning for no partitions block

2021-05-24 Thread Stefan Hajnoczi
On Thu, May 20, 2021 at 04:39:08PM +0300, Yury Kamenev wrote:

Hi,
Is there a VIRTIO spec change for the new VIRTIO_BLK_F_NO_PS feature
bit? Please send one:
https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=virtio#feedback

GENHD_FL_NO_PART_SCAN is not used much in other drivers. This makes me
wonder if the same use case is addressed through other means with SCSI,
NVMe, etc devices. Maybe Christoph or Jens can weigh in on whether
adding a bit to disable partition scanning for a virtio-blk fits into
the big picture?

Is your goal to avoid accidentally detecting partitions because it's
confusing when that happens?

VIRTIO is currently undergoing auditing and changes to support untrusted
devices. From that perspective adding a device feature bit to disable
partition scanning does not help protect the guest from an untrusted
disk. The guest cannot trust the device, instead the guest itself would
need to be configured to avoid partition scanning of untrusted devices.

Stefan

> Signed-off-by: Yury Kamenev 
> ---
>  drivers/block/virtio_blk.c  | 6 ++
>  include/uapi/linux/virtio_blk.h | 1 +
>  2 files changed, 7 insertions(+)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index b9fa3ef5b57c..17edcfee2208 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -799,6 +799,10 @@ static int virtblk_probe(struct virtio_device *vdev)
>   vblk->disk->flags |= GENHD_FL_EXT_DEVT;
>   vblk->index = index;
>  
> + /*Disable partitions scanning for no-partitions block*/

Formatting cleanup and rephrasing:

  /* Disable partition scanning for devices with no partitions */

> + if (virtio_has_feature(vdev, VIRTIO_BLK_F_NO_PS))

I suggest user a more obvious name:

  VIRTIO_BLK_F_NO_PART_SCAN

> + vblk->disk->flags |= GENHD_FL_NO_PART_SCAN;
> +
>   /* configure queue flush support */
>   virtblk_update_cache_mode(vdev);
>  
> @@ -977,6 +981,7 @@ static unsigned int features_legacy[] = {
>   VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
>   VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
>   VIRTIO_BLK_F_MQ, VIRTIO_BLK_F_DISCARD, VIRTIO_BLK_F_WRITE_ZEROES,
> + VIRTIO_BLK_F_NO_PS,
>  }
>  ;
>  static unsigned int features[] = {
> @@ -984,6 +989,7 @@ static unsigned int features[] = {
>   VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
>   VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
>   VIRTIO_BLK_F_MQ, VIRTIO_BLK_F_DISCARD, VIRTIO_BLK_F_WRITE_ZEROES,
> + VIRTIO_BLK_F_NO_PS,
>  };
>  
>  static struct virtio_driver virtio_blk = {
> diff --git a/include/uapi/linux/virtio_blk.h b/include/uapi/linux/virtio_blk.h
> index d888f013d9ff..f197d07afb05 100644
> --- a/include/uapi/linux/virtio_blk.h
> +++ b/include/uapi/linux/virtio_blk.h
> @@ -40,6 +40,7 @@
>  #define VIRTIO_BLK_F_MQ  12  /* support more than one vq */
>  #define VIRTIO_BLK_F_DISCARD 13  /* DISCARD is supported */
>  #define VIRTIO_BLK_F_WRITE_ZEROES14  /* WRITE ZEROES is supported */
> +#define VIRTIO_BLK_F_NO_PS  16  /* No partitions */
>  
>  /* Legacy feature bits */
>  #ifndef VIRTIO_BLK_NO_LEGACY
> -- 
> 2.24.3 (Apple Git-128)
> 


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC] virtio_scsi: to poll and kick the virtqueue in timeout handler

2021-05-24 Thread Stefan Hajnoczi
On Sun, May 23, 2021 at 09:39:51AM +0200, Hannes Reinecke wrote:
> On 5/23/21 8:38 AM, Dongli Zhang wrote:
> > This RFC is to trigger the discussion about to poll and kick the
> > virtqueue on purpose in virtio-scsi timeout handler.
> > 
> > The virtio-scsi relies on the virtio vring shared between VM and host.
> > The VM side produces requests to vring and kicks the virtqueue, while the
> > host side produces responses to vring and interrupts the VM side.
> > 
> > By default the virtio-scsi handler depends on the host timeout handler
> > by BLK_EH_RESET_TIMER to give host a chance to perform EH.
> > 
> > However, this is not helpful for the case that the responses are available
> > on vring but the notification from host to VM is lost.
> > 
> How can this happen?
> If responses are lost the communication between VM and host is broken, and
> we should rather reset the virtio rings themselves.

I agree. In principle it's fine to poll the virtqueue at any time, but I
don't understand the failure scenario here. It's not clear to me why the
device-to-driver vq notification could be lost.

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH 2/3] virtio_blk: avoid repeating vblk->vqs[qid]

2021-05-20 Thread Stefan Hajnoczi
struct virtio_blk_vq is accessed in many places. Introduce "vbq" local
variables to avoid repeating vblk->vqs[qid] throughout the code. The
patches that follow will add more accesses, making the payoff even
greater.

virtio_commit_rqs() calls the local variable "vq", which is easily
confused with struct virtqueue. Rename to "vbq" for clarity.

Signed-off-by: Stefan Hajnoczi 
---
 drivers/block/virtio_blk.c | 34 +-
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index b9fa3ef5b57c..fc0fb1dcd399 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -174,16 +174,16 @@ static inline void virtblk_request_done(struct request 
*req)
 static void virtblk_done(struct virtqueue *vq)
 {
struct virtio_blk *vblk = vq->vdev->priv;
+   struct virtio_blk_vq *vbq = >vqs[vq->index];
bool req_done = false;
-   int qid = vq->index;
struct virtblk_req *vbr;
unsigned long flags;
unsigned int len;
 
-   spin_lock_irqsave(>vqs[qid].lock, flags);
+   spin_lock_irqsave(>lock, flags);
do {
virtqueue_disable_cb(vq);
-   while ((vbr = virtqueue_get_buf(vblk->vqs[qid].vq, )) != 
NULL) {
+   while ((vbr = virtqueue_get_buf(vq, )) != NULL) {
struct request *req = blk_mq_rq_from_pdu(vbr);
 
if (likely(!blk_should_fake_timeout(req->q)))
@@ -197,32 +197,32 @@ static void virtblk_done(struct virtqueue *vq)
/* In case queue is stopped waiting for more buffers. */
if (req_done)
blk_mq_start_stopped_hw_queues(vblk->disk->queue, true);
-   spin_unlock_irqrestore(>vqs[qid].lock, flags);
+   spin_unlock_irqrestore(>lock, flags);
 }
 
 static void virtio_commit_rqs(struct blk_mq_hw_ctx *hctx)
 {
struct virtio_blk *vblk = hctx->queue->queuedata;
-   struct virtio_blk_vq *vq = >vqs[hctx->queue_num];
+   struct virtio_blk_vq *vbq = >vqs[hctx->queue_num];
bool kick;
 
-   spin_lock_irq(>lock);
-   kick = virtqueue_kick_prepare(vq->vq);
-   spin_unlock_irq(>lock);
+   spin_lock_irq(>lock);
+   kick = virtqueue_kick_prepare(vbq->vq);
+   spin_unlock_irq(>lock);
 
if (kick)
-   virtqueue_notify(vq->vq);
+   virtqueue_notify(vbq->vq);
 }
 
 static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
   const struct blk_mq_queue_data *bd)
 {
struct virtio_blk *vblk = hctx->queue->queuedata;
+   struct virtio_blk_vq *vbq = >vqs[hctx->queue_num];
struct request *req = bd->rq;
struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
unsigned long flags;
unsigned int num;
-   int qid = hctx->queue_num;
int err;
bool notify = false;
bool unmap = false;
@@ -274,16 +274,16 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx 
*hctx,
vbr->out_hdr.type |= cpu_to_virtio32(vblk->vdev, 
VIRTIO_BLK_T_IN);
}
 
-   spin_lock_irqsave(>vqs[qid].lock, flags);
-   err = virtblk_add_req(vblk->vqs[qid].vq, vbr, vbr->sg, num);
+   spin_lock_irqsave(>lock, flags);
+   err = virtblk_add_req(vbq->vq, vbr, vbr->sg, num);
if (err) {
-   virtqueue_kick(vblk->vqs[qid].vq);
+   virtqueue_kick(vbq->vq);
/* Don't stop the queue if -ENOMEM: we may have failed to
 * bounce the buffer due to global resource outage.
 */
if (err == -ENOSPC)
blk_mq_stop_hw_queue(hctx);
-   spin_unlock_irqrestore(>vqs[qid].lock, flags);
+   spin_unlock_irqrestore(>lock, flags);
switch (err) {
case -ENOSPC:
return BLK_STS_DEV_RESOURCE;
@@ -294,12 +294,12 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx 
*hctx,
}
}
 
-   if (bd->last && virtqueue_kick_prepare(vblk->vqs[qid].vq))
+   if (bd->last && virtqueue_kick_prepare(vbq->vq))
notify = true;
-   spin_unlock_irqrestore(>vqs[qid].lock, flags);
+   spin_unlock_irqrestore(>lock, flags);
 
if (notify)
-   virtqueue_notify(vblk->vqs[qid].vq);
+   virtqueue_notify(vbq->vq);
return BLK_STS_OK;
 }
 
-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()

2021-05-20 Thread Stefan Hajnoczi
Request completion latency can be reduced by using polling instead of
irqs. Even Posted Interrupts or similar hardware support doesn't beat
polling. The reason is that disabling virtqueue notifications saves
critical-path CPU cycles on the host by skipping irq injection and in
the guest by skipping the irq handler. So let's add blk_mq_ops->poll()
support to virtio_blk.

The approach taken by this patch differs from the NVMe driver's
approach. NVMe dedicates hardware queues to polling and submits
REQ_HIPRI requests only on those queues. This patch does not require
exclusive polling queues for virtio_blk. Instead, it switches between
irqs and polling when one or more REQ_HIPRI requests are in flight on a
virtqueue.

This is possible because toggling virtqueue notifications is cheap even
while the virtqueue is running. NVMe cqs can't do this because irqs are
only enabled/disabled at queue creation time.

This toggling approach requires no configuration. There is no need to
dedicate queues ahead of time or to teach users and orchestration tools
how to set up polling queues.

Possible drawbacks of this approach:

- Hardware virtio_blk implementations may find virtqueue_disable_cb()
  expensive since it requires DMA. If such devices become popular then
  the virtio_blk driver could use a similar approach to NVMe when
  VIRTIO_F_ACCESS_PLATFORM is detected in the future.

- If a blk_poll() thread is descheduled it not only hurts polling
  performance but also delays completion of non-REQ_HIPRI requests on
  that virtqueue since vq notifications are disabled.

Performance:

- Benchmark: fio ioengine=pvsync2 numjobs=4 direct=1
- Guest: 4 vCPUs with one virtio-blk device (4 virtqueues)
- Disk: Intel Corporation NVMe Datacenter SSD [Optane] [8086:2701]
- CPU: Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz

rw  bs hipri=0 hipri=1
--
randread4k 149,426 170,763 +14%
randread   16k 118,939 134,269 +12%
randread   64k  34,886  34,906   0%
randread  128k  17,655  17,667   0%
randwrite   4k 138,578 163,600 +18%
randwrite  16k 102,089 120,950 +18%
randwrite  64k  32,364  32,561   0%
randwrite 128k  16,154  16,237   0%
read4k 146,032 170,620 +16%
read   16k 117,097 130,437 +11%
read   64k  34,834  35,037   0%
read  128k  17,680  17,658   0%
write   4k 134,562 151,422 +12%
write  16k 101,796 107,606  +5%
write  64k  32,364  32,594   0%
write 128k  16,259  16,265   0%

Larger block sizes do not benefit from polling as much but the
improvement is worthwhile for smaller block sizes.

Signed-off-by: Stefan Hajnoczi 
---
 drivers/block/virtio_blk.c | 92 +++---
 1 file changed, 87 insertions(+), 5 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index fc0fb1dcd399..f0243dcd745a 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -29,6 +29,16 @@ static struct workqueue_struct *virtblk_wq;
 struct virtio_blk_vq {
struct virtqueue *vq;
spinlock_t lock;
+
+   /* Number of non-REQ_HIPRI requests in flight. Protected by lock. */
+   unsigned int num_lopri;
+
+   /* Number of REQ_HIPRI requests in flight. Protected by lock. */
+   unsigned int num_hipri;
+
+   /* Are vq notifications enabled? Protected by lock. */
+   bool cb_enabled;
+
char name[VQ_NAME_LEN];
 } cacheline_aligned_in_smp;
 
@@ -171,33 +181,67 @@ static inline void virtblk_request_done(struct request 
*req)
blk_mq_end_request(req, virtblk_result(vbr));
 }
 
-static void virtblk_done(struct virtqueue *vq)
+/* Returns true if one or more requests completed */
+static bool virtblk_complete_requests(struct virtqueue *vq)
 {
struct virtio_blk *vblk = vq->vdev->priv;
struct virtio_blk_vq *vbq = >vqs[vq->index];
bool req_done = false;
+   bool last_hipri_done = false;
struct virtblk_req *vbr;
unsigned long flags;
unsigned int len;
 
spin_lock_irqsave(>lock, flags);
+
do {
-   virtqueue_disable_cb(vq);
+   if (vbq->cb_enabled)
+   virtqueue_disable_cb(vq);
while ((vbr = virtqueue_get_buf(vq, )) != NULL) {
struct request *req = blk_mq_rq_from_pdu(vbr);
 
+   if (req->cmd_flags & REQ_HIPRI) {
+   if (--vbq->num_hipri == 0)
+   last_hipri_done = true;
+   } else
+   vbq->num_lopri--;
+
if (likely(!blk_should_fake_timeout(req->q)))
blk_mq_complete_request(req);
req_done = true;
}
if (unlikely(virtqueue_is_broken(vq)))
break;
-   } while (!virtqueue_enable_cb(vq));
+
+   /* Enable vq notifi

[PATCH 1/3] virtio: add virtioqueue_more_used()

2021-05-20 Thread Stefan Hajnoczi
Add an API to check whether there are pending used buffers. There is
already a similar API called virtqueue_poll() but it only works together
with virtqueue_enable_cb_prepare(). The patches that follow add blk-mq
->poll() support to virtio_blk and they need to check for used buffers
without re-enabling virtqueue callbacks, so introduce an API for it.

Signed-off-by: Stefan Hajnoczi 
---
 include/linux/virtio.h   |  2 ++
 drivers/virtio/virtio_ring.c | 17 +
 2 files changed, 19 insertions(+)

diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index b1894e0323fa..c6ad0f25f412 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -63,6 +63,8 @@ bool virtqueue_kick_prepare(struct virtqueue *vq);
 
 bool virtqueue_notify(struct virtqueue *vq);
 
+bool virtqueue_more_used(const struct virtqueue *vq);
+
 void *virtqueue_get_buf(struct virtqueue *vq, unsigned int *len);
 
 void *virtqueue_get_buf_ctx(struct virtqueue *vq, unsigned int *len,
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 71e16b53e9c1..7c3da75da462 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -2032,6 +2032,23 @@ static inline bool more_used(const struct 
vring_virtqueue *vq)
return vq->packed_ring ? more_used_packed(vq) : more_used_split(vq);
 }
 
+/**
+ * virtqueue_more_used - check if there are used buffers pending
+ * @_vq: the struct virtqueue we're talking about.
+ *
+ * Returns true if there are used buffers, false otherwise. May be called at
+ * the same time as other virtqueue operations, but actually calling
+ * virtqueue_get_buf() requires serialization so be mindful of the race between
+ * calling virtqueue_more_used() and virtqueue_get_buf().
+ */
+bool virtqueue_more_used(const struct virtqueue *_vq)
+{
+   struct vring_virtqueue *vq = to_vvq(_vq);
+
+   return more_used(vq);
+}
+EXPORT_SYMBOL_GPL(virtqueue_more_used);
+
 irqreturn_t vring_interrupt(int irq, void *_vq)
 {
struct vring_virtqueue *vq = to_vvq(_vq);
-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 0/3] virtio_blk: blk-mq io_poll support

2021-05-20 Thread Stefan Hajnoczi
This patch series implements blk_mq_ops->poll() so REQ_HIPRI requests can be
polled. IOPS for 4k and 16k block sizes increases by 5-18% on a virtio-blk
device with 4 virtqueues backed by an NVMe drive.

- Benchmark: fio ioengine=pvsync2 numjobs=4 direct=1
- Guest: 4 vCPUs with one virtio-blk device (4 virtqueues)
- Disk: Intel Corporation NVMe Datacenter SSD [Optane] [8086:2701]
- CPU: Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz

rw  bs hipri=0 hipri=1
--
randread4k 149,426 170,763 +14%
randread   16k 118,939 134,269 +12%
randread   64k  34,886  34,906   0%
randread  128k  17,655  17,667   0%
randwrite   4k 138,578 163,600 +18%
randwrite  16k 102,089 120,950 +18%
randwrite  64k  32,364  32,561   0%
randwrite 128k  16,154  16,237   0%
read4k 146,032 170,620 +16%
read   16k 117,097 130,437 +11%
read   64k  34,834  35,037   0%
read  128k  17,680  17,658   0%
write   4k 134,562 151,422 +12%
write  16k 101,796 107,606  +5%
write  64k  32,364  32,594   0%
write 128k  16,259  16,265   0%

Larger block sizes do not benefit from polling as much but the
improvement is worthwhile for smaller block sizes.

Stefan Hajnoczi (3):
  virtio: add virtioqueue_more_used()
  virtio_blk: avoid repeating vblk->vqs[qid]
  virtio_blk: implement blk_mq_ops->poll()

 include/linux/virtio.h   |   2 +
 drivers/block/virtio_blk.c   | 126 +--
 drivers/virtio/virtio_ring.c |  17 +
 3 files changed, 123 insertions(+), 22 deletions(-)

-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH V2 0/7] Do not read from descripto ring

2021-05-14 Thread Stefan Hajnoczi
On Fri, May 14, 2021 at 03:29:20PM +0800, Jason Wang wrote:
> On Fri, May 14, 2021 at 12:27 AM Stefan Hajnoczi  wrote:
> >
> > On Fri, Apr 23, 2021 at 04:09:35PM +0800, Jason Wang wrote:
> > > Sometimes, the driver doesn't trust the device. This is usually
> > > happens for the encrtpyed VM or VDUSE[1].
> >
> > Thanks for doing this.
> >
> > Can you describe the overall memory safety model that virtio drivers
> > must follow?
> 
> My understanding is that, basically the driver should not trust the
> device (since the driver doesn't know what kind of device that it
> tries to drive)
> 
> 1) For any read only metadata (required at the spec level) which is
> mapped as coherent, driver should not depend on the metadata that is
> stored in a place that could be wrote by the device. This is what this
> series tries to achieve.
> 2) For other metadata that is produced by the device, need to make
> sure there's no malicious device triggered behavior, this is somehow
> similar to what vhost did. No DOS, loop, kernel bug and other stuffs.
> 3) swiotb is a must to enforce memory access isolation. (VDUSE or encrypted 
> VM)
> 
> > For example:
> >
> > - Driver-to-device buffers must be on dedicated pages to avoid
> >   information leaks.
> 
> It looks to me if swiotlb is used, we don't need this since the
> bouncing is not done at byte not page.
> 
> But if swiotlb is not used, we need to enforce this.
> 
> >
> > - Driver-to-device buffers must be on dedicated pages to avoid memory
> >   corruption.
> 
> Similar to the above.
> 
> >
> > When I say "pages" I guess it's the IOMMU page size that matters?
> >
> 
> And the IOTLB page size.
> 
> > What is the memory access granularity of VDUSE?
> 
> It has an swiotlb, but the access and bouncing is done per byte.
> 
> >
> > I'm asking these questions because there is driver code that exposes
> > kernel memory to the device and I'm not sure it's safe. For example:
> >
> >   static int virtblk_add_req(struct virtqueue *vq, struct virtblk_req *vbr,
> >   struct scatterlist *data_sg, bool have_data)
> >   {
> >   struct scatterlist hdr, status, *sgs[3];
> >   unsigned int num_out = 0, num_in = 0;
> >
> >   sg_init_one(, >out_hdr, sizeof(vbr->out_hdr));
> > ^
> >   sgs[num_out++] = 
> >
> >   if (have_data) {
> >   if (vbr->out_hdr.type & cpu_to_virtio32(vq->vdev, 
> > VIRTIO_BLK_T_OUT))
> >   sgs[num_out++] = data_sg;
> >   else
> >   sgs[num_out + num_in++] = data_sg;
> >   }
> >
> >   sg_init_one(, >status, sizeof(vbr->status));
> >
> >   sgs[num_out + num_in++] = 
> >
> >   return virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, 
> > GFP_ATOMIC);
> >   }
> >
> > I guess the drivers don't need to be modified as long as swiotlb is used
> > to bounce the buffers through "insecure" memory so that the memory
> > surrounding the buffers is not exposed?
> 
> Yes, swiotlb won't bounce the whole page. So I think it's safe.

Thanks Jason and Yongji Xie for clarifying. Seems like swiotlb or a
similar mechanism can handle byte-granularity isolation so the drivers
not need to worry about information leaks or memory corruption outside
the mapped byte range.

We still need to audit virtio guest drivers to ensure they don't trust
data that can be modified by the device. I will look at virtio-blk and
virtio-fs next week.

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH V2 0/7] Do not read from descripto ring

2021-05-13 Thread Stefan Hajnoczi
On Fri, Apr 23, 2021 at 04:09:35PM +0800, Jason Wang wrote:
> Sometimes, the driver doesn't trust the device. This is usually
> happens for the encrtpyed VM or VDUSE[1].

Thanks for doing this.

Can you describe the overall memory safety model that virtio drivers
must follow? For example:

- Driver-to-device buffers must be on dedicated pages to avoid
  information leaks.

- Driver-to-device buffers must be on dedicated pages to avoid memory
  corruption.

When I say "pages" I guess it's the IOMMU page size that matters?

What is the memory access granularity of VDUSE?

I'm asking these questions because there is driver code that exposes
kernel memory to the device and I'm not sure it's safe. For example:

  static int virtblk_add_req(struct virtqueue *vq, struct virtblk_req *vbr,
  struct scatterlist *data_sg, bool have_data)
  {
  struct scatterlist hdr, status, *sgs[3];
  unsigned int num_out = 0, num_in = 0;

  sg_init_one(, >out_hdr, sizeof(vbr->out_hdr));
^
  sgs[num_out++] = 

  if (have_data) {
  if (vbr->out_hdr.type & cpu_to_virtio32(vq->vdev, 
VIRTIO_BLK_T_OUT))
  sgs[num_out++] = data_sg;
  else
  sgs[num_out + num_in++] = data_sg;
  }

  sg_init_one(, >status, sizeof(vbr->status));
   
  sgs[num_out + num_in++] = 

  return virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC);
  }

I guess the drivers don't need to be modified as long as swiotlb is used
to bounce the buffers through "insecure" memory so that the memory
surrounding the buffers is not exposed?

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH] virtio-vsock: add description for datagram type

2021-05-13 Thread Stefan Hajnoczi
On Thu, May 06, 2021 at 04:56:48PM -0700, Cong Wang . wrote:
> On Mon, Apr 26, 2021 at 9:07 AM Stefan Hajnoczi  wrote:
> >
> > On Wed, Apr 14, 2021 at 05:43:52PM -0700, Cong Wang . wrote:
> > > On Wed, Mar 17, 2021 at 8:45 AM Stefan Hajnoczi  
> > > wrote:
> > > >
> > > > Please stick to UDP semantics as much as possible so that applications
> > > > can be ported easily and developers aren't surprised by unexpected
> > > > behavior. UDP packets sent to a destination that has no listen socket
> > > > result in a Connection Refused error. vsock dgrams should behave in the
> > > > same way.
> > >
> > > There is no connection refused error for UDP, as it is clearly 
> > > connectionless.
> > > What you suggested might be an ICMP unreachable error, however, vsock does
> > > not have anything equivalent. So, I think there is no error returned
> > > in this scenario
> > > for vsock datagram.
> >
> > Yes, exactly. UDP uses ICMP unreachable:
> >
> >   16:55:40.380292 IP 127.0.0.1.41519 > 127.0.0.1.1234: UDP, length 5
> >   16:55:40.380308 IP 127.0.0.1 > 127.0.0.1: ICMP 127.0.0.1 udp port 1234 
> > unreachable, length 41
> >
> > This is mentioned a bit here:
> > https://tools.ietf.org/html/rfc8085#section-5.2
> >
> > Here is how Python's socket module produces an ConnectionRefused
> > exception:
> >
> >   socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC, IPPROTO_IP) = 3
> >   sendto(3, "hello", 5, 0, {sa_family=AF_INET, sin_port=htons(1234), 
> > sin_addr=inet_addr("127.0.0.1")}, 16) = 5
> >   getsockname(3, {sa_family=AF_INET, sin_port=htons(41519), 
> > sin_addr=inet_addr("0.0.0.0")}, [16]) = 0
> >   getpeername(3, 0x7ffc43c99a20, [16])= -1 ENOTCONN (Transport endpoint 
> > is not connected)
> 
> Yes, we are all aware of these UDP behaviors.
> 
> >
> > UDP itself may not carry this information but applications do rely on
> > the ICMP information as shown above.
> >
> > The Linux network stack implementation is here:
> > net/ipv4/udp.c:__udp4_lib_err().
> 
> Sure, but applications using UDP when porting to vsock dgram, they
> must be aware AF_VSOCK has totally different protocols from AF_INET,
> hence behaves differently from AF_INET. Therefore, it is not necessary
> to do everything UDP does here, particularly not necessary to mimic ICMP
> in vsock. In short, I don't think there is any standard to enforce what
> dgram sockets must behave, each address family should have its right
> to behave reasonably differently.

They can behave differently but at the cost of making it harder to port
applications. Sending a RST packet in vsock is easy and allows
applications to get the same behavior as with UDP/ICMP. I don't see a
reason to avoid it.

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC][PATCH] vhost/vsock: Add vsock_list file to map cid with vhost tasks

2021-05-13 Thread Stefan Hajnoczi
On Wed, May 05, 2021 at 04:38:55PM -0400, Steven Rostedt wrote:
> The new trace-cmd 3.0 (which is almost ready to be released) allows for
> tracing between host and guests with timestamp synchronization such that
> the events on the host and the guest can be interleaved in the proper order
> that they occur. KernelShark now has a plugin that visualizes this
> interaction.
> 
> The implementation requires that the guest has a vsock CID assigned, and on
> the guest a "trace-cmd agent" is running, that will listen on a port for
> the CID. The on the host a "trace-cmd record -A guest@cid:port -e events"
> can be called and the host will connect to the guest agent through the
> cid/port pair and have the agent enable tracing on behalf of the host and
> send the trace data back down to it.
> 
> The problem is that there is no sure fire way to find the CID for a guest.
> Currently, the user must know the cid, or we have a hack that looks for the
> qemu process and parses the --guest-cid parameter from it. But this is
> prone to error and does not work on other implementation (was told that
> crosvm does not use qemu).

The crosvm command-line syntax is: crosvm run --cid 

> As I can not find a way to discover CIDs assigned to guests via any kernel
> interface, I decided to create this one. Note, I'm not attached to it. If
> there's a better way to do this, I would love to have it. But since I'm not
> an expert in the networking layer nor virtio, I decided to stick to what I
> know and add a debugfs interface that simply lists all the registered CIDs
> and the worker task that they are associated with. The worker task at
> least has the PID of the task it represents.
> 
> Now I can find the cid / host process in charge of the guest pair:
> 
>   # cat /sys/kernel/debug/vsock_list
>   3   vhost-1954:2002
> 
>   # ps aux | grep 1954
>   qemu1954  9.9 21.3 1629092 796148 ?  Sl   16:22   0:58  
> /usr/bin/qemu-kvm -name guest=Fedora21,debug-threads=on -S -object 
> secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-Fedora21/master-key.aes
>  -machine pc-1.2,accel=kvm,usb=off,dump-guest-core=off -cpu qemu64 -m 1000 
> -overcommit mem-lock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 
> 1eefeeb0-3ac7-07c1-926e-236908313b4c -no-user-config -nodefaults -chardev 
> socket,id=charmonitor,fd=32,server,nowait -mon 
> chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot 
> strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device 
> virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -blockdev 
> {"driver":"host_device","filename":"/dev/mapper/vg_bxtest-GuestFedora","node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"}
>  -blockdev 
> {"node-name":"libvirt-1-format","read-only":false,"driver":"raw","file":"libvirt-1-storage"}
>  -device ide-hd,bus=ide.0,unit=0,drive=libvirt-1-
>  format,id=ide0-0-0,bootindex=1 -netdev tap,fd=34,id=hostnet0 -device 
> rtl8139,netdev=hostnet0,id=net0,mac=52:54:00:9f:e9:d5,bus=pci.0,addr=0x3 
> -netdev tap,fd=35,id=hostnet1 -device 
> virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:ec:dc:6e,bus=pci.0,addr=0x5
>  -chardev pty,id=charserial0 -device 
> isa-serial,chardev=charserial0,id=serial0 -chardev 
> pipe,id=charchannel0,path=/var/lib/trace-cmd/virt/Fedora21/trace-pipe-cpu0 
> -device 
> virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=trace-pipe-cpu0
>  -chardev 
> pipe,id=charchannel1,path=/var/lib/trace-cmd/virt/Fedora21/trace-pipe-cpu1 
> -device 
> virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=trace-pipe-cpu1
>  -vnc 127.0.0.1:0 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device 
> virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -sandbox 
> on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny 
> -device vhost-vsock-pci,id=vsock0,guest-cid=3,vhostfd=16,bus=pci.0,addr=0x7 
> -msg 
>  timestamp=on
>   root2000  0.0  0.0  0 0 ?S16:22   0:00 
> [kvm-pit/1954]
>   root2002  0.0  0.0  0 0 ?S16:22   0:00 
> [vhost-1954]

This approach relies on process hierarchy of the VMM (QEMU).
Multi-process QEMU is in development and will allow VIRTIO devices to
run as separate processes from the main QEMU. It then becomes harder to
correlate a VIRTIO device process with its QEMU process.

So I think in the end this approach ends up being as fragile as parsing
command-lines. The kernel doesn't really have the concept of a "VM" that
the vhost_vsock is associated with :). Maybe just parse QEMU and crosvm
command-lines?

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [External] Re: [RFC v2] virtio-vsock: add description for datagram type

2021-05-13 Thread Stefan Hajnoczi
On Mon, Apr 12, 2021 at 03:39:36PM -0700, Jiang Wang . wrote:
> On Mon, Apr 12, 2021 at 6:50 AM Stefan Hajnoczi  wrote:
> > On Thu, Apr 01, 2021 at 04:36:02AM +, jiang.wang wrote:
> > > +For the rx side, dgram also uses the \field{buf_alloc}. If it is full, 
> > > the packet
> > > +is dropped by the receiver.
> >
> > UDP is connectionless so any number of other sources can send messages
> > to the same destination, causing buf_alloc's value to be unpredictable.
> > Can you explain how buf_alloc works with datagram sockets in more
> > detail?
> 
> In the linux kernel in my prototype, datagram sockets also use
> virtio_transport_inc_rx_pkt() and virtio_transport_dec_rx_pkt() to update
> vvs->rx_bytes and compare it with vvs->buf_alloc. virtio_transport_inc_rx_pkt
> is called when enqueuing the datagram packets.
> virtio_transport_dec_rx_pkt is called when dequeuing those packets.
> 
> If multiple sources send messages to the same destination, they will share
> the same buf_alloc. Do you want something with more control?
> Maybe somehow allocate a buffer for each remote CID and port? But I
> feel that is a little bit overkill. Any other suggestions?

The opposite, I want less control :). It's not possible to track buffer
space accurately with connectionless sockets, so let's not try.

A real example is the iperf3 networking benchmark where you need to set
target bitrate for UDP sockets because there is no flow control (buffer
space) mechanism in UDP. That's how UDP applications work and for
similar reasons I don't think it's possible to achieve flow control with
vsock's buf_alloc.

> > >  \drivernormative{\paragraph}{Device Operation: Buffer Space 
> > > Management}{Device Types / Socket Device / Device Operation / Buffer 
> > > Space Management}
> > > -VIRTIO_VSOCK_OP_RW data packets MUST only be transmitted when the peer 
> > > has
> > > -sufficient free buffer space for the payload.
> > > +For stream sockets, VIRTIO_VSOCK_OP_RW data packets MUST only be 
> > > transmitted when the peer has
> > > +sufficient free buffer space for the payload. For dgram sockets, 
> > > VIRTIO_VSOCK_OP_RW data packets
> > > +MAY be transmitted when the peer buffer is full. Then the packet will be 
> > > dropped by the receiver.
> > >
> > >  All packets associated with a stream flow MUST contain valid information 
> > > in
> > >  \field{buf_alloc} and \field{fwd_cnt} fields.
> > >
> > >  \devicenormative{\paragraph}{Device Operation: Buffer Space 
> > > Management}{Device Types / Socket Device / Device Operation / Buffer 
> > > Space Management}
> > > -VIRTIO_VSOCK_OP_RW data packets MUST only be transmitted when the peer 
> > > has
> > > -sufficient free buffer space for the payload.
> > > +For stream sockets, VIRTIO_VSOCK_OP_RW data packets MUST only be 
> > > transmitted when the peer has
> > > +sufficient free buffer space for the payload. For dgram sockets, 
> > > VIRTIO_VSOCK_OP_RW data packets
> > > +MAY be transmitted when the peer buffer is full. Then the packet will be 
> > > dropped by the receiver.
> > >
> > >  All packets associated with a stream flow MUST contain valid information 
> > > in
> > >  \field{buf_alloc} and \field{fwd_cnt} fields.
> > > @@ -203,14 +234,14 @@ \subsubsection{Receive and 
> > > Transmit}\label{sec:Device Types / Socket Device / De
> > >  The \field{guest_cid} configuration field MUST be used as the source CID 
> > > when
> > >  sending outgoing packets.
> > >
> > > -A VIRTIO_VSOCK_OP_RST reply MUST be sent if a packet is received with an
> > > +For stream sockets, A VIRTIO_VSOCK_OP_RST reply MUST be sent if a packet 
> > > is received with an
> > >  unknown \field{type} value.
> >
> > What about datagram sockets? Please state what must happen and why.
> 
> I think datagram sockets should do the same thing as the stream to
> return a VIRTIO_VSOCK_OP_RST?
> Any other ideas?

Sounds good to me. I just wanted to check and request that you add that
to the spec.


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

  1   2   3   4   5   6   7   >