Re: [RFC v6 2/4] virtio-blk: add zoned storage emulation for zoned devices

2023-01-31 Thread Stefan Hajnoczi
On Mon, Jan 30, 2023 at 06:30:16PM +, Daniel P. Berrangé wrote:
> On Mon, Jan 30, 2023 at 10:17:48AM -0500, Stefan Hajnoczi wrote:
> > On Mon, 30 Jan 2023 at 07:33, Daniel P. Berrangé  
> > wrote:
> > >
> > > On Sun, Jan 29, 2023 at 06:39:49PM +0800, Sam Li wrote:
> > > > This patch extends virtio-blk emulation to handle zoned device commands
> > > > by calling the new block layer APIs to perform zoned device I/O on
> > > > behalf of the guest. It supports Report Zone, four zone oparations 
> > > > (open,
> > > > close, finish, reset), and Append Zone.
> > > >
> > > > The VIRTIO_BLK_F_ZONED feature bit will only be set if the host does
> > > > support zoned block devices. Regular block devices(conventional zones)
> > > > will not be set.
> > > >
> > > > The guest os can use blktests, fio to test those commands on zoned 
> > > > devices.
> > > > Furthermore, using zonefs to test zone append write is also supported.
> > > >
> > > > Signed-off-by: Sam Li 
> > > > ---
> > > >  hw/block/virtio-blk-common.c |   2 +
> > > >  hw/block/virtio-blk.c| 394 +++
> > > >  2 files changed, 396 insertions(+)
> > > >
> > >
> > > > @@ -949,6 +1311,30 @@ static void virtio_blk_update_config(VirtIODevice 
> > > > *vdev, uint8_t *config)
> > > >  blkcfg.write_zeroes_may_unmap = 1;
> > > >  virtio_stl_p(vdev, _write_zeroes_seg, 1);
> > > >  }
> > > > +if (bs->bl.zoned != BLK_Z_NONE) {
> > > > +switch (bs->bl.zoned) {
> > > > +case BLK_Z_HM:
> > > > +blkcfg.zoned.model = VIRTIO_BLK_Z_HM;
> > > > +break;
> > > > +case BLK_Z_HA:
> > > > +blkcfg.zoned.model = VIRTIO_BLK_Z_HA;
> > > > +break;
> > > > +default:
> > > > +g_assert_not_reached();
> > > > +}
> > > > +
> > > > +virtio_stl_p(vdev, _sectors,
> > > > + bs->bl.zone_size / 512);
> > > > +virtio_stl_p(vdev, _active_zones,
> > > > + bs->bl.max_active_zones);
> > > > +virtio_stl_p(vdev, _open_zones,
> > > > + bs->bl.max_open_zones);
> > > > +virtio_stl_p(vdev, _granularity, blk_size);
> > > > +virtio_stl_p(vdev, _append_sectors,
> > > > + bs->bl.max_append_sectors);
> > >
> > > So these are all ABI sensitive frontend device settings, but they are
> > > not exposed as tunables on the virtio-blk device, instead they are
> > > implicitly set from the backend.
> > >
> > > We have done this kind of thing before in QEMU, but several times it
> > > has bitten QEMU maintainers/users, as having a backend affect the
> > > frontend ABI is not to typical. It wouldn't be immediately obvious
> > > when starting QEMU on a target host that the live migration would
> > > be breaking ABI if the target host wasn't using a zoned device with
> > > exact same settings.
> > >
> > > This also limits mgmt flexibility across live migration, if the
> > > mgmt app wants/needs to change the storage backend. eg maybe they
> > > need to evacuate the host for an emergency, but don't have spare
> > > hosts with same kind of storage. It might be desirable to migrate
> > > and switch to a plain block device or raw/qcow2 file, rather than
> > > let the VM die.
> > >
> > > Can we make these virtio setting be explicitly controlled on the
> > > virtio-blk device.  If not specified explicitly they could be
> > > auto-populated from the backend for ease of use, but if specified
> > > then simply validate the backend is a match. libvirt would then
> > > make sure these are always explicitly set on the frontend.
> > 
> > I think this is a good idea, especially if we streamline the
> > file-posix.c driver by merging --blockdev zoned_host_device into
> > --blockdev host_device. It won't be obvious from the command-line
> > whether this is a zoned or non-zoned device. There should be a
> > --device virtio-blk-pci,drive=drive0,zoned=on option that fails when
> > drive0 isn't zoned. It should probably be on/off/auto where auto is
> > the default and doesn't check anything, on requires a zoned device,
> > and off requires a non-zoned device. That will prevent accidental
> > migration between zoned/non-zoned devices.
> > 
> > I want to point out that virtio-blk doesn't have checks for the disk
> > size or other details, so what you're suggesting for zone_sectors, etc
> > is stricter than what QEMU does today. Since the virtio-blk parameters
> > you're proposing are optional, I think it doesn't hurt though.
> 
> Yeah, it is slightly different than some of the parameters handling.
> I guess you could say that with disk capacity, matching size is a
> fairly obvious constraint/expectation to manage, and also long standing. 
> 
> With disk capacity, you can add the 'raw' driver on top of any block
> driver stack, to apply an arbitrary offset+size, to make the storage
> smaller than it otherwise is on disk. Conceptually than could have
> been done on 

Re: [RFC v6 2/4] virtio-blk: add zoned storage emulation for zoned devices

2023-01-30 Thread Daniel P . Berrangé
On Mon, Jan 30, 2023 at 10:17:48AM -0500, Stefan Hajnoczi wrote:
> On Mon, 30 Jan 2023 at 07:33, Daniel P. Berrangé  wrote:
> >
> > On Sun, Jan 29, 2023 at 06:39:49PM +0800, Sam Li wrote:
> > > This patch extends virtio-blk emulation to handle zoned device commands
> > > by calling the new block layer APIs to perform zoned device I/O on
> > > behalf of the guest. It supports Report Zone, four zone oparations (open,
> > > close, finish, reset), and Append Zone.
> > >
> > > The VIRTIO_BLK_F_ZONED feature bit will only be set if the host does
> > > support zoned block devices. Regular block devices(conventional zones)
> > > will not be set.
> > >
> > > The guest os can use blktests, fio to test those commands on zoned 
> > > devices.
> > > Furthermore, using zonefs to test zone append write is also supported.
> > >
> > > Signed-off-by: Sam Li 
> > > ---
> > >  hw/block/virtio-blk-common.c |   2 +
> > >  hw/block/virtio-blk.c| 394 +++
> > >  2 files changed, 396 insertions(+)
> > >
> >
> > > @@ -949,6 +1311,30 @@ static void virtio_blk_update_config(VirtIODevice 
> > > *vdev, uint8_t *config)
> > >  blkcfg.write_zeroes_may_unmap = 1;
> > >  virtio_stl_p(vdev, _write_zeroes_seg, 1);
> > >  }
> > > +if (bs->bl.zoned != BLK_Z_NONE) {
> > > +switch (bs->bl.zoned) {
> > > +case BLK_Z_HM:
> > > +blkcfg.zoned.model = VIRTIO_BLK_Z_HM;
> > > +break;
> > > +case BLK_Z_HA:
> > > +blkcfg.zoned.model = VIRTIO_BLK_Z_HA;
> > > +break;
> > > +default:
> > > +g_assert_not_reached();
> > > +}
> > > +
> > > +virtio_stl_p(vdev, _sectors,
> > > + bs->bl.zone_size / 512);
> > > +virtio_stl_p(vdev, _active_zones,
> > > + bs->bl.max_active_zones);
> > > +virtio_stl_p(vdev, _open_zones,
> > > + bs->bl.max_open_zones);
> > > +virtio_stl_p(vdev, _granularity, blk_size);
> > > +virtio_stl_p(vdev, _append_sectors,
> > > + bs->bl.max_append_sectors);
> >
> > So these are all ABI sensitive frontend device settings, but they are
> > not exposed as tunables on the virtio-blk device, instead they are
> > implicitly set from the backend.
> >
> > We have done this kind of thing before in QEMU, but several times it
> > has bitten QEMU maintainers/users, as having a backend affect the
> > frontend ABI is not to typical. It wouldn't be immediately obvious
> > when starting QEMU on a target host that the live migration would
> > be breaking ABI if the target host wasn't using a zoned device with
> > exact same settings.
> >
> > This also limits mgmt flexibility across live migration, if the
> > mgmt app wants/needs to change the storage backend. eg maybe they
> > need to evacuate the host for an emergency, but don't have spare
> > hosts with same kind of storage. It might be desirable to migrate
> > and switch to a plain block device or raw/qcow2 file, rather than
> > let the VM die.
> >
> > Can we make these virtio setting be explicitly controlled on the
> > virtio-blk device.  If not specified explicitly they could be
> > auto-populated from the backend for ease of use, but if specified
> > then simply validate the backend is a match. libvirt would then
> > make sure these are always explicitly set on the frontend.
> 
> I think this is a good idea, especially if we streamline the
> file-posix.c driver by merging --blockdev zoned_host_device into
> --blockdev host_device. It won't be obvious from the command-line
> whether this is a zoned or non-zoned device. There should be a
> --device virtio-blk-pci,drive=drive0,zoned=on option that fails when
> drive0 isn't zoned. It should probably be on/off/auto where auto is
> the default and doesn't check anything, on requires a zoned device,
> and off requires a non-zoned device. That will prevent accidental
> migration between zoned/non-zoned devices.
> 
> I want to point out that virtio-blk doesn't have checks for the disk
> size or other details, so what you're suggesting for zone_sectors, etc
> is stricter than what QEMU does today. Since the virtio-blk parameters
> you're proposing are optional, I think it doesn't hurt though.

Yeah, it is slightly different than some of the parameters handling.
I guess you could say that with disk capacity, matching size is a
fairly obvious constraint/expectation to manage, and also long standing. 

With disk capacity, you can add the 'raw' driver on top of any block
driver stack, to apply an arbitrary offset+size, to make the storage
smaller than it otherwise is on disk. Conceptually than could have
been done on the frontend device(s) too, but I guess it made more
sense to do it in the block layer to give consistent enforcement
of the limits across frontends. It is fuzzy whether such a use of
the 'raw' driver is really considered backend config,  as opposed to
frontend config but to me 

Re: [RFC v6 2/4] virtio-blk: add zoned storage emulation for zoned devices

2023-01-30 Thread Stefan Hajnoczi
On Mon, 30 Jan 2023 at 07:33, Daniel P. Berrangé  wrote:
>
> On Sun, Jan 29, 2023 at 06:39:49PM +0800, Sam Li wrote:
> > This patch extends virtio-blk emulation to handle zoned device commands
> > by calling the new block layer APIs to perform zoned device I/O on
> > behalf of the guest. It supports Report Zone, four zone oparations (open,
> > close, finish, reset), and Append Zone.
> >
> > The VIRTIO_BLK_F_ZONED feature bit will only be set if the host does
> > support zoned block devices. Regular block devices(conventional zones)
> > will not be set.
> >
> > The guest os can use blktests, fio to test those commands on zoned devices.
> > Furthermore, using zonefs to test zone append write is also supported.
> >
> > Signed-off-by: Sam Li 
> > ---
> >  hw/block/virtio-blk-common.c |   2 +
> >  hw/block/virtio-blk.c| 394 +++
> >  2 files changed, 396 insertions(+)
> >
>
> > @@ -949,6 +1311,30 @@ static void virtio_blk_update_config(VirtIODevice 
> > *vdev, uint8_t *config)
> >  blkcfg.write_zeroes_may_unmap = 1;
> >  virtio_stl_p(vdev, _write_zeroes_seg, 1);
> >  }
> > +if (bs->bl.zoned != BLK_Z_NONE) {
> > +switch (bs->bl.zoned) {
> > +case BLK_Z_HM:
> > +blkcfg.zoned.model = VIRTIO_BLK_Z_HM;
> > +break;
> > +case BLK_Z_HA:
> > +blkcfg.zoned.model = VIRTIO_BLK_Z_HA;
> > +break;
> > +default:
> > +g_assert_not_reached();
> > +}
> > +
> > +virtio_stl_p(vdev, _sectors,
> > + bs->bl.zone_size / 512);
> > +virtio_stl_p(vdev, _active_zones,
> > + bs->bl.max_active_zones);
> > +virtio_stl_p(vdev, _open_zones,
> > + bs->bl.max_open_zones);
> > +virtio_stl_p(vdev, _granularity, blk_size);
> > +virtio_stl_p(vdev, _append_sectors,
> > + bs->bl.max_append_sectors);
>
> So these are all ABI sensitive frontend device settings, but they are
> not exposed as tunables on the virtio-blk device, instead they are
> implicitly set from the backend.
>
> We have done this kind of thing before in QEMU, but several times it
> has bitten QEMU maintainers/users, as having a backend affect the
> frontend ABI is not to typical. It wouldn't be immediately obvious
> when starting QEMU on a target host that the live migration would
> be breaking ABI if the target host wasn't using a zoned device with
> exact same settings.
>
> This also limits mgmt flexibility across live migration, if the
> mgmt app wants/needs to change the storage backend. eg maybe they
> need to evacuate the host for an emergency, but don't have spare
> hosts with same kind of storage. It might be desirable to migrate
> and switch to a plain block device or raw/qcow2 file, rather than
> let the VM die.
>
> Can we make these virtio setting be explicitly controlled on the
> virtio-blk device.  If not specified explicitly they could be
> auto-populated from the backend for ease of use, but if specified
> then simply validate the backend is a match. libvirt would then
> make sure these are always explicitly set on the frontend.

I think this is a good idea, especially if we streamline the
file-posix.c driver by merging --blockdev zoned_host_device into
--blockdev host_device. It won't be obvious from the command-line
whether this is a zoned or non-zoned device. There should be a
--device virtio-blk-pci,drive=drive0,zoned=on option that fails when
drive0 isn't zoned. It should probably be on/off/auto where auto is
the default and doesn't check anything, on requires a zoned device,
and off requires a non-zoned device. That will prevent accidental
migration between zoned/non-zoned devices.

I want to point out that virtio-blk doesn't have checks for the disk
size or other details, so what you're suggesting for zone_sectors, etc
is stricter than what QEMU does today. Since the virtio-blk parameters
you're proposing are optional, I think it doesn't hurt though.

Stefan



Re: [RFC v6 2/4] virtio-blk: add zoned storage emulation for zoned devices

2023-01-30 Thread Daniel P . Berrangé
On Sun, Jan 29, 2023 at 06:39:49PM +0800, Sam Li wrote:
> This patch extends virtio-blk emulation to handle zoned device commands
> by calling the new block layer APIs to perform zoned device I/O on
> behalf of the guest. It supports Report Zone, four zone oparations (open,
> close, finish, reset), and Append Zone.
> 
> The VIRTIO_BLK_F_ZONED feature bit will only be set if the host does
> support zoned block devices. Regular block devices(conventional zones)
> will not be set.
> 
> The guest os can use blktests, fio to test those commands on zoned devices.
> Furthermore, using zonefs to test zone append write is also supported.
> 
> Signed-off-by: Sam Li 
> ---
>  hw/block/virtio-blk-common.c |   2 +
>  hw/block/virtio-blk.c| 394 +++
>  2 files changed, 396 insertions(+)
> 

> @@ -949,6 +1311,30 @@ static void virtio_blk_update_config(VirtIODevice 
> *vdev, uint8_t *config)
>  blkcfg.write_zeroes_may_unmap = 1;
>  virtio_stl_p(vdev, _write_zeroes_seg, 1);
>  }
> +if (bs->bl.zoned != BLK_Z_NONE) {
> +switch (bs->bl.zoned) {
> +case BLK_Z_HM:
> +blkcfg.zoned.model = VIRTIO_BLK_Z_HM;
> +break;
> +case BLK_Z_HA:
> +blkcfg.zoned.model = VIRTIO_BLK_Z_HA;
> +break;
> +default:
> +g_assert_not_reached();
> +}
> +
> +virtio_stl_p(vdev, _sectors,
> + bs->bl.zone_size / 512);
> +virtio_stl_p(vdev, _active_zones,
> + bs->bl.max_active_zones);
> +virtio_stl_p(vdev, _open_zones,
> + bs->bl.max_open_zones);
> +virtio_stl_p(vdev, _granularity, blk_size);
> +virtio_stl_p(vdev, _append_sectors,
> + bs->bl.max_append_sectors);

So these are all ABI sensitive frontend device settings, but they are
not exposed as tunables on the virtio-blk device, instead they are
implicitly set from the backend.

We have done this kind of thing before in QEMU, but several times it
has bitten QEMU maintainers/users, as having a backend affect the
frontend ABI is not to typical. It wouldn't be immediately obvious
when starting QEMU on a target host that the live migration would
be breaking ABI if the target host wasn't using a zoned device with
exact same settings.

This also limits mgmt flexibility across live migration, if the
mgmt app wants/needs to change the storage backend. eg maybe they
need to evacuate the host for an emergency, but don't have spare
hosts with same kind of storage. It might be desirable to migrate
and switch to a plain block device or raw/qcow2 file, rather than
let the VM die.

Can we make these virtio setting be explicitly controlled on the
virtio-blk device.  If not specified explicitly they could be
auto-populated from the backend for ease of use, but if specified
then simply validate the backend is a match. libvirt would then
make sure these are always explicitly set on the frontend.


With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




[RFC v6 2/4] virtio-blk: add zoned storage emulation for zoned devices

2023-01-29 Thread Sam Li
This patch extends virtio-blk emulation to handle zoned device commands
by calling the new block layer APIs to perform zoned device I/O on
behalf of the guest. It supports Report Zone, four zone oparations (open,
close, finish, reset), and Append Zone.

The VIRTIO_BLK_F_ZONED feature bit will only be set if the host does
support zoned block devices. Regular block devices(conventional zones)
will not be set.

The guest os can use blktests, fio to test those commands on zoned devices.
Furthermore, using zonefs to test zone append write is also supported.

Signed-off-by: Sam Li 
---
 hw/block/virtio-blk-common.c |   2 +
 hw/block/virtio-blk.c| 394 +++
 2 files changed, 396 insertions(+)

diff --git a/hw/block/virtio-blk-common.c b/hw/block/virtio-blk-common.c
index ac52d7c176..e2f8e2f6da 100644
--- a/hw/block/virtio-blk-common.c
+++ b/hw/block/virtio-blk-common.c
@@ -29,6 +29,8 @@ static const VirtIOFeature feature_sizes[] = {
  .end = endof(struct virtio_blk_config, discard_sector_alignment)},
 {.flags = 1ULL << VIRTIO_BLK_F_WRITE_ZEROES,
  .end = endof(struct virtio_blk_config, write_zeroes_may_unmap)},
+{.flags = 1ULL << VIRTIO_BLK_F_ZONED,
+ .end = endof(struct virtio_blk_config, zoned)},
 {}
 };
 
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 1762517878..09220f400d 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -17,6 +17,7 @@
 #include "qemu/module.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
+#include "block/block_int.h"
 #include "trace.h"
 #include "hw/block/block.h"
 #include "hw/qdev-properties.h"
@@ -601,6 +602,341 @@ err:
 return err_status;
 }
 
+typedef struct ZoneCmdData {
+VirtIOBlockReq *req;
+struct iovec *in_iov;
+unsigned in_num;
+union {
+struct {
+unsigned int nr_zones;
+BlockZoneDescriptor *zones;
+} zone_report_data;
+struct {
+int64_t offset;
+} zone_append_data;
+};
+} ZoneCmdData;
+
+/*
+ * check zoned_request: error checking before issuing requests. If all checks
+ * passed, return true.
+ * append: true if only zone append requests issued.
+ */
+static bool check_zoned_request(VirtIOBlock *s, int64_t offset, int64_t len,
+ bool append, uint8_t *status) {
+BlockDriverState *bs = blk_bs(s->blk);
+int index;
+
+if (!virtio_has_feature(s->host_features, VIRTIO_BLK_F_ZONED)) {
+*status = VIRTIO_BLK_S_UNSUPP;
+return false;
+}
+
+if (offset < 0 || len < 0 || len > (bs->total_sectors << BDRV_SECTOR_BITS)
+|| offset > (bs->total_sectors << BDRV_SECTOR_BITS) - len) {
+*status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+return false;
+}
+
+if (append) {
+if (bs->bl.write_granularity) {
+if ((offset % bs->bl.write_granularity) != 0) {
+*status = VIRTIO_BLK_S_ZONE_UNALIGNED_WP;
+return false;
+}
+}
+
+index = offset / bs->bl.zone_size;
+if (BDRV_ZT_IS_CONV(bs->bl.wps->wp[index])) {
+*status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+return false;
+}
+
+if (len / 512 > bs->bl.max_append_sectors) {
+if (bs->bl.max_append_sectors == 0) {
+*status = VIRTIO_BLK_S_UNSUPP;
+} else {
+*status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+}
+return false;
+}
+}
+return true;
+}
+
+static void virtio_blk_zone_report_complete(void *opaque, int ret)
+{
+ZoneCmdData *data = opaque;
+VirtIOBlockReq *req = data->req;
+VirtIOBlock *s = req->dev;
+VirtIODevice *vdev = VIRTIO_DEVICE(req->dev);
+struct iovec *in_iov = data->in_iov;
+unsigned in_num = data->in_num;
+int64_t zrp_size, n, j = 0;
+int64_t nz = data->zone_report_data.nr_zones;
+int8_t err_status = VIRTIO_BLK_S_OK;
+
+if (ret) {
+err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+goto out;
+}
+
+struct virtio_blk_zone_report zrp_hdr = (struct virtio_blk_zone_report) {
+.nr_zones = cpu_to_le64(nz),
+};
+zrp_size = sizeof(struct virtio_blk_zone_report)
+   + sizeof(struct virtio_blk_zone_descriptor) * nz;
+n = iov_from_buf(in_iov, in_num, 0, _hdr, sizeof(zrp_hdr));
+if (n != sizeof(zrp_hdr)) {
+virtio_error(vdev, "Driver provided input buffer that is too small!");
+err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+goto out;
+}
+
+for (size_t i = sizeof(zrp_hdr); i < zrp_size;
+i += sizeof(struct virtio_blk_zone_descriptor), ++j) {
+struct virtio_blk_zone_descriptor desc =
+(struct virtio_blk_zone_descriptor) {
+.z_start = cpu_to_le64(data->zone_report_data.zones[j].start
+>> BDRV_SECTOR_BITS),
+.z_cap = cpu_to_le64(data->zone_report_data.zones[j].cap
+