Re: [PATCH 2/4] libvduse: Replace strcpy() with strncpy()

2022-06-27 Thread Yongji Xie
On Tue, Jun 28, 2022 at 8:26 AM Richard Henderson
 wrote:
>
> On 6/27/22 14:32, Xie Yongji wrote:
> > -strcpy(dev_config->name, name);
> > +strncpy(dev_config->name, name, VDUSE_NAME_MAX);
> > +dev_config->name[VDUSE_NAME_MAX - 1] = '\0';
>
> g_strlcpy
>

Now we don't have a dependency on glib, so we use strncpy here.

Thanks,
Yongji



Re: [PATCH 3/4] libvduse: Pass positive value to strerror()

2022-06-27 Thread Richard Henderson

On 6/27/22 14:32, Xie Yongji wrote:

The value passed to strerror() should be positive.
So let's fix it.

Fixes: Coverity CID 1490226, 1490223
Signed-off-by: Xie Yongji
---
  subprojects/libvduse/libvduse.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)


Reviewed-by: Richard Henderson 

r~



Re: [PATCH 2/4] libvduse: Replace strcpy() with strncpy()

2022-06-27 Thread Richard Henderson

On 6/27/22 14:32, Xie Yongji wrote:

-strcpy(dev_config->name, name);
+strncpy(dev_config->name, name, VDUSE_NAME_MAX);
+dev_config->name[VDUSE_NAME_MAX - 1] = '\0';


g_strlcpy

r~



Re: [RFC v2] Adding block layer APIs resembling Linux ZoneBlockDevice ioctls.

2022-06-27 Thread Damien Le Moal
On 6/25/22 00:49, Stefan Hajnoczi wrote:
> On Fri, Jun 24, 2022 at 11:14:32AM +0800, Sam Li wrote:
>> Hi Stefan,
>>
>> Stefan Hajnoczi  于2022年6月20日周一 15:55写道:
>>>
>>> On Mon, Jun 20, 2022 at 11:36:11AM +0800, Sam Li wrote:
>>>
>>> Hi Sam,
>>> Is this version 2 of "[RFC v1] Add support for zoned device"? Please
>>> keep the email subject line the same (except for "v2", "v3", etc) so
>>> that it's clear which patch series this new version replaces.
>>>
 Fix some mistakes before. It can report a range of zones now.
>>>
>>> This looks like the description of what changed compared to v1. Please
>>> put the changelog below "---" in the future. When patch emails are
>>> merged by git-am(1) it keeps the text above "---" and discards the text
>>> below "---". The changelog is usually no longer useful once the patches
>>> are merged, so it should be located below the "---" line.
>>>
>>> The text above the "---" is the commit description (an explanation of
>>> why this commit is necessary). In this case the commit description
>>> should explain that this patch adds .bdrv_co_zone_report() and
>>> .bdrv_co_zone_mgmt() to BlockDriver so that zoned block devices can be
>>> supported.
>>>

 Signed-off-by: Sam Li 
 ---
  block/block-backend.c |  22 
  block/coroutines.h|   5 +
  block/file-posix.c| 182 ++
  block/io.c|  23 
  include/block/block-common.h  |  43 ++-
  include/block/block-io.h  |  13 +++
  include/block/block_int-common.h  |  20 
  qemu-io-cmds.c| 118 +++
  tests/qemu-iotests/tests/zoned.sh |  52 +
  9 files changed, 477 insertions(+), 1 deletion(-)
  create mode 100644 tests/qemu-iotests/tests/zoned.sh

 diff --git a/block/block-backend.c b/block/block-backend.c
 index e0e1aff4b1..20248e4a35 100644
 --- a/block/block-backend.c
 +++ b/block/block-backend.c
 @@ -104,6 +104,8 @@ typedef struct BlockBackendAIOCB {
  int ret;
  } BlockBackendAIOCB;

 +
 +
>>>
>>> Please avoid whitespace changes in code that is otherwise untouched by
>>> your patch. Code changes can cause merge conflicts and they make it
>>> harder to use git-annotate(1), so only changes that are necessary should
>>> be included in a patch.
>>>
  static const AIOCBInfo block_backend_aiocb_info = {
  .get_aio_context = blk_aiocb_get_aio_context,
  .aiocb_size = sizeof(BlockBackendAIOCB),
 @@ -1810,6 +1812,25 @@ int blk_flush(BlockBackend *blk)
  return ret;
  }

>>>
>>> Please add a documentation comment for blk_co_zone_report() that
>>> explains how to use the functions and the purpose of the arguments. For
>>> example, does offset have to be the first byte in a zone or can it be
>>> any byte offset? What are the alignment requirements of offset and len?
>>> Why is nr_zones a pointer?
>>>
 +int blk_co_zone_report(BlockBackend *blk, int64_t offset, int64_t len,
>>>
>>> Functions that run in coroutine context must be labeled with
>>> coroutine_fn:
>>>
>>> int coroutine_fn blk_co_zone_report(...)
>>>
>>> This tells humans and tools that the function can only be called from a
>>> coroutine. There is a blog post about coroutines in QEMU here:
>>> https://blog.vmsplice.net/2014/01/coroutines-in-qemu-basics.html
>>>
 +   int64_t *nr_zones,
 +   struct BlockZoneDescriptor *zones)
>>>
>>> QEMU coding style uses typedefs when defining structs, so "struct
>>> BlockZoneDescriptor *zones" should be written as "BlockZoneDescriptor
>>> *zones".
>>>
 +{
 +int ret;
>>>
>>> This function is called from the I/O code path, please mark it with:
>>>
>>>   IO_CODE();
>>>
>>> From include/block/block-io.h:
>>>
>>>   * I/O API functions. These functions are thread-safe, and therefore
>>>   * can run in any thread as long as the thread has called
>>>   * aio_context_acquire/release().
>>>   *
>>>   * These functions can only call functions from I/O and Common categories,
>>>   * but can be invoked by GS, "I/O or GS" and I/O APIs.
>>>   *
>>>   * All functions in this category must use the macro
>>>   * IO_CODE();
>>>   * to catch when they are accidentally called by the wrong API.
>>>
 +ret = bdrv_co_zone_report(blk->root->bs, offset, len, nr_zones, 
 zones);
>>>
>>> Please add blk_inc_in_flight(blk) and blk_dec_in_flight(blk) around this
>>> function call to ensure that zone report requests finish before I/O is
>>> drained (see bdrv_drained_begin()). This is necessary so that it's
>>> possible to wait for I/O requests, including zone report, to complete.
>>>
>>> Similar to blk_co_do_preadv() we need blk_wait_while_drained(blk),
>>> blk_check_byte_request(), and bdrv_inc_in_flight(bs) before calling
>>> bdrv_co_zone_report(). bdrv_dec_in_flight(bs) needs to be called after
>>> 

[PATCH 1/2] hw: m25p80: Add Block Protect and Top Bottom bits for write protect

2022-06-27 Thread Iris Chen
Signed-off-by: Iris Chen 
---
 hw/block/m25p80.c | 74 +++
 1 file changed, 62 insertions(+), 12 deletions(-)

diff --git a/hw/block/m25p80.c b/hw/block/m25p80.c
index 50b523e5b1..0156a70f5e 100644
--- a/hw/block/m25p80.c
+++ b/hw/block/m25p80.c
@@ -38,21 +38,19 @@
 #include "trace.h"
 #include "qom/object.h"
 
-/* Fields for FlashPartInfo->flags */
-
-/* erase capabilities */
-#define ER_4K 1
-#define ER_32K 2
-/* set to allow the page program command to write 0s back to 1. Useful for
- * modelling EEPROM with SPI flash command set
- */
-#define EEPROM 0x100
-
 /* 16 MiB max in 3 byte address mode */
 #define MAX_3BYTES_SIZE 0x100
-
 #define SPI_NOR_MAX_ID_LEN 6
 
+/* Fields for FlashPartInfo->flags */
+enum spi_nor_option_flags {
+ER_4K  = BIT(0),
+ER_32K = BIT(1),
+EEPROM = BIT(2),
+SNOR_F_HAS_SR_TB   = BIT(3),
+SNOR_F_HAS_SR_BP3_BIT6 = BIT(4),
+};
+
 typedef struct FlashPartInfo {
 const char *part_name;
 /*
@@ -253,7 +251,8 @@ static const FlashPartInfo known_devices[] = {
 { INFO("n25q512a11",  0x20bb20,  0,  64 << 10, 1024, ER_4K) },
 { INFO("n25q512a13",  0x20ba20,  0,  64 << 10, 1024, ER_4K) },
 { INFO("n25q128", 0x20ba18,  0,  64 << 10, 256, 0) },
-{ INFO("n25q256a",0x20ba19,  0,  64 << 10, 512, ER_4K) },
+{ INFO("n25q256a",0x20ba19,  0,  64 << 10, 512,
+   ER_4K | SNOR_F_HAS_SR_BP3_BIT6 | SNOR_F_HAS_SR_TB) },
 { INFO("n25q512a",0x20ba20,  0,  64 << 10, 1024, ER_4K) },
 { INFO("n25q512ax3",  0x20ba20,  0x1000,  64 << 10, 1024, ER_4K) },
 { INFO("mt25ql512ab", 0x20ba20, 0x1044, 64 << 10, 1024, ER_4K | ER_32K) },
@@ -480,6 +479,11 @@ struct Flash {
 bool reset_enable;
 bool quad_enable;
 bool aai_enable;
+bool block_protect0;
+bool block_protect1;
+bool block_protect2;
+bool block_protect3;
+bool top_bottom_bit;
 bool status_register_write_disabled;
 uint8_t ear;
 
@@ -630,6 +634,29 @@ void flash_write8(Flash *s, uint32_t addr, uint8_t data)
 qemu_log_mask(LOG_GUEST_ERROR, "M25P80: write with write protect!\n");
 return;
 }
+uint32_t block_protect_value = (s->block_protect3 << 3) |
+   (s->block_protect2 << 2) |
+   (s->block_protect1 << 1) |
+   (s->block_protect0 << 0);
+
+ uint32_t num_protected_sectors = 1 << (block_protect_value - 1);
+ uint32_t sector = addr / s->pi->sector_size;
+
+ /* top_bottom_bit == 0 means TOP */
+if (!s->top_bottom_bit) {
+if (block_protect_value > 0 &&
+s->pi->n_sectors <= sector + num_protected_sectors) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "M25P80: write with write protect!\n");
+return;
+}
+} else {
+if (block_protect_value > 0 && sector < num_protected_sectors) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "M25P80: write with write protect!\n");
+return;
+}
+}
 
 if ((prev ^ data) & data) {
 trace_m25p80_programming_zero_to_one(s, addr, prev, data);
@@ -728,6 +755,15 @@ static void complete_collecting_data(Flash *s)
 break;
 case WRSR:
 s->status_register_write_disabled = extract32(s->data[0], 7, 1);
+s->block_protect0 = extract32(s->data[0], 2, 1);
+s->block_protect1 = extract32(s->data[0], 3, 1);
+s->block_protect2 = extract32(s->data[0], 4, 1);
+if (s->pi->flags & SNOR_F_HAS_SR_TB) {
+s->top_bottom_bit = extract32(s->data[0], 5, 1);
+}
+if (s->pi->flags & SNOR_F_HAS_SR_BP3_BIT6) {
+s->block_protect3 = extract32(s->data[0], 6, 1);
+}
 
 switch (get_man(s)) {
 case MAN_SPANSION:
@@ -1213,6 +1249,15 @@ static void decode_new_cmd(Flash *s, uint32_t value)
 case RDSR:
 s->data[0] = (!!s->write_enable) << 1;
 s->data[0] |= (!!s->status_register_write_disabled) << 7;
+s->data[0] |= (!!s->block_protect0) << 2;
+s->data[0] |= (!!s->block_protect1) << 3;
+s->data[0] |= (!!s->block_protect2) << 4;
+if (s->pi->flags & SNOR_F_HAS_SR_TB) {
+s->data[0] |= (!!s->top_bottom_bit) << 5;
+}
+if (s->pi->flags & SNOR_F_HAS_SR_BP3_BIT6) {
+s->data[0] |= (!!s->block_protect3) << 6;
+}
 
 if (get_man(s) == MAN_MACRONIX || get_man(s) == MAN_ISSI) {
 s->data[0] |= (!!s->quad_enable) << 6;
@@ -1553,6 +1598,11 @@ static void m25p80_reset(DeviceState *d)
 
 s->wp_level = true;
 s->status_register_write_disabled = false;
+s->block_protect0 = false;
+s->block_protect1 = false;
+s->block_protect2 = false;
+s->block_protect3 = false;
+s->top_bottom_bit = false;
 
 reset_memory(s);
 }
-- 
2.30.2




[PATCH 2/2] hw: m25p80: add tests for BP and TB bit write protect

2022-06-27 Thread Iris Chen
Signed-off-by: Iris Chen 
---
 tests/qtest/aspeed_smc-test.c | 111 ++
 1 file changed, 111 insertions(+)

diff --git a/tests/qtest/aspeed_smc-test.c b/tests/qtest/aspeed_smc-test.c
index 1258687eac..05ce941566 100644
--- a/tests/qtest/aspeed_smc-test.c
+++ b/tests/qtest/aspeed_smc-test.c
@@ -192,6 +192,24 @@ static void read_page_mem(uint32_t addr, uint32_t *page)
 }
 }
 
+static void write_page_mem(uint32_t addr, uint32_t write_value)
+{
+spi_ctrl_setmode(CTRL_WRITEMODE, PP);
+
+for (int i = 0; i < FLASH_PAGE_SIZE / 4; i++) {
+writel(ASPEED_FLASH_BASE + addr + i * 4, write_value);
+}
+}
+
+static void assert_page_mem(uint32_t addr, uint32_t expected_value)
+{
+uint32_t page[FLASH_PAGE_SIZE / 4];
+read_page_mem(addr, page);
+for (int i = 0; i < FLASH_PAGE_SIZE / 4; i++) {
+g_assert_cmphex(page[i], ==, expected_value);
+}
+}
+
 static void test_erase_sector(void)
 {
 uint32_t some_page_addr = 0x600 * FLASH_PAGE_SIZE;
@@ -501,6 +519,95 @@ static void test_status_reg_write_protection(void)
 flash_reset();
 }
 
+static void test_write_block_protect(void)
+{
+uint32_t sector_size = 65536;
+uint32_t n_sectors = 512;
+
+spi_ce_ctrl(1 << CRTL_EXTENDED0);
+spi_conf(CONF_ENABLE_W0);
+
+uint32_t bp_bits = 0b0;
+
+for (int i = 0; i < 16; i++) {
+bp_bits = ((i & 0b1000) << 3) | ((i & 0b0111) << 2);
+
+spi_ctrl_start_user();
+writeb(ASPEED_FLASH_BASE, WREN);
+writeb(ASPEED_FLASH_BASE, BULK_ERASE);
+writeb(ASPEED_FLASH_BASE, WREN);
+writeb(ASPEED_FLASH_BASE, WRSR);
+writeb(ASPEED_FLASH_BASE, bp_bits);
+writeb(ASPEED_FLASH_BASE, EN_4BYTE_ADDR);
+writeb(ASPEED_FLASH_BASE, WREN);
+spi_ctrl_stop_user();
+
+uint32_t num_protected_sectors = i ? MIN(1 << (i - 1), n_sectors) : 0;
+uint32_t protection_start = n_sectors - num_protected_sectors;
+uint32_t protection_end = n_sectors;
+
+for (int sector = 0; sector < n_sectors; sector++) {
+uint32_t addr = sector * sector_size;
+
+assert_page_mem(addr, 0x);
+write_page_mem(addr, make_be32(0xabcdef12));
+
+uint32_t expected_value = protection_start <= sector
+  && sector < protection_end
+  ? 0x : 0xabcdef12;
+
+assert_page_mem(addr, expected_value);
+}
+}
+
+flash_reset();
+}
+
+static void test_write_block_protect_bottom_bit(void)
+{
+uint32_t sector_size = 65536;
+uint32_t n_sectors = 512;
+
+spi_ce_ctrl(1 << CRTL_EXTENDED0);
+spi_conf(CONF_ENABLE_W0);
+
+/* top bottom bit is enabled */
+uint32_t bp_bits = 0b00100 << 3;
+
+for (int i = 0; i < 16; i++) {
+bp_bits = (((i & 0b1000) | 0b0100) << 3) | ((i & 0b0111) << 2);
+
+spi_ctrl_start_user();
+writeb(ASPEED_FLASH_BASE, WREN);
+writeb(ASPEED_FLASH_BASE, BULK_ERASE);
+writeb(ASPEED_FLASH_BASE, WREN);
+writeb(ASPEED_FLASH_BASE, WRSR);
+writeb(ASPEED_FLASH_BASE, bp_bits);
+writeb(ASPEED_FLASH_BASE, EN_4BYTE_ADDR);
+writeb(ASPEED_FLASH_BASE, WREN);
+spi_ctrl_stop_user();
+
+uint32_t num_protected_sectors = i ? MIN(1 << (i - 1), n_sectors) : 0;
+uint32_t protection_start = 0;
+uint32_t protection_end = num_protected_sectors;
+
+for (int sector = 0; sector < n_sectors; sector++) {
+uint32_t addr = sector * sector_size;
+
+assert_page_mem(addr, 0x);
+write_page_mem(addr, make_be32(0xabcdef12));
+
+uint32_t expected_value = protection_start <= sector
+  && sector < protection_end
+  ? 0x : 0xabcdef12;
+
+assert_page_mem(addr, expected_value);
+}
+}
+
+flash_reset();
+}
+
 static char tmp_path[] = "/tmp/qtest.m25p80.XX";
 
 int main(int argc, char **argv)
@@ -529,6 +636,10 @@ int main(int argc, char **argv)
 qtest_add_func("/ast2400/smc/read_status_reg", test_read_status_reg);
 qtest_add_func("/ast2400/smc/status_reg_write_protection",
test_status_reg_write_protection);
+qtest_add_func("/ast2400/smc/write_block_protect",
+   test_write_block_protect);
+qtest_add_func("/ast2400/smc/write_block_protect_bottom_bit",
+   test_write_block_protect_bottom_bit);
 
 flash_reset();
 ret = g_test_run();
-- 
2.30.2




[PATCH 0/2] Add Block Protect (BP) and Top Bottom (TB) bits for write protect

2022-06-27 Thread Iris Chen
Hey everyone,

Adding the 3 block protection bits and top bottom bits which configure which 
parts
of the flash are writable and read-only.

These patches are based on previous patches I just submitted:
hw: m25p80: fixing individual test failure when tests are running in isolation
hw: m25p80: add WP# pin and SRWD bit for write protection
hw: m25p80: add tests for write protect (WP# and SRWD bit)

The tests that were added iterates through every different memory protection
case so it takes a little longer to run. Let me know what you think about that
and if that's okay or not.

Thanks,
Iris

Iris Chen (2):
  hw: m25p80: Add Block Protect and Top Bottom bits for write protect
  hw: m25p80: add tests for BP and TB bit write protect

 hw/block/m25p80.c |  74 +++
 tests/qtest/aspeed_smc-test.c | 111 ++
 2 files changed, 173 insertions(+), 12 deletions(-)

--
2.30.2




Re: [PATCH v2 1/4] hw/nvme: claim NVMe 2.0 compliance

2022-06-27 Thread Christoph Hellwig
On Mon, Jun 27, 2022 at 01:47:28PM +0200, Niklas Cassel via wrote:
> CRMS.CRWMS bit shall be set to 1 on controllers compliant with versions
> later than NVMe 1.4.
> 
> The first version later than NVMe 1.4 is NVMe 2.0
> 
> Let's claim compliance with NVMe 2.0 such that a follow up patch can
> set the CRMS.CRWMS bit.
> 
> This is needed since CC.CRIME is only writable when both CRMS.CRIMS
> and CRMS.CRWMS is set.

You can also always support newer features without claiming
compliance for the new version.  I'd suggest to go through the
mandatory changes list first before upgrading the compliance.
(And one day it would be neat if someone tried to run the official
but commercial compliance tests on qemu a well..)



Re: [PATCH] hw/nvme: fix example serial in documentation

2022-06-27 Thread Klaus Jensen
On Jun 27 14:39, Niklas Cassel wrote:
> The serial prop on the controller is actually describing the nvme
> subsystem serial, which has to be identical for all controllers within
> the same nvme subsystem.
> 
> This is enforced since commit a859eb9f8f64 ("hw/nvme: enforce common
> serial per subsystem").
> 
> Fix the documentation, so that people copying the qemu command line
> example won't get an error on qemu start.
> 
> Signed-off-by: Niklas Cassel 
> ---
>  docs/system/devices/nvme.rst | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/docs/system/devices/nvme.rst b/docs/system/devices/nvme.rst
> index aba253304e..30f841ef62 100644
> --- a/docs/system/devices/nvme.rst
> +++ b/docs/system/devices/nvme.rst
> @@ -104,8 +104,8 @@ multipath I/O.
>  .. code-block:: console
>  
> -device nvme-subsys,id=nvme-subsys-0,nqn=subsys0
> -   -device nvme,serial=a,subsys=nvme-subsys-0
> -   -device nvme,serial=b,subsys=nvme-subsys-0
> +   -device nvme,serial=deadbeef,subsys=nvme-subsys-0
> +   -device nvme,serial=deadbeef,subsys=nvme-subsys-0
>  
>  This will create an NVM subsystem with two controllers. Having controllers
>  linked to an ``nvme-subsys`` device allows additional ``nvme-ns`` parameters:
> -- 
> 2.36.1
> 

Woops!

Thanks Niklas, applied to nvme-next!


signature.asc
Description: PGP signature


[PATCH 0/3] libvhost-user: NEED_REPLY behaviour fixes

2022-06-27 Thread Kevin Wolf
Command handlers shouldn't send replies themselves, but just prepare the
message and let vu_dispatch() send it. Otherwise, vu_dispatch() will add
a second reply as an ACK when NEED_REPLY was requested.

QEMU's vhost-user devices are not affected by these bugs because they
don't set NEED_REPLY for the commands that get it wrong. libblkio does
run into it, though.

Kevin Wolf (3):
  docs/vhost-user: Fix mismerge
  libvhost-user: Fix VHOST_USER_GET_MAX_MEM_SLOTS reply
  libvhost-user: Fix VHOST_USER_ADD_MEM_REG reply

 docs/interop/vhost-user.rst   | 16 
 subprojects/libvhost-user/libvhost-user.c | 19 +++
 2 files changed, 3 insertions(+), 32 deletions(-)

-- 
2.35.3




Re: [PATCH 1/3] docs/vhost-user: Fix mismerge

2022-06-27 Thread Peter Maydell
On Mon, 27 Jun 2022 at 14:49, Kevin Wolf  wrote:
>
> This reverts commit 76b1b64370007234279ea4cc8b09c98cbd2523de.
>
> The commit only duplicated some text that had already been merged in
> commit 31009d13cc5.
>
> Signed-off-by: Kevin Wolf 
> ---
>  docs/interop/vhost-user.rst | 16 
>  1 file changed, 16 deletions(-)

Reviewed-by: Peter Maydell 

thanks
-- PMM



[PATCH 2/3] libvhost-user: Fix VHOST_USER_GET_MAX_MEM_SLOTS reply

2022-06-27 Thread Kevin Wolf
With REPLY_NEEDED, libvhost-user sends both the acutal result and an
additional ACK reply for VHOST_USER_GET_MAX_MEM_SLOTS. This is
incorrect, the spec mandates that it behave the same with and without
REPLY_NEEDED because it always sends a reply.

Fixes: 6fb2e173d20c9bbb5466183d33a3ad7dcd0375fa
Signed-off-by: Kevin Wolf 
---
 subprojects/libvhost-user/libvhost-user.c | 11 ++-
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index b4cc3c2d68..cfa1bcc334 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -1827,18 +1827,11 @@ vu_handle_vring_kick(VuDev *dev, VhostUserMsg *vmsg)
 
 static bool vu_handle_get_max_memslots(VuDev *dev, VhostUserMsg *vmsg)
 {
-vmsg->flags = VHOST_USER_REPLY_MASK | VHOST_USER_VERSION;
-vmsg->size  = sizeof(vmsg->payload.u64);
-vmsg->payload.u64 = VHOST_USER_MAX_RAM_SLOTS;
-vmsg->fd_num = 0;
-
-if (!vu_message_write(dev, dev->sock, vmsg)) {
-vu_panic(dev, "Failed to send max ram slots: %s\n", strerror(errno));
-}
+vmsg_set_reply_u64(vmsg, VHOST_USER_MAX_RAM_SLOTS);
 
 DPRINT("u64: 0x%016"PRIx64"\n", (uint64_t) VHOST_USER_MAX_RAM_SLOTS);
 
-return false;
+return true;
 }
 
 static bool
-- 
2.35.3




[PATCH 1/3] docs/vhost-user: Fix mismerge

2022-06-27 Thread Kevin Wolf
This reverts commit 76b1b64370007234279ea4cc8b09c98cbd2523de.

The commit only duplicated some text that had already been merged in
commit 31009d13cc5.

Signed-off-by: Kevin Wolf 
---
 docs/interop/vhost-user.rst | 16 
 1 file changed, 16 deletions(-)

diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
index d7cf904f7f..3f18ab424e 100644
--- a/docs/interop/vhost-user.rst
+++ b/docs/interop/vhost-user.rst
@@ -1376,14 +1376,6 @@ Front-end message types
   For further details on postcopy, see ``VHOST_USER_SET_MEM_TABLE``.
   They apply to ``VHOST_USER_ADD_MEM_REG`` accordingly.
 
-  Exactly one file descriptor from which the memory is mapped is
-  passed in the ancillary data.
-
-  In postcopy mode (see ``VHOST_USER_POSTCOPY_LISTEN``), the back-end
-  replies with the bases of the memory mapped region to the front-end.
-  For further details on postcopy, see ``VHOST_USER_SET_MEM_TABLE``.
-  They apply to ``VHOST_USER_ADD_MEM_REG`` accordingly.
-
 ``VHOST_USER_REM_MEM_REG``
   :id: 38
   :equivalent ioctl: N/A
@@ -1408,14 +1400,6 @@ Front-end message types
   accept messages with one file descriptor. If a file descriptor is
   passed, the back-end MUST close it without using it otherwise.
 
-  The memory region to be removed is identified by its guest address,
-  user address and size. The mmap offset is ignored.
-
-  No file descriptors SHOULD be passed in the ancillary data. For
-  compatibility with existing incorrect implementations, the back-end MAY
-  accept messages with one file descriptor. If a file descriptor is
-  passed, the back-end MUST close it without using it otherwise.
-
 ``VHOST_USER_SET_STATUS``
   :id: 39
   :equivalent ioctl: VHOST_VDPA_SET_STATUS
-- 
2.35.3




[PATCH 3/3] libvhost-user: Fix VHOST_USER_ADD_MEM_REG reply

2022-06-27 Thread Kevin Wolf
With REPLY_NEEDED, libvhost-user sends both the acutal result and an
additional ACK reply for VHOST_USER_ADD_MEM_REG. This is incorrect, the
spec mandates that it behave the same with and without REPLY_NEEDED
because it always sends a reply.

Fixes: ec94c8e621de96c50c2d381c8c9ec94f5beec7c1
Signed-off-by: Kevin Wolf 
---
 subprojects/libvhost-user/libvhost-user.c | 8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index cfa1bcc334..ffed4729a3 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -779,15 +779,9 @@ vu_add_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
 
 /* Send the message back to qemu with the addresses filled in. */
 vmsg->fd_num = 0;
-if (!vu_send_reply(dev, dev->sock, vmsg)) {
-vu_panic(dev, "failed to respond to add-mem-region for postcopy");
-return false;
-}
-
 DPRINT("Successfully added new region in postcopy\n");
 dev->nregions++;
-return false;
-
+return true;
 } else {
 for (i = 0; i < dev->max_queues; i++) {
 if (dev->vq[i].vring.desc) {
-- 
2.35.3




[PATCH] hw/nvme: fix example serial in documentation

2022-06-27 Thread Niklas Cassel via
The serial prop on the controller is actually describing the nvme
subsystem serial, which has to be identical for all controllers within
the same nvme subsystem.

This is enforced since commit a859eb9f8f64 ("hw/nvme: enforce common
serial per subsystem").

Fix the documentation, so that people copying the qemu command line
example won't get an error on qemu start.

Signed-off-by: Niklas Cassel 
---
 docs/system/devices/nvme.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/system/devices/nvme.rst b/docs/system/devices/nvme.rst
index aba253304e..30f841ef62 100644
--- a/docs/system/devices/nvme.rst
+++ b/docs/system/devices/nvme.rst
@@ -104,8 +104,8 @@ multipath I/O.
 .. code-block:: console
 
-device nvme-subsys,id=nvme-subsys-0,nqn=subsys0
-   -device nvme,serial=a,subsys=nvme-subsys-0
-   -device nvme,serial=b,subsys=nvme-subsys-0
+   -device nvme,serial=deadbeef,subsys=nvme-subsys-0
+   -device nvme,serial=deadbeef,subsys=nvme-subsys-0
 
 This will create an NVM subsystem with two controllers. Having controllers
 linked to an ``nvme-subsys`` device allows additional ``nvme-ns`` parameters:
-- 
2.36.1




Re: [PATCH v2 18/21] migration: remove the QEMUFileOps 'get_buffer' callback

2022-06-27 Thread Dr. David Alan Gilbert
* Daniel P. Berrangé (berra...@redhat.com) wrote:
> This directly implements the get_buffer logic using QIOChannel APIs.
> 
> Reviewed-by: Dr. David Alan Gilbert 
> Signed-off-by: Daniel P. Berrangé 

Coverity is pointing out a fun deadcode path from this:

> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
> index 5eb8cf0e28..df438724cd 100644
> --- a/migration/qemu-file.c
> +++ b/migration/qemu-file.c
> @@ -377,8 +377,22 @@ static ssize_t qemu_fill_buffer(QEMUFile *f)
>  return 0;
>  }
>  
> -len = f->ops->get_buffer(f->ioc, f->buf + pending, f->total_transferred,
> - IO_BUF_SIZE - pending, _error);
> +do {
> +len = qio_channel_read(f->ioc,
> +   (char *)f->buf + pending,
> +   IO_BUF_SIZE - pending,
> +   _error);
> +if (len == QIO_CHANNEL_ERR_BLOCK) {
> +if (qemu_in_coroutine()) {
> +qio_channel_yield(f->ioc, G_IO_IN);
> +} else {
> +qio_channel_wait(f->ioc, G_IO_IN);
> +}
> +} else if (len < 0) {
> +len = EIO;
> +}
> +} while (len == QIO_CHANNEL_ERR_BLOCK);
> +

the next code is:
if (len > 0) {
f->buf_size += len;
f->total_transferred += len;  
} else if (len == 0) {
qemu_file_set_error_obj(f, -EIO, local_error);
} else if (len != -EAGAIN) {  
qemu_file_set_error_obj(f, len, local_error);
} else {
error_free(local_error);  
}

because of the while loop, we should never actually see
len = QIO_CHANNEL_ERR_BLOCK out of the bottom; so the only
error value we should have is -EIO;  so that error_free is 
not hittable.

Dave

>  if (len > 0) {
>  f->buf_size += len;
>  f->total_transferred += len;
> diff --git a/migration/qemu-file.h b/migration/qemu-file.h
> index 4a3beedb5b..f7ed568894 100644
> --- a/migration/qemu-file.h
> +++ b/migration/qemu-file.h
> @@ -29,14 +29,6 @@
>  #include "exec/cpu-common.h"
>  #include "io/channel.h"
>  
> -/* Read a chunk of data from a file at the given position.  The pos argument
> - * can be ignored if the file is only be used for streaming.  The number of
> - * bytes actually read should be returned.
> - */
> -typedef ssize_t (QEMUFileGetBufferFunc)(void *opaque, uint8_t *buf,
> -int64_t pos, size_t size,
> -Error **errp);
> -
>  /*
>   * This function writes an iovec to file. The handler must write all
>   * of the data or return a negative errno value.
> @@ -77,7 +69,6 @@ typedef size_t (QEMURamSaveFunc)(QEMUFile *f,
>  typedef QEMUFile *(QEMURetPathFunc)(void *opaque);
>  
>  typedef struct QEMUFileOps {
> -QEMUFileGetBufferFunc *get_buffer;
>  QEMUFileWritevBufferFunc *writev_buffer;
>  QEMURetPathFunc *get_return_path;
>  } QEMUFileOps;
> -- 
> 2.36.1
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK




[PATCH v2 3/4] hw/nvme: add support for ratified TP4084

2022-06-27 Thread Niklas Cassel via
TP4084 adds a new mode, CC.CRIME, that can be used to mark a namespace
as ready independently from the controller.

When CC.CRIME is 0 (default), things behave as before, all namespaces
are ready when CSTS.RDY gets set to 1.

When CC.CRIME is 1, the controller will become ready when CSTS.RDY gets
set to 1, but commands accessing a namespace are allowed to return
"Namespace Not Ready" or "Admin Command Media Not Ready".
After CRTO.CRWMT amount of time, if the namespace has not yet been
marked ready, the status codes also need to have the DNR bit set.

Add a new "ready_delay" namespace device parameter, in order to emulate
different ready latencies for namespaces.

Once a namespace is ready, it will set the NRDY bit in the I/O Command
Set Independent Identify Namespace Data Structure, and then send out a
Namespace Attribute Changed event.

This new "ready_delay" is supported on controllers not part of a NVMe
subsystem. The reasons are many. One problem is that multiple controllers
can have different CC.CRIME modes running. Another problem is the extra
locking needed. The third problem is when to actually clear NRDY. If we
assume that a namespace clears NRDY when it no longer has any controller
online for that namespace. The problem then is that Linux will reset the
controllers one by one during probe time. The reset goes so fast so that
there is no time when all controllers are in reset at the same time, so
NRDY will never get cleared. (The controllers are enabled by SeaBIOS by
default.) We could introduce a reset_time param, but this would only
increase the chances that all controllers are in reset at the same time.

Signed-off-by: Niklas Cassel 
---
 hw/nvme/ctrl.c   | 123 +--
 hw/nvme/ns.c |  18 +++
 hw/nvme/nvme.h   |   6 +++
 hw/nvme/trace-events |   1 +
 include/block/nvme.h |  60 -
 5 files changed, 204 insertions(+), 4 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 8ca824ea14..5404f67480 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -88,6 +88,12 @@
  *   completion when there are no outstanding AERs. When the maximum number of
  *   enqueued events are reached, subsequent events will be dropped.
  *
+ * - `crwmt`
+ *   This is the Controller Ready With Media Timeout (CRWMT) field that is
+ *   defined in the CRTO register. This specifies the worst-case time that host
+ *   software should wait for the controller and all attached namespaces to
+ *   become ready. The value is in units of 500 milliseconds.
+ *
  * - `mdts`
  *   Indicates the maximum data transfer size for a command that transfers data
  *   between host-accessible memory and the controller. The value is specified
@@ -157,6 +163,11 @@
  *   namespace will be available in the subsystem but not attached to any
  *   controllers.
  *
+ * - `ready_delay`
+ *   This parameter specifies the amount of time that the namespace should wait
+ *   before being marked ready. Only applicable if CC.CRIME is set by the user.
+ *   The value is in units of 500 milliseconds (to be consistent with `crwmt`).
+ *
  * Setting `zoned` to true selects Zoned Command Set at the namespace.
  * In this case, the following namespace properties are available to configure
  * zoned operation:
@@ -216,6 +227,8 @@
 #define NVME_VF_RES_GRANULARITY 1
 #define NVME_VF_OFFSET 0x1
 #define NVME_VF_STRIDE 1
+#define NVME_DEFAULT_CRIMT 0xa
+#define NVME_DEFAULT_CRWMT 0xf
 
 #define NVME_GUEST_ERR(trace, fmt, ...) \
 do { \
@@ -4188,6 +4201,10 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest 
*req)
 return NVME_INVALID_OPCODE | NVME_DNR;
 }
 
+if (!(ns->id_indep_ns.nstat & NVME_NSTAT_NRDY)) {
+return NVME_NS_NOT_READY;
+}
+
 if (ns->status) {
 return ns->status;
 }
@@ -4791,6 +4808,27 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, 
NvmeRequest *req, bool active)
 return NVME_INVALID_CMD_SET | NVME_DNR;
 }
 
+static uint16_t nvme_identify_cs_indep_ns(NvmeCtrl *n, NvmeRequest *req)
+{
+NvmeNamespace *ns;
+NvmeIdentify *c = (NvmeIdentify *)>cmd;
+uint32_t nsid = le32_to_cpu(c->nsid);
+
+trace_pci_nvme_identify_cs_indep_ns(nsid);
+
+if (!nvme_nsid_valid(n, nsid) || nsid == NVME_NSID_BROADCAST) {
+return NVME_INVALID_NSID | NVME_DNR;
+}
+
+ns = nvme_ns(n, nsid);
+if (unlikely(!ns)) {
+return nvme_rpt_empty_id_struct(n, req);
+}
+
+return nvme_c2h(n, (uint8_t *)>id_indep_ns, sizeof(NvmeIdNsCsIndep),
+req);
+}
+
 static uint16_t nvme_identify_ctrl_list(NvmeCtrl *n, NvmeRequest *req,
 bool attached)
 {
@@ -5081,6 +5119,8 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest 
*req)
 return nvme_identify_ns(n, req, true);
 case NVME_ID_CNS_NS_PRESENT:
 return nvme_identify_ns(n, req, false);
+case NVME_ID_CNS_CS_INDEPENDENT_NS:
+return nvme_identify_cs_indep_ns(n, req);

[PATCH v2 2/4] hw/nvme: store a pointer to the NvmeSubsystem in the NvmeNamespace

2022-06-27 Thread Niklas Cassel via
Each NvmeNamespace can be used by serveral controllers,
but a NvmeNamespace can at most belong to a single NvmeSubsystem.
Store a pointer to the NvmeSubsystem, if the namespace was realized
with a NvmeSubsystem.

This will be used by a follow up patch.

Signed-off-by: Niklas Cassel 
---
 hw/nvme/ns.c   | 1 +
 hw/nvme/nvme.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index 870c3ca1a2..8bee3e8b3b 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -559,6 +559,7 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
 if (!qdev_set_parent_bus(dev, >bus.parent_bus, errp)) {
 return;
 }
+ns->subsys = subsys;
 }
 
 if (nvme_ns_setup(ns, errp)) {
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 0711b9748c..5487e2db40 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -167,6 +167,7 @@ typedef struct NvmeNamespace {
 int32_t nr_active_zones;
 
 NvmeNamespaceParams params;
+NvmeSubsystem *subsys;
 
 struct {
 uint32_t err_rec;
-- 
2.36.1




[PATCH v2 4/4] hw/nvme: add new never_ready parameter to test the DNR bit

2022-06-27 Thread Niklas Cassel via
Since we verify that "ready_delay" parameter has to be smaller than CRWMT,
we know that the namespace will always become ready.
Therefore the "Namespace Not Ready" status code will never have the DNR
bit set.

Add a new parameter "never_ready" that can be used to emulate a namespace
that never gets ready, such that the DNR bit gets set after CRWMT amount
of time.

Signed-off-by: Niklas Cassel 
---
 hw/nvme/ctrl.c | 28 +++-
 hw/nvme/ns.c   |  1 +
 hw/nvme/nvme.h |  2 ++
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 5404f67480..5f98d4778d 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -168,6 +168,12 @@
  *   before being marked ready. Only applicable if CC.CRIME is set by the user.
  *   The value is in units of 500 milliseconds (to be consistent with `crwmt`).
  *
+ * - `never_ready`
+ *   This parameter specifies that a namespace should never be marked as ready.
+ *   When `crwmt` amount of time has passed after enabling the controller,
+ *   status code "Namespace Not Ready" will have the DNR bit set. If specified
+ *   together with `ready_delay`, `never_ready` will take precedence.
+ *
  * Setting `zoned` to true selects Zoned Command Set at the namespace.
  * In this case, the following namespace properties are available to configure
  * zoned operation:
@@ -4156,6 +4162,14 @@ static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, 
NvmeRequest *req)
 return status;
 }
 
+static bool nvme_ready_has_passed_timeout(NvmeCtrl *n)
+{
+int64_t current_time = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
+int64_t elapsed_time = current_time - n->cc_enable_timestamp;
+
+return elapsed_time > n->params.crwmt * 500;
+}
+
 static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
 {
 NvmeNamespace *ns;
@@ -4202,7 +4216,11 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest 
*req)
 }
 
 if (!(ns->id_indep_ns.nstat & NVME_NSTAT_NRDY)) {
-return NVME_NS_NOT_READY;
+uint16_t ret = NVME_NS_NOT_READY;
+if (ns->params.never_ready && nvme_ready_has_passed_timeout(n)) {
+ret |= NVME_DNR;
+}
+return ret;
 }
 
 if (ns->status) {
@@ -5616,6 +5634,10 @@ static void nvme_set_ready_or_start_timer(NvmeCtrl *n, 
NvmeNamespace *ns)
 {
 int64_t expire_time;
 
+if (ns->params.never_ready) {
+return;
+}
+
 if (!NVME_CC_CRIME(ldl_le_p(>bar.cc)) || ns->params.ready_delay == 0) {
 ns->id_indep_ns.nstat |= NVME_NSTAT_NRDY;
 return;
@@ -6346,6 +6368,7 @@ static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType 
rst)
 }
 }
 
+n->cc_enable_timestamp = 0;
 n->aer_queued = 0;
 n->aer_mask = 0;
 n->outstanding_aers = 0;
@@ -6389,6 +6412,8 @@ static void nvme_ctrl_shutdown(NvmeCtrl *n)
 
 nvme_ns_shutdown(ns);
 }
+
+n->cc_enable_timestamp = 0;
 }
 
 static void nvme_ctrl_per_ns_action_on_start(NvmeCtrl *n)
@@ -6506,6 +6531,7 @@ static int nvme_start_ctrl(NvmeCtrl *n)
 NVME_CAP_SET_TO(cap, new_cap_timeout);
 stq_le_p(>bar.cap, cap);
 
+n->cc_enable_timestamp = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
 n->page_bits = page_bits;
 n->page_size = page_size;
 n->max_prp_ents = n->page_size / sizeof(uint64_t);
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index c1d70183c4..fc12b4e0d3 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -661,6 +661,7 @@ static Property nvme_ns_props[] = {
 DEFINE_PROP_BOOL("eui64-default", NvmeNamespace, params.eui64_default,
  false),
 DEFINE_PROP_UINT16("ready_delay", NvmeNamespace, params.ready_delay, 0),
+DEFINE_PROP_BOOL("never_ready", NvmeNamespace, params.never_ready, false),
 DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 9d2f13dfdb..292b1acf15 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -127,6 +127,7 @@ typedef struct NvmeNamespaceParams {
 uint64_t zrwafg;
 
 uint16_t ready_delay;
+bool never_ready;
 } NvmeNamespaceParams;
 
 typedef struct NvmeNamespace {
@@ -451,6 +452,7 @@ typedef struct NvmeCtrl {
 int cq_pending;
 uint64_thost_timestamp; /* Timestamp sent by the host 
*/
 uint64_ttimestamp_set_qemu_clock_ms;/* QEMU clock time */
+uint64_tcc_enable_timestamp;/* QEMU clock time */
 uint64_tstarttime_ms;
 uint16_ttemperature;
 uint8_t smart_critical_warning;
-- 
2.36.1




[PATCH v2 0/4] hw/nvme: add support for TP4084

2022-06-27 Thread Niklas Cassel via
Hello there,

considering that Linux v5.19 will include support for NVMe TP4084:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/nvme/host/core.c?id=354201c53e61e493017b15327294b0c8ab522d69

I thought that it might be nice to have QEMU support for the same.

TP4084 adds a new mode, CC.CRIME, that can be used to mark a namespace
as ready independently from the controller.

When CC.CRIME is 0 (default), things behave as before, all namespaces
are ready when CSTS.RDY gets set to 1.

Add a new "ready_delay" namespace device parameter, in order to emulate
different ready latencies for namespaces when CC.CRIME is 1.

The patch series also adds a "crwmt" controller parameter, in order to
be able to expose the worst case timeout that the host should wait for
all namespaces to become ready.


Example qemu cmd line for the new options:

# delay in s (20s)
NS1_DELAY_S=20
# convert to units of 500ms
NS1_DELAY=$((NS1_DELAY_S*2))

# delay in s (60s)
NS2_DELAY_S=60
# convert to units of 500ms
NS2_DELAY=$((NS2_DELAY_S*2))

# timeout in s (120s)
CRWMT_S=120
# convert to units of 500ms
CRWMT=$((CRWMT_S*2))

 -device nvme,serial=deadbeef,crwmt=$CRWMT \
 -drive file=$NS1_DATA,id=nvm-1,format=raw,if=none \
 -device nvme-ns,drive=nvm-1,ready_delay=$NS1_DELAY \
 -drive file=$NS2_DATA,id=nvm-2,format=raw,if=none \
 -device nvme-ns,drive=nvm-2,ready_delay=$NS2_DELAY \


Changes since v1:
-Rebased on nvme-next
-Set id_indep_ns->nmic if ns->params.shared in patch 3/4.


Niklas Cassel (4):
  hw/nvme: claim NVMe 2.0 compliance
  hw/nvme: store a pointer to the NvmeSubsystem in the NvmeNamespace
  hw/nvme: add support for ratified TP4084
  hw/nvme: add new never_ready parameter to test the DNR bit

 hw/nvme/ctrl.c   | 151 +--
 hw/nvme/ns.c |  20 ++
 hw/nvme/nvme.h   |   9 +++
 hw/nvme/trace-events |   1 +
 include/block/nvme.h |  60 -
 5 files changed, 236 insertions(+), 5 deletions(-)

-- 
2.36.1




[PATCH v2 1/4] hw/nvme: claim NVMe 2.0 compliance

2022-06-27 Thread Niklas Cassel via
CRMS.CRWMS bit shall be set to 1 on controllers compliant with versions
later than NVMe 1.4.

The first version later than NVMe 1.4 is NVMe 2.0

Let's claim compliance with NVMe 2.0 such that a follow up patch can
set the CRMS.CRWMS bit.

This is needed since CC.CRIME is only writable when both CRMS.CRIMS
and CRMS.CRWMS is set.

Signed-off-by: Niklas Cassel 
---
 hw/nvme/ctrl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index f8ec4a7be3..8ca824ea14 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -204,7 +204,7 @@
 
 #define NVME_MAX_IOQPAIRS 0x
 #define NVME_DB_SIZE  4
-#define NVME_SPEC_VER 0x00010400
+#define NVME_SPEC_VER 0x0002
 #define NVME_CMB_BIR 2
 #define NVME_PMR_BIR 4
 #define NVME_TEMPERATURE 0x143
-- 
2.36.1




Re: [PATCH v4] hw: m25p80: add tests for write protect (WP# and SRWD bit)

2022-06-27 Thread Cédric Le Goater

On 6/24/22 20:30, Iris Chen wrote:

Signed-off-by: Iris Chen 


Reviewed-by: Cédric Le Goater 

Thanks,

C.


---
Adding Signed Off By tag -- sorry I missed that !

  tests/qtest/aspeed_smc-test.c | 62 +++
  1 file changed, 62 insertions(+)

diff --git a/tests/qtest/aspeed_smc-test.c b/tests/qtest/aspeed_smc-test.c
index ec233315e6..7786addfb8 100644
--- a/tests/qtest/aspeed_smc-test.c
+++ b/tests/qtest/aspeed_smc-test.c
@@ -56,7 +56,9 @@ enum {
  BULK_ERASE = 0xc7,
  READ = 0x03,
  PP = 0x02,
+WRSR = 0x1,
  WREN = 0x6,
+SRWD = 0x80,
  RESET_ENABLE = 0x66,
  RESET_MEMORY = 0x99,
  EN_4BYTE_ADDR = 0xB7,
@@ -390,6 +392,64 @@ static void test_read_status_reg(void)
  flash_reset();
  }
  
+static void test_status_reg_write_protection(void)

+{
+uint8_t r;
+
+spi_conf(CONF_ENABLE_W0);
+
+/* default case: WP# is high and SRWD is low -> status register writable */
+spi_ctrl_start_user();
+writeb(ASPEED_FLASH_BASE, WREN);
+/* test ability to write SRWD */
+writeb(ASPEED_FLASH_BASE, WRSR);
+writeb(ASPEED_FLASH_BASE, SRWD);
+writeb(ASPEED_FLASH_BASE, RDSR);
+r = readb(ASPEED_FLASH_BASE);
+spi_ctrl_stop_user();
+g_assert_cmphex(r & SRWD, ==, SRWD);
+
+/* WP# high and SRWD high -> status register writable */
+spi_ctrl_start_user();
+writeb(ASPEED_FLASH_BASE, WREN);
+/* test ability to write SRWD */
+writeb(ASPEED_FLASH_BASE, WRSR);
+writeb(ASPEED_FLASH_BASE, 0);
+writeb(ASPEED_FLASH_BASE, RDSR);
+r = readb(ASPEED_FLASH_BASE);
+spi_ctrl_stop_user();
+g_assert_cmphex(r & SRWD, ==, 0);
+
+/* WP# low and SRWD low -> status register writable */
+qtest_set_irq_in(global_qtest,
+ "/machine/soc/fmc/ssi.0/child[0]", "WP#", 0, 0);
+spi_ctrl_start_user();
+writeb(ASPEED_FLASH_BASE, WREN);
+/* test ability to write SRWD */
+writeb(ASPEED_FLASH_BASE, WRSR);
+writeb(ASPEED_FLASH_BASE, SRWD);
+writeb(ASPEED_FLASH_BASE, RDSR);
+r = readb(ASPEED_FLASH_BASE);
+spi_ctrl_stop_user();
+g_assert_cmphex(r & SRWD, ==, SRWD);
+
+/* WP# low and SRWD high -> status register NOT writable */
+spi_ctrl_start_user();
+writeb(ASPEED_FLASH_BASE, WREN);
+/* test ability to write SRWD */
+writeb(ASPEED_FLASH_BASE, WRSR);
+writeb(ASPEED_FLASH_BASE, 0);
+writeb(ASPEED_FLASH_BASE, RDSR);
+r = readb(ASPEED_FLASH_BASE);
+spi_ctrl_stop_user();
+/* write is not successful */
+g_assert_cmphex(r & SRWD, ==, SRWD);
+
+qtest_set_irq_in(global_qtest,
+ "/machine/soc/fmc/ssi.0/child[0]", "WP#", 0, 1);
+flash_reset();
+}
+
  static char tmp_path[] = "/tmp/qtest.m25p80.XX";
  
  int main(int argc, char **argv)

@@ -416,6 +476,8 @@ int main(int argc, char **argv)
  qtest_add_func("/ast2400/smc/read_page_mem", test_read_page_mem);
  qtest_add_func("/ast2400/smc/write_page_mem", test_write_page_mem);
  qtest_add_func("/ast2400/smc/read_status_reg", test_read_status_reg);
+qtest_add_func("/ast2400/smc/status_reg_write_protection",
+   test_status_reg_write_protection);
  
  ret = g_test_run();
  





[PATCH 3/4] libvduse: Pass positive value to strerror()

2022-06-27 Thread Xie Yongji
The value passed to strerror() should be positive.
So let's fix it.

Fixes: Coverity CID 1490226, 1490223
Signed-off-by: Xie Yongji 
---
 subprojects/libvduse/libvduse.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/subprojects/libvduse/libvduse.c b/subprojects/libvduse/libvduse.c
index 1e36227388..1a5981445c 100644
--- a/subprojects/libvduse/libvduse.c
+++ b/subprojects/libvduse/libvduse.c
@@ -1257,7 +1257,7 @@ VduseDev *vduse_dev_create_by_name(const char *name, 
uint16_t num_queues,
 ret = vduse_dev_init(dev, name, num_queues, ops, priv);
 if (ret < 0) {
 fprintf(stderr, "Failed to init vduse device %s: %s\n",
-name, strerror(ret));
+name, strerror(-ret));
 free(dev);
 return NULL;
 }
@@ -1331,7 +1331,7 @@ VduseDev *vduse_dev_create(const char *name, uint32_t 
device_id,
 ret = vduse_dev_init(dev, name, num_queues, ops, priv);
 if (ret < 0) {
 fprintf(stderr, "Failed to init vduse device %s: %s\n",
-name, strerror(ret));
+name, strerror(-ret));
 goto err;
 }
 
-- 
2.20.1




[PATCH 4/4] libvduse: Check the return value of some ioctls

2022-06-27 Thread Xie Yongji
Coverity pointed out (CID 1490222, 1490227) that we called
ioctl somewhere without checking the return value. This
patch fixes these issues.

Fixes: Coverity CID 1490222, 1490227
Signed-off-by: Xie Yongji 
---
 subprojects/libvduse/libvduse.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/subprojects/libvduse/libvduse.c b/subprojects/libvduse/libvduse.c
index 1a5981445c..bf7302c60a 100644
--- a/subprojects/libvduse/libvduse.c
+++ b/subprojects/libvduse/libvduse.c
@@ -947,7 +947,10 @@ static void vduse_queue_disable(VduseVirtq *vq)
 
 eventfd.index = vq->index;
 eventfd.fd = VDUSE_EVENTFD_DEASSIGN;
-ioctl(dev->fd, VDUSE_VQ_SETUP_KICKFD, );
+if (ioctl(dev->fd, VDUSE_VQ_SETUP_KICKFD, )) {
+fprintf(stderr, "Failed to disable eventfd for vq[%d]: %s\n",
+vq->index, strerror(errno));
+}
 close(vq->fd);
 
 assert(vq->inuse == 0);
@@ -1337,7 +1340,10 @@ VduseDev *vduse_dev_create(const char *name, uint32_t 
device_id,
 
 return dev;
 err:
-ioctl(ctrl_fd, VDUSE_DESTROY_DEV, name);
+if (ioctl(ctrl_fd, VDUSE_DESTROY_DEV, name)) {
+fprintf(stderr, "Failed to destroy vduse device %s: %s\n",
+name, strerror(errno));
+}
 err_dev:
 close(ctrl_fd);
 err_ctrl:
-- 
2.20.1




[PATCH 0/4] Fix some coverity issues on VDUSE

2022-06-27 Thread Xie Yongji
This series fixes some issues reported by coverity.

Patch 1 fixes an incorrect function name.

Patch 2 fixes Coverity CID 1490224.

Patch 3 fixes Coverity CID 1490226, 1490223.

Patch 4 fixes Coverity CID 1490222, 1490227.

Xie Yongji (4):
  libvduse: Fix the incorrect function name
  libvduse: Replace strcpy() with strncpy()
  libvduse: Pass positive value to strerror()
  libvduse: Check the return value of some ioctls

 subprojects/libvduse/libvduse.c | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

-- 
2.20.1




[PATCH 1/4] libvduse: Fix the incorrect function name

2022-06-27 Thread Xie Yongji
In vduse_name_is_valid(), we actually check whether
the name is invalid or not. So let's change the
function name to vduse_name_is_invalid() to match
the behavior.

Signed-off-by: Xie Yongji 
---
 subprojects/libvduse/libvduse.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/subprojects/libvduse/libvduse.c b/subprojects/libvduse/libvduse.c
index 9a2bcec282..6374933881 100644
--- a/subprojects/libvduse/libvduse.c
+++ b/subprojects/libvduse/libvduse.c
@@ -1193,7 +1193,7 @@ static int vduse_dev_init(VduseDev *dev, const char *name,
 return 0;
 }
 
-static inline bool vduse_name_is_valid(const char *name)
+static inline bool vduse_name_is_invalid(const char *name)
 {
 return strlen(name) >= VDUSE_NAME_MAX || strstr(name, "..");
 }
@@ -1242,7 +1242,7 @@ VduseDev *vduse_dev_create_by_name(const char *name, 
uint16_t num_queues,
 VduseDev *dev;
 int ret;
 
-if (!name || vduse_name_is_valid(name) || !ops ||
+if (!name || vduse_name_is_invalid(name) || !ops ||
 !ops->enable_queue || !ops->disable_queue) {
 fprintf(stderr, "Invalid parameter for vduse\n");
 return NULL;
@@ -1276,7 +1276,7 @@ VduseDev *vduse_dev_create(const char *name, uint32_t 
device_id,
 struct vduse_dev_config *dev_config;
 size_t size = offsetof(struct vduse_dev_config, config);
 
-if (!name || vduse_name_is_valid(name) ||
+if (!name || vduse_name_is_invalid(name) ||
 !has_feature(features,  VIRTIO_F_VERSION_1) || !config ||
 !config_size || !ops || !ops->enable_queue || !ops->disable_queue) {
 fprintf(stderr, "Invalid parameter for vduse\n");
-- 
2.20.1




[PATCH 2/4] libvduse: Replace strcpy() with strncpy()

2022-06-27 Thread Xie Yongji
Coverity reported a string overflow issue since we copied
"name" to "dev_config->name" without checking the length.
This should be a false positive since we already checked
the length of "name" in vduse_name_is_invalid(). But anyway,
let's replace strcpy() with strncpy() to fix the coverity
complaint.

Fixes: Coverity CID 1490224
Signed-off-by: Xie Yongji 
---
 subprojects/libvduse/libvduse.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/subprojects/libvduse/libvduse.c b/subprojects/libvduse/libvduse.c
index 6374933881..1e36227388 100644
--- a/subprojects/libvduse/libvduse.c
+++ b/subprojects/libvduse/libvduse.c
@@ -1309,7 +1309,8 @@ VduseDev *vduse_dev_create(const char *name, uint32_t 
device_id,
 goto err_dev;
 }
 
-strcpy(dev_config->name, name);
+strncpy(dev_config->name, name, VDUSE_NAME_MAX);
+dev_config->name[VDUSE_NAME_MAX - 1] = '\0';
 dev_config->device_id = device_id;
 dev_config->vendor_id = vendor_id;
 dev_config->features = features;
-- 
2.20.1




Re: [RFC v3 4/5] file-posix: introduce get_sysfs_str_val for device zoned model.

2022-06-27 Thread Sam Li
Hannes Reinecke  于2022年6月27日周一 15:41写道:
>
> On 6/27/22 02:19, Sam Li wrote:
> > ---
> >   block/file-posix.c   | 60 
> >   include/block/block-common.h |  4 +--
> >   2 files changed, 62 insertions(+), 2 deletions(-)
> >
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index 73c2cdfbca..74c0245e0f 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -1277,6 +1277,66 @@ out:
> >   #endif
> >   }
> >
> > +/*
> > + * Convert the zoned attribute file in sysfs to internal value.
> > + */
> > +static zone_model get_sysfs_str_val(int fd, struct stat *st) {
> > +#ifdef CONFIG_LINUX
> > +char buf[32];
> > +char *sysfspath = NULL;
> > +int ret;
> > +int sysfd = -1;
> > +
> > +if (S_ISCHR(st->st_mode)) {
> > +if (ioctl(fd, SG_GET_SG_TABLESIZE, ) == 0) {
> > +return ret;
> > +}
> > +return -ENOTSUP;
> > +}
> > +
> > +if (!S_ISBLK(st->st_mode)) {
> > +return -ENOTSUP;
> > +}
> > +
> > +sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/zoned",
> > +major(st->st_rdev), minor(st->st_rdev));
> > +sysfd = open(sysfspath, O_RDONLY);
> > +if (sysfd == -1) {
> > +ret = -errno;
> > +goto out;
> > +}
> > +do {
> > +ret = read(sysfd, buf, sizeof(buf) - 1);
> > +} while (ret == -1 && errno == EINTR);
>
> This is wrong.
> read() might return a value smaller than the 'len' argument (sizeof(buf)
> -1 in your case). But in that case it's a short read, and one need to
> call 'read()' again to fetch the remaining bytes.
>
> So the correct code would be something like:
>
> offset = 0;
> do {
>  ret = read(sysfd, buf + offset, sizeof(buf) - 1 + offset);
>  if (ret > 0)
>  offset += ret;
> } while (ret > 0);
>
> Not that you'd actually need it; reads from sysfs are basically never
> interrupted, so you should be able to read from an attribute in one go.
> IE alternatively you can drop the 'while' loop and just call read().
>
Yes, I didn't think it through.

> > +if (ret < 0) {
> > +ret = -errno;
> > +goto out;
> > +} else if (ret == 0) {
> > +ret = -EIO;
> > +goto out;
> > +}
> > +buf[ret] = 0;
> > +
> > +/* The file is ended with '\n' */
>
> I'd rather check if the string ends with an '\n', and overwrite
> it with a '\0'. That way you'd be insulated against any changes
> to sysfs.
>
Right! I was thinking about it but got hasty to hardcode everything.

> > +if (strcmp(buf, "host-managed\n") == 0) {
> > +return BLK_Z_HM;
> > +} else if (strcmp(buf, "host-aware\n") == 0) {
> > +return BLK_Z_HA;
> > +} else {
> > +return -ENOTSUP;
> > +}
> > +
> > +out:
> > +if (sysfd != -1) {
> > +close(sysfd);
> > +}
> > +g_free(sysfspath);
> > +return ret;
> > +#else
> > +return -ENOTSUP;
> > +#endif
> > +}
> > +
> >   static int hdev_get_max_segments(int fd, struct stat *st) {
> >   int ret;
> >   ret = get_sysfs_long_val(fd, st, "max_segments");
>
> And as you already set a precedent in your previous patch, I'd recommend
> split this in two patches, one introducing a generic function
> 'get_sysfs_str_val()' which returns a string and another function
> (eg hdev_get_zone_model()) which calls this function to fetch the device
> zoned model.
>
Maybe we can just return a str in get_sysfs_str_val and ignore
returning a zone_model for now? I was using zone_model for testing
purpose. Right now, this function is not used by others in
file-posix.c that causes compilation error in QEMU.

Thanks for reviewing!
Sam
> > diff --git a/include/block/block-common.h b/include/block/block-common.h
> > index 78cddeeda5..35e00afe8e 100644
> > --- a/include/block/block-common.h
> > +++ b/include/block/block-common.h
> > @@ -56,8 +56,8 @@ typedef enum zone_op {
> >   } zone_op;
> >
> >   typedef enum zone_model {
> > -BLK_Z_HM,
> > -BLK_Z_HA,
> > +BLK_Z_HM = 0x1,
> > +BLK_Z_HA = 0x2,
> >   } zone_model;
> >
> >   typedef enum BlkZoneCondition {
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes ReineckeKernel Storage Architect
> h...@suse.de  +49 911 74053 688
> SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
> HRB 36809 (AG Nürnberg), GF: Felix Imendörffer



Re: [PULL v2 10/20] libvduse: Add VDUSE (vDPA Device in Userspace) library

2022-06-27 Thread Yongji Xie
On Mon, Jun 27, 2022 at 12:45 PM Markus Armbruster  wrote:
>
> Kevin Wolf  writes:
>
> > From: Xie Yongji 
> >
> > VDUSE [1] is a linux framework that makes it possible to implement
> > software-emulated vDPA devices in userspace. This adds a library
> > as a subproject to help implementing VDUSE backends in QEMU.
> >
> > [1] https://www.kernel.org/doc/html/latest/userspace-api/vduse.html
> >
> > Signed-off-by: Xie Yongji 
> > Message-Id: <20220523084611.91-6-xieyon...@bytedance.com>
> > Reviewed-by: Stefan Hajnoczi 
> > Signed-off-by: Kevin Wolf 
> > ---
> >  meson_options.txt   |2 +
> >  subprojects/libvduse/include/atomic.h   |1 +
> >  subprojects/libvduse/include/compiler.h |1 +
> >  subprojects/libvduse/libvduse.h |  235 
> >  subprojects/libvduse/libvduse.c | 1150 +++
> >  MAINTAINERS |5 +
> >  meson.build |   15 +
> >  scripts/meson-buildoptions.sh   |3 +
> >  subprojects/libvduse/linux-headers/linux|1 +
> >  subprojects/libvduse/meson.build|   10 +
> >  subprojects/libvduse/standard-headers/linux |1 +
> >  11 files changed, 1424 insertions(+)
> >  create mode 12 subprojects/libvduse/include/atomic.h
> >  create mode 12 subprojects/libvduse/include/compiler.h
> >  create mode 100644 subprojects/libvduse/libvduse.h
> >  create mode 100644 subprojects/libvduse/libvduse.c
> >  create mode 12 subprojects/libvduse/linux-headers/linux
> >  create mode 100644 subprojects/libvduse/meson.build
> >  create mode 12 subprojects/libvduse/standard-headers/linux
> >
> > diff --git a/meson_options.txt b/meson_options.txt
> > index f3e2f22c1e..23a9f440f7 100644
> > --- a/meson_options.txt
> > +++ b/meson_options.txt
> > @@ -257,6 +257,8 @@ option('virtfs', type: 'feature', value: 'auto',
> > description: 'virtio-9p support')
> >  option('virtiofsd', type: 'feature', value: 'auto',
> > description: 'build virtiofs daemon (virtiofsd)')
> > +option('libvduse', type: 'feature', value: 'auto',
> > +   description: 'build VDUSE Library')
> >
> >  option('capstone', type: 'feature', value: 'auto',
> > description: 'Whether and how to find the capstone library')
> > diff --git a/subprojects/libvduse/include/atomic.h 
> > b/subprojects/libvduse/include/atomic.h
> > new file mode 12
> > index 00..8c2be64f7b
> > --- /dev/null
> > +++ b/subprojects/libvduse/include/atomic.h
> > @@ -0,0 +1 @@
> > +../../../include/qemu/atomic.h
> > \ No newline at end of file
> > diff --git a/subprojects/libvduse/include/compiler.h 
> > b/subprojects/libvduse/include/compiler.h
> > new file mode 12
> > index 00..de7b70697c
> > --- /dev/null
> > +++ b/subprojects/libvduse/include/compiler.h
> > @@ -0,0 +1 @@
> > +../../../include/qemu/compiler.h
> > \ No newline at end of file
> > diff --git a/subprojects/libvduse/libvduse.h 
> > b/subprojects/libvduse/libvduse.h
> > new file mode 100644
> > index 00..6c2fe98213
> > --- /dev/null
> > +++ b/subprojects/libvduse/libvduse.h
> > @@ -0,0 +1,235 @@
> > +/*
> > + * VDUSE (vDPA Device in Userspace) library
> > + *
> > + * Copyright (C) 2022 Bytedance Inc. and/or its affiliates. All rights 
> > reserved.
> > + *
> > + * Author:
> > + *   Xie Yongji 
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2 or
> > + * later.  See the COPYING file in the top-level directory.
> > + */
> > +
> > +#ifndef LIBVDUSE_H
> > +#define LIBVDUSE_H
> > +
> > +#include 
> > +#include 
> > +
> > +#define VIRTQUEUE_MAX_SIZE 1024
> > +
> > +/* VDUSE device structure */
> > +typedef struct VduseDev VduseDev;
> > +
> > +/* Virtqueue structure */
> > +typedef struct VduseVirtq VduseVirtq;
> > +
> > +/* Some operation of VDUSE backend */
> > +typedef struct VduseOps {
> > +/* Called when virtqueue can be processed */
> > +void (*enable_queue)(VduseDev *dev, VduseVirtq *vq);
> > +/* Called when virtqueue processing should be stopped */
> > +void (*disable_queue)(VduseDev *dev, VduseVirtq *vq);
> > +} VduseOps;
> > +
> > +/* Describing elements of the I/O buffer */
> > +typedef struct VduseVirtqElement {
> > +/* Descriptor table index */
> > +unsigned int index;
> > +/* Number of physically-contiguous device-readable descriptors */
> > +unsigned int out_num;
> > +/* Number of physically-contiguous device-writable descriptors */
> > +unsigned int in_num;
> > +/* Array to store physically-contiguous device-writable descriptors */
> > +struct iovec *in_sg;
> > +/* Array to store physically-contiguous device-readable descriptors */
> > +struct iovec *out_sg;
> > +} VduseVirtqElement;
> > +
> > +
> > +/**
> > + * vduse_get_virtio_features:
> > + *
> > + * Get supported virtio features
> > + *
> > + * Returns: supported feature bits
> > + */
> > +uint64_t vduse_get_virtio_features(void);
> 

Re: [RFC v3 3/5] file-posix: introduce get_sysfs_long_val for zoned device information.

2022-06-27 Thread Sam Li
Hannes Reinecke  于2022年6月27日周一 15:31写道:
>
> On 6/27/22 02:19, Sam Li wrote:
> > Use sysfs attribute files to get the zoned device information in case
> > that ioctl() commands of zone management interface won't work. It can
> > return long type of value like chunk_sectors, zoned_append_max_bytes,
> > max_open_zones, max_active_zones.
> > ---
> >   block/file-posix.c | 37 +
> >   1 file changed, 25 insertions(+), 12 deletions(-)
> >
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index 1b8b0d351f..73c2cdfbca 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -1216,15 +1216,19 @@ static int hdev_get_max_hw_transfer(int fd, struct 
> > stat *st)
> >   #endif
> >   }
> >
> > -static int hdev_get_max_segments(int fd, struct stat *st)
> > -{
> > +/*
> > + * Get zoned device information (chunk_sectors, zoned_append_max_bytes,
> > + * max_open_zones, max_active_zones) through sysfs attribute files.
> > + */
> > +static long get_sysfs_long_val(int fd, struct stat *st,
> > +   const char *attribute) {
> >   #ifdef CONFIG_LINUX
> >   char buf[32];
> >   const char *end;
> >   char *sysfspath = NULL;
> >   int ret;
> >   int sysfd = -1;
> > -long max_segments;
> > +long val;
> >
> >   if (S_ISCHR(st->st_mode)) {
> >   if (ioctl(fd, SG_GET_SG_TABLESIZE, ) == 0) {
> > @@ -1237,8 +1241,9 @@ static int hdev_get_max_segments(int fd, struct stat 
> > *st)
> >   return -ENOTSUP;
> >   }
> >
> > -sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments",
> > -major(st->st_rdev), minor(st->st_rdev));
> > +sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/%s",
> > +major(st->st_rdev), minor(st->st_rdev),
> > +attribute);
> >   sysfd = open(sysfspath, O_RDONLY);
> >   if (sysfd == -1) {
> >   ret = -errno;
> > @@ -1256,9 +1261,9 @@ static int hdev_get_max_segments(int fd, struct stat 
> > *st)
> >   }
> >   buf[ret] = 0;
> >   /* The file is ended with '\n', pass 'end' to accept that. */
> > -ret = qemu_strtol(buf, , 10, _segments);
> > +ret = qemu_strtol(buf, , 10, );
> >   if (ret == 0 && end && *end == '\n') {
> > -ret = max_segments;
> > +ret = val;
> >   }
> >
> >   out:
> > @@ -1272,6 +1277,15 @@ out:
> >   #endif
> >   }
> >
> > +static int hdev_get_max_segments(int fd, struct stat *st) {
> > +int ret;
> > +ret = get_sysfs_long_val(fd, st, "max_segments");
> > +if (ret < 0) {
> > +return -1;
> > +}
> > +return ret;
> > +}
> > +
> >   static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
> >   {
> >   BDRVRawState *s = bs->opaque;
> > @@ -1872,6 +1886,7 @@ static int handle_aiocb_zone_report(void *opaque) {
> >
> >   static int handle_aiocb_zone_mgmt(void *opaque) {
> >   RawPosixAIOData *aiocb = opaque;
> > +BlockDriverState *s = aiocb->bs;
> >   int fd = aiocb->aio_fildes;
> >   int64_t offset = aiocb->aio_offset;
> >   int64_t len = aiocb->aio_nbytes;
> > @@ -1884,11 +1899,9 @@ static int handle_aiocb_zone_mgmt(void *opaque) {
> >   int64_t zone_size_mask;
> >   int ret;
> >
> > -ret = ioctl(fd, BLKGETZONESZ, _size);
> > -if (ret) {
> > -return -1;
> > -}
> > -
> > +g_autofree struct stat *file = g_new(struct stat, 1);
> > +stat(s->filename, file);
> > +zone_size = get_sysfs_long_val(fd, file, "chunk_sectors");
> >   zone_size_mask = zone_size - 1;
> >   if (offset & zone_size_mask) {
> >   error_report("offset is not the start of a zone");
>
> Round of applause.
>
Thanks! It was based on Damien's suggestion.

> Reviewed-by: Hannes Reinecke 
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes ReineckeKernel Storage Architect
> h...@suse.de  +49 911 74053 688
> SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
> HRB 36809 (AG Nürnberg), GF: Felix Imendörffer



Re: [RFC v3 2/5] qemu-io: add zoned block device operations.

2022-06-27 Thread Sam Li
Hannes Reinecke  于2022年6月27日周一 15:30写道:
>
> On 6/27/22 02:19, Sam Li wrote:
> > ---
>
> Good coding style would advise to add some text here what the patch does.
>
> >   block/io.c   |  21 +++
> >   include/block/block-io.h |  13 +
> >   qemu-io-cmds.c   | 121 +++
> >   3 files changed, 155 insertions(+)
> >
> > diff --git a/block/io.c b/block/io.c
> > index 789e6373d5..656a1b7271 100644
> > --- a/block/io.c
> > +++ b/block/io.c
> > @@ -3258,6 +3258,27 @@ out:
> >   return co.ret;
> >   }
> >
> > +int bdrv_co_zone_report(BlockDriverState *bs, int64_t offset,
> > +int64_t len, int64_t *nr_zones,
> > +BlockZoneDescriptor *zones)
> > +{
> > +if (!bs->drv->bdrv_co_zone_report) {
> > +return -ENOTSUP;
>
> ENOTSUP or EOPNOTSUP?
> Kevin?
>
> > +}
> > +
> > +return bs->drv->bdrv_co_zone_report(bs, offset, len, nr_zones, zones);
> > +}
> > +
> > +int bdrv_co_zone_mgmt(BlockDriverState *bs, enum zone_op op,
> > +int64_t offset, int64_t len)
> > +{
> > +if (!bs->drv->bdrv_co_zone_mgmt) {
> > +return -ENOTSUP;
> > +}
> > +
> > +return bs->drv->bdrv_co_zone_mgmt(bs, op, offset, len);
> > +}
> > +
> >   void *qemu_blockalign(BlockDriverState *bs, size_t size)
> >   {
> >   IO_CODE();
> > diff --git a/include/block/block-io.h b/include/block/block-io.h
> > index 62c84f0519..c85c174579 100644
> > --- a/include/block/block-io.h
> > +++ b/include/block/block-io.h
> > @@ -80,6 +80,13 @@ int bdrv_co_ioctl(BlockDriverState *bs, int req, void 
> > *buf);
> >   /* Ensure contents are flushed to disk.  */
> >   int coroutine_fn bdrv_co_flush(BlockDriverState *bs);
> >
> > +/* Report zone information of zone block device. */
> > +int coroutine_fn bdrv_co_zone_report(BlockDriverState *bs, int64_t offset,
> > + int64_t len, int64_t *nr_zones,
> > + BlockZoneDescriptor *zones);
> > +int coroutine_fn bdrv_co_zone_mgmt(BlockDriverState *bs, zone_op op,
> > +int64_t offset, int64_t len);
> > +
>
> There's the thing with the intendation ... please make it consistent,
> and ideally follow with whatever the remaining prototypes do.
>
> >   int bdrv_co_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes);
> >   bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs);
> >   int bdrv_block_status(BlockDriverState *bs, int64_t offset,
> > @@ -290,6 +297,12 @@ bdrv_readv_vmstate(BlockDriverState *bs, QEMUIOVector 
> > *qiov, int64_t pos);
> >   int generated_co_wrapper
> >   bdrv_writev_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t 
> > pos);
> >
> > +int generated_co_wrapper blk_zone_report(BlockBackend *blk, int64_t offset,
> > + int64_t len, int64_t *nr_zones,
> > + BlockZoneDescriptor *zones);
> > +int generated_co_wrapper blk_zone_mgmt(BlockBackend *blk, enum zone_op op,
> > +int64_t offset, int64_t len);
> > +
>
> Again here.
>
> >   /**
> >* bdrv_parent_drained_begin_single:
> >*
> > diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
> > index 2f0d8ac25a..3f2592b9f5 100644
> > --- a/qemu-io-cmds.c
> > +++ b/qemu-io-cmds.c
> > @@ -1706,6 +1706,122 @@ static const cmdinfo_t flush_cmd = {
> >   .oneline= "flush all in-core file state to disk",
> >   };
> >
> > +static int zone_report_f(BlockBackend *blk, int argc, char **argv)
> > +{
> > +int ret;
> > +int64_t offset, len, nr_zones;
> > +int i = 0;
> > +
> > +++optind;
> > +offset = cvtnum(argv[optind]);
> > +++optind;
> > +len = cvtnum(argv[optind]);
> > +++optind;
> > +nr_zones = cvtnum(argv[optind]);
> > +
> And 'optind' is set where?

optind is declared in getopt_core.h.

> Plus do check for 'argv' overflow; before increasing 'optind' and using
> 'argv[optind]' you have to validate that 'argv[optind]' is a valid pointer.
>

Maybe we can check if argc is bigger than what we want?


> > +g_autofree BlockZoneDescriptor *zones = g_new(BlockZoneDescriptor, 
> > nr_zones);
> > +ret = blk_zone_report(blk, offset, len, _zones, zones);
> > +while (i < nr_zones) {
> > +fprintf(stdout, "start: 0x%lx, len 0x%lx, cap 0x%lx, wptr 0x%lx, "
> > +"zcond:%u, [type: %u]\n",
> > +zones[i].start, zones[i].length, zones[i].cap, zones[i].wp,
> > +zones[i].cond, zones[i].type);
> > +++i;
> As 'i' is a simple iterator maybe use a 'for' loop here.
> But that really is a matter of preference :-)
>
Ok!

> > +}
> > +return ret;
> > +}
> > +
> > +static const cmdinfo_t zone_report_cmd = {
> > +.name = "zone_report",
> > +.altname = "f",
>
> altname 'f'?
> Is that correct?
>
No, it's not. I misunderstood altname's meaning. Changed it now.

> > +.cfunc = zone_report_f,
> > +.argmin = 3,
> > +

Re: [RFC v3 1/5] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls.

2022-06-27 Thread Sam Li
Hi Hannes,

Hannes Reinecke  于2022年6月27日周一 15:21写道:

>
> On 6/27/22 02:19, Sam Li wrote:
> > By adding zone management operations in BlockDriver, storage
> > controller emulation can use the new block layer APIs including
> > zone_report and zone_mgmt(open, close, finish, reset).
> > ---
> >   block/block-backend.c|  56 
> >   block/coroutines.h   |   5 +
> >   block/file-posix.c   | 238 +++
> >   include/block/block-common.h |  43 +-
> >   include/block/block_int-common.h |  20 +++
> >   5 files changed, 361 insertions(+), 1 deletion(-)
> >
> > diff --git a/block/block-backend.c b/block/block-backend.c
> > index e0e1aff4b1..786f964d02 100644
> > --- a/block/block-backend.c
> > +++ b/block/block-backend.c
> > @@ -1810,6 +1810,62 @@ int blk_flush(BlockBackend *blk)
> >   return ret;
> >   }
> >
> > +/*
> > + * Return zone_report from BlockDriver. Offset can be any number within
> > + * the zone size. No alignment for offset and len.
> > + */
> > +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
> > +   int64_t len, int64_t *nr_zones,
> > +   BlockZoneDescriptor *zones)
> > +{
> > +int ret;
> > +BlockDriverState *bs;
> > +IO_CODE();
> > +
> > +blk_inc_in_flight(blk); /* increase before waiting */
> > +blk_wait_while_drained(blk);
> > +bs = blk_bs(blk);
> > +
> > +ret = blk_check_byte_request(blk, offset, len);
> > +if (ret < 0) {
> > +return ret;
> > +}
> > +
> > +bdrv_inc_in_flight(bs);
> > +ret = bdrv_co_zone_report(blk->root->bs, offset, len,
> > +  nr_zones, zones);
> > +bdrv_dec_in_flight(bs);
> > +blk_dec_in_flight(blk);
> > +return ret;
> > +}
> > +
> > +/*
> > + * Return zone_mgmt from BlockDriver.
> > + * Offset is the start of a zone and len is aligned to zones.
> > + */
> > +int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, enum zone_op op,
> > +int64_t offset, int64_t len)
> > +{
> > +int ret;
> > +BlockDriverState *bs;
> > +IO_CODE();
> > +
> > +blk_inc_in_flight(blk);
> > +blk_wait_while_drained(blk);
> > +bs = blk_bs(blk);
> > +
> > +ret = blk_check_byte_request(blk, offset, len);
> > +if (ret < 0) {
> > +return ret;
> > +}
> > +
> > +bdrv_inc_in_flight(bs);
> > +ret = bdrv_co_zone_mgmt(blk->root->bs, op, offset, len);
> > +bdrv_dec_in_flight(bs);
> > +blk_dec_in_flight(blk);
> > +return ret;
> > +}
> > +
> >   void blk_drain(BlockBackend *blk)
> >   {
> >   BlockDriverState *bs = blk_bs(blk);
> > diff --git a/block/coroutines.h b/block/coroutines.h
> > index 830ecaa733..a114d7bc30 100644
> > --- a/block/coroutines.h
> > +++ b/block/coroutines.h
> > @@ -80,6 +80,11 @@ int coroutine_fn
> >   blk_co_do_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes);
> >
> >   int coroutine_fn blk_co_do_flush(BlockBackend *blk);
> > +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
> > +int64_t len, int64_t *nr_zones,
> > +BlockZoneDescriptor *zones);
> > +int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, enum zone_op op,
> > +int64_t offset, int64_t len);
> >
> >
> >   /*
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index 48cd096624..1b8b0d351f 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -67,6 +67,7 @@
> >   #include 
> >   #include 
> >   #include 
> > +#include 
> >   #include 
> >   #include 
> >   #include 
> > @@ -216,6 +217,11 @@ typedef struct RawPosixAIOData {
> >   PreallocMode prealloc;
> >   Error **errp;
> >   } truncate;
> > +struct {
> > +int64_t *nr_zones;
> > +BlockZoneDescriptor *zones;
> > +} zone_report;
> > +zone_op op;
> >   };
> >   } RawPosixAIOData;
> >
> > @@ -1801,6 +1807,135 @@ static off_t copy_file_range(int in_fd, off_t 
> > *in_off, int out_fd,
> >   }
> >   #endif
> >
> > +/*
> > + * parse_zone - Fill a zone descriptor
> > + */
> > +static inline void parse_zone(struct BlockZoneDescriptor *zone,
> > +  struct blk_zone *blkz) {
> > +zone->start = blkz->start;
> > +zone->length = blkz->len;
> > +zone->cap = blkz->capacity;
> > +zone->wp = blkz->wp - blkz->start;
> > +zone->type = blkz->type;
> > +zone->cond = blkz->cond;
> > +}
> > +
> > +static int handle_aiocb_zone_report(void *opaque) {
> > +RawPosixAIOData *aiocb = opaque;
> > +int fd = aiocb->aio_fildes;
> > +int64_t *nr_zones = aiocb->zone_report.nr_zones;
> > +BlockZoneDescriptor *zones = aiocb->zone_report.zones;
> > +int64_t offset = aiocb->aio_offset;
> > +int64_t len = aiocb->aio_nbytes;
> > +
> > +struct blk_zone *blkz;
> > +int64_t rep_size, nrz;
> > +int ret, n = 0, i = 0;
> > +
> 

Re: [RFC v3 5/5] qemu-iotests: add zone operation tests.

2022-06-27 Thread Hannes Reinecke

On 6/27/22 02:19, Sam Li wrote:

---
  tests/qemu-iotests/tests/zoned.sh | 49 +++
  1 file changed, 49 insertions(+)
  create mode 100755 tests/qemu-iotests/tests/zoned.sh

diff --git a/tests/qemu-iotests/tests/zoned.sh 
b/tests/qemu-iotests/tests/zoned.sh
new file mode 100755
index 00..262c0b5427
--- /dev/null
+++ b/tests/qemu-iotests/tests/zoned.sh
@@ -0,0 +1,49 @@
+#!/usr/bin/env bash
+#
+# Test zone management operations.
+#
+
+QEMU_IO="build/qemu-io"
+IMG="--image-opts driver=zoned_host_device,filename=/dev/nullb0"
+QEMU_IO_OPTIONS=$QEMU_IO_OPTIONS_NO_FMT
+
+echo "Testing a null_blk device"
+echo "Simple cases: if the operations work"
+sudo modprobe null_blk nr_devices=1 zoned=1
+# hidden issues:
+# 1. memory allocation error of "unaligned tcache chunk detected" when the 
nr_zone=1 in zone report
+# 2. qemu-io: after report 10 zones, the program failed at double free error 
and exited.
+echo "report the first zone"
+sudo $QEMU_IO $IMG -c "zone_report 0 0 1"
+echo "report: the first 10 zones"
+sudo $QEMU_IO $IMG -c "zone_report 0 0 10"
+
+echo "open the first zone"
+sudo $QEMU_IO $IMG -c "zone_open 0 0x8"
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zone_report 0 0 1"
+echo "open the last zone"
+sudo $QEMU_IO $IMG -c "zone_open 0x3e7000 0x8"
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zone_report 0x3e7000 0 2"
+
+echo "close the first zone"
+sudo $QEMU_IO $IMG -c "zone_close 0 0x8"
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zone_report 0 0 1"
+echo "close the last zone"
+sudo $QEMU_IO $IMG -c "zone_close 0x3e7000 0x8"
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zone_report 0x3e7000 0 2"
+
+
+echo "reset the second zone"
+sudo $QEMU_IO $IMG -c "zone_reset 0x8 0x8"
+echo "After resetting a zone:"
+sudo $QEMU_IO $IMG -c "zone_report 0x8 0 5"
+
+# success, all done
+sudo rmmod null_blk
+echo "*** done"
+#rm -f $seq.full
+status=0


Caveat: I'm not that familiar with qemu-iotests.
FWIW:

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer



Re: [RFC v3 4/5] file-posix: introduce get_sysfs_str_val for device zoned model.

2022-06-27 Thread Hannes Reinecke

On 6/27/22 02:19, Sam Li wrote:

---
  block/file-posix.c   | 60 
  include/block/block-common.h |  4 +--
  2 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 73c2cdfbca..74c0245e0f 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1277,6 +1277,66 @@ out:
  #endif
  }
  
+/*

+ * Convert the zoned attribute file in sysfs to internal value.
+ */
+static zone_model get_sysfs_str_val(int fd, struct stat *st) {
+#ifdef CONFIG_LINUX
+char buf[32];
+char *sysfspath = NULL;
+int ret;
+int sysfd = -1;
+
+if (S_ISCHR(st->st_mode)) {
+if (ioctl(fd, SG_GET_SG_TABLESIZE, ) == 0) {
+return ret;
+}
+return -ENOTSUP;
+}
+
+if (!S_ISBLK(st->st_mode)) {
+return -ENOTSUP;
+}
+
+sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/zoned",
+major(st->st_rdev), minor(st->st_rdev));
+sysfd = open(sysfspath, O_RDONLY);
+if (sysfd == -1) {
+ret = -errno;
+goto out;
+}
+do {
+ret = read(sysfd, buf, sizeof(buf) - 1);
+} while (ret == -1 && errno == EINTR);


This is wrong.
read() might return a value smaller than the 'len' argument (sizeof(buf) 
-1 in your case). But in that case it's a short read, and one need to 
call 'read()' again to fetch the remaining bytes.


So the correct code would be something like:

offset = 0;
do {
ret = read(sysfd, buf + offset, sizeof(buf) - 1 + offset);
if (ret > 0)
offset += ret;
} while (ret > 0);

Not that you'd actually need it; reads from sysfs are basically never 
interrupted, so you should be able to read from an attribute in one go.

IE alternatively you can drop the 'while' loop and just call read().


+if (ret < 0) {
+ret = -errno;
+goto out;
+} else if (ret == 0) {
+ret = -EIO;
+goto out;
+}
+buf[ret] = 0;
+
+/* The file is ended with '\n' */


I'd rather check if the string ends with an '\n', and overwrite
it with a '\0'. That way you'd be insulated against any changes
to sysfs.


+if (strcmp(buf, "host-managed\n") == 0) {
+return BLK_Z_HM;
+} else if (strcmp(buf, "host-aware\n") == 0) {
+return BLK_Z_HA;
+} else {
+return -ENOTSUP;
+}
+
+out:
+if (sysfd != -1) {
+close(sysfd);
+}
+g_free(sysfspath);
+return ret;
+#else
+return -ENOTSUP;
+#endif
+}
+
  static int hdev_get_max_segments(int fd, struct stat *st) {
  int ret;
  ret = get_sysfs_long_val(fd, st, "max_segments");


And as you already set a precedent in your previous patch, I'd recommend 
split this in two patches, one introducing a generic function 
'get_sysfs_str_val()' which returns a string and another function
(eg hdev_get_zone_model()) which calls this function to fetch the device 
zoned model.



diff --git a/include/block/block-common.h b/include/block/block-common.h
index 78cddeeda5..35e00afe8e 100644
--- a/include/block/block-common.h
+++ b/include/block/block-common.h
@@ -56,8 +56,8 @@ typedef enum zone_op {
  } zone_op;
  
  typedef enum zone_model {

-BLK_Z_HM,
-BLK_Z_HA,
+BLK_Z_HM = 0x1,
+BLK_Z_HA = 0x2,
  } zone_model;
  
  typedef enum BlkZoneCondition {


Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer



Re: [RFC v3 2/5] qemu-io: add zoned block device operations.

2022-06-27 Thread Hannes Reinecke

On 6/27/22 02:19, Sam Li wrote:

---


Good coding style would advise to add some text here what the patch does.


  block/io.c   |  21 +++
  include/block/block-io.h |  13 +
  qemu-io-cmds.c   | 121 +++
  3 files changed, 155 insertions(+)

diff --git a/block/io.c b/block/io.c
index 789e6373d5..656a1b7271 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3258,6 +3258,27 @@ out:
  return co.ret;
  }
  
+int bdrv_co_zone_report(BlockDriverState *bs, int64_t offset,

+int64_t len, int64_t *nr_zones,
+BlockZoneDescriptor *zones)
+{
+if (!bs->drv->bdrv_co_zone_report) {
+return -ENOTSUP;


ENOTSUP or EOPNOTSUP?
Kevin?


+}
+
+return bs->drv->bdrv_co_zone_report(bs, offset, len, nr_zones, zones);
+}
+
+int bdrv_co_zone_mgmt(BlockDriverState *bs, enum zone_op op,
+int64_t offset, int64_t len)
+{
+if (!bs->drv->bdrv_co_zone_mgmt) {
+return -ENOTSUP;
+}
+
+return bs->drv->bdrv_co_zone_mgmt(bs, op, offset, len);
+}
+
  void *qemu_blockalign(BlockDriverState *bs, size_t size)
  {
  IO_CODE();
diff --git a/include/block/block-io.h b/include/block/block-io.h
index 62c84f0519..c85c174579 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -80,6 +80,13 @@ int bdrv_co_ioctl(BlockDriverState *bs, int req, void *buf);
  /* Ensure contents are flushed to disk.  */
  int coroutine_fn bdrv_co_flush(BlockDriverState *bs);
  
+/* Report zone information of zone block device. */

+int coroutine_fn bdrv_co_zone_report(BlockDriverState *bs, int64_t offset,
+ int64_t len, int64_t *nr_zones,
+ BlockZoneDescriptor *zones);
+int coroutine_fn bdrv_co_zone_mgmt(BlockDriverState *bs, zone_op op,
+int64_t offset, int64_t len);
+


There's the thing with the intendation ... please make it consistent, 
and ideally follow with whatever the remaining prototypes do.



  int bdrv_co_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes);
  bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs);
  int bdrv_block_status(BlockDriverState *bs, int64_t offset,
@@ -290,6 +297,12 @@ bdrv_readv_vmstate(BlockDriverState *bs, QEMUIOVector 
*qiov, int64_t pos);
  int generated_co_wrapper
  bdrv_writev_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
  
+int generated_co_wrapper blk_zone_report(BlockBackend *blk, int64_t offset,

+ int64_t len, int64_t *nr_zones,
+ BlockZoneDescriptor *zones);
+int generated_co_wrapper blk_zone_mgmt(BlockBackend *blk, enum zone_op op,
+int64_t offset, int64_t len);
+


Again here.


  /**
   * bdrv_parent_drained_begin_single:
   *
diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index 2f0d8ac25a..3f2592b9f5 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -1706,6 +1706,122 @@ static const cmdinfo_t flush_cmd = {
  .oneline= "flush all in-core file state to disk",
  };
  
+static int zone_report_f(BlockBackend *blk, int argc, char **argv)

+{
+int ret;
+int64_t offset, len, nr_zones;
+int i = 0;
+
+++optind;
+offset = cvtnum(argv[optind]);
+++optind;
+len = cvtnum(argv[optind]);
+++optind;
+nr_zones = cvtnum(argv[optind]);
+

And 'optind' is set where?
Plus do check for 'argv' overflow; before increasing 'optind' and using 
'argv[optind]' you have to validate that 'argv[optind]' is a valid pointer.



+g_autofree BlockZoneDescriptor *zones = g_new(BlockZoneDescriptor, 
nr_zones);
+ret = blk_zone_report(blk, offset, len, _zones, zones);
+while (i < nr_zones) {
+fprintf(stdout, "start: 0x%lx, len 0x%lx, cap 0x%lx, wptr 0x%lx, "
+"zcond:%u, [type: %u]\n",
+zones[i].start, zones[i].length, zones[i].cap, zones[i].wp,
+zones[i].cond, zones[i].type);
+++i;

As 'i' is a simple iterator maybe use a 'for' loop here.
But that really is a matter of preference :-)


+}
+return ret;
+}
+
+static const cmdinfo_t zone_report_cmd = {
+.name = "zone_report",
+.altname = "f",


altname 'f'?
Is that correct?


+.cfunc = zone_report_f,
+.argmin = 3,
+.argmax = 3,
+.args = "offset [offset..] len [len..] number [num..]",
+.oneline = "report a number of zones",
+};
+
+static int zone_open_f(BlockBackend *blk, int argc, char **argv)
+{
+int64_t offset, len;
+++optind;
+offset = cvtnum(argv[optind]);
+++optind;
+len = cvtnum(argv[optind]);


Same here: please check for 'argv' overflow.


+return blk_zone_mgmt(blk, zone_open, offset, len);
+}
+
+static const cmdinfo_t zone_open_cmd = {
+.name = "zone_open",
+.altname = "f",


Same here; shouldn't 'altname' be different for each function?
'zo', maybe?


+.cfunc = zone_open_f,
+.argmin = 

Re: [RFC v3 3/5] file-posix: introduce get_sysfs_long_val for zoned device information.

2022-06-27 Thread Hannes Reinecke

On 6/27/22 02:19, Sam Li wrote:

Use sysfs attribute files to get the zoned device information in case
that ioctl() commands of zone management interface won't work. It can
return long type of value like chunk_sectors, zoned_append_max_bytes,
max_open_zones, max_active_zones.
---
  block/file-posix.c | 37 +
  1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 1b8b0d351f..73c2cdfbca 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1216,15 +1216,19 @@ static int hdev_get_max_hw_transfer(int fd, struct stat 
*st)
  #endif
  }
  
-static int hdev_get_max_segments(int fd, struct stat *st)

-{
+/*
+ * Get zoned device information (chunk_sectors, zoned_append_max_bytes,
+ * max_open_zones, max_active_zones) through sysfs attribute files.
+ */
+static long get_sysfs_long_val(int fd, struct stat *st,
+   const char *attribute) {
  #ifdef CONFIG_LINUX
  char buf[32];
  const char *end;
  char *sysfspath = NULL;
  int ret;
  int sysfd = -1;
-long max_segments;
+long val;
  
  if (S_ISCHR(st->st_mode)) {

  if (ioctl(fd, SG_GET_SG_TABLESIZE, ) == 0) {
@@ -1237,8 +1241,9 @@ static int hdev_get_max_segments(int fd, struct stat *st)
  return -ENOTSUP;
  }
  
-sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments",

-major(st->st_rdev), minor(st->st_rdev));
+sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/%s",
+major(st->st_rdev), minor(st->st_rdev),
+attribute);
  sysfd = open(sysfspath, O_RDONLY);
  if (sysfd == -1) {
  ret = -errno;
@@ -1256,9 +1261,9 @@ static int hdev_get_max_segments(int fd, struct stat *st)
  }
  buf[ret] = 0;
  /* The file is ended with '\n', pass 'end' to accept that. */
-ret = qemu_strtol(buf, , 10, _segments);
+ret = qemu_strtol(buf, , 10, );
  if (ret == 0 && end && *end == '\n') {
-ret = max_segments;
+ret = val;
  }
  
  out:

@@ -1272,6 +1277,15 @@ out:
  #endif
  }
  
+static int hdev_get_max_segments(int fd, struct stat *st) {

+int ret;
+ret = get_sysfs_long_val(fd, st, "max_segments");
+if (ret < 0) {
+return -1;
+}
+return ret;
+}
+
  static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
  {
  BDRVRawState *s = bs->opaque;
@@ -1872,6 +1886,7 @@ static int handle_aiocb_zone_report(void *opaque) {
  
  static int handle_aiocb_zone_mgmt(void *opaque) {

  RawPosixAIOData *aiocb = opaque;
+BlockDriverState *s = aiocb->bs;
  int fd = aiocb->aio_fildes;
  int64_t offset = aiocb->aio_offset;
  int64_t len = aiocb->aio_nbytes;
@@ -1884,11 +1899,9 @@ static int handle_aiocb_zone_mgmt(void *opaque) {
  int64_t zone_size_mask;
  int ret;
  
-ret = ioctl(fd, BLKGETZONESZ, _size);

-if (ret) {
-return -1;
-}
-
+g_autofree struct stat *file = g_new(struct stat, 1);
+stat(s->filename, file);
+zone_size = get_sysfs_long_val(fd, file, "chunk_sectors");
  zone_size_mask = zone_size - 1;
  if (offset & zone_size_mask) {
  error_report("offset is not the start of a zone");


Round of applause.

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer



Re: [RFC v3 1/5] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls.

2022-06-27 Thread Hannes Reinecke

On 6/27/22 02:19, Sam Li wrote:

By adding zone management operations in BlockDriver, storage
controller emulation can use the new block layer APIs including
zone_report and zone_mgmt(open, close, finish, reset).
---
  block/block-backend.c|  56 
  block/coroutines.h   |   5 +
  block/file-posix.c   | 238 +++
  include/block/block-common.h |  43 +-
  include/block/block_int-common.h |  20 +++
  5 files changed, 361 insertions(+), 1 deletion(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index e0e1aff4b1..786f964d02 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1810,6 +1810,62 @@ int blk_flush(BlockBackend *blk)
  return ret;
  }
  
+/*

+ * Return zone_report from BlockDriver. Offset can be any number within
+ * the zone size. No alignment for offset and len.
+ */
+int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
+   int64_t len, int64_t *nr_zones,
+   BlockZoneDescriptor *zones)
+{
+int ret;
+BlockDriverState *bs;
+IO_CODE();
+
+blk_inc_in_flight(blk); /* increase before waiting */
+blk_wait_while_drained(blk);
+bs = blk_bs(blk);
+
+ret = blk_check_byte_request(blk, offset, len);
+if (ret < 0) {
+return ret;
+}
+
+bdrv_inc_in_flight(bs);
+ret = bdrv_co_zone_report(blk->root->bs, offset, len,
+  nr_zones, zones);
+bdrv_dec_in_flight(bs);
+blk_dec_in_flight(blk);
+return ret;
+}
+
+/*
+ * Return zone_mgmt from BlockDriver.
+ * Offset is the start of a zone and len is aligned to zones.
+ */
+int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, enum zone_op op,
+int64_t offset, int64_t len)
+{
+int ret;
+BlockDriverState *bs;
+IO_CODE();
+
+blk_inc_in_flight(blk);
+blk_wait_while_drained(blk);
+bs = blk_bs(blk);
+
+ret = blk_check_byte_request(blk, offset, len);
+if (ret < 0) {
+return ret;
+}
+
+bdrv_inc_in_flight(bs);
+ret = bdrv_co_zone_mgmt(blk->root->bs, op, offset, len);
+bdrv_dec_in_flight(bs);
+blk_dec_in_flight(blk);
+return ret;
+}
+
  void blk_drain(BlockBackend *blk)
  {
  BlockDriverState *bs = blk_bs(blk);
diff --git a/block/coroutines.h b/block/coroutines.h
index 830ecaa733..a114d7bc30 100644
--- a/block/coroutines.h
+++ b/block/coroutines.h
@@ -80,6 +80,11 @@ int coroutine_fn
  blk_co_do_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes);
  
  int coroutine_fn blk_co_do_flush(BlockBackend *blk);

+int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
+int64_t len, int64_t *nr_zones,
+BlockZoneDescriptor *zones);
+int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, enum zone_op op,
+int64_t offset, int64_t len);
  
  
  /*

diff --git a/block/file-posix.c b/block/file-posix.c
index 48cd096624..1b8b0d351f 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -67,6 +67,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -216,6 +217,11 @@ typedef struct RawPosixAIOData {
  PreallocMode prealloc;
  Error **errp;
  } truncate;
+struct {
+int64_t *nr_zones;
+BlockZoneDescriptor *zones;
+} zone_report;
+zone_op op;
  };
  } RawPosixAIOData;
  
@@ -1801,6 +1807,135 @@ static off_t copy_file_range(int in_fd, off_t *in_off, int out_fd,

  }
  #endif
  
+/*

+ * parse_zone - Fill a zone descriptor
+ */
+static inline void parse_zone(struct BlockZoneDescriptor *zone,
+  struct blk_zone *blkz) {
+zone->start = blkz->start;
+zone->length = blkz->len;
+zone->cap = blkz->capacity;
+zone->wp = blkz->wp - blkz->start;
+zone->type = blkz->type;
+zone->cond = blkz->cond;
+}
+
+static int handle_aiocb_zone_report(void *opaque) {
+RawPosixAIOData *aiocb = opaque;
+int fd = aiocb->aio_fildes;
+int64_t *nr_zones = aiocb->zone_report.nr_zones;
+BlockZoneDescriptor *zones = aiocb->zone_report.zones;
+int64_t offset = aiocb->aio_offset;
+int64_t len = aiocb->aio_nbytes;
+
+struct blk_zone *blkz;
+int64_t rep_size, nrz;
+int ret, n = 0, i = 0;
+
+nrz = *nr_zones;
+if (len == -1) {
+return -errno;
+}
+rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct blk_zone);
+g_autofree struct blk_zone_report *rep = g_new(struct blk_zone_report, 
nrz);
+offset = offset / 512; /* get the unit of the start sector: sector size is 
512 bytes. */
+printf("start to report zone with offset: 0x%lx\n", offset);
+
+blkz = (struct blk_zone *)(rep + 1);
+while (n < nrz) {
+memset(rep, 0, rep_size);
+rep->sector = offset;
+rep->nr_zones = nrz;
+
+ret = ioctl(fd, BLKREPORTZONE, rep);
+if 

[RFC PATCH 04/10] hw/ide/piix: Avoid the isabus global when wiring ISA interrupts for internal devices

2022-06-27 Thread Bernhard Beschow
isa_get_irq() currently always uses the "isabus" global to get the
desired qemu_irq. In order to resolve this global, we want
isa_get_irq() to determine the ISABus from its *dev parameter using
isa_bus_from_device(). As a preparation, all callers who pass NULL
as *dev need to be resolved which seems to happen in hw/ide/piix only.

This patch roughly implements the solution outlined in https://
lists.nongnu.org/archive/html/qemu-devel/2020-03/msg01707.html.

In oder to address the PIIX IDE functions being user-creatable, (see
https://lists.nongnu.org/archive/html/qemu-devel/2021-04/msg05655.html) a
backwards compatibility quirk is introduced which can be removed after some
deprecation period. The quirk consists of internal devices to opt into new
behavior where the ISA interrupt wiring is performed by the caller rather
than by the device itself. The opt-in can be performed by:

qdev_prop_set_bit(DEVICE(dev), "user-created", false);

RFC: qdev_init_gpio_in() seems to expose interrupts internal to the device.
Can this be fixed?

Signed-off-by: Bernhard Beschow 
---
 hw/i386/pc_piix.c|  3 +++
 hw/ide/piix.c| 41 +++--
 hw/isa/piix4.c   |  3 +++
 include/hw/ide/pci.h |  2 ++
 4 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index 2d146b19c0..e24fbab334 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -253,8 +253,11 @@ static void pc_init1(MachineState *machine,
 PCIDevice *dev;
 
 dev = pci_new_multifunction(piix3_devfn + 1, false, "piix3-ide");
+qdev_prop_set_bit(DEVICE(dev), "user-created", false);
 pci_realize_and_unref(dev, pci_bus, _fatal);
 pci_ide_create_devs(dev);
+qdev_connect_gpio_out(DEVICE(dev), 0, x86ms->gsi[14]);
+qdev_connect_gpio_out(DEVICE(dev), 1, x86ms->gsi[15]);
 idebus[0] = qdev_get_child_bus(>qdev, "ide.0");
 idebus[1] = qdev_get_child_bus(>qdev, "ide.1");
 pc_cmos_init(pcms, idebus[0], idebus[1], rtc_state);
diff --git a/hw/ide/piix.c b/hw/ide/piix.c
index 21777ecc8b..fbf2756b47 100644
--- a/hw/ide/piix.c
+++ b/hw/ide/piix.c
@@ -103,6 +103,13 @@ static void bmdma_setup_bar(PCIIDEState *d)
 }
 }
 
+static void piix_ide_set_irq(void *opaque, int n, int level)
+{
+PCIIDEState *d = opaque;
+
+qemu_set_irq(d->isa_irqs[n], level);
+}
+
 static void piix_ide_reset(DeviceState *dev)
 {
 PCIIDEState *d = PCI_IDE(dev);
@@ -129,14 +136,14 @@ static int pci_piix_init_ports(PCIIDEState *d)
 static const struct {
 int iobase;
 int iobase2;
-int isairq;
 } port_info[] = {
-{0x1f0, 0x3f6, 14},
-{0x170, 0x376, 15},
+{0x1f0, 0x3f6},
+{0x170, 0x376},
 };
+DeviceState *dev = DEVICE(d);
 int i;
 
-{
+if (d->user_created) {
 ISABus *isa_bus;
 bool ambiguous;
 
@@ -145,13 +152,18 @@ static int pci_piix_init_ports(PCIIDEState *d)
 if (!isa_bus || ambiguous) {
 return -ENODEV;
 }
+
+d->isa_irqs[0] = isa_bus->irqs[14];
+d->isa_irqs[1] = isa_bus->irqs[15];
+} else {
+qdev_init_gpio_out(dev, d->isa_irqs, 2);
 }
 
 for (i = 0; i < 2; i++) {
-ide_bus_init(>bus[i], sizeof(d->bus[i]), DEVICE(d), i, 2);
+ide_bus_init(>bus[i], sizeof(d->bus[i]), dev, i, 2);
 ide_init_ioport(>bus[i], NULL, port_info[i].iobase,
 port_info[i].iobase2);
-ide_init2(>bus[i], isa_get_irq(NULL, port_info[i].isairq));
+ide_init2(>bus[i], qdev_get_gpio_in(dev, i));
 
 bmdma_init(>bus[i], >bmdma[i], d);
 d->bmdma[i].bus = >bus[i];
@@ -181,6 +193,14 @@ static void pci_piix_ide_realize(PCIDevice *dev, Error 
**errp)
 }
 }
 
+static void pci_piix_ide_init(Object *obj)
+{
+PCIIDEState *d = PCI_IDE(obj);
+DeviceState *dev = DEVICE(d);
+
+qdev_init_gpio_in(dev, piix_ide_set_irq, 2);
+}
+
 static void pci_piix_ide_exitfn(PCIDevice *dev)
 {
 PCIIDEState *d = PCI_IDE(dev);
@@ -192,6 +212,11 @@ static void pci_piix_ide_exitfn(PCIDevice *dev)
 }
 }
 
+static Property piix_ide_properties[] = {
+DEFINE_PROP_BOOL("user-created", PCIIDEState, user_created, true),
+DEFINE_PROP_END_OF_LIST(),
+};
+
 /* NOTE: for the PIIX3, the IRQs and IOports are hardcoded */
 static void piix3_ide_class_init(ObjectClass *klass, void *data)
 {
@@ -206,11 +231,13 @@ static void piix3_ide_class_init(ObjectClass *klass, void 
*data)
 k->class_id = PCI_CLASS_STORAGE_IDE;
 set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
 dc->hotpluggable = false;
+device_class_set_props(dc, piix_ide_properties);
 }
 
 static const TypeInfo piix3_ide_info = {
 .name  = "piix3-ide",
 .parent= TYPE_PCI_IDE,
+.instance_init = pci_piix_ide_init,
 .class_init= piix3_ide_class_init,
 };
 
@@ -228,11 +255,13 @@ static void piix4_ide_class_init(ObjectClass *klass, void 
*data)
  

[RFC PATCH 05/10] hw/isa/isa-bus: assert() if isa_get_irq() gets passed a NULL ISADevice

2022-06-27 Thread Bernhard Beschow
Now that all call-sites have been fixed to pass non-NULL ISADevices, we can
assert() on NULL ISADevices to catch regressions.

Signed-off-by: Bernhard Beschow 
---
 hw/isa/isa-bus.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/hw/isa/isa-bus.c b/hw/isa/isa-bus.c
index 0537a9f2c1..9e8b5da027 100644
--- a/hw/isa/isa-bus.c
+++ b/hw/isa/isa-bus.c
@@ -81,9 +81,10 @@ void isa_bus_irqs(ISABus *bus, qemu_irq *irqs)
  */
 qemu_irq isa_get_irq(ISADevice *dev, unsigned isairq)
 {
-assert(!dev || ISA_BUS(qdev_get_parent_bus(DEVICE(dev))) == isabus);
+assert(dev);
 assert(isairq < ISA_NUM_IRQS);
-return isabus->irqs[isairq];
+
+return isa_bus_from_device(dev)->irqs[isairq];
 }
 
 void isa_connect_gpio_out(ISADevice *isadev, int gpioirq, unsigned isairq)
-- 
2.36.1




[RFC PATCH 07/10] hw/pci/pci: Introduce pci_register_portio_list()

2022-06-27 Thread Bernhard Beschow
pci_ide_init_ioport() and pci_register_portio_list() are introduced which
mirror their ISA counterparts. But rather than asking for an ISADevice, the
functions ask for PCIDevice which can be used in hw/ide/piix which fixes
having to pass a NULL ISADevice which is not avialable there.

Passing NULL as ISADevice to pci_ide_init_ioport() also causes a NULL
ISADevice to be passed to isa_register_ioport(). Currently this function
always uses the isabus global. To fix this, we'll want to determine the
ISABus using isa_bus_from_device(), so no call-site must pass a NULL
ISADevice.

Signed-off-by: Bernhard Beschow 
---
 hw/ide/ioport.c   | 14 ++
 hw/pci/pci.c  | 18 ++
 include/hw/ide/internal.h |  1 +
 include/hw/pci/pci.h  | 21 +
 4 files changed, 54 insertions(+)

diff --git a/hw/ide/ioport.c b/hw/ide/ioport.c
index ed1f34f573..69e4fa15d4 100644
--- a/hw/ide/ioport.c
+++ b/hw/ide/ioport.c
@@ -25,6 +25,7 @@
 
 #include "qemu/osdep.h"
 #include "hw/isa/isa.h"
+#include "hw/pci/pci_bus.h"
 #include "qemu/error-report.h"
 #include "qemu/timer.h"
 #include "sysemu/blockdev.h"
@@ -62,3 +63,16 @@ void isa_ide_init_ioport(IDEBus *bus, ISADevice *dev, int 
iobase, int iobase2)
  iobase2, ide_portio2_list, bus, "ide");
 }
 }
+
+void pci_ide_init_ioport(IDEBus *bus, PCIDevice *dev, int iobase, int iobase2)
+{
+assert(dev);
+
+pci_register_portio_list(dev, >portio_list,
+ iobase, ide_portio_list, bus, "ide");
+
+if (iobase2) {
+pci_register_portio_list(dev, >portio2_list,
+ iobase2, ide_portio2_list, bus, "ide");
+}
+}
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 2f450f6a72..3046dd5477 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -1440,6 +1440,24 @@ pcibus_t pci_bar_address(PCIDevice *d,
 return new_addr;
 }
 
+void pci_register_portio_list(PCIDevice *dev,
+  PortioList *piolist, uint16_t start,
+  const MemoryRegionPortio *pio_start,
+  void *opaque, const char *name)
+{
+PCIBus *bus;
+
+assert(dev);
+assert(piolist && !piolist->owner);
+
+bus = pci_get_bus(dev);
+
+assert(bus);
+
+portio_list_init(piolist, OBJECT(dev), pio_start, opaque, name);
+portio_list_add(piolist, bus->address_space_io, start);
+}
+
 static void pci_update_mappings(PCIDevice *d)
 {
 PCIIORegion *r;
diff --git a/include/hw/ide/internal.h b/include/hw/ide/internal.h
index 86ecc04ce4..4a375d3c09 100644
--- a/include/hw/ide/internal.h
+++ b/include/hw/ide/internal.h
@@ -625,6 +625,7 @@ int ide_init_drive(IDEState *s, BlockBackend *blk, 
IDEDriveKind kind,
 void ide_init2(IDEBus *bus, qemu_irq irq);
 void ide_exit(IDEState *s);
 void isa_ide_init_ioport(IDEBus *bus, ISADevice *isa, int iobase, int iobase2);
+void pci_ide_init_ioport(IDEBus *bus, PCIDevice *isa, int iobase, int iobase2);
 void ide_register_restart_cb(IDEBus *bus);
 
 void ide_exec_cmd(IDEBus *bus, uint32_t val);
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index b54b6ef88f..91b479d542 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -522,6 +522,27 @@ void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void 
*opaque);
 pcibus_t pci_bar_address(PCIDevice *d,
  int reg, uint8_t type, pcibus_t size);
 
+/**
+ * pci_register_portio_list: Initialize a set of io ports
+ *
+ * Several ISA devices have many dis-joint I/O ports.  Worse, these I/O
+ * ports can be interleaved with I/O ports from other devices.  This
+ * function makes it easy to create multiple MemoryRegions for a single
+ * device and use the legacy portio routines.
+ *
+ * @dev: the PCIDevice against which these are registered
+ * @piolist: the PortioList associated with the io ports
+ * @start: the base I/O port against which the portio->offset is applied.
+ * @portio: the ports, sorted by offset.
+ * @opaque: passed into the portio callbacks.
+ * @name: passed into memory_region_init_io.
+ */
+void pci_register_portio_list(PCIDevice *dev,
+  PortioList *piolist,
+  uint16_t start,
+  const MemoryRegionPortio *portio,
+  void *opaque, const char *name);
+
 static inline void
 pci_set_byte(uint8_t *config, uint8_t val)
 {
-- 
2.36.1




[RFC PATCH 10/10] hw/isa/isa-bus: Resolve isabus global

2022-06-27 Thread Bernhard Beschow
Now that only isa_bus_new() accesses the isabus global it can be removed
assuming that all call sites take care of not passing the same address
spaces twice to different isa_bus_new() invocations.

Signed-off-by: Bernhard Beschow 
---
 hw/isa/isa-bus.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/hw/isa/isa-bus.c b/hw/isa/isa-bus.c
index 5518db93cd..783506685d 100644
--- a/hw/isa/isa-bus.c
+++ b/hw/isa/isa-bus.c
@@ -26,8 +26,6 @@
 #include "hw/isa/isa.h"
 #include "hw/acpi/acpi_aml_interface.h"
 
-static ISABus *isabus;
-
 static char *isabus_get_fw_dev_path(DeviceState *dev);
 
 static void isa_bus_class_init(ObjectClass *klass, void *data)
@@ -53,10 +51,8 @@ static const TypeInfo isa_bus_info = {
 ISABus *isa_bus_new(DeviceState *dev, MemoryRegion* address_space,
 MemoryRegion *address_space_io, Error **errp)
 {
-if (isabus) {
-error_setg(errp, "Can't create a second ISA bus");
-return NULL;
-}
+ISABus *isabus;
+
 if (!dev) {
 dev = qdev_new("isabus-bridge");
 sysbus_realize_and_unref(SYS_BUS_DEVICE(dev), _fatal);
-- 
2.36.1




[RFC PATCH 09/10] hw/isa: Resolve unneeded usage of isabus global

2022-06-27 Thread Bernhard Beschow
Now that all call sites of these functions are fixed to pass non-NULL
ISADevices, the ISABus can be determined from the ISADevice argument.

Patch based on https://lists.nongnu.org/archive/html/qemu-devel/2021-05/
msg05785.html.

Signed-off-by: Bernhard Beschow 
---
 hw/ide/ioport.c  |  4 ++--
 hw/isa/isa-bus.c | 21 +
 include/hw/isa/isa.h |  2 +-
 3 files changed, 16 insertions(+), 11 deletions(-)

diff --git a/hw/ide/ioport.c b/hw/ide/ioport.c
index 69e4fa15d4..112726e415 100644
--- a/hw/ide/ioport.c
+++ b/hw/ide/ioport.c
@@ -53,8 +53,8 @@ static const MemoryRegionPortio ide_portio2_list[] = {
 
 void isa_ide_init_ioport(IDEBus *bus, ISADevice *dev, int iobase, int iobase2)
 {
-/* ??? Assume only ISA and PCI configurations, and that the PCI-ISA
-   bridge has been setup properly to always register with ISA.  */
+assert(dev);
+
 isa_register_portio_list(dev, >portio_list,
  iobase, ide_portio_list, bus, "ide");
 
diff --git a/hw/isa/isa-bus.c b/hw/isa/isa-bus.c
index 9e8b5da027..5518db93cd 100644
--- a/hw/isa/isa-bus.c
+++ b/hw/isa/isa-bus.c
@@ -116,6 +116,10 @@ static inline void isa_init_ioport(ISADevice *dev, 
uint16_t ioport)
 
 void isa_register_ioport(ISADevice *dev, MemoryRegion *io, uint16_t start)
 {
+ISABus *isabus;
+
+assert(dev);
+isabus = isa_bus_from_device(dev);
 memory_region_add_subregion(isabus->address_space_io, start, io);
 isa_init_ioport(dev, start);
 }
@@ -125,8 +129,13 @@ void isa_register_portio_list(ISADevice *dev,
   const MemoryRegionPortio *pio_start,
   void *opaque, const char *name)
 {
+ISABus *isabus;
+
+assert(dev);
 assert(piolist && !piolist->owner);
 
+isabus = isa_bus_from_device(dev);
+
 /* START is how we should treat DEV, regardless of the actual
contents of the portio array.  This is how the old code
actually handled e.g. the FDC device.  */
@@ -246,20 +255,16 @@ static char *isabus_get_fw_dev_path(DeviceState *dev)
 
 MemoryRegion *isa_address_space(ISADevice *dev)
 {
-if (dev) {
-return isa_bus_from_device(dev)->address_space;
-}
+assert(dev);
 
-return isabus->address_space;
+return isa_bus_from_device(dev)->address_space;
 }
 
 MemoryRegion *isa_address_space_io(ISADevice *dev)
 {
-if (dev) {
-return isa_bus_from_device(dev)->address_space_io;
-}
+assert(dev);
 
-return isabus->address_space_io;
+return isa_bus_from_device(dev)->address_space_io;
 }
 
 type_init(isabus_register_types)
diff --git a/include/hw/isa/isa.h b/include/hw/isa/isa.h
index 8dd2953211..486851e7cb 100644
--- a/include/hw/isa/isa.h
+++ b/include/hw/isa/isa.h
@@ -108,7 +108,7 @@ void isa_register_ioport(ISADevice *dev, MemoryRegion *io, 
uint16_t start);
  * function makes it easy to create multiple MemoryRegions for a single
  * device and use the legacy portio routines.
  *
- * @dev: the ISADevice against which these are registered; may be NULL.
+ * @dev: the ISADevice against which these are registered
  * @piolist: the PortioList associated with the io ports
  * @start: the base I/O port against which the portio->offset is applied.
  * @portio: the ports, sorted by offset.
-- 
2.36.1




[RFC PATCH 06/10] hw/ide/ioport: Rename ide_init_ioport() to isa_ide_init_ioport()

2022-06-27 Thread Bernhard Beschow
ide_init_ioport() takes an ISADevice* parameter which eventually gets passed
to isa_address_space_io(). Unfortunately, there is no ISADevice in hw/ide/
piix, so NULL gets passed instead. This causes isa_address_space_io() to
resort to using the isabus global - which we want to get rid of.

To resolve this, observe that hw/isa/piix* models pass PCI's IO address
space to ISA which can be used instead. The next patch therefore introduces
pci_ide_init_ioport() which takes a PCIDevice* parameter instead and is
available in hw/ide/piix.

Signed-off-by: Bernhard Beschow 
---
 hw/ide/ioport.c   | 2 +-
 hw/ide/isa.c  | 2 +-
 hw/ide/piix.c | 4 ++--
 include/hw/ide/internal.h | 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/hw/ide/ioport.c b/hw/ide/ioport.c
index b613ff3bba..ed1f34f573 100644
--- a/hw/ide/ioport.c
+++ b/hw/ide/ioport.c
@@ -50,7 +50,7 @@ static const MemoryRegionPortio ide_portio2_list[] = {
 PORTIO_END_OF_LIST(),
 };
 
-void ide_init_ioport(IDEBus *bus, ISADevice *dev, int iobase, int iobase2)
+void isa_ide_init_ioport(IDEBus *bus, ISADevice *dev, int iobase, int iobase2)
 {
 /* ??? Assume only ISA and PCI configurations, and that the PCI-ISA
bridge has been setup properly to always register with ISA.  */
diff --git a/hw/ide/isa.c b/hw/ide/isa.c
index 8bedbd13f1..79ed33aefa 100644
--- a/hw/ide/isa.c
+++ b/hw/ide/isa.c
@@ -74,7 +74,7 @@ static void isa_ide_realizefn(DeviceState *dev, Error **errp)
 ISAIDEState *s = ISA_IDE(dev);
 
 ide_bus_init(>bus, sizeof(s->bus), dev, 0, 2);
-ide_init_ioport(>bus, isadev, s->iobase, s->iobase2);
+isa_ide_init_ioport(>bus, isadev, s->iobase, s->iobase2);
 s->irq = isa_get_irq(isadev, s->isairq);
 ide_init2(>bus, s->irq);
 vmstate_register(VMSTATE_IF(dev), 0, _ide_isa, s);
diff --git a/hw/ide/piix.c b/hw/ide/piix.c
index fbf2756b47..312611c61f 100644
--- a/hw/ide/piix.c
+++ b/hw/ide/piix.c
@@ -161,8 +161,8 @@ static int pci_piix_init_ports(PCIIDEState *d)
 
 for (i = 0; i < 2; i++) {
 ide_bus_init(>bus[i], sizeof(d->bus[i]), dev, i, 2);
-ide_init_ioport(>bus[i], NULL, port_info[i].iobase,
-port_info[i].iobase2);
+isa_ide_init_ioport(>bus[i], NULL, port_info[i].iobase,
+port_info[i].iobase2);
 ide_init2(>bus[i], qdev_get_gpio_in(dev, i));
 
 bmdma_init(>bus[i], >bmdma[i], d);
diff --git a/include/hw/ide/internal.h b/include/hw/ide/internal.h
index 348e7f2510..86ecc04ce4 100644
--- a/include/hw/ide/internal.h
+++ b/include/hw/ide/internal.h
@@ -624,7 +624,7 @@ int ide_init_drive(IDEState *s, BlockBackend *blk, 
IDEDriveKind kind,
int chs_trans, Error **errp);
 void ide_init2(IDEBus *bus, qemu_irq irq);
 void ide_exit(IDEState *s);
-void ide_init_ioport(IDEBus *bus, ISADevice *isa, int iobase, int iobase2);
+void isa_ide_init_ioport(IDEBus *bus, ISADevice *isa, int iobase, int iobase2);
 void ide_register_restart_cb(IDEBus *bus);
 
 void ide_exec_cmd(IDEBus *bus, uint32_t val);
-- 
2.36.1




[RFC PATCH 03/10] hw/i386/pc_piix: Allow for setting properties on "piix3-ide" before realizing it

2022-06-27 Thread Bernhard Beschow
The next patch will introduce a quirk for user-created PIIX IDE devices for
backwards compatibility. In order to opt-in to new behavior for builtin
devices a property will need to be set until a deprecation period is over.

Signed-off-by: Bernhard Beschow 
---
 hw/i386/pc_piix.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index 0fc2361ffe..2d146b19c0 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -252,7 +252,8 @@ static void pc_init1(MachineState *machine,
 if (pcmc->pci_enabled) {
 PCIDevice *dev;
 
-dev = pci_create_simple(pci_bus, piix3_devfn + 1, "piix3-ide");
+dev = pci_new_multifunction(piix3_devfn + 1, false, "piix3-ide");
+pci_realize_and_unref(dev, pci_bus, _fatal);
 pci_ide_create_devs(dev);
 idebus[0] = qdev_get_child_bus(>qdev, "ide.0");
 idebus[1] = qdev_get_child_bus(>qdev, "ide.1");
-- 
2.36.1




[RFC PATCH 08/10] hw/ide/piix: Use pci_ide_init_ioport() rather than isa_ide_init_ioport()

2022-06-27 Thread Bernhard Beschow
This should fix the last caller causing a NULL ISADev to be passed to
isa_register_portio_list() which now allows for disusing the isabus global
there.

Signed-off-by: Bernhard Beschow 
---
 hw/ide/piix.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/ide/piix.c b/hw/ide/piix.c
index 312611c61f..087568ecf1 100644
--- a/hw/ide/piix.c
+++ b/hw/ide/piix.c
@@ -161,7 +161,7 @@ static int pci_piix_init_ports(PCIIDEState *d)
 
 for (i = 0; i < 2; i++) {
 ide_bus_init(>bus[i], sizeof(d->bus[i]), dev, i, 2);
-isa_ide_init_ioport(>bus[i], NULL, port_info[i].iobase,
+pci_ide_init_ioport(>bus[i], PCI_DEVICE(d), port_info[i].iobase,
 port_info[i].iobase2);
 ide_init2(>bus[i], qdev_get_gpio_in(dev, i));
 
-- 
2.36.1




[RFC PATCH 00/10] Resolve isabus global

2022-06-27 Thread Bernhard Beschow
This series resolves the global "isabus" variable and is basically a v2 of [1].
Note that the majority of the work consists of fixing ISA API calls in PIIX IDE
which implicitly rely on the usage of the isabus global.

Rather than adding an ISABus pointer in PCIIDEState as in "v1" this series uses
a qemu_irq array which is roughly the approach outlined in [2]. Moreover, this
series considers backwards compatibility for user-created PIIX IDE
"Frankensten" devices by using a temporary hack. This hack can be removed again
once a deprecation period of user-createable PIIX IDE devices is over. This
deprecation wasn't announced yet but now might be a good time.

Testing done:
* `./qemu-system-x86_64 -M x-remote -device piix3-ide` still fails gracefully 
with
`qemu-system-x86_64: -device piix3-ide: Failed to realize piix3-ide: No such 
device`
* `make check-avocado` doesn't report errors
* Booting a live image with `./qemu-system-x86_64 -M pc` works
* Booting a MIPS Malta machine [3] works

[1] https://patchew.org/QEMU/20210518215545.1793947-1-phi...@redhat.com/
[2] https://lists.nongnu.org/archive/html/qemu-devel/2020-03/msg01707.html
[3] https://people.debian.org/~aurel32/qemu/mips/

Bernhard Beschow (10):
  hw/ide/piix: Check for presence of ISABus before using it
  Revert "hw/ide: Fix crash when plugging a piix3-ide device into the
x-remote machine"
  hw/i386/pc_piix: Allow for setting properties on "piix3-ide" before
realizing it
  hw/ide/piix: Avoid the isabus global when wiring ISA interrupts for
internal devices
  hw/isa/isa-bus: assert() if isa_get_irq() gets passed a NULL ISADevice
  hw/ide/ioport: Rename ide_init_ioport() to isa_ide_init_ioport()
  hw/pci/pci: Introduce pci_register_portio_list()
  hw/ide/piix: Use pci_ide_init_ioport() rather than
isa_ide_init_ioport()
  hw/isa: Resolve unneeded usage of isabus global
  hw/isa/isa-bus: Resolve isabus global

 hw/i386/pc_piix.c |  6 +++-
 hw/ide/ioport.c   | 30 +---
 hw/ide/isa.c  |  2 +-
 hw/ide/piix.c | 59 +++
 hw/isa/isa-bus.c  | 46 ++
 hw/isa/piix4.c|  3 ++
 hw/pci/pci.c  | 18 
 include/hw/ide/internal.h |  3 +-
 include/hw/ide/pci.h  |  2 ++
 include/hw/isa/isa.h  | 15 --
 include/hw/pci/pci.h  | 21 ++
 11 files changed, 147 insertions(+), 58 deletions(-)

-- 
2.36.1




[RFC PATCH 02/10] Revert "hw/ide: Fix crash when plugging a piix3-ide device into the x-remote machine"

2022-06-27 Thread Bernhard Beschow
Now that the PIIX IDE device models check for presence of an ISABus before
using it, this fix isn't needed any longer.

This reverts commit 9405d87be25db6dff4d7b5ab48a81bbf6d083e47.

Signed-off-by: Bernhard Beschow 
---
 hw/ide/ioport.c   | 16 ++--
 hw/ide/piix.c |  9 +++--
 hw/isa/isa-bus.c  | 14 --
 include/hw/ide/internal.h |  2 +-
 include/hw/isa/isa.h  | 13 +
 5 files changed, 19 insertions(+), 35 deletions(-)

diff --git a/hw/ide/ioport.c b/hw/ide/ioport.c
index e6caa537fa..b613ff3bba 100644
--- a/hw/ide/ioport.c
+++ b/hw/ide/ioport.c
@@ -50,19 +50,15 @@ static const MemoryRegionPortio ide_portio2_list[] = {
 PORTIO_END_OF_LIST(),
 };
 
-int ide_init_ioport(IDEBus *bus, ISADevice *dev, int iobase, int iobase2)
+void ide_init_ioport(IDEBus *bus, ISADevice *dev, int iobase, int iobase2)
 {
-int ret;
-
 /* ??? Assume only ISA and PCI configurations, and that the PCI-ISA
bridge has been setup properly to always register with ISA.  */
-ret = isa_register_portio_list(dev, >portio_list,
-   iobase, ide_portio_list, bus, "ide");
+isa_register_portio_list(dev, >portio_list,
+ iobase, ide_portio_list, bus, "ide");
 
-if (ret == 0 && iobase2) {
-ret = isa_register_portio_list(dev, >portio2_list,
-   iobase2, ide_portio2_list, bus, "ide");
+if (iobase2) {
+isa_register_portio_list(dev, >portio2_list,
+ iobase2, ide_portio2_list, bus, "ide");
 }
-
-return ret;
 }
diff --git a/hw/ide/piix.c b/hw/ide/piix.c
index e8f3abc4b5..21777ecc8b 100644
--- a/hw/ide/piix.c
+++ b/hw/ide/piix.c
@@ -134,7 +134,7 @@ static int pci_piix_init_ports(PCIIDEState *d)
 {0x1f0, 0x3f6, 14},
 {0x170, 0x376, 15},
 };
-int i, ret;
+int i;
 
 {
 ISABus *isa_bus;
@@ -149,11 +149,8 @@ static int pci_piix_init_ports(PCIIDEState *d)
 
 for (i = 0; i < 2; i++) {
 ide_bus_init(>bus[i], sizeof(d->bus[i]), DEVICE(d), i, 2);
-ret = ide_init_ioport(>bus[i], NULL, port_info[i].iobase,
-  port_info[i].iobase2);
-if (ret) {
-return ret;
-}
+ide_init_ioport(>bus[i], NULL, port_info[i].iobase,
+port_info[i].iobase2);
 ide_init2(>bus[i], isa_get_irq(NULL, port_info[i].isairq));
 
 bmdma_init(>bus[i], >bmdma[i], d);
diff --git a/hw/isa/isa-bus.c b/hw/isa/isa-bus.c
index 1bee1a47f1..0537a9f2c1 100644
--- a/hw/isa/isa-bus.c
+++ b/hw/isa/isa-bus.c
@@ -119,17 +119,13 @@ void isa_register_ioport(ISADevice *dev, MemoryRegion 
*io, uint16_t start)
 isa_init_ioport(dev, start);
 }
 
-int isa_register_portio_list(ISADevice *dev,
- PortioList *piolist, uint16_t start,
- const MemoryRegionPortio *pio_start,
- void *opaque, const char *name)
+void isa_register_portio_list(ISADevice *dev,
+  PortioList *piolist, uint16_t start,
+  const MemoryRegionPortio *pio_start,
+  void *opaque, const char *name)
 {
 assert(piolist && !piolist->owner);
 
-if (!isabus) {
-return -ENODEV;
-}
-
 /* START is how we should treat DEV, regardless of the actual
contents of the portio array.  This is how the old code
actually handled e.g. the FDC device.  */
@@ -137,8 +133,6 @@ int isa_register_portio_list(ISADevice *dev,
 
 portio_list_init(piolist, OBJECT(dev), pio_start, opaque, name);
 portio_list_add(piolist, isabus->address_space_io, start);
-
-return 0;
 }
 
 ISADevice *isa_new(const char *name)
diff --git a/include/hw/ide/internal.h b/include/hw/ide/internal.h
index 97e7e59dc5..348e7f2510 100644
--- a/include/hw/ide/internal.h
+++ b/include/hw/ide/internal.h
@@ -624,7 +624,7 @@ int ide_init_drive(IDEState *s, BlockBackend *blk, 
IDEDriveKind kind,
int chs_trans, Error **errp);
 void ide_init2(IDEBus *bus, qemu_irq irq);
 void ide_exit(IDEState *s);
-int ide_init_ioport(IDEBus *bus, ISADevice *isa, int iobase, int iobase2);
+void ide_init_ioport(IDEBus *bus, ISADevice *isa, int iobase, int iobase2);
 void ide_register_restart_cb(IDEBus *bus);
 
 void ide_exec_cmd(IDEBus *bus, uint32_t val);
diff --git a/include/hw/isa/isa.h b/include/hw/isa/isa.h
index 6c8a8a92cb..8dd2953211 100644
--- a/include/hw/isa/isa.h
+++ b/include/hw/isa/isa.h
@@ -114,15 +114,12 @@ void isa_register_ioport(ISADevice *dev, MemoryRegion 
*io, uint16_t start);
  * @portio: the ports, sorted by offset.
  * @opaque: passed into the portio callbacks.
  * @name: passed into memory_region_init_io.
- *
- * Returns: 0 on success, negative error code otherwise (e.g. if the
- *  ISA bus is not available)
  */
-int isa_register_portio_list(ISADevice *dev,
-   

[RFC PATCH 01/10] hw/ide/piix: Check for presence of ISABus before using it

2022-06-27 Thread Bernhard Beschow
This is an alternative solution to commit
9405d87be25db6dff4d7b5ab48a81bbf6d083e47 'hw/ide: Fix crash when plugging a
piix3-ide device into the x-remote machine' which allows for cleaning up the
ISA API while keeping PIIX IDE functions user-createable for an arbitrarily
long deprecation period.

Signed-off-by: Bernhard Beschow 
---
 hw/ide/piix.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/hw/ide/piix.c b/hw/ide/piix.c
index 9a9b28078e..e8f3abc4b5 100644
--- a/hw/ide/piix.c
+++ b/hw/ide/piix.c
@@ -136,6 +136,17 @@ static int pci_piix_init_ports(PCIIDEState *d)
 };
 int i, ret;
 
+{
+ISABus *isa_bus;
+bool ambiguous;
+
+isa_bus = ISA_BUS(object_resolve_path_type("", TYPE_ISA_BUS,
+   ));
+if (!isa_bus || ambiguous) {
+return -ENODEV;
+}
+}
+
 for (i = 0; i < 2; i++) {
 ide_bus_init(>bus[i], sizeof(d->bus[i]), DEVICE(d), i, 2);
 ret = ide_init_ioport(>bus[i], NULL, port_info[i].iobase,
-- 
2.36.1