date:20230609

Re: [Libguestfs] [libnbd PATCH v3 08/22] block_status: Track 64-bit extents internally

2023-06-09 Thread Eric Blake

[fixing some of my editing errors]

On Fri, Jun 09, 2023 at 10:42:25AM -0500, Eric Blake wrote:
[...]
> 
> The tl;dr summary of the above discourse:
> There are two orthogonal communications going on:
> 
> libnbd <-> server  choice of NBD_REPLY_TYPE_BLOCK_STATUS{,_EXT}
> app <-> libnbd choice of nbd_block_status() or nbd_block_status_64()
> 
> and all four combinations are easy to encounter when loading the .so
> for libnbd 1.18:
> 
> 32 x 32 (app compiled against libnbd 1.16 to server A)
> 64 x 32 (app compiled against libnbd 1.16 to server B)
> 32 x 64 (app compiled against libnbd 1.18 to server A)
> 64 x 64 (app compiled against libnbd 1.18 to server B)
> 
> and we want all four combinations to work insofar as possible.  32x32
> and 64x64 obviously work, as does 32x64 (widening the server's
> responses never fails); for 32x64 (using the 32-bit nbd_block_status()
> API to access a server's response through 64-bit
> NBD_REPLY_TYPE_BLOCK_STATUS), accessing a metacontext with large flags

for 64x32 (using the 32-bit nbd_block_status() to access a server's
response through 64-bit NBD_REPLY_TYPE_BLOCK_STATUS_EXT),

> changes from fail early to fail late; and accessing a metacontext with
> 32-bit flags but now potential 64-bit lengths obeys overall NBD
> semantics that a block status response can be truncated as long as it
> makes progress (the app shouldn't care whether it was the server or
> libnbd that did the truncation).
>
[...]
> 
> It is indeed a bug if a server replies with
> NBD_REPLY_TYPE_BLOCK_STATUS_EXT for a client that did not request
> extended headers.  But it is also a bug if a server replise with

replies

> NBD_REPLY_TYPE_BLOCK_STATUS for a client that did request extended
> headers, even if the reply does not need more than 32 bits for either
> the length or the flags.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

[PATCH v2 09/10] block: Convert qmp_query_block() to coroutine_fn

2023-06-09 Thread Fabiano Rosas

This is another caller of bdrv_get_allocated_file_size() that needs to
be converted to a coroutine because that function will be made
asynchronous when called (indirectly) from the QMP dispatcher.

This QMP command is a candidate because it calls bdrv_do_query_node_info(),
which in turn calls bdrv_get_allocated_file_size().

We've determined bdrv_do_query_node_info() to be coroutine-safe (see
previous commits), so we can just put this QMP command in a coroutine.

Since qmp_query_block() now expects to run in a coroutine, its callers
need to be converted as well. Convert hmp_info_block(), which calls
only coroutine-safe code, including qmp_query_named_block_nodes()
which has been converted to coroutine in the previous patches.

Now that all callers of bdrv_[co_]block_device_info() are using the
coroutine version, a few things happen:

 - we can return to using bdrv_block_device_info() without a wrapper;

 - bdrv_get_allocated_file_size() can stop being mixed;

 - bdrv_co_get_allocated_file_size() needs to be put under the graph
   lock because it is being called wthout the wrapper;

 - bdrv_do_query_node_info() doesn't need to acquire the AioContext
   because it doesn't call aio_poll anymore;

Signed-off-by: Fabiano Rosas 
---
 block.c|  2 +-
 block/monitor/block-hmp-cmds.c |  2 +-
 block/qapi.c   | 18 +-
 hmp-commands-info.hx   |  1 +
 include/block/block-hmp-cmds.h |  2 +-
 include/block/block-io.h   |  2 +-
 include/block/qapi.h   | 12 
 qapi/block-core.json   |  2 +-
 8 files changed, 19 insertions(+), 22 deletions(-)

diff --git a/block.c b/block.c
index abed744b60..f94cee8930 100644
--- a/block.c
+++ b/block.c
@@ -6148,7 +6148,7 @@ BlockDeviceInfoList *bdrv_named_nodes_list(bool flat,
 
 list = NULL;
 QTAILQ_FOREACH(bs, _bdrv_states, node_list) {
-BlockDeviceInfo *info = bdrv_co_block_device_info(NULL, bs, flat, 
errp);
+BlockDeviceInfo *info = bdrv_block_device_info(NULL, bs, flat, errp);
 if (!info) {
 qapi_free_BlockDeviceInfoList(list);
 return NULL;
diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
index 26116fe831..1049f9b006 100644
--- a/block/monitor/block-hmp-cmds.c
+++ b/block/monitor/block-hmp-cmds.c
@@ -742,7 +742,7 @@ static void print_block_info(Monitor *mon, BlockInfo *info,
 }
 }
 
-void hmp_info_block(Monitor *mon, const QDict *qdict)
+void coroutine_fn hmp_info_block(Monitor *mon, const QDict *qdict)
 {
 BlockInfoList *block_list, *info;
 BlockDeviceInfoList *blockdev_list, *blockdev;
diff --git a/block/qapi.c b/block/qapi.c
index 20660e15d6..3b4bc0b782 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -41,10 +41,10 @@
 #include "qemu/qemu-print.h"
 #include "sysemu/block-backend.h"
 
-BlockDeviceInfo *coroutine_fn bdrv_co_block_device_info(BlockBackend *blk,
-BlockDriverState *bs,
-bool flat,
-Error **errp)
+BlockDeviceInfo *coroutine_fn bdrv_block_device_info(BlockBackend *blk,
+ BlockDriverState *bs,
+ bool flat,
+ Error **errp)
 {
 ImageInfo **p_image_info;
 ImageInfo *backing_info;
@@ -235,8 +235,6 @@ static void bdrv_do_query_node_info(BlockDriverState *bs,
 int ret;
 Error *err = NULL;
 
-aio_context_acquire(bdrv_get_aio_context(bs));
-
 size = bdrv_getlength(bs);
 if (size < 0) {
 error_setg_errno(errp, -size, "Can't get image size '%s'",
@@ -249,7 +247,9 @@ static void bdrv_do_query_node_info(BlockDriverState *bs,
 info->filename= g_strdup(bs->filename);
 info->format  = g_strdup(bdrv_get_format_name(bs));
 info->virtual_size= size;
-info->actual_size = bdrv_get_allocated_file_size(bs);
+bdrv_graph_co_rdlock();
+info->actual_size = bdrv_co_get_allocated_file_size(bs);
+bdrv_graph_co_rdunlock();
 info->has_actual_size = info->actual_size >= 0;
 if (bs->encrypted) {
 info->encrypted = true;
@@ -305,7 +305,7 @@ static void bdrv_do_query_node_info(BlockDriverState *bs,
 }
 
 out:
-aio_context_release(bdrv_get_aio_context(bs));
+return;
 }
 
 /**
@@ -668,7 +668,7 @@ bdrv_query_bds_stats(BlockDriverState *bs, bool blk_level)
 return s;
 }
 
-BlockInfoList *qmp_query_block(Error **errp)
+BlockInfoList *coroutine_fn qmp_query_block(Error **errp)
 {
 BlockInfoList *head = NULL, **p_next = 
 BlockBackend *blk;
diff --git a/hmp-commands-info.hx b/hmp-commands-info.hx
index 47d63d26db..996895f417 100644
--- a/hmp-commands-info.hx
+++ b/hmp-commands-info.hx
@@ -65,6 +65,7 @@ ERST
 .help   = "show info of one block device or all block devices

[PATCH v2 05/10] block: Convert bdrv_query_block_graph_info to coroutine

2023-06-09 Thread Fabiano Rosas

We're converting callers of bdrv_get_allocated_file_size() to run in
coroutines because that function will be made asynchronous when called
(indirectly) from the QMP dispatcher.

This function is a candidate because it calls bdrv_do_query_node_info(),
which in turn calls bdrv_get_allocated_file_size().

All the functions called from bdrv_do_query_node_info() onwards are
coroutine-safe, either have a coroutine version themselves[1] or are
mostly simple code/string manipulation[2].

1) bdrv_getlength(), bdrv_get_allocated_file_size(), bdrv_get_info(),
   bdrv_get_specific_info();

2) bdrv_refresh_filename(), bdrv_get_format_name(),
   bdrv_get_full_backing_filename(), bdrv_query_snapshot_info_list();

Signed-off-by: Fabiano Rosas 
---
 block/qapi.c | 12 +++-
 include/block/qapi.h |  6 +-
 qemu-img.c   |  2 --
 3 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/block/qapi.c b/block/qapi.c
index 1cbb0935ff..a2e71edaff 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -375,7 +375,7 @@ fail:
 }
 
 /**
- * bdrv_query_block_graph_info:
+ * bdrv_co_query_block_graph_info:
  * @bs: root node to start from
  * @p_info: location to store image information
  * @errp: location to store error information
@@ -384,15 +384,17 @@ fail:
  *
  * @p_info will be set only on success. On error, store error in @errp.
  */
-void bdrv_query_block_graph_info(BlockDriverState *bs,
- BlockGraphInfo **p_info,
- Error **errp)
+void coroutine_fn bdrv_co_query_block_graph_info(BlockDriverState *bs,
+ BlockGraphInfo **p_info,
+ Error **errp)
 {
 BlockGraphInfo *info;
 BlockChildInfoList **children_list_tail;
 BdrvChild *c;
 ERRP_GUARD();
 
+assert_bdrv_graph_readable();
+
 info = g_new0(BlockGraphInfo, 1);
 bdrv_do_query_node_info(bs, qapi_BlockGraphInfo_base(info), errp);
 if (*errp) {
@@ -408,7 +410,7 @@ void bdrv_query_block_graph_info(BlockDriverState *bs,
 QAPI_LIST_APPEND(children_list_tail, c_info);
 
 c_info->name = g_strdup(c->name);
-bdrv_query_block_graph_info(c->bs, _info->info, errp);
+bdrv_co_query_block_graph_info(c->bs, _info->info, errp);
 if (*errp) {
 goto fail;
 }
diff --git a/include/block/qapi.h b/include/block/qapi.h
index 8663971c58..7035bcd1ae 100644
--- a/include/block/qapi.h
+++ b/include/block/qapi.h
@@ -25,6 +25,7 @@
 #ifndef BLOCK_QAPI_H
 #define BLOCK_QAPI_H
 
+#include "block/block-common.h"
 #include "block/graph-lock.h"
 #include "block/snapshot.h"
 #include "qapi/qapi-types-block-core.h"
@@ -41,7 +42,10 @@ void bdrv_query_image_info(BlockDriverState *bs,
bool flat,
bool skip_implicit_filters,
Error **errp);
-void GRAPH_RDLOCK
+void coroutine_fn GRAPH_RDLOCK
+bdrv_co_query_block_graph_info(BlockDriverState *bs, BlockGraphInfo **p_info,
+   Error **errp);
+void co_wrapper_bdrv_rdlock
 bdrv_query_block_graph_info(BlockDriverState *bs, BlockGraphInfo **p_info,
 Error **errp);
 
diff --git a/qemu-img.c b/qemu-img.c
index 27f48051b0..8066286f5e 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -2945,9 +2945,7 @@ static BlockGraphInfoList *collect_image_info_list(bool 
image_opts,
  * duplicate the backing chain information that we obtain by walking
  * the chain manually here.
  */
-bdrv_graph_rdlock_main_loop();
 bdrv_query_block_graph_info(bs, , );
-bdrv_graph_rdunlock_main_loop();
 
 if (err) {
 error_report_err(err);
-- 
2.35.3

[PATCH v2 01/10] block: Remove bdrv_query_block_node_info

2023-06-09 Thread Fabiano Rosas

The last call site of this function has been removed by commit
c04d0ab026 ("qemu-img: Let info print block graph").

Reviewed-by: Claudio Fontana 
Signed-off-by: Fabiano Rosas 
---
 block/qapi.c | 27 ---
 include/block/qapi.h |  3 ---
 2 files changed, 30 deletions(-)

diff --git a/block/qapi.c b/block/qapi.c
index f34f95e0ef..79bf80c503 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -309,33 +309,6 @@ out:
 aio_context_release(bdrv_get_aio_context(bs));
 }
 
-/**
- * bdrv_query_block_node_info:
- * @bs: block node to examine
- * @p_info: location to store node information
- * @errp: location to store error information
- *
- * Store image information about @bs in @p_info.
- *
- * @p_info will be set only on success. On error, store error in @errp.
- */
-void bdrv_query_block_node_info(BlockDriverState *bs,
-BlockNodeInfo **p_info,
-Error **errp)
-{
-BlockNodeInfo *info;
-ERRP_GUARD();
-
-info = g_new0(BlockNodeInfo, 1);
-bdrv_do_query_node_info(bs, info, errp);
-if (*errp) {
-qapi_free_BlockNodeInfo(info);
-return;
-}
-
-*p_info = info;
-}
-
 /**
  * bdrv_query_image_info:
  * @bs: block node to examine
diff --git a/include/block/qapi.h b/include/block/qapi.h
index 18d48ddb70..8663971c58 100644
--- a/include/block/qapi.h
+++ b/include/block/qapi.h
@@ -36,9 +36,6 @@ BlockDeviceInfo *bdrv_block_device_info(BlockBackend *blk,
 int bdrv_query_snapshot_info_list(BlockDriverState *bs,
   SnapshotInfoList **p_list,
   Error **errp);
-void bdrv_query_block_node_info(BlockDriverState *bs,
-BlockNodeInfo **p_info,
-Error **errp);
 void bdrv_query_image_info(BlockDriverState *bs,
ImageInfo **p_info,
bool flat,
-- 
2.35.3

[PATCH v2 06/10] block: Convert bdrv_block_device_info into co_wrapper

2023-06-09 Thread Fabiano Rosas

We're converting callers of bdrv_get_allocated_file_size() to run in
coroutines because that function will be made asynchronous when called
(indirectly) from the QMP dispatcher.

This function is a candidate because it calls bdrv_query_image_info()
-> bdrv_do_query_node_info() -> bdrv_get_allocated_file_size().

It is safe to turn this is a coroutine because the code it calls is
made up of either simple accessors and string manipulation functions
[1] or it has already been determined to be safe [2].

1) bdrv_refresh_filename(), bdrv_is_read_only(),
   blk_enable_write_cache(), bdrv_cow_bs(), blk_get_public(),
   throttle_group_get_name(), bdrv_write_threshold_get(),
   bdrv_query_dirty_bitmaps(), throttle_group_get_config(),
   bdrv_filter_or_cow_bs(), bdrv_skip_implicit_filters()

2) bdrv_do_query_node_info() (see previous commit);

Signed-off-by: Fabiano Rosas 
---
 block/qapi.c |  8 
 include/block/qapi.h | 12 
 2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/block/qapi.c b/block/qapi.c
index a2e71edaff..20660e15d6 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -41,10 +41,10 @@
 #include "qemu/qemu-print.h"
 #include "sysemu/block-backend.h"
 
-BlockDeviceInfo *bdrv_block_device_info(BlockBackend *blk,
-BlockDriverState *bs,
-bool flat,
-Error **errp)
+BlockDeviceInfo *coroutine_fn bdrv_co_block_device_info(BlockBackend *blk,
+BlockDriverState *bs,
+bool flat,
+Error **errp)
 {
 ImageInfo **p_image_info;
 ImageInfo *backing_info;
diff --git a/include/block/qapi.h b/include/block/qapi.h
index 7035bcd1ae..5cb0202791 100644
--- a/include/block/qapi.h
+++ b/include/block/qapi.h
@@ -30,10 +30,14 @@
 #include "block/snapshot.h"
 #include "qapi/qapi-types-block-core.h"
 
-BlockDeviceInfo *bdrv_block_device_info(BlockBackend *blk,
-BlockDriverState *bs,
-bool flat,
-Error **errp);
+BlockDeviceInfo *coroutine_fn bdrv_co_block_device_info(BlockBackend *blk,
+BlockDriverState *bs,
+bool flat,
+Error **errp);
+BlockDeviceInfo *co_wrapper bdrv_block_device_info(BlockBackend *blk,
+   BlockDriverState *bs,
+   bool flat,
+   Error **errp);
 int bdrv_query_snapshot_info_list(BlockDriverState *bs,
   SnapshotInfoList **p_list,
   Error **errp);
-- 
2.35.3

[PATCH v2 07/10] block: Convert qmp_query_named_block_nodes to coroutine

2023-06-09 Thread Fabiano Rosas

From: Lin Ma 

We're converting callers of bdrv_get_allocated_file_size() to run in
coroutines because that function will be made asynchronous when called
(indirectly) from the QMP dispatcher.

This QMP command is a candidate because it indirectly calls
bdrv_get_allocated_file_size() through bdrv_block_device_info() ->
bdrv_query_image_info() -> bdrv_query_image_info().

The previous patches have determined that bdrv_query_image_info() and
bdrv_do_query_node_info() are coroutine-safe so we can just make the
QMP command run in a coroutine.

Signed-off-by: Lin Ma 
Signed-off-by: Fabiano Rosas 
---
 block.c  | 2 +-
 blockdev.c   | 6 +++---
 qapi/block-core.json | 3 ++-
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/block.c b/block.c
index f94cee8930..abed744b60 100644
--- a/block.c
+++ b/block.c
@@ -6148,7 +6148,7 @@ BlockDeviceInfoList *bdrv_named_nodes_list(bool flat,
 
 list = NULL;
 QTAILQ_FOREACH(bs, _bdrv_states, node_list) {
-BlockDeviceInfo *info = bdrv_block_device_info(NULL, bs, flat, errp);
+BlockDeviceInfo *info = bdrv_co_block_device_info(NULL, bs, flat, 
errp);
 if (!info) {
 qapi_free_BlockDeviceInfoList(list);
 return NULL;
diff --git a/blockdev.c b/blockdev.c
index e6eba61484..8b5f7d06c8 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2818,9 +2818,9 @@ void qmp_drive_backup(DriveBackup *backup, Error **errp)
 blockdev_do_action(, errp);
 }
 
-BlockDeviceInfoList *qmp_query_named_block_nodes(bool has_flat,
- bool flat,
- Error **errp)
+BlockDeviceInfoList *coroutine_fn qmp_query_named_block_nodes(bool has_flat,
+  bool flat,
+  Error **errp)
 {
 bool return_flat = has_flat && flat;
 
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 5dd5f7e4b0..9d4c92f2c9 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1972,7 +1972,8 @@
 { 'command': 'query-named-block-nodes',
   'returns': [ 'BlockDeviceInfo' ],
   'data': { '*flat': 'bool' },
-  'allow-preconfig': true }
+  'allow-preconfig': true,
+  'coroutine': true}
 
 ##
 # @XDbgBlockGraphNodeType:
-- 
2.35.3

[PATCH v2 00/10] block: Make raw_co_get_allocated_file_size asynchronous

2023-06-09 Thread Fabiano Rosas

Hi,

The major change from the last version is that this time I'm moving
all of the callers of bdrv_get_allocated_file_size() into
coroutines. I had to make some temporary changes to avoid asserts
while not all the callers were converted.

I tried my best to explain why I think the changes are safe. To avoid
changing too much of the code I added a change that removes the
dependency of qmp_query_block from hmp_nbd_server_start, that way I
don't need to move all of the nbd code into a coroutine as well.

Based on:
 [PATCH v2 00/11] block: Re-enable the graph lock
 https://lore.kernel.org/r/20230605085711.21261-1-kw...@redhat.com

changes:

  - fixed duplicated commit message [Lin]
  - clarified why we need to convert info-block [Claudio]
  - added my rationale of why the changes are safe [Eric]
  - converted all callers to coroutines [Kevin]
  - made hmp_nbd_server_start don't depend on qmp_query_block

CI run: https://gitlab.com/farosas/qemu/-/pipelines/895525156
===
v1:
https://lore.kernel.org/r/20230523213903.18418-1-faro...@suse.de

As discussed in another thread [1], here's an RFC addressing a VCPU
softlockup encountered when issuing QMP commands that target a disk
placed on NFS.

Since QMP commands happen with the qemu_global_mutex locked, any
command that takes too long to finish will block other threads waiting
to take the global mutex. One such thread could be a VCPU thread going
out of the guest to handle IO.

This is the case when issuing the QMP command query-block, which
eventually calls raw_co_get_allocated_file_size(). This function makes
an 'fstat' call that has been observed to take a long time (seconds)
over NFS.

NFS latency issues aside, we can improve the situation by not blocking
VCPU threads while the command is running.

Move the 'fstat' call into the thread-pool and make the necessary
adaptations to ensure raw_co_get_allocated_file_size runs in a
coroutine in the block driver aio_context.

1- Question about QMP and BQL
https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg03141.html

Fabiano Rosas (8):
  block: Remove bdrv_query_block_node_info
  block: Remove unnecessary variable in bdrv_block_device_info
  block: Allow the wrapper script to see functions declared in qapi.h
  block: Temporarily mark bdrv_co_get_allocated_file_size as mixed
  block: Convert bdrv_query_block_graph_info to coroutine
  block: Convert bdrv_block_device_info into co_wrapper
  block: Don't query all block devices at hmp_nbd_server_start
  block: Convert qmp_query_block() to coroutine_fn

João Silva (1):
  block: Add a thread-pool version of fstat

Lin Ma (1):
  block: Convert qmp_query_named_block_nodes to coroutine

 block/file-posix.c | 40 +--
 block/meson.build  |  1 +
 block/monitor/block-hmp-cmds.c | 22 ++-
 block/qapi.c   | 62 +-
 blockdev.c |  6 +--
 hmp-commands-info.hx   |  1 +
 include/block/block-hmp-cmds.h |  2 +-
 include/block/qapi.h   | 17 
 include/block/raw-aio.h|  4 +-
 qapi/block-core.json   |  5 ++-
 qemu-img.c |  2 -
 scripts/block-coroutine-wrapper.py |  1 +
 12 files changed, 90 insertions(+), 73 deletions(-)

-- 
2.35.3

[PATCH v2 04/10] block: Temporarily mark bdrv_co_get_allocated_file_size as mixed

2023-06-09 Thread Fabiano Rosas

Some callers of this function are about to be converted to run in
coroutines, so allow it to be executed both inside and outside a
coroutine while we convert all the callers.

This will be reverted once all callers of bdrv_do_query_node_info run
in a coroutine.

Signed-off-by: Fabiano Rosas 
Reviewed-by: Eric Blake 
---
 include/block/block-io.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/block/block-io.h b/include/block/block-io.h
index 43af816d75..f31e25cf56 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -87,7 +87,7 @@ int64_t co_wrapper_mixed_bdrv_rdlock 
bdrv_getlength(BlockDriverState *bs);
 int64_t coroutine_fn GRAPH_RDLOCK
 bdrv_co_get_allocated_file_size(BlockDriverState *bs);
 
-int64_t co_wrapper_bdrv_rdlock
+int64_t co_wrapper_mixed_bdrv_rdlock
 bdrv_get_allocated_file_size(BlockDriverState *bs);
 
 BlockMeasureInfo *bdrv_measure(BlockDriver *drv, QemuOpts *opts,
-- 
2.35.3

[PATCH v2 03/10] block: Allow the wrapper script to see functions declared in qapi.h

2023-06-09 Thread Fabiano Rosas

The following patches will add co_wrapper annotations to functions
declared in qapi.h. Add that header to the set of files used by
block-coroutine-wrapper.py.

Signed-off-by: Fabiano Rosas 
---
 block/meson.build  | 1 +
 scripts/block-coroutine-wrapper.py | 1 +
 2 files changed, 2 insertions(+)

diff --git a/block/meson.build b/block/meson.build
index fb4332bd66..7ad6a396a4 100644
--- a/block/meson.build
+++ b/block/meson.build
@@ -150,6 +150,7 @@ block_gen_c = custom_target('block-gen.c',
   '../include/block/dirty-bitmap.h',
   '../include/block/block_int-io.h',
   '../include/block/block-global-state.h',
+  '../include/block/qapi.h',
   
'../include/sysemu/block-backend-global-state.h',
   '../include/sysemu/block-backend-io.h',
   'coroutines.h'
diff --git a/scripts/block-coroutine-wrapper.py 
b/scripts/block-coroutine-wrapper.py
index d4a183db61..814b62df26 100644
--- a/scripts/block-coroutine-wrapper.py
+++ b/scripts/block-coroutine-wrapper.py
@@ -44,6 +44,7 @@ def gen_header():
 #include "block/block-gen.h"
 #include "block/block_int.h"
 #include "block/dirty-bitmap.h"
+#include "block/qapi.h"
 """
 
 
-- 
2.35.3

[PATCH v2 08/10] block: Don't query all block devices at hmp_nbd_server_start

2023-06-09 Thread Fabiano Rosas

We're currently doing a full query-block just to enumerate the devices
for qmp_nbd_server_add and then discarding the BlockInfoList
afterwards. Alter hmp_nbd_server_start to instead iterate explicitly
over the block_backends list.

This allows the removal of the dependency on qmp_query_block from
hmp_nbd_server_start. This is desirable because we're about to move
qmp_query_block into a coroutine and don't need to change the NBD code
at the same time.

Signed-off-by: Fabiano Rosas 
---
 block/monitor/block-hmp-cmds.c | 20 
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
index ca2599de44..26116fe831 100644
--- a/block/monitor/block-hmp-cmds.c
+++ b/block/monitor/block-hmp-cmds.c
@@ -394,7 +394,7 @@ void hmp_nbd_server_start(Monitor *mon, const QDict *qdict)
 bool writable = qdict_get_try_bool(qdict, "writable", false);
 bool all = qdict_get_try_bool(qdict, "all", false);
 Error *local_err = NULL;
-BlockInfoList *block_list, *info;
+BlockBackend *blk;
 SocketAddress *addr;
 NbdServerAddOptions export;
 
@@ -419,18 +419,24 @@ void hmp_nbd_server_start(Monitor *mon, const QDict 
*qdict)
 return;
 }
 
-/* Then try adding all block devices.  If one fails, close all and
+/*
+ * Then try adding all block devices.  If one fails, close all and
  * exit.
  */
-block_list = qmp_query_block(NULL);
+for (blk = blk_all_next(NULL); blk; blk = blk_all_next(blk)) {
+BlockDriverState *bs = blk_bs(blk);
 
-for (info = block_list; info; info = info->next) {
-if (!info->value->inserted) {
+if (!*blk_name(blk) && !blk_get_attached_dev(blk)) {
+continue;
+}
+
+bs = bdrv_skip_implicit_filters(bs);
+if (!bs || !bs->drv) {
 continue;
 }
 
 export = (NbdServerAddOptions) {
-.device = info->value->device,
+.device = g_strdup(blk_name(blk)),
 .has_writable   = true,
 .writable   = writable,
 };
@@ -443,8 +449,6 @@ void hmp_nbd_server_start(Monitor *mon, const QDict *qdict)
 }
 }
 
-qapi_free_BlockInfoList(block_list);
-
 exit:
 hmp_handle_error(mon, local_err);
 }
-- 
2.35.3

[PATCH v2 02/10] block: Remove unnecessary variable in bdrv_block_device_info

2023-06-09 Thread Fabiano Rosas

The commit 5d8813593f ("block/qapi: Let bdrv_query_image_info()
recurse") removed the loop where we set the 'bs0' variable, so now it
is just the same as 'bs'.

Signed-off-by: Fabiano Rosas 
---
 block/qapi.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/block/qapi.c b/block/qapi.c
index 79bf80c503..1cbb0935ff 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -48,7 +48,7 @@ BlockDeviceInfo *bdrv_block_device_info(BlockBackend *blk,
 {
 ImageInfo **p_image_info;
 ImageInfo *backing_info;
-BlockDriverState *bs0, *backing;
+BlockDriverState *backing;
 BlockDeviceInfo *info;
 ERRP_GUARD();
 
@@ -145,7 +145,6 @@ BlockDeviceInfo *bdrv_block_device_info(BlockBackend *blk,
 
 info->write_threshold = bdrv_write_threshold_get(bs);
 
-bs0 = bs;
 p_image_info = >image;
 info->backing_file_depth = 0;
 
@@ -153,7 +152,7 @@ BlockDeviceInfo *bdrv_block_device_info(BlockBackend *blk,
  * Skip automatically inserted nodes that the user isn't aware of for
  * query-block (blk != NULL), but not for query-named-block-nodes
  */
-bdrv_query_image_info(bs0, p_image_info, flat, blk != NULL, errp);
+bdrv_query_image_info(bs, p_image_info, flat, blk != NULL, errp);
 if (*errp) {
 qapi_free_BlockDeviceInfo(info);
 return NULL;
-- 
2.35.3

[PATCH v2 10/10] block: Add a thread-pool version of fstat

2023-06-09 Thread Fabiano Rosas

From: João Silva 

The fstat call can take a long time to finish when running over
NFS. Add a version of it that runs in the thread pool.

Adapt one of its users, raw_co_get_allocated_file size to use the new
version. That function is called via QMP under the qemu_global_mutex
so it has a large chance of blocking VCPU threads in case it takes too
long to finish.

Reviewed-by: Claudio Fontana 
Signed-off-by: João Silva 
Signed-off-by: Fabiano Rosas 
---
 block/file-posix.c  | 40 +---
 include/block/raw-aio.h |  4 +++-
 2 files changed, 40 insertions(+), 4 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index ac1ed54811..45232dc0f9 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -227,6 +227,9 @@ typedef struct RawPosixAIOData {
 struct {
 unsigned long op;
 } zone_mgmt;
+struct {
+struct stat *st;
+} fstat;
 };
 } RawPosixAIOData;
 
@@ -2614,6 +2617,34 @@ static void raw_close(BlockDriverState *bs)
 }
 }
 
+static int handle_aiocb_fstat(void *opaque)
+{
+RawPosixAIOData *aiocb = opaque;
+
+if (fstat(aiocb->aio_fildes, aiocb->fstat.st) < 0) {
+return -errno;
+}
+
+return 0;
+}
+
+static int coroutine_fn raw_co_fstat(BlockDriverState *bs, struct stat *st)
+{
+BDRVRawState *s = bs->opaque;
+RawPosixAIOData acb;
+
+acb = (RawPosixAIOData) {
+.bs = bs,
+.aio_fildes = s->fd,
+.aio_type   = QEMU_AIO_FSTAT,
+.fstat  = {
+.st = st,
+},
+};
+
+return raw_thread_pool_submit(handle_aiocb_fstat, );
+}
+
 /**
  * Truncates the given regular file @fd to @offset and, when growing, fills the
  * new space according to @prealloc.
@@ -2853,11 +2884,14 @@ static int64_t coroutine_fn 
raw_co_getlength(BlockDriverState *bs)
 static int64_t coroutine_fn raw_co_get_allocated_file_size(BlockDriverState 
*bs)
 {
 struct stat st;
-BDRVRawState *s = bs->opaque;
+int ret;
 
-if (fstat(s->fd, ) < 0) {
-return -errno;
+ret = raw_co_fstat(bs, );
+
+if (ret) {
+return ret;
 }
+
 return (int64_t)st.st_blocks * 512;
 }
 
diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index 0f63c2800c..1f2c678461 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -31,6 +31,7 @@
 #define QEMU_AIO_ZONE_REPORT  0x0100
 #define QEMU_AIO_ZONE_MGMT0x0200
 #define QEMU_AIO_ZONE_APPEND  0x0400
+#define QEMU_AIO_FSTAT0x0800
 #define QEMU_AIO_TYPE_MASK \
 (QEMU_AIO_READ | \
  QEMU_AIO_WRITE | \
@@ -42,7 +43,8 @@
  QEMU_AIO_TRUNCATE | \
  QEMU_AIO_ZONE_REPORT | \
  QEMU_AIO_ZONE_MGMT | \
- QEMU_AIO_ZONE_APPEND)
+ QEMU_AIO_ZONE_APPEND | \
+ QEMU_AIO_FSTAT)
 
 /* AIO flags */
 #define QEMU_AIO_MISALIGNED   0x1000
-- 
2.35.3

[PATCH 2/5] cmd646: create separate header and QOM type for CMD646_IDE

2023-06-09 Thread Mark Cave-Ayland

This will enable CMD646-specific fields to be added to CMD6464IDEState in
future.

Signed-off-by: Mark Cave-Ayland 
---
 hw/ide/cmd646.c |  4 +++-
 include/hw/ide/cmd646.h | 38 ++
 2 files changed, 41 insertions(+), 1 deletion(-)
 create mode 100644 include/hw/ide/cmd646.h

diff --git a/hw/ide/cmd646.c b/hw/ide/cmd646.c
index 20f1e41d57..a3e227fba7 100644
--- a/hw/ide/cmd646.c
+++ b/hw/ide/cmd646.c
@@ -33,6 +33,7 @@
 #include "sysemu/reset.h"
 
 #include "hw/ide/pci.h"
+#include "hw/ide/cmd646.h"
 #include "trace.h"
 
 /* CMD646 specific */
@@ -339,9 +340,10 @@ static void cmd646_ide_class_init(ObjectClass *klass, void 
*data)
 }
 
 static const TypeInfo cmd646_ide_info = {
-.name  = "cmd646-ide",
+.name  = TYPE_CMD646_IDE,
 .parent= TYPE_PCI_IDE,
 .class_init= cmd646_ide_class_init,
+.instance_size = sizeof(CMD646IDEState),
 };
 
 static void cmd646_ide_register_types(void)
diff --git a/include/hw/ide/cmd646.h b/include/hw/ide/cmd646.h
new file mode 100644
index 00..4780b1132c
--- /dev/null
+++ b/include/hw/ide/cmd646.h
@@ -0,0 +1,38 @@
+/*
+ * QEMU IDE Emulation: PCI cmd646 support.
+ *
+ * Copyright (c) 2003 Fabrice Bellard
+ * Copyright (c) 2006 Openedhand Ltd.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#ifndef HW_IDE_CMD646_H
+#define HW_IDE_CMD646_H
+
+#define TYPE_CMD646_IDE "cmd646-ide"
+OBJECT_DECLARE_SIMPLE_TYPE(CMD646IDEState, CMD646_IDE)
+
+#include "hw/ide/pci.h"
+
+struct CMD646IDEState {
+PCIIDEState parent_obj;
+};
+
+#endif
-- 
2.30.2

[PATCH 4/5] cmd646: rename cmd646_bmdma_ops to bmdma_ops

2023-06-09 Thread Mark Cave-Ayland

This is to allow us to use the cmd646_bmdma_ops name for the CMD646
device-specific registers in the next commit.

Signed-off-by: Mark Cave-Ayland 
---
 hw/ide/cmd646.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/ide/cmd646.c b/hw/ide/cmd646.c
index a3e227fba7..9103da581f 100644
--- a/hw/ide/cmd646.c
+++ b/hw/ide/cmd646.c
@@ -158,7 +158,7 @@ static void bmdma_write(void *opaque, hwaddr addr,
 }
 }
 
-static const MemoryRegionOps cmd646_bmdma_ops = {
+static const MemoryRegionOps bmdma_ops = {
 .read = bmdma_read,
 .write = bmdma_write,
 };
@@ -171,7 +171,7 @@ static void bmdma_setup_bar(PCIIDEState *d)
 memory_region_init(>bmdma_bar, OBJECT(d), "cmd646-bmdma", 16);
 for (i = 0; i < 2; i++) {
 bm = >bmdma[i];
-memory_region_init_io(>extra_io, OBJECT(d), _bmdma_ops, bm,
+memory_region_init_io(>extra_io, OBJECT(d), _ops, bm,
   "cmd646-bmdma-bus", 4);
 memory_region_add_subregion(>bmdma_bar, i * 8, >extra_io);
 memory_region_init_io(>addr_ioport, OBJECT(d),
-- 
2.30.2

[PATCH 1/5] cmd646: checkpatch fixes

2023-06-09 Thread Mark Cave-Ayland

Signed-off-by: Mark Cave-Ayland 
---
 hw/ide/cmd646.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/hw/ide/cmd646.c b/hw/ide/cmd646.c
index a68357c1c5..20f1e41d57 100644
--- a/hw/ide/cmd646.c
+++ b/hw/ide/cmd646.c
@@ -96,7 +96,7 @@ static uint64_t bmdma_read(void *opaque, hwaddr addr,
 return ((uint64_t)1 << (size * 8)) - 1;
 }
 
-switch(addr & 3) {
+switch (addr & 3) {
 case 0:
 val = bm->cmd;
 break;
@@ -133,7 +133,7 @@ static void bmdma_write(void *opaque, hwaddr addr,
 }
 
 trace_bmdma_write_cmd646(addr, val);
-switch(addr & 3) {
+switch (addr & 3) {
 case 0:
 bmdma_cmd_writeb(bm, val);
 break;
@@ -144,7 +144,8 @@ static void bmdma_write(void *opaque, hwaddr addr,
 cmd646_update_irq(pci_dev);
 break;
 case 2:
-bm->status = (val & 0x60) | (bm->status & 1) | (bm->status & ~val & 
0x06);
+bm->status = (val & 0x60) | (bm->status & 1) |
+ (bm->status & ~val & 0x06);
 break;
 case 3:
 if (bm == >pci_dev->bmdma[0]) {
@@ -167,7 +168,7 @@ static void bmdma_setup_bar(PCIIDEState *d)
 int i;
 
 memory_region_init(>bmdma_bar, OBJECT(d), "cmd646-bmdma", 16);
-for(i = 0;i < 2; i++) {
+for (i = 0; i < 2; i++) {
 bm = >bmdma[i];
 memory_region_init_io(>extra_io, OBJECT(d), _bmdma_ops, bm,
   "cmd646-bmdma-bus", 4);
@@ -255,7 +256,7 @@ static void pci_cmd646_ide_realize(PCIDevice *dev, Error 
**errp)
 
 pci_conf[PCI_CLASS_PROG] = 0x8f;
 
-pci_conf[CNTRL] = CNTRL_EN_CH0; // enable IDE0
+pci_conf[CNTRL] = CNTRL_EN_CH0; /* enable IDE0 */
 if (d->secondary) {
 /* XXX: if not enabled, really disable the seconday IDE controller */
 pci_conf[CNTRL] |= CNTRL_EN_CH1; /* enable IDE1 */
@@ -289,7 +290,7 @@ static void pci_cmd646_ide_realize(PCIDevice *dev, Error 
**errp)
 pci_register_bar(dev, 4, PCI_BASE_ADDRESS_SPACE_IO, >bmdma_bar);
 
 /* TODO: RST# value should be 0 */
-pci_conf[PCI_INTERRUPT_PIN] = 0x01; // interrupt on pin 1
+pci_conf[PCI_INTERRUPT_PIN] = 0x01; /* interrupt on pin 1 */
 
 qdev_init_gpio_in(ds, cmd646_set_irq, 2);
 for (i = 0; i < 2; i++) {
-- 
2.30.2

[PATCH 5/5] cmd646: move device-specific BMDMA registers to separate memory region

2023-06-09 Thread Mark Cave-Ayland

The aim here is to eliminate any device-specific registers from the main BMDMA
bar memory region so it can potentially be moved into the shared PCI IDE device.

For each BMDMA bus create a new cmd646-bmdma-specific memory region representing
the device-specific BMDMA registers and then map them using aliases onto the
existing BMDMAState memory region.

Signed-off-by: Mark Cave-Ayland 
---
 hw/ide/cmd646.c | 111 +++-
 include/hw/ide/cmd646.h |   4 ++
 2 files changed, 90 insertions(+), 25 deletions(-)

diff --git a/hw/ide/cmd646.c b/hw/ide/cmd646.c
index 9103da581f..dd495f2e1b 100644
--- a/hw/ide/cmd646.c
+++ b/hw/ide/cmd646.c
@@ -90,7 +90,6 @@ static uint64_t bmdma_read(void *opaque, hwaddr addr,
unsigned size)
 {
 BMDMAState *bm = opaque;
-PCIDevice *pci_dev = PCI_DEVICE(bm->pci_dev);
 uint32_t val;
 
 if (size != 1) {
@@ -101,19 +100,9 @@ static uint64_t bmdma_read(void *opaque, hwaddr addr,
 case 0:
 val = bm->cmd;
 break;
-case 1:
-val = pci_dev->config[MRDMODE];
-break;
 case 2:
 val = bm->status;
 break;
-case 3:
-if (bm == >pci_dev->bmdma[0]) {
-val = pci_dev->config[UDIDETCR0];
-} else {
-val = pci_dev->config[UDIDETCR1];
-}
-break;
 default:
 val = 0xff;
 break;
@@ -127,7 +116,6 @@ static void bmdma_write(void *opaque, hwaddr addr,
 uint64_t val, unsigned size)
 {
 BMDMAState *bm = opaque;
-PCIDevice *pci_dev = PCI_DEVICE(bm->pci_dev);
 
 if (size != 1) {
 return;
@@ -138,23 +126,10 @@ static void bmdma_write(void *opaque, hwaddr addr,
 case 0:
 bmdma_cmd_writeb(bm, val);
 break;
-case 1:
-pci_dev->config[MRDMODE] =
-(pci_dev->config[MRDMODE] & ~0x30) | (val & 0x30);
-cmd646_update_dma_interrupts(pci_dev);
-cmd646_update_irq(pci_dev);
-break;
 case 2:
 bm->status = (val & 0x60) | (bm->status & 1) |
  (bm->status & ~val & 0x06);
 break;
-case 3:
-if (bm == >pci_dev->bmdma[0]) {
-pci_dev->config[UDIDETCR0] = val;
-} else {
-pci_dev->config[UDIDETCR1] = val;
-}
-break;
 }
 }
 
@@ -181,6 +156,91 @@ static void bmdma_setup_bar(PCIIDEState *d)
 }
 }
 
+static uint64_t cmd646_bmdma_read(void *opaque, hwaddr addr, unsigned size)
+{
+BMDMAState *bm = opaque;
+PCIDevice *pci_dev = PCI_DEVICE(bm->pci_dev);
+uint32_t val;
+
+if (size != 1) {
+return ((uint64_t)1 << (size * 8)) - 1;
+}
+
+switch (addr & 3) {
+case 1:
+val = pci_dev->config[MRDMODE];
+break;
+case 3:
+if (bm == >pci_dev->bmdma[0]) {
+val = pci_dev->config[UDIDETCR0];
+} else {
+val = pci_dev->config[UDIDETCR1];
+}
+break;
+default:
+val = 0xff;
+break;
+}
+
+trace_bmdma_read_cmd646(addr, val);
+return val;
+}
+
+static void cmd646_bmdma_write(void *opaque, hwaddr addr, uint64_t val,
+   unsigned size)
+{
+BMDMAState *bm = opaque;
+PCIDevice *pci_dev = PCI_DEVICE(bm->pci_dev);
+
+if (size != 1) {
+return;
+}
+
+trace_bmdma_write_cmd646(addr, val);
+switch (addr & 3) {
+case 1:
+pci_dev->config[MRDMODE] =
+(pci_dev->config[MRDMODE] & ~0x30) | (val & 0x30);
+cmd646_update_dma_interrupts(pci_dev);
+cmd646_update_irq(pci_dev);
+break;
+case 3:
+if (bm == >pci_dev->bmdma[0]) {
+pci_dev->config[UDIDETCR0] = val;
+} else {
+pci_dev->config[UDIDETCR1] = val;
+}
+break;
+}
+}
+
+static const MemoryRegionOps cmd646_bmdma_ops = {
+.read = cmd646_bmdma_read,
+.write = cmd646_bmdma_write,
+};
+
+static void cmd646_bmdma_setup(PCIIDEState *d)
+{
+CMD646IDEState *s = CMD646_IDE(d);
+BMDMAState *bm;
+int i;
+
+/* Setup aliases for device-specific BMDMA registers */
+for (i = 0; i < 2; i++) {
+bm = >bmdma[i];
+memory_region_init_io(>bmdma_mem[i], OBJECT(d), _bmdma_ops,
+  bm, "cmd646-bmdma", 4);
+memory_region_init_alias(>bmdma_mem_alias[i][0], OBJECT(d),
+ "cmd646-bmdma[1]", >bmdma_mem[i], 0x1, 1);
+memory_region_add_subregion(>extra_io, 0x1,
+>bmdma_mem_alias[i][0]);
+memory_region_init_alias(>bmdma_mem_alias[i][1], OBJECT(d),
+ "cmd646-bmdma[3]", >bmdma_mem[i], 0x3, 1);
+memory_region_add_subregion(>extra_io, 0x3,
+>bmdma_mem_alias[i][1]);
+}
+}
+
 static void cmd646_update_irq(PCIDevice *pd)
 {
 int pci_level;
@@ -289,6 +349,7 @@ static void

[PATCH 3/5] cmd646: use TYPE_CMD646_IDE instead of hardcoded "cmd646-ide" string

2023-06-09 Thread Mark Cave-Ayland

Signed-off-by: Mark Cave-Ayland 
---
 hw/alpha/dp264.c   | 4 ++--
 hw/sparc64/sun4u.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/hw/alpha/dp264.c b/hw/alpha/dp264.c
index 03495e1e60..f2affecad9 100644
--- a/hw/alpha/dp264.c
+++ b/hw/alpha/dp264.c
@@ -13,7 +13,7 @@
 #include "alpha_sys.h"
 #include "qemu/error-report.h"
 #include "hw/rtc/mc146818rtc.h"
-#include "hw/ide/pci.h"
+#include "hw/ide/cmd646.h"
 #include "hw/isa/superio.h"
 #include "net/net.h"
 #include "qemu/cutils.h"
@@ -132,7 +132,7 @@ static void clipper_init(MachineState *machine)
 isa_create_simple(isa_bus, TYPE_SMC37C669_SUPERIO);
 
 /* IDE disk setup.  */
-pci_dev = pci_create_simple(pci_bus, -1, "cmd646-ide");
+pci_dev = pci_create_simple(pci_bus, -1, TYPE_CMD646_IDE);
 pci_ide_create_devs(pci_dev);
 
 /* Load PALcode.  Given that this is not "real" cpu palcode,
diff --git a/hw/sparc64/sun4u.c b/hw/sparc64/sun4u.c
index e2858a0331..66b55eabd1 100644
--- a/hw/sparc64/sun4u.c
+++ b/hw/sparc64/sun4u.c
@@ -50,7 +50,7 @@
 #include "hw/sparc/sparc64.h"
 #include "hw/nvram/fw_cfg.h"
 #include "hw/sysbus.h"
-#include "hw/ide/pci.h"
+#include "hw/ide/cmd646.h"
 #include "hw/loader.h"
 #include "hw/fw-path-provider.h"
 #include "elf.h"
@@ -673,7 +673,7 @@ static void sun4uv_init(MemoryRegion *address_space_mem,
 qemu_macaddr_default_if_unset();
 }
 
-pci_dev = pci_new(PCI_DEVFN(3, 0), "cmd646-ide");
+pci_dev = pci_new(PCI_DEVFN(3, 0), TYPE_CMD646_IDE);
 qdev_prop_set_uint32(_dev->qdev, "secondary", 1);
 pci_realize_and_unref(pci_dev, pci_busA, _fatal);
 pci_ide_create_devs(pci_dev);
-- 
2.30.2

[PATCH 0/5] cmd646: move device-specific BMDMA registers to separate memory region

2023-06-09 Thread Mark Cave-Ayland

This series follows on from a comment I made on Bernhard's PCI IDE tidy-up 
series [1]
that it should be possible to further consolidate the BMDMA registers into the 
PCI
IDE device with some minor rework to the CMD646 device.

It does this by moving the CMD646 device-specific BMDMA registers to a separate
memory region, and then aliasing the device-specific BMDMA registers from 
the existing BMDMAState memory region. With this in place it should be fairly
trivial to extend the consolidation implementation in [1].

Before "info mtree" output:
8200-820f (prio 1, i/o): cmd646-bmdma
  8200-8203 (prio 0, i/o): cmd646-bmdma-bus
  8204-8207 (prio 0, i/o): cmd646-bmdma-ioport
  8208-820b (prio 0, i/o): cmd646-bmdma-bus
  820c-820f (prio 0, i/o): cmd646-bmdma-ioport

After "info mtree" output:
8200-820f (prio 1, i/o): cmd646-bmdma
  8200-8203 (prio 0, i/o): cmd646-bmdma-bus
8201-8201 (prio 0, i/o): alias cmd646-bmdma[1] 
@cmd646-bmdma 0001-0001
8203-8203 (prio 0, i/o): alias cmd646-bmdma[3] 
@cmd646-bmdma 0003-0003
  8204-8207 (prio 0, i/o): cmd646-bmdma-ioport
  8208-820b (prio 0, i/o): cmd646-bmdma-bus
8209-8209 (prio 0, i/o): alias cmd646-bmdma[1] 
@cmd646-bmdma 0001-0001
820b-820b (prio 0, i/o): alias cmd646-bmdma[3] 
@cmd646-bmdma 0003-0003
  820c-820f (prio 0, i/o): cmd646-bmdma-ioport

The series was tested by confirming that breakpoints on the CMD646-specific
BMDMA registers were being hit and that my test Debian install ISO still
boots under qemu-system-sparc64.

Signed-off-by: Mark Cave-Ayland 

[1] https://patchew.org/QEMU/20230422150728.176512-1-shen...@gmail.com/


Mark Cave-Ayland (5):
  cmd646: checkpatch fixes
  cmd646: create separate header and QOM type for CMD646_IDE
  cmd646: use TYPE_CMD646_IDE instead of hardcoded "cmd646-ide" string
  cmd646: rename cmd646_bmdma_ops to bmdma_ops
  cmd646: move device-specific BMDMA registers to separate memory region

 hw/alpha/dp264.c|   4 +-
 hw/ide/cmd646.c | 122 ++--
 hw/sparc64/sun4u.c  |   4 +-
 include/hw/ide/cmd646.h |  42 ++
 4 files changed, 139 insertions(+), 33 deletions(-)
 create mode 100644 include/hw/ide/cmd646.h

-- 
2.30.2

Re: [Libguestfs] [libnbd PATCH v3 08/22] block_status: Track 64-bit extents internally

2023-06-09 Thread Eric Blake

On Fri, Jun 09, 2023 at 11:39:27AM +0200, Laszlo Ersek wrote:
> Fair warning: this patch has made me both very excited and very
> uncomfortable; that's the reason my style might come across as
> needlessly "combative". I don't mean to be combative. It's just how my
> thoughts have come out, and (unfortunately) I need to go pick up my son
> now, and don't have time for further rounds of polishing my style. I
> apologize in advance. Again, it's a consequence of me caring a bit too
> much perhaps, and getting worked up.

The disclaimer is much appreciated.  And I likewise hope my reply is
not seen as combative, although I readily admit it is long-winded.

> 
> On 5/25/23 15:00, Eric Blake wrote:
> > When extended headers are in use, the server is required to use a
> > response that can include 64-bit extent lengths and flags (even if it
> > chooses to keep the extent length under 4G, and even for metacontexts
> > like "base:allocation" where the flags never need more than 32 bits).
> 
> Yes. Per spec, negotiating NBD_OPT_EXTENDED_HEADERS is equivalent to the
> server sending NBD_REPLY_TYPE_BLOCK_STATUS_EXT, as opposed to
> NBD_REPLY_TYPE_BLOCK_STATUS.
> 
> > For maximum flexibility, we are better off storing 64-bit data
> > internally, with a goal of letting the client's 32-bit interface work
> > as much as possible, and for a future API addition of a 64-bit
> > interface to work even with older servers that only give 32-bit
> > results.
> 
> I agree that a 64-bit-only internal representation makes sense. Once
> corresponding code part (quoting out of order for discussion's sake) is:
> 
> > diff --git a/lib/internal.h b/lib/internal.h
> > index e4e21a4d..11186e24 100644
> > --- a/lib/internal.h
> > +++ b/lib/internal.h
> > @@ -80,7 +80,7 @@ struct export {
> >
> >  struct command_cb {
> >union {
> > -nbd_extent_callback extent;
> > +nbd_extent64_callback extent;
> >  nbd_chunk_callback chunk;
> >  nbd_list_callback list;
> >  nbd_context_callback context;
> 
> and that looks good to me.
> 
> > For backwards compatibility, a client that never negotiates a 64-bit
> > status context can be handled without errors by truncating any 64-bit
> > lengths down to just under 4G; so the old 32-bit interface will
> > continue to work in most cases.
> 
> I absolutely don't follow this. The spec is clear.
> NBD_OPT_EXTENDED_HEADERS corresponds to NBD_REPLY_TYPE_BLOCK_STATUS_EXT,
> and (!NBD_OPT_EXTENDED_HEADERS) corresponds to
> NBD_REPLY_TYPE_BLOCK_STATUS. If NBD_OPT_EXTENDED_HEADERS is negotiated
> (which means both client and server agree to it), then the client
> interface is 64-bit clean, so there's no need to truncate anything. If
> NBD_OPT_EXTENDED_HEADERS is *not* negotiated, then our internal 64-bit
> clean representation will never be filled with information that doesn't
> fit into 32-bits (due to the reply type being restricted to
> NBD_REPLY_TYPE_BLOCK_STATUS), so there is again no need to (observably)
> truncate anything (technically there would be a uint64_t to uint32_t
> conversion, but it would be value-preserving).
> 
> The spec also says that, if a client wants to query a particular
> metadata context that is known to use 64-bit values, then the client
> must first negotiate NBD_OPT_EXTENDED_HEADERS.

Maybe I can improve my wording in that part of the commit message.
But first, let's step back and see the big picture.

For the sake of argument, let's suppose there are two NBD servers in
play; server A that does not know extended headers but supports
"base:allocation" (a good set of servers today), and server B that has
implemented extended headers and has added a new metacontext (let's
call it "x-new:big") that returns 64-bit status flags, but is willing
to also accept older clients.  (I hope that someday I can get nbdkit
to be such a server, but doing that requires adding a v3 protocol for
nbdkit plugins; the current v2 plugin protocol can ONLY provide
"base:allocation", but being able to let a plugin provide its own
metacontexts will add some nice expressive power to nbdkit).

Clients compiled against libnbd 1.16 do NOT have access to
nbd_block_status_64() (the API did not exist yet); they expect that
nbd_block_status() (which returns responses constrained to 32-bit
lengths, by its API contract) will work.  When run with the .so for
libnbd 1.16 (currently named libnbd.so.0.0.0, because we have never
broken ABI compatibility across version updates - everything has been
additions-only), the client does not request extended headers (that
was not part of the code base), so we can reasonably state that if
such a client requests "x-new:big" during negotiation with server B, a
compliant server will reply with NBD_REP_ERR_EXT_HEADER_REQD; libnbd
1.16 doesn't recognize that constant, but does recognize it as an
error reply, and will gracefully proceed by telling the client it
cannot access "x-new:big", so there are never any 64-bit flag values
to worry about.  You are also right that if

[PATCH v3 4/8] hw/ide/ahci: simplify and document PxCI handling

2023-06-09 Thread Niklas Cassel

From: Niklas Cassel 

The AHCI spec states that:
For NCQ, PxCI is cleared on command queued successfully.

For non-NCQ, PxCI is cleared on command completed successfully.
(A non-NCQ command that completes with error does not clear PxCI.)

The current QEMU implementation either clears PxCI in check_cmd(),
or in ahci_cmd_done().

check_cmd() will clear PxCI for a command if handle_cmd() returns 0.
handle_cmd() will return -1 if BUSY or DRQ is set.

The QEMU implementation for NCQ commands will currently not set BUSY
or DRQ, so they will always have PxCI cleared by handle_cmd().
ahci_cmd_done() will never even get called for NCQ commands.

Non-NCQ commands are executed by ide_bus_exec_cmd().
Non-NCQ commands in QEMU are implemented either in a sync or in an async
way.

For non-NCQ commands implemented in a sync way, the command handler will
return true, and when ide_bus_exec_cmd() sees that a command handler
returns true, it will call ide_cmd_done() (which will call
ahci_cmd_done()). For a command implemented in a sync way,
ahci_cmd_done() will do nothing (since busy_slot is not set). Instead,
after ide_bus_exec_cmd() has finished, check_cmd() will clear PxCI for
these commands.

For non-NCQ commands implemented in an async way (using either aiocb or
pio_aiocb), the command handler will return false, ide_bus_exec_cmd()
will not call ide_cmd_done(), instead it is expected that the async
callback function will call ide_cmd_done() once the async command is
done. handle_cmd() will set busy_slot, if and only if BUSY or DRQ is
set, and this is checked _after_ ide_bus_exec_cmd() has returned.
handle_cmd() will return -1, so check_cmd() will not clear PxCI.
When the async callback calls ide_cmd_done() (which will call
ahci_cmd_done()), it will see that busy_slot is set, and
ahci_cmd_done() will clear PxCI.

This seems racy, since busy_slot is set _after_ ide_bus_exec_cmd() has
returned. The callback might come before busy_slot gets set. And it is
quite confusing that ahci_cmd_done() will be called for all non-NCQ
commands when the command is done, but will only clear PxCI in certain
cases, even though it will always write a D2H FIS and raise an IRQ.

Even worse, in the case where ahci_cmd_done() does not clear PxCI, it
still raises an IRQ. Host software might thus read an old PxCI value,
since PxCI is cleared (by check_cmd()) after the IRQ has been raised.

Try to simplify this by always setting busy_slot for non-NCQ commands,
such that ahci_cmd_done() will always be responsible for clearing PxCI
for non-NCQ commands.

For NCQ commands, clear PxCI when we receive the D2H FIS, but before
raising the IRQ, see AHCI 1.3.1, section 5.3.8, states RegFIS:Entry and
RegFIS:ClearCI.

Signed-off-by: Niklas Cassel 
---
 hw/ide/ahci.c | 70 ---
 1 file changed, 50 insertions(+), 20 deletions(-)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index 4b272397fd..3deaf01add 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -41,9 +41,10 @@
 #include "trace.h"
 
 static void check_cmd(AHCIState *s, int port);
-static int handle_cmd(AHCIState *s, int port, uint8_t slot);
+static void handle_cmd(AHCIState *s, int port, uint8_t slot);
 static void ahci_reset_port(AHCIState *s, int port);
 static bool ahci_write_fis_d2h(AHCIDevice *ad, bool d2h_fis_i);
+static void ahci_clear_cmd_issue(AHCIDevice *ad, uint8_t slot);
 static void ahci_init_d2h(AHCIDevice *ad);
 static int ahci_dma_prepare_buf(const IDEDMA *dma, int32_t limit);
 static bool ahci_map_clb_address(AHCIDevice *ad);
@@ -591,9 +592,8 @@ static void check_cmd(AHCIState *s, int port)
 
 if ((pr->cmd & PORT_CMD_START) && pr->cmd_issue) {
 for (slot = 0; (slot < 32) && pr->cmd_issue; slot++) {
-if ((pr->cmd_issue & (1U << slot)) &&
-!handle_cmd(s, port, slot)) {
-pr->cmd_issue &= ~(1U << slot);
+if (pr->cmd_issue & (1U << slot)) {
+handle_cmd(s, port, slot);
 }
 }
 }
@@ -1123,6 +1123,22 @@ static void process_ncq_command(AHCIState *s, int port, 
const uint8_t *cmd_fis,
 return;
 }
 
+/*
+ * A NCQ command clears the bit in PxCI after the command has been QUEUED
+ * successfully (ERROR not set, BUSY and DRQ cleared).
+ *
+ * For NCQ commands, PxCI will always be cleared here.
+ *
+ * (Once the NCQ command is COMPLETED, the device will send a SDB FIS with
+ * the interrupt bit set, which will clear PxSACT and raise an interrupt.)
+ */
+ahci_clear_cmd_issue(ad, slot);
+
+/*
+ * In reality, for NCQ commands, PxCI is cleared after receiving a D2H FIS
+ * without the interrupt bit set, but since ahci_write_fis_d2h() can raise
+ * an IRQ on error, we need to call them in reverse order.
+ */
 ahci_write_fis_d2h(ad, false);
 
 ncq_tfs->used = 1;
@@ -1197,6 +1213,7 @@ static void handle_reg_h2d_fis(AHCIState *s, int port,
 {
 IDEState *ide_state =

Re: [PATCH v5 5/5] parallels: Image repairing in parallels_open()

2023-06-09 Thread Hanna Czenczek


On 09.06.23 15:21, Alexander Ivanov wrote:



On 6/2/23 16:59, Hanna Czenczek wrote:

On 29.05.23 17:15, Alexander Ivanov wrote:

Repair an image at opening if the image is unclean or out-of-image
corruption was detected.

Signed-off-by: Alexander Ivanov 
---
  block/parallels.c | 65 
+--

  1 file changed, 35 insertions(+), 30 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index d64e8007d5..7bbd5cb112 100644
--- a/block/parallels.c
+++ b/block/parallels.c


[...]

@@ -1130,6 +1101,40 @@ static int parallels_open(BlockDriverState 
*bs, QDict *options, int flags,

  goto fail;
  }
  qemu_co_mutex_init(>lock);
+
+    if (le32_to_cpu(ph.inuse) == HEADER_INUSE_MAGIC) {
+    s->header_unclean = true;
+    }
+
+    for (i = 0; i < s->bat_size; i++) {
+    sector = bat2sect(s, i);
+    if (sector + s->tracks > s->data_end) {
+    s->data_end = sector + s->tracks;
+    }
+    }
+
+    /*
+ * We don't repair the image here if it's opened for checks. 
Also we don't
+ * want to change inactive images and can't change readonly 
images.

+ */
+    if ((flags & (BDRV_O_CHECK | BDRV_O_INACTIVE)) || !(flags & 
BDRV_O_RDWR)) {

+    return 0;
+    }
+
+    /*
+ * Repair the image if it's dirty or
+ * out-of-image corruption was detected.
+ */
+    if (s->data_end > file_nb_sectors || s->header_unclean) {
+    BdrvCheckResult res;
+    ret = bdrv_check(bs, , BDRV_FIX_ERRORS | BDRV_FIX_LEAKS);
+    if (ret < 0) {


Should we also verify that res->corruptions == res->corruptions_fixed 
&& res->check_errors == 0?
If ret == 0 there must be res->check_errors == 0 and res->corruptions 
== res->corruptions_fixed.


OK.




+ error_free(s->migration_blocker);


I’d move this clean-up to a new error path below, then we could even 
reuse that where migrate_add_blocker() fails.
Is this guaranteed that s->migration_blocker is NULL at the function 
parallels_open() beginning? If so it could be easy to move the clean-up,

otherwise it could lead to code complication.


Three answers here:

First, I just realized that we probably need to undo the 
migrate_add_blocker() call, too, i.e. call migrate_del_blocker() here.


Second, I’m pretty sure that s->migration_blocker must be NULL before 
the error_setg(>migration_blocker) call, because error_setg() asserts 
that the *errp passed to it is NULL.


Third, I meant to add a new path e.g.:

```
fail_blocker:
    error_free(s->migration_blocker);
fail_format:
[...]
```

And then use `goto fail_blocker;` here and in the migrate_add_blocker() 
error path, so it shouldn’t really matter whether s->migration_blocker 
is NULL before the error_setg() call.  But then again, I think the 
probably necessary migrate_del_blocker() call complicates things further.


Hanna



Anyway, not wrong as-is, just suggestion, so:

Reviewed-by: Hanna Czenczek 

+    error_setg_errno(errp, -ret, "Could not repair 
corrupted image");

+    goto fail;
+    }
+    }
+
  return 0;
    fail_format:

[PATCH v3 3/8] hw/ide/ahci: write D2H FIS when processing NCQ command

2023-06-09 Thread Niklas Cassel

From: Niklas Cassel 

The way that BUSY + PxCI is cleared for NCQ (FPDMA QUEUED) commands is
described in SATA 3.5a Gold:

11.15 FPDMA QUEUED command protocol
DFPDMAQ2: ClearInterfaceBsy
"Transmit Register Device to Host FIS with the BSY bit cleared to zero
and the DRQ bit cleared to zero and Interrupt bit cleared to zero to
mark interface ready for the next command."

PxCI is currently cleared by handle_cmd(), but we don't write the D2H
FIS to the FIS Receive Area that actually caused PxCI to be cleared.

Similar to how ahci_pio_transfer() calls ahci_write_fis_pio() with an
additional parameter to write a PIO Setup FIS without raising an IRQ,
add a parameter to ahci_write_fis_d2h() so that ahci_write_fis_d2h()
also can write the FIS to the FIS Receive Area without raising an IRQ.

Change process_ncq_command() to call ahci_write_fis_d2h() without
raising an IRQ (similar to ahci_pio_transfer()), such that the FIS
Receive Area is in sync with the PxTFD shadow register.

E.g. Linux reads status and error fields from the FIS Receive Area
directly, so it is wise to keep the FIS Receive Area and the PxTFD
shadow register in sync.

Signed-off-by: Niklas Cassel 
---
 hw/ide/ahci.c | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index 48d550f633..4b272397fd 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -43,7 +43,7 @@
 static void check_cmd(AHCIState *s, int port);
 static int handle_cmd(AHCIState *s, int port, uint8_t slot);
 static void ahci_reset_port(AHCIState *s, int port);
-static bool ahci_write_fis_d2h(AHCIDevice *ad);
+static bool ahci_write_fis_d2h(AHCIDevice *ad, bool d2h_fis_i);
 static void ahci_init_d2h(AHCIDevice *ad);
 static int ahci_dma_prepare_buf(const IDEDMA *dma, int32_t limit);
 static bool ahci_map_clb_address(AHCIDevice *ad);
@@ -618,7 +618,7 @@ static void ahci_init_d2h(AHCIDevice *ad)
 return;
 }
 
-if (ahci_write_fis_d2h(ad)) {
+if (ahci_write_fis_d2h(ad, true)) {
 ad->init_d2h_sent = true;
 /* We're emulating receiving the first Reg H2D Fis from the device;
  * Update the SIG register, but otherwise proceed as normal. */
@@ -850,7 +850,7 @@ static void ahci_write_fis_pio(AHCIDevice *ad, uint16_t 
len, bool pio_fis_i)
 }
 }
 
-static bool ahci_write_fis_d2h(AHCIDevice *ad)
+static bool ahci_write_fis_d2h(AHCIDevice *ad, bool d2h_fis_i)
 {
 AHCIPortRegs *pr = >port_regs;
 uint8_t *d2h_fis;
@@ -864,7 +864,7 @@ static bool ahci_write_fis_d2h(AHCIDevice *ad)
 d2h_fis = >res_fis[RES_FIS_RFIS];
 
 d2h_fis[0] = SATA_FIS_TYPE_REGISTER_D2H;
-d2h_fis[1] = (1 << 6); /* interrupt bit */
+d2h_fis[1] = d2h_fis_i ? (1 << 6) : 0; /* interrupt bit */
 d2h_fis[2] = s->status;
 d2h_fis[3] = s->error;
 
@@ -890,7 +890,10 @@ static bool ahci_write_fis_d2h(AHCIDevice *ad)
 ahci_trigger_irq(ad->hba, ad, AHCI_PORT_IRQ_BIT_TFES);
 }
 
-ahci_trigger_irq(ad->hba, ad, AHCI_PORT_IRQ_BIT_DHRS);
+if (d2h_fis_i) {
+ahci_trigger_irq(ad->hba, ad, AHCI_PORT_IRQ_BIT_DHRS);
+}
+
 return true;
 }
 
@@ -1120,6 +1123,8 @@ static void process_ncq_command(AHCIState *s, int port, 
const uint8_t *cmd_fis,
 return;
 }
 
+ahci_write_fis_d2h(ad, false);
+
 ncq_tfs->used = 1;
 ncq_tfs->drive = ad;
 ncq_tfs->slot = slot;
@@ -1506,7 +1511,7 @@ static void ahci_cmd_done(const IDEDMA *dma)
 }
 
 /* update d2h status */
-ahci_write_fis_d2h(ad);
+ahci_write_fis_d2h(ad, true);
 
 if (ad->port_regs.cmd_issue && !ad->check_bh) {
 ad->check_bh = qemu_bh_new_guarded(ahci_check_cmd_bh, ad,
-- 
2.40.1

[PATCH v3 7/8] hw/ide/ahci: fix ahci_write_fis_sdb()

2023-06-09 Thread Niklas Cassel

From: Niklas Cassel 

When there is an error, we need to raise a TFES error irq, see AHCI 1.3.1,
5.3.13.1 SDB:Entry.

If ERR_STAT is set, we jump to state ERR:FatalTaskfile, which will raise
a TFES IRQ unconditionally, regardless if the I bit is set in the FIS or
not.

Thus, we should never raise a normal IRQ after having sent an error IRQ.

It is valid to signal successfully completed commands as finished in the
same SDB FIS that generates the error IRQ. The important thing is that
commands that did not complete successfully (e.g. commands that were
aborted, do not get the finished bit set).

Before this commit, there was never a TFES IRQ raised on NCQ error.

Signed-off-by: Niklas Cassel 
---
 hw/ide/ahci.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index 12aaadc554..ef6c9fc378 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -806,8 +806,14 @@ static void ahci_write_fis_sdb(AHCIState *s, 
NCQTransferState *ncq_tfs)
 pr->scr_act &= ~ad->finished;
 ad->finished = 0;
 
-/* Trigger IRQ if interrupt bit is set (which currently, it always is) */
-if (sdb_fis->flags & 0x40) {
+/*
+ * TFES IRQ is always raised if ERR_STAT is set, regardless of I bit.
+ * If ERR_STAT is not set, trigger SDBS IRQ if interrupt bit is set
+ * (which currently, it always is).
+ */
+if (sdb_fis->status & ERR_STAT) {
+ahci_trigger_irq(s, ad, AHCI_PORT_IRQ_BIT_TFES);
+} else if (sdb_fis->flags & 0x40) {
 ahci_trigger_irq(s, ad, AHCI_PORT_IRQ_BIT_SDBS);
 }
 }
-- 
2.40.1

[PATCH v3 2/8] hw/ide/core: set ERR_STAT in unsupported command completion

2023-06-09 Thread Niklas Cassel

From: Niklas Cassel 

Currently, the first time sending an unsupported command
(e.g. READ LOG DMA EXT) will not have ERR_STAT set in the completion.
Sending the unsupported command again, will correctly have ERR_STAT set.

When ide_cmd_permitted() returns false, it calls ide_abort_command().
ide_abort_command() first calls ide_transfer_stop(), which will call
ide_transfer_halt() and ide_cmd_done(), after that ide_abort_command()
sets ERR_STAT in status.

ide_cmd_done() for AHCI will call ahci_write_fis_d2h() which writes the
current status in the FIS, and raises an IRQ. (The status here will not
have ERR_STAT set!).

Thus, we cannot call ide_transfer_stop() before setting ERR_STAT, as
ide_transfer_stop() will result in the FIS being written and an IRQ
being raised.

The reason why it works the second time, is that ERR_STAT will still
be set from the previous command, so when writing the FIS, the
completion will correctly have ERR_STAT set.

Set ERR_STAT before writing the FIS (calling cmd_done), so that we will
raise an error IRQ correctly when receiving an unsupported command.

Signed-off-by: Niklas Cassel 
---
 hw/ide/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/ide/core.c b/hw/ide/core.c
index de48ff9f86..07971c0218 100644
--- a/hw/ide/core.c
+++ b/hw/ide/core.c
@@ -533,9 +533,9 @@ BlockAIOCB *ide_issue_trim(
 
 void ide_abort_command(IDEState *s)
 {
-ide_transfer_stop(s);
 s->status = READY_STAT | ERR_STAT;
 s->error = ABRT_ERR;
+ide_transfer_stop(s);
 }
 
 static void ide_set_retry(IDEState *s)
-- 
2.40.1

[PATCH v3 5/8] hw/ide/ahci: PxSACT and PxCI is cleared when PxCMD.ST is cleared

2023-06-09 Thread Niklas Cassel

From: Niklas Cassel 

According to AHCI 1.3.1 definition of PxSACT:
This field is cleared when PxCMD.ST is written from a '1' to a '0' by
software. This field is not cleared by a COMRESET or a software reset.

According to AHCI 1.3.1 definition of PxCI:
This field is also cleared when PxCMD.ST is written from a '1' to a '0'
by software.

Clearing PxCMD.ST is part of the error recovery procedure, see
AHCI 1.3.1, section "6.2 Error Recovery".

If we don't clear PxCI on error recovery, the previous command will
incorrectly still be marked as pending after error recovery.

Signed-off-by: Niklas Cassel 
---
 hw/ide/ahci.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index 3deaf01add..a31e6fa65e 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -329,6 +329,11 @@ static void ahci_port_write(AHCIState *s, int port, int 
offset, uint32_t val)
 ahci_check_irq(s);
 break;
 case AHCI_PORT_REG_CMD:
+if ((pr->cmd & PORT_CMD_START) && !(val & PORT_CMD_START)) {
+pr->scr_act = 0;
+pr->cmd_issue = 0;
+}
+
 /* Block any Read-only fields from being set;
  * including LIST_ON and FIS_ON.
  * The spec requires to set ICC bits to zero after the ICC change
-- 
2.40.1

Re: [PATCH 0/4] hw/nvme: tp4146 misc

2023-06-09 Thread Jesper Devantier

On Wed, May 24, 2023 at 01:19:00PM +0200, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> A set of fixes and small quality-of-life improvements for the TP4146
> ("Flexible Data Placement") support.
> 
> Klaus Jensen (4):
>   hw/nvme: fix verification of number of ruhis
>   hw/nvme: verify uniqueness of reclaim unit handle identifiers
>   hw/nvme: add placement handle list ranges
>   docs: update hw/nvme documentation for TP4146
> 
>  docs/system/devices/nvme.rst | 37 +++-
>  hw/nvme/ns.c | 55 
>  hw/nvme/subsys.c |  6 ++--
>  3 files changed, 84 insertions(+), 14 deletions(-)
> 
> -- 
> 2.40.0
> 
> 

Looks good to me :)

Reviewed-by: Jesper Wendel Devantier

[PATCH v3 8/8] hw/ide/ahci: fix broken SError handling

2023-06-09 Thread Niklas Cassel

From: Niklas Cassel 

When encountering an NCQ error, you should not write the NCQ tag to the
SError register. This is completely wrong.

The SError register has a clear definition, where each bit represents a
different error, see PxSERR definition in AHCI 1.3.1.

If we write a random value (like the NCQ tag) in SError, e.g. Linux will
read SError, and will trigger arbitrary error handling depending on the
NCQ tag that happened to be executing.

In case of success, ncq_cb() will call ncq_finish().
In case of error, ncq_cb() will call ncq_err() (which will clear
ncq_tfs->used), and then call ncq_finish(), thus using ncq_tfs->used is
sufficient to tell if finished should get set or not.

Signed-off-by: Niklas Cassel 
---
 hw/ide/ahci.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index ef6c9fc378..d0a774bc17 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -1012,7 +1012,6 @@ static void ncq_err(NCQTransferState *ncq_tfs)
 
 ide_state->error = ABRT_ERR;
 ide_state->status = READY_STAT | ERR_STAT;
-ncq_tfs->drive->port_regs.scr_err |= (1 << ncq_tfs->tag);
 qemu_sglist_destroy(_tfs->sglist);
 ncq_tfs->used = 0;
 }
@@ -1022,7 +1021,7 @@ static void ncq_finish(NCQTransferState *ncq_tfs)
 /* If we didn't error out, set our finished bit. Errored commands
  * do not get a bit set for the SDB FIS ACT register, nor do they
  * clear the outstanding bit in scr_act (PxSACT). */
-if (!(ncq_tfs->drive->port_regs.scr_err & (1 << ncq_tfs->tag))) {
+if (ncq_tfs->used) {
 ncq_tfs->drive->finished |= (1 << ncq_tfs->tag);
 }
 
-- 
2.40.1

[PATCH v3 0/8] misc AHCI cleanups

2023-06-09 Thread Niklas Cassel

From: Niklas Cassel 

Hello John,

Here comes some misc AHCI cleanups.

Most are related to error handling.

Please review.

Changes since v2:
-Squashed in the test commits that were sent out as a separate series into
 the patch "hw/ide/ahci: PxCI should not get cleared when ERR_STAT is set",
 and reordered some of the patches, such that each and every commit passes
 the ahci test suite as a separate unit. This way it will be possible to
 perform a git bisect without seeing any failures in the ahci test suite.


Kind regards,
Niklas

Niklas Cassel (8):
  hw/ide/ahci: remove stray backslash
  hw/ide/core: set ERR_STAT in unsupported command completion
  hw/ide/ahci: write D2H FIS when processing NCQ command
  hw/ide/ahci: simplify and document PxCI handling
  hw/ide/ahci: PxSACT and PxCI is cleared when PxCMD.ST is cleared
  hw/ide/ahci: PxCI should not get cleared when ERR_STAT is set
  hw/ide/ahci: fix ahci_write_fis_sdb()
  hw/ide/ahci: fix broken SError handling

 hw/ide/ahci.c | 112 +++---
 hw/ide/core.c |   2 +-
 tests/qtest/libqos/ahci.c | 106 +++-
 tests/qtest/libqos/ahci.h |   8 +--
 4 files changed, 164 insertions(+), 64 deletions(-)

-- 
2.40.1

Re: [PATCH v5 5/5] parallels: Image repairing in parallels_open()

2023-06-09 Thread Alexander Ivanov





On 6/2/23 16:59, Hanna Czenczek wrote:

On 29.05.23 17:15, Alexander Ivanov wrote:

Repair an image at opening if the image is unclean or out-of-image
corruption was detected.

Signed-off-by: Alexander Ivanov 
---
  block/parallels.c | 65 +--
  1 file changed, 35 insertions(+), 30 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index d64e8007d5..7bbd5cb112 100644
--- a/block/parallels.c
+++ b/block/parallels.c


[...]

@@ -1130,6 +1101,40 @@ static int parallels_open(BlockDriverState 
*bs, QDict *options, int flags,

  goto fail;
  }
  qemu_co_mutex_init(>lock);
+
+    if (le32_to_cpu(ph.inuse) == HEADER_INUSE_MAGIC) {
+    s->header_unclean = true;
+    }
+
+    for (i = 0; i < s->bat_size; i++) {
+    sector = bat2sect(s, i);
+    if (sector + s->tracks > s->data_end) {
+    s->data_end = sector + s->tracks;
+    }
+    }
+
+    /*
+ * We don't repair the image here if it's opened for checks. 
Also we don't

+ * want to change inactive images and can't change readonly images.
+ */
+    if ((flags & (BDRV_O_CHECK | BDRV_O_INACTIVE)) || !(flags & 
BDRV_O_RDWR)) {

+    return 0;
+    }
+
+    /*
+ * Repair the image if it's dirty or
+ * out-of-image corruption was detected.
+ */
+    if (s->data_end > file_nb_sectors || s->header_unclean) {
+    BdrvCheckResult res;
+    ret = bdrv_check(bs, , BDRV_FIX_ERRORS | BDRV_FIX_LEAKS);
+    if (ret < 0) {


Should we also verify that res->corruptions == res->corruptions_fixed 
&& res->check_errors == 0?
If ret == 0 there must be res->check_errors == 0 and res->corruptions == 
res->corruptions_fixed.



+ error_free(s->migration_blocker);


I’d move this clean-up to a new error path below, then we could even 
reuse that where migrate_add_blocker() fails.
Is this guaranteed that s->migration_blocker is NULL at the function 
parallels_open() beginning? If so it could be easy to move the clean-up,

otherwise it could lead to code complication.


Anyway, not wrong as-is, just suggestion, so:

Reviewed-by: Hanna Czenczek 

+    error_setg_errno(errp, -ret, "Could not repair corrupted 
image");

+    goto fail;
+    }
+    }
+
  return 0;
    fail_format:




--
Best regards,
Alexander Ivanov

[PATCH v3 6/8] hw/ide/ahci: PxCI should not get cleared when ERR_STAT is set

2023-06-09 Thread Niklas Cassel

From: Niklas Cassel 

For NCQ, PxCI is cleared on command queued successfully.
For non-NCQ, PxCI is cleared on command completed successfully.
Successfully means ERR_STAT, BUSY and DRQ are all cleared.

A command that has ERR_STAT set, does not get to clear PxCI.
See AHCI 1.3.1, section 5.3.8, states RegFIS:Entry and RegFIS:ClearCI,
and 5.3.16.5 ERR:FatalTaskfile.

In the case of non-NCQ commands, not clearing PxCI is needed in order
for host software to be able to see which command slot that failed.

Signed-off-by: Niklas Cassel 
---
 hw/ide/ahci.c |   7 ++-
 tests/qtest/libqos/ahci.c | 106 --
 tests/qtest/libqos/ahci.h |   8 ++-
 3 files changed, 88 insertions(+), 33 deletions(-)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index a31e6fa65e..12aaadc554 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -1523,7 +1523,8 @@ static void ahci_clear_cmd_issue(AHCIDevice *ad, uint8_t 
slot)
 {
 IDEState *ide_state = >port.ifs[0];
 
-if (!(ide_state->status & (BUSY_STAT | DRQ_STAT))) {
+if (!(ide_state->status & ERR_STAT) &&
+!(ide_state->status & (BUSY_STAT | DRQ_STAT))) {
 ad->port_regs.cmd_issue &= ~(1 << slot);
 }
 }
@@ -1532,6 +1533,7 @@ static void ahci_clear_cmd_issue(AHCIDevice *ad, uint8_t 
slot)
 static void ahci_cmd_done(const IDEDMA *dma)
 {
 AHCIDevice *ad = DO_UPCAST(AHCIDevice, dma, dma);
+IDEState *ide_state = >port.ifs[0];
 
 trace_ahci_cmd_done(ad->hba, ad->port_no);
 
@@ -1548,7 +1550,8 @@ static void ahci_cmd_done(const IDEDMA *dma)
  */
 ahci_write_fis_d2h(ad, true);
 
-if (ad->port_regs.cmd_issue && !ad->check_bh) {
+if (!(ide_state->status & ERR_STAT) &&
+ad->port_regs.cmd_issue && !ad->check_bh) {
 ad->check_bh = qemu_bh_new_guarded(ahci_check_cmd_bh, ad,
>mem_reentrancy_guard);
 qemu_bh_schedule(ad->check_bh);
diff --git a/tests/qtest/libqos/ahci.c b/tests/qtest/libqos/ahci.c
index f53f12aa99..a2c94c6e06 100644
--- a/tests/qtest/libqos/ahci.c
+++ b/tests/qtest/libqos/ahci.c
@@ -404,57 +404,110 @@ void ahci_port_clear(AHCIQState *ahci, uint8_t port)
 /**
  * Check a port for errors.
  */
-void ahci_port_check_error(AHCIQState *ahci, uint8_t port,
-   uint32_t imask, uint8_t emask)
+void ahci_port_check_error(AHCIQState *ahci, AHCICommand *cmd)
 {
+uint8_t port = cmd->port;
 uint32_t reg;
 
-/* The upper 9 bits of the IS register all indicate errors. */
-reg = ahci_px_rreg(ahci, port, AHCI_PX_IS);
-reg &= ~imask;
-reg >>= 23;
-g_assert_cmphex(reg, ==, 0);
+/* If expecting TF error, ensure that TFES is set. */
+if (cmd->errors) {
+reg = ahci_px_rreg(ahci, port, AHCI_PX_IS);
+ASSERT_BIT_SET(reg, AHCI_PX_IS_TFES);
+} else {
+/* The upper 9 bits of the IS register all indicate errors. */
+reg = ahci_px_rreg(ahci, port, AHCI_PX_IS);
+reg &= ~cmd->interrupts;
+reg >>= 23;
+g_assert_cmphex(reg, ==, 0);
+}
 
-/* The Sata Error Register should be empty. */
+/* The Sata Error Register should be empty, even when expecting TF error. 
*/
 reg = ahci_px_rreg(ahci, port, AHCI_PX_SERR);
 g_assert_cmphex(reg, ==, 0);
 
+/* If expecting TF error, and TFES was set, perform error recovery
+ * (see AHCI 1.3 section 6.2.2.1) such that we can send new commands. */
+if (cmd->errors) {
+/* This will clear PxCI. */
+ahci_px_clr(ahci, port, AHCI_PX_CMD, AHCI_PX_CMD_ST);
+
+/* The port has 500ms to disengage. */
+usleep(50);
+reg = ahci_px_rreg(ahci, port, AHCI_PX_CMD);
+ASSERT_BIT_CLEAR(reg, AHCI_PX_CMD_CR);
+
+/* Clear PxIS. */
+reg = ahci_px_rreg(ahci, port, AHCI_PX_IS);
+ahci_px_wreg(ahci, port, AHCI_PX_IS, reg);
+
+/* Check if we need to perform a COMRESET.
+ * Not implemented right now, as there is no reason why our QEMU model
+ * should need a COMRESET when expecting TF error. */
+reg = ahci_px_rreg(ahci, port, AHCI_PX_TFD);
+ASSERT_BIT_CLEAR(reg, AHCI_PX_TFD_STS_BSY | AHCI_PX_TFD_STS_DRQ);
+
+/* Enable issuing new commands. */
+ahci_px_set(ahci, port, AHCI_PX_CMD, AHCI_PX_CMD_ST);
+}
+
 /* The TFD also has two error sections. */
 reg = ahci_px_rreg(ahci, port, AHCI_PX_TFD);
-if (!emask) {
+if (!cmd->errors) {
 ASSERT_BIT_CLEAR(reg, AHCI_PX_TFD_STS_ERR);
 } else {
 ASSERT_BIT_SET(reg, AHCI_PX_TFD_STS_ERR);
 }
-ASSERT_BIT_CLEAR(reg, AHCI_PX_TFD_ERR & (~emask << 8));
-ASSERT_BIT_SET(reg, AHCI_PX_TFD_ERR & (emask << 8));
+ASSERT_BIT_CLEAR(reg, AHCI_PX_TFD_ERR & (~cmd->errors << 8));
+ASSERT_BIT_SET(reg, AHCI_PX_TFD_ERR & (cmd->errors << 8));
 }
 
-void ahci_port_check_interrupts(AHCIQState *ahci, uint8_t port,
-uint32_t intr_mask)
+void

[PATCH v3 1/8] hw/ide/ahci: remove stray backslash

2023-06-09 Thread Niklas Cassel

From: Niklas Cassel 

This backslash obviously does not belong here, so remove it.

Signed-off-by: Niklas Cassel 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: John Snow 
---
 hw/ide/ahci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index 4e76d6b191..48d550f633 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -690,7 +690,7 @@ static void ahci_reset_port(AHCIState *s, int port)
 
 s->dev[port].port_state = STATE_RUN;
 if (ide_state->drive_kind == IDE_CD) {
-ahci_set_signature(d, SATA_SIGNATURE_CDROM);\
+ahci_set_signature(d, SATA_SIGNATURE_CDROM);
 ide_state->status = SEEK_STAT | WRERR_STAT | READY_STAT;
 } else {
 ahci_set_signature(d, SATA_SIGNATURE_DISK);
-- 
2.40.1

Re: [PATCH v5 4/5] parallels: Replace fprintf by qemu_log in check

2023-06-09 Thread Peter Maydell

On Mon, 29 May 2023 at 16:16, Alexander Ivanov
 wrote:
>
> If the check is called during normal work, tracking of the check must be
> present in VM logs to have some clues if something going wrong with user's
> data.
>
> Signed-off-by: Alexander Ivanov 
> Reviewed-by: Denis V. Lunev 
> ---
>  block/parallels.c | 18 +-
>  1 file changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/block/parallels.c b/block/parallels.c
> index 9fa1f93973..d64e8007d5 100644
> --- a/block/parallels.c
> +++ b/block/parallels.c
> @@ -42,6 +42,7 @@
>  #include "qemu/bswap.h"
>  #include "qemu/bitmap.h"
>  #include "qemu/memalign.h"
> +#include "qemu/log-for-trace.h"
>  #include "migration/blocker.h"
>  #include "parallels.h"
>
> @@ -436,8 +437,8 @@ static void parallels_check_unclean(BlockDriverState *bs,
>  return;
>  }
>
> -fprintf(stderr, "%s image was not closed correctly\n",
> -fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR");
> +qemu_log("%s image was not closed correctly\n",
> + fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR");

Generally speaking you shouldn't directly call qemu_log().
Instead call qemu_log_mask() and pass it the relevant
QEMU_LOG_* constant for whichever kind of -d debug logging
this is. The raw qemu_log() function is for the odd case
of expensive logging where you want to say
 if (log category foo enabled) {
 do expensive calculation;
 qemu_log(result1);
 qemu_log(result2);
 }

But this doesn't look like the kind of logging we would usually
do with qemu_log(). Notably, the default for the qemu_log machinery
is to not output anything at all: it's only enabled if the user
explicitly turns on debug logging with a -d option.

You probably want warn_report() here.

thanks
-- PMM

Re: [PATCH v5 4/5] parallels: Replace fprintf by qemu_log in check

2023-06-09 Thread Alexander Ivanov





On 6/2/23 16:48, Hanna Czenczek wrote:

On 29.05.23 17:15, Alexander Ivanov wrote:

If the check is called during normal work, tracking of the check must be
present in VM logs to have some clues if something going wrong with 
user's

data.


I understand stderr counts as part of the VM log, doesn’t it?  I 
thought stderr is generally logged, and naïvely, it seems like the 
better fit to me, because it conveys more urgency than the standard 
log (which, judging from its callers, looks mostly like a debug log).


Hanna
We want to add some image checks to parallels_open(). It means that it 
will be used not only in qemu-img and we may lose these messages if a 
log file is specified. Stderr is not duplicated to the log file.



Signed-off-by: Alexander Ivanov 
Reviewed-by: Denis V. Lunev 
---
  block/parallels.c | 18 +-
  1 file changed, 9 insertions(+), 9 deletions(-)




--
Best regards,
Alexander Ivanov

Re: [Libguestfs] [libnbd PATCH v3 08/22] block_status: Track 64-bit extents internally

2023-06-09 Thread Laszlo Ersek

Fair warning: this patch has made me both very excited and very
uncomfortable; that's the reason my style might come across as
needlessly "combative". I don't mean to be combative. It's just how my
thoughts have come out, and (unfortunately) I need to go pick up my son
now, and don't have time for further rounds of polishing my style. I
apologize in advance. Again, it's a consequence of me caring a bit too
much perhaps, and getting worked up.

On 5/25/23 15:00, Eric Blake wrote:
> When extended headers are in use, the server is required to use a
> response that can include 64-bit extent lengths and flags (even if it
> chooses to keep the extent length under 4G, and even for metacontexts
> like "base:allocation" where the flags never need more than 32 bits).

Yes. Per spec, negotiating NBD_OPT_EXTENDED_HEADERS is equivalent to the
server sending NBD_REPLY_TYPE_BLOCK_STATUS_EXT, as opposed to
NBD_REPLY_TYPE_BLOCK_STATUS.

> For maximum flexibility, we are better off storing 64-bit data
> internally, with a goal of letting the client's 32-bit interface work
> as much as possible, and for a future API addition of a 64-bit
> interface to work even with older servers that only give 32-bit
> results.

I agree that a 64-bit-only internal representation makes sense. Once
corresponding code part (quoting out of order for discussion's sake) is:

> diff --git a/lib/internal.h b/lib/internal.h
> index e4e21a4d..11186e24 100644
> --- a/lib/internal.h
> +++ b/lib/internal.h
> @@ -80,7 +80,7 @@ struct export {
>
>  struct command_cb {
>union {
> -nbd_extent_callback extent;
> +nbd_extent64_callback extent;
>  nbd_chunk_callback chunk;
>  nbd_list_callback list;
>  nbd_context_callback context;

and that looks good to me.

> For backwards compatibility, a client that never negotiates a 64-bit
> status context can be handled without errors by truncating any 64-bit
> lengths down to just under 4G; so the old 32-bit interface will
> continue to work in most cases.

I absolutely don't follow this. The spec is clear.
NBD_OPT_EXTENDED_HEADERS corresponds to NBD_REPLY_TYPE_BLOCK_STATUS_EXT,
and (!NBD_OPT_EXTENDED_HEADERS) corresponds to
NBD_REPLY_TYPE_BLOCK_STATUS. If NBD_OPT_EXTENDED_HEADERS is negotiated
(which means both client and server agree to it), then the client
interface is 64-bit clean, so there's no need to truncate anything. If
NBD_OPT_EXTENDED_HEADERS is *not* negotiated, then our internal 64-bit
clean representation will never be filled with information that doesn't
fit into 32-bits (due to the reply type being restricted to
NBD_REPLY_TYPE_BLOCK_STATUS), so there is again no need to (observably)
truncate anything (technically there would be a uint64_t to uint32_t
conversion, but it would be value-preserving).

The spec also says that, if a client wants to query a particular
metadata context that is known to use 64-bit values, then the client
must first negotiate NBD_OPT_EXTENDED_HEADERS.

> A server should not advertise a
> metacontext that requires flags larger than 32 bits unless the client
> negotiated extended headers; but if such a client still tries to use
> the 32-bit interface, the code now reports EOVERFLOW for a server
> response with a flags value that cannot be represented without using
> the future API for 64-bit extent queries.

Again, I don't follow. Whenever the client issues an
NBD_CMD_BLOCK_STATUS command (which can only occur after negotiation):

- the client must have set at least one metacontext with
  NBD_OPT_SET_META_CONTEXT during negotiation,

- the client may or may not have agreed upon with the server on
  NBD_OPT_EXTENDED_HEADERS.

Therefore, when the server receives the NBD_CMD_BLOCK_STATUS command, it
can immediately decide whether there is *any* configured metacontext for
which *lack* of extended headers could be too restrictive. If only
32-bit metacontexts have been requested, this is not a problem. If
ext-hdrs have been negotiated, also not a problem. If there is at least
one 64-bit metacontext (regarding *either* status flags *or* lengths)
*and* ext-hdrs have not been negotiated, the server should immediately
fail (reject) the request.

Now, peeking a bit forward, I can see patch#10: "api: Add
[aio_]nbd_block_status_64". That means we can reject a 32-bit-only
nbd_block_status() API call on the client side immediately, if the
client has negotiated ext-hdrs. That is, if the client is smart enough
to ask for extended headers, it can be smart enough to use the 64-bit
API, and then it can accept everything the server throws at it. If the
client never even asks for extended headers, then it is permitted to
stick with the 32-bit-only API, but then the server must never respond
with NBD_REPLY_TYPE_BLOCK_STATUS_EXT [*], so actually 64-bit values are
never going to be present in our internal representation.

[*] If it still does, then the server is buggy, and we should realize
that as soon as we parse the generic reply magic, or the specific
reply type code.

Re: [PULL 05/17] parallels: Out of image offset in BAT leads to image inflation

2023-06-09 Thread Michael Tokarev


09.06.2023 11:54, Denis V. Lunev wrote:

On 6/7/23 17:14, Hanna Czenczek wrote:

..>>> This, and a few other parallels changes in this series:

Should these be applied to -stable too, or is it not important?


Personally, I don’t think they need to be in stable; but I’ll leave the final 
judgment to Alexander.


I do not think that this needs to go to stable too.


Ok, excellent, thank you!

/mjt

Re: [PULL 05/17] parallels: Out of image offset in BAT leads to image inflation

2023-06-09 Thread Denis V. Lunev


On 6/7/23 17:14, Hanna Czenczek wrote:

On 07.06.23 08:51, Michael Tokarev wrote:

05.06.2023 18:45, Hanna Czenczek wrote:

From: Alexander Ivanov 

data_end field in BDRVParallelsState is set to the biggest offset 
present

in BAT. If this offset is outside of the image, any further write will
create the cluster at this offset and/or the image will be truncated to
this offset on close. This is definitely not correct.

Raise an error in parallels_open() if data_end points outside the image
and it is not a check (let the check to repaire the image). Set 
data_end

to the end of the cluster with the last correct offset.


Hi!

This, and a few other parallels changes in this series:

 parallels: Out of image offset in BAT leads to image inflation
 parallels: Fix high_off calculation in parallels_co_check()
 parallels: Fix image_end_offset and data_end after out-of-image check
 parallels: Fix statistics calculation (?)

Should these be applied to -stable too, or is it not important?


Personally, I don’t think they need to be in stable; but I’ll leave 
the final judgment to Alexander.


Hanna



I do not think that this needs to go to stable too.

Thanks you in advance,
    Den

[PATCH] block: Fix pad_request's request restriction

2023-06-09 Thread Hanna Czenczek

bdrv_pad_request() relies on requests' lengths not to exceed SIZE_MAX,
which bdrv_check_qiov_request() does not guarantee.

bdrv_check_request32() however will guarantee this, and both of
bdrv_pad_request()'s callers (bdrv_co_preadv_part() and
bdrv_co_pwritev_part()) already run it before calling
bdrv_pad_request().  Therefore, bdrv_pad_request() can safely call
bdrv_check_request32() without expecting error, too.

There is one difference between bdrv_check_qiov_request() and
bdrv_check_request32(): The former takes an errp, the latter does not,
so we can no longer just pass _abort.  Instead, we need to check
the returned value.  While we do expect success (because the callers
have already run this function), an assert(ret == 0) is not much simpler
than just to return an error if it occurs, so let us handle errors by
returning them up the stack now.

Reported-by: Peter Maydell 
Fixes: 18743311b829cafc1737a5f20bc3248d5f91ee2a
   ("block: Collapse padded I/O vecs exceeding IOV_MAX")
Signed-off-by: Hanna Czenczek 
---
 block/io.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/block/io.c b/block/io.c
index 30748f0b59..e43b4ad09b 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1710,7 +1710,11 @@ static int bdrv_pad_request(BlockDriverState *bs,
 int sliced_niov;
 size_t sliced_head, sliced_tail;
 
-bdrv_check_qiov_request(*offset, *bytes, *qiov, *qiov_offset, 
_abort);
+/* Should have been checked by the caller already */
+ret = bdrv_check_request32(*offset, *bytes, *qiov, *qiov_offset);
+if (ret < 0) {
+return ret;
+}
 
 if (!bdrv_init_padding(bs, *offset, *bytes, write, pad)) {
 if (padded) {
@@ -1723,7 +1727,7 @@ static int bdrv_pad_request(BlockDriverState *bs,
   _head, _tail,
   _niov);
 
-/* Guaranteed by bdrv_check_qiov_request() */
+/* Guaranteed by bdrv_check_request32() */
 assert(*bytes <= SIZE_MAX);
 ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
   sliced_head, *bytes);
-- 
2.40.1

Re: [PULL 02/17] block: Collapse padded I/O vecs exceeding IOV_MAX

2023-06-09 Thread Hanna Czenczek


On 08.06.23 11:52, Peter Maydell wrote:

On Mon, 5 Jun 2023 at 16:48, Hanna Czenczek  wrote:

When processing vectored guest requests that are not aligned to the
storage request alignment, we pad them by adding head and/or tail
buffers for a read-modify-write cycle.

Hi; Coverity complains (CID 1512819) that the assert added
in this commit is always true:


@@ -1573,26 +1701,34 @@ static void bdrv_padding_destroy(BdrvRequestPadding 
*pad)
  static int bdrv_pad_request(BlockDriverState *bs,
  QEMUIOVector **qiov, size_t *qiov_offset,
  int64_t *offset, int64_t *bytes,
+bool write,
  BdrvRequestPadding *pad, bool *padded,
  BdrvRequestFlags *flags)
  {
  int ret;
+struct iovec *sliced_iov;
+int sliced_niov;
+size_t sliced_head, sliced_tail;

  bdrv_check_qiov_request(*offset, *bytes, *qiov, *qiov_offset, 
_abort);

-if (!bdrv_init_padding(bs, *offset, *bytes, pad)) {
+if (!bdrv_init_padding(bs, *offset, *bytes, write, pad)) {
  if (padded) {
  *padded = false;
  }
  return 0;
  }

-ret = qemu_iovec_init_extended(>local_qiov, pad->buf, pad->head,
-   *qiov, *qiov_offset, *bytes,
-   pad->buf + pad->buf_len - pad->tail,
-   pad->tail);
+sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
+  _head, _tail,
+  _niov);
+
+/* Guaranteed by bdrv_check_qiov_request() */
+assert(*bytes <= SIZE_MAX);

This one. (The Coverity complaint is because SIZE_MAX for it is
UINT64_MAX and an int64_t can't possibly be bigger than that.)

Is this because the assert() is deliberately handling the case
of a 32-bit host where SIZE_MAX might not be UINT64_MAX ? Or was
the bound supposed to be SSIZE_MAX or INT64_MAX ?


It’s supposed to be SIZE_MAX, because of the subsequent 
bdrv_created_padded_qiov() call that takes a size_t.  So it is for a 
case where SIZE_MAX < UINT64_MAX, i.e. 32-bit hosts, yes.



Looking at bdrv_check_qiov_request(), it seems to check bytes
against BDRV_MAX_LENGTH, which is defined as something somewhere
near INT64_MAX. So on a 32-bit host I'm not sure that function
does guarantee that the bytes count is going to be less than
SIZE_MAX...


Ah, crap.  I was thinking of BDRV_REQUEST_MAX_BYTES, which is checked by 
bdrv_check_request32(), which both callers of bdrv_pad_request() 
run/check before calling bdrv_pad_request().  So bdrv_pad_request() 
should use the same function.


I’ll send a patch to change the bdrv_check_qiov_request() to 
bdrv_check_request32().  Because both callers (bdrv_co_preadv_part(), 
bdrv_co_pwritev_part()) already call it (the latter only for non-zero 
writes, but bdrv_pad_request() similarly is called only for non-zero 
writes), there should be no change in behavior, and the asserted 
condition should in practice already be always fulfilled (because of the 
callers checking).


Hanna

38 matches

Mail list logo