date:20211203

[libnbd PATCH 11/13] api: Add three functions for controlling extended headers

2021-12-03 Thread Eric Blake

The new NBD_OPT_EXTENDED_HEADERS feature is worth using by default,
but there may be cases where the user explicitly wants to stick with
the older 32-bit headers.  nbd_set_request_extended_headers() will let
the client override the default, nbd_get_request_extended_headers()
determines the current state of the request, and
nbd_get_extended_headers_negotiated() determines what the client and
server actually settled on.  These use
nbd_set_request_structured_headers() and friends as a template.

Note that this patch just adds the API but ignores the state variable;
the next one will then tweak the state machine to actually request
structured headers when the state variable is set.
---
 lib/internal.h |  1 +
 generator/API.ml   | 89 --
 lib/handle.c   | 23 ++
 python/t/110-defaults.py   |  3 +-
 python/t/120-set-non-defaults.py   |  4 +-
 ocaml/tests/test_110_defaults.ml   |  4 +-
 ocaml/tests/test_120_set_non_defaults.ml   |  5 +-
 golang/libnbd_110_defaults_test.go |  8 ++
 golang/libnbd_120_set_non_defaults_test.go | 12 +++
 9 files changed, 137 insertions(+), 12 deletions(-)

diff --git a/lib/internal.h b/lib/internal.h
index 97abf4f2..a579e413 100644
--- a/lib/internal.h
+++ b/lib/internal.h
@@ -107,6 +107,7 @@ struct nbd_handle {
   char *tls_psk_file;   /* PSK filename, NULL = no PSK */

   /* Extended headers. */
+  bool request_eh;  /* Whether to request extended headers */
   bool extended_headers;/* If we negotiated NBD_OPT_EXTENDED_HEADERS */

   /* Desired metadata contexts. */
diff --git a/generator/API.ml b/generator/API.ml
index 1a452a24..e45f0c86 100644
--- a/generator/API.ml
+++ b/generator/API.ml
@@ -675,6 +675,63 @@   "get_tls_psk_file", {
   };
 *)

+  "set_request_extended_headers", {
+default_call with
+args = [Bool "request"]; ret = RErr;
+permitted_states = [ Created ];
+shortdesc = "control use of extended headers";
+longdesc = "\
+By default, libnbd tries to negotiate extended headers with the
+server, as this protocol extension permits the use of 64-bit
+zero, trim, and block status actions.  However,
+for integration testing, it can be useful to clear this flag
+rather than find a way to alter the server to fail the negotiation
+request.";
+see_also = [Link "get_request_extended_headers";
+Link "set_handshake_flags"; Link "set_strict_mode";
+Link "get_extended_headers_negotiated";
+Link "zero"; Link "trim"; Link "cache";
+Link "block_status_64";
+Link "set_request_structured_replies"];
+  };
+
+  "get_request_extended_headers", {
+default_call with
+args = []; ret = RBool;
+may_set_error = false;
+shortdesc = "see if extended headers are attempted";
+longdesc = "\
+Return the state of the request extended headers flag on this
+handle.
+
+B If you want to find out if extended headers were actually
+negotiated on a particular connection use
+L instead.";
+see_also = [Link "set_request_extended_headers";
+Link "get_extended_headers_negotiated";
+Link "get_request_extended_headers"];
+  };
+
+  "get_extended_headers_negotiated", {
+default_call with
+args = []; ret = RBool;
+permitted_states = [ Negotiating; Connected; Closed ];
+shortdesc = "see if extended headers are in use";
+longdesc = "\
+After connecting you may call this to find out if the connection is
+using extended headers.  When extended headers are not in use, commands
+are limited to a 32-bit length, even when the libnbd API uses a 64-bit
+variable to express the length.  But even when extended headers are
+supported, the server may enforce other limits, visible through
+L.";
+see_also = [Link "set_request_extended_headers";
+Link "get_request_extended_headers";
+Link "zero"; Link "trim"; Link "cache";
+Link "block_status_64"; Link "get_block_size";
+Link "get_protocol";
+Link "get_structured_replies_negotiated"];
+  };
+
   "set_request_structured_replies", {
 default_call with
 args = [Bool "request"]; ret = RErr;
@@ -690,7 +747,8 @@   "set_request_structured_replies", {
 see_also = [Link "get_request_structured_replies";
 Link "set_handshake_flags"; Link "set_strict_mode";
 Link "get_structured_replies_negotiated";
-Link "can_meta_context"; Link "can_df"];
+Link "can_meta_context"; Link "can_df";
+Link "set_request_extended_headers"];
   };

   "get_request_structured_replies", {
@@ -706,7 +764,8 @@   "get_request_structured_replies", {
 negotiated on a particular connection use
 L instead.";
 see_also = [Link "set_request_structured_replies";
-Link

[libnbd PATCH 10/13] api: Add [aio_]nbd_block_status_64

2021-12-03 Thread Eric Blake

Overcome the inherent 32-bit limitation of our existing
nbd_block_status command by adding a 64-bit variant.  The command sent
to the server does not change, but the user's callback is now handed
64-bit information regardless of whether the server replies with 32-
or 64-bit extents.

Unit tests prove that the new API works in each of C, Python, OCaml,
and Go bindings.  We can also get rid of the temporary hack added to
appease the compiler in an earlier patch.
---
 generator/API.ml  | 138 +++---
 generator/OCaml.ml|   1 -
 generator/Python.ml   |   1 -
 lib/rw.c  |  48 ++--
 python/t/465-block-status-64.py   |  56 +
 ocaml/tests/Makefile.am   |   5 +-
 ocaml/tests/test_465_block_status_64.ml   |  58 +
 tests/meta-base-allocation.c  | 111 +++--
 golang/Makefile.am|   3 +-
 golang/libnbd_465_block_status_64_test.go | 119 +++
 10 files changed, 503 insertions(+), 37 deletions(-)
 create mode 100644 python/t/465-block-status-64.py
 create mode 100644 ocaml/tests/test_465_block_status_64.ml
 create mode 100644 golang/libnbd_465_block_status_64_test.go

diff --git a/generator/API.ml b/generator/API.ml
index 70ae721d..1a452a24 100644
--- a/generator/API.ml
+++ b/generator/API.ml
@@ -1071,7 +1071,7 @@   "add_meta_context", {
 During connection libnbd can negotiate zero or more metadata
 contexts with the server.  Metadata contexts are features (such
 as C<\"base:allocation\">) which describe information returned
-by the L command (for C<\"base:allocation\">
+by the L command (for C<\"base:allocation\">
 this is whether blocks of data are allocated, zero or sparse).

 This call adds one metadata context to the list to be negotiated.
@@ -1098,7 +1098,7 @@   "add_meta_context", {
 Other metadata contexts are server-specific, but include
 C<\"qemu:dirty-bitmap:...\"> and C<\"qemu:allocation-depth\"> for
 qemu-nbd (see qemu-nbd I<-B> and I<-A> options).";
-see_also = [Link "block_status"; Link "can_meta_context";
+see_also = [Link "block_status_64"; Link "can_meta_context";
 Link "get_nr_meta_contexts"; Link "get_meta_context";
 Link "clear_meta_contexts"];
   };
@@ -,14 +,14 @@   "get_nr_meta_contexts", {
 During connection libnbd can negotiate zero or more metadata
 contexts with the server.  Metadata contexts are features (such
 as C<\"base:allocation\">) which describe information returned
-by the L command (for C<\"base:allocation\">
+by the L command (for C<\"base:allocation\">
 this is whether blocks of data are allocated, zero or sparse).

 This command returns how many meta contexts have been added to
 the list to request from the server via L.
 The server is not obligated to honor all of the requests; to see
 what it actually supports, see L.";
-see_also = [Link "block_status"; Link "can_meta_context";
+see_also = [Link "block_status_64"; Link "can_meta_context";
 Link "add_meta_context"; Link "get_meta_context";
 Link "clear_meta_contexts"];
   };
@@ -1131,13 +1131,13 @@   "get_meta_context", {
 During connection libnbd can negotiate zero or more metadata
 contexts with the server.  Metadata contexts are features (such
 as C<\"base:allocation\">) which describe information returned
-by the L command (for C<\"base:allocation\">
+by the L command (for C<\"base:allocation\">
 this is whether blocks of data are allocated, zero or sparse).

 This command returns the i'th meta context request, as added by
 L, and bounded by
 L.";
-see_also = [Link "block_status"; Link "can_meta_context";
+see_also = [Link "block_status_64"; Link "can_meta_context";
 Link "add_meta_context"; Link "get_nr_meta_contexts";
 Link "clear_meta_contexts"];
   };
@@ -1151,7 +1151,7 @@   "clear_meta_contexts", {
 During connection libnbd can negotiate zero or more metadata
 contexts with the server.  Metadata contexts are features (such
 as C<\"base:allocation\">) which describe information returned
-by the L command (for C<\"base:allocation\">
+by the L command (for C<\"base:allocation\">
 this is whether blocks of data are allocated, zero or sparse).

 This command resets the list of meta contexts to request back to
@@ -1160,7 +1160,7 @@   "clear_meta_contexts", {
 negotiation mode is selected (see L), for
 altering the list of attempted contexts between subsequent export
 queries.";
-see_also = [Link "block_status"; Link "can_meta_context";
+see_also = [Link "block_status_64"; Link "can_meta_context";
 Link "add_meta_context"; Link "get_nr_meta_contexts";
 Link "get_meta_context"; Link "set_opt_mode"];
   };
@@ -1727,7 +1727,7 @@   "can_meta_context", {
 ^ non_blocking_test_call_description;
 see_also = [SectionLink "Flag calls"; Link "opt_info";

[libnbd PATCH 12/13] generator: Actually request extended headers

2021-12-03 Thread Eric Blake

This is the culmination of the previous patches preparation work for
using extended headers when possible.  The new states in the state
machine are copied extensively from our handling of
OPT_STRUCTURED_REPLY.

At the same time I posted this patch, I had patches for qemu-nbd to
support extended headers as server (nbdkit is a bit tougher).  The
interop tests still pass when using a new enough qemu-nbd, showing
that we have cross-project interoperability and therefore an extension
worth standardizing.
---
 generator/Makefile.am |  3 +-
 generator/state_machine.ml| 41 +
 .../states-newstyle-opt-extended-headers.c| 90 +++
 generator/states-newstyle-opt-starttls.c  | 10 +--
 4 files changed, 138 insertions(+), 6 deletions(-)
 create mode 100644 generator/states-newstyle-opt-extended-headers.c

diff --git a/generator/Makefile.am b/generator/Makefile.am
index 594d23cf..c889eb7f 100644
--- a/generator/Makefile.am
+++ b/generator/Makefile.am
@@ -1,5 +1,5 @@
 # nbd client library in userspace
-# Copyright (C) 2013-2020 Red Hat Inc.
+# Copyright (C) 2013-2021 Red Hat Inc.
 #
 # This library is free software; you can redistribute it and/or
 # modify it under the terms of the GNU Lesser General Public
@@ -30,6 +30,7 @@ states_code = \
states-issue-command.c \
states-magic.c \
states-newstyle-opt-export-name.c \
+   states-newstyle-opt-extended-headers.c \
states-newstyle-opt-list.c \
states-newstyle-opt-go.c \
states-newstyle-opt-meta-context.c \
diff --git a/generator/state_machine.ml b/generator/state_machine.ml
index 99652948..ad8eba5e 100644
--- a/generator/state_machine.ml
+++ b/generator/state_machine.ml
@@ -295,6 +295,7 @@ and
* NEGOTIATING after OPT_STRUCTURED_REPLY or any failed OPT_GO.
*)
   Group ("OPT_STARTTLS", newstyle_opt_starttls_state_machine);
+  Group ("OPT_EXTENDED_HEADERS", newstyle_opt_extended_headers_state_machine);
   Group ("OPT_STRUCTURED_REPLY", newstyle_opt_structured_reply_state_machine);
   Group ("OPT_META_CONTEXT", newstyle_opt_meta_context_state_machine);
   Group ("OPT_GO", newstyle_opt_go_state_machine);
@@ -432,6 +433,46 @@ and
   };
 ]

+(* Fixed newstyle NBD_OPT_EXTENDED_HEADERS option.
+ * Implementation: generator/states-newstyle-opt-extended-headers.c
+ *)
+and newstyle_opt_extended_headers_state_machine = [
+  State {
+default_state with
+name = "START";
+comment = "Try to negotiate newstyle NBD_OPT_EXTENDED_HEADERS";
+external_events = [];
+  };
+
+  State {
+default_state with
+name = "SEND";
+comment = "Send newstyle NBD_OPT_EXTENDED_HEADERS negotiation request";
+external_events = [ NotifyWrite, "" ];
+  };
+
+  State {
+default_state with
+name = "RECV_REPLY";
+comment = "Receive newstyle NBD_OPT_EXTENDED_HEADERS option reply";
+external_events = [ NotifyRead, "" ];
+  };
+
+  State {
+default_state with
+name = "RECV_REPLY_PAYLOAD";
+comment = "Receive any newstyle NBD_OPT_EXTENDED_HEADERS reply payload";
+external_events = [ NotifyRead, "" ];
+  };
+
+  State {
+default_state with
+name = "CHECK_REPLY";
+comment = "Check newstyle NBD_OPT_EXTENDED_HEADERS option reply";
+external_events = [];
+  };
+]
+
 (* Fixed newstyle NBD_OPT_STRUCTURED_REPLY option.
  * Implementation: generator/states-newstyle-opt-structured-reply.c
  *)
diff --git a/generator/states-newstyle-opt-extended-headers.c 
b/generator/states-newstyle-opt-extended-headers.c
new file mode 100644
index ..e2c9890e
--- /dev/null
+++ b/generator/states-newstyle-opt-extended-headers.c
@@ -0,0 +1,90 @@
+/* nbd client library in userspace: state machine
+ * Copyright (C) 2013-2021 Red Hat Inc.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+/* State machine for negotiating NBD_OPT_EXTENDED_HEADERS. */
+
+STATE_MACHINE {
+ NEWSTYLE.OPT_EXTENDED_HEADERS.START:
+  assert (h->gflags & LIBNBD_HANDSHAKE_FLAG_FIXED_NEWSTYLE);
+  if (!h->request_eh) {
+SET_NEXT_STATE (%^OPT_STRUCTURED_REPLY.START);
+return 0;
+  }
+
+  h->sbuf.option.version = htobe64 (NBD_NEW_VERSION);
+  h->sbuf.option.option = htobe32 (NBD_OPT_EXTENDED_HEADERS);
+  h->sbuf.option.optlen = htobe32 (0);
+  h->wbuf = >sbuf;
+

[libnbd PATCH 03/13] protocol: Add definitions for extended headers

2021-12-03 Thread Eric Blake

Add the magic numbers and new structs necessary to implement the NBD
protocol extension of extended headers providing 64-bit lengths.
---
 lib/nbd-protocol.h | 61 ++
 1 file changed, 51 insertions(+), 10 deletions(-)

diff --git a/lib/nbd-protocol.h b/lib/nbd-protocol.h
index e5d6404b..7247d775 100644
--- a/lib/nbd-protocol.h
+++ b/lib/nbd-protocol.h
@@ -1,5 +1,5 @@
 /* nbdkit
- * Copyright (C) 2013-2020 Red Hat Inc.
+ * Copyright (C) 2013-2021 Red Hat Inc.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions are
@@ -124,6 +124,7 @@ struct nbd_fixed_new_option_reply {
 #define NBD_OPT_STRUCTURED_REPLY   8
 #define NBD_OPT_LIST_META_CONTEXT  9
 #define NBD_OPT_SET_META_CONTEXT   10
+#define NBD_OPT_EXTENDED_HEADERS   11

 #define NBD_REP_ERR(val) (0x8000 | (val))
 #define NBD_REP_IS_ERR(val) (!!((val) & 0x8000))
@@ -188,6 +189,13 @@ struct nbd_block_descriptor {
   uint32_t status_flags;/* block type (hole etc) */
 } NBD_ATTRIBUTE_PACKED;

+/* NBD_REPLY_TYPE_BLOCK_STATUS_EXT block descriptor. */
+struct nbd_block_descriptor_ext {
+  uint64_t length;  /* length of block */
+  uint32_t status_flags;/* block type (hole etc) */
+  uint32_t pad; /* must be zero */
+} NBD_ATTRIBUTE_PACKED;
+
 /* Request (client -> server). */
 struct nbd_request {
   uint32_t magic;   /* NBD_REQUEST_MAGIC. */
@@ -197,6 +205,14 @@ struct nbd_request {
   uint64_t offset;  /* Request offset. */
   uint32_t count;   /* Request length. */
 } NBD_ATTRIBUTE_PACKED;
+struct nbd_request_ext {
+  uint32_t magic;   /* NBD_REQUEST_EXT_MAGIC. */
+  uint16_t flags;   /* Request flags. */
+  uint16_t type;/* Request type. */
+  uint64_t handle;  /* Opaque handle. */
+  uint64_t offset;  /* Request offset. */
+  uint64_t count;   /* Request length. */
+} NBD_ATTRIBUTE_PACKED;

 /* Simple reply (server -> client). */
 struct nbd_simple_reply {
@@ -204,6 +220,13 @@ struct nbd_simple_reply {
   uint32_t error;   /* NBD_SUCCESS or one of NBD_E*. */
   uint64_t handle;  /* Opaque handle. */
 } NBD_ATTRIBUTE_PACKED;
+struct nbd_simple_reply_ext {
+  uint32_t magic;   /* NBD_SIMPLE_REPLY_EXT_MAGIC. */
+  uint32_t error;   /* NBD_SUCCESS or one of NBD_E*. */
+  uint64_t handle;  /* Opaque handle. */
+  uint64_t pad1;/* Must be 0. */
+  uint64_t pad2;/* Must be 0. */
+} NBD_ATTRIBUTE_PACKED;

 /* Structured reply (server -> client). */
 struct nbd_structured_reply {
@@ -213,6 +236,14 @@ struct nbd_structured_reply {
   uint64_t handle;  /* Opaque handle. */
   uint32_t length;  /* Length of payload which follows. */
 } NBD_ATTRIBUTE_PACKED;
+struct nbd_structured_reply_ext {
+  uint32_t magic;   /* NBD_STRUCTURED_REPLY_EXT_MAGIC. */
+  uint16_t flags;   /* NBD_REPLY_FLAG_* */
+  uint16_t type;/* NBD_REPLY_TYPE_* */
+  uint64_t handle;  /* Opaque handle. */
+  uint64_t length;  /* Length of payload which follows. */
+  uint64_t pad; /* Must be 0. */
+} NBD_ATTRIBUTE_PACKED;

 struct nbd_structured_reply_offset_data {
   uint64_t offset;  /* offset */
@@ -224,15 +255,23 @@ struct nbd_structured_reply_offset_hole {
   uint32_t length;  /* Length of hole. */
 } NBD_ATTRIBUTE_PACKED;

+struct nbd_structured_reply_offset_hole_ext {
+  uint64_t offset;
+  uint64_t length;  /* Length of hole. */
+} NBD_ATTRIBUTE_PACKED;
+
 struct nbd_structured_reply_error {
   uint32_t error;   /* NBD_E* error number */
   uint16_t len; /* Length of human readable error. */
   /* Followed by human readable error string, and possibly more structure. */
 } NBD_ATTRIBUTE_PACKED;

-#define NBD_REQUEST_MAGIC   0x25609513
-#define NBD_SIMPLE_REPLY_MAGIC  0x67446698
-#define NBD_STRUCTURED_REPLY_MAGIC  0x668e33ef
+#define NBD_REQUEST_MAGIC   0x25609513
+#define NBD_REQUEST_EXT_MAGIC   0x21e41c71
+#define NBD_SIMPLE_REPLY_MAGIC  0x67446698
+#define NBD_SIMPLE_REPLY_EXT_MAGIC  0x60d12fd6
+#define NBD_STRUCTURED_REPLY_MAGIC  0x668e33ef
+#define NBD_STRUCTURED_REPLY_EXT_MAGIC  0x6e8a278c

 /* Structured reply flags. */
 #define NBD_REPLY_FLAG_DONE (1<<0)
@@ -241,12 +280,14 @@ struct nbd_structured_reply_error {
 #define NBD_REPLY_TYPE_IS_ERR(val) (!!((val) & (1<<15)))

 /* Structured reply types. */
-#define NBD_REPLY_TYPE_NONE 0
-#define NBD_REPLY_TYPE_OFFSET_DATA  1
-#define NBD_REPLY_TYPE_OFFSET_HOLE  2
-#define NBD_REPLY_TYPE_BLOCK_STATUS 5
-#define NBD_REPLY_TYPE_ERRORNBD_REPLY_TYPE_ERR (1)
-#define NBD_REPLY_TYPE_ERROR_OFFSET NBD_REPLY_TYPE_ERR (2)
+#define

[libnbd PATCH 13/13] interop: Add test of 64-bit block status

2021-12-03 Thread Eric Blake

Prove that we can round-trip a block status request larger than 4G
through a new-enough qemu-nbd.  Also serves as a unit test of our shim
for converting internal 64-bit representation back to the older 32-bit
nbd_block_status callback interface.
---
 interop/Makefile.am |   6 ++
 interop/large-status.c  | 186 
 interop/large-status.sh |  49 +++
 .gitignore  |   1 +
 4 files changed, 242 insertions(+)
 create mode 100644 interop/large-status.c
 create mode 100755 interop/large-status.sh

diff --git a/interop/Makefile.am b/interop/Makefile.am
index 3a8d5677..96c0a0f6 100644
--- a/interop/Makefile.am
+++ b/interop/Makefile.am
@@ -20,6 +20,7 @@ include $(top_srcdir)/subdir-rules.mk
 EXTRA_DIST = \
dirty-bitmap.sh \
interop-qemu-storage-daemon.sh \
+   large-status.sh \
list-exports-nbd-config \
list-exports-test-dir/disk1 \
list-exports-test-dir/disk2 \
@@ -129,6 +130,7 @@ check_PROGRAMS += \
list-exports-qemu-nbd \
socket-activation-qemu-nbd \
dirty-bitmap \
+   large-status \
structured-read \
$(NULL)
 TESTS += \
@@ -138,6 +140,7 @@ TESTS += \
list-exports-qemu-nbd \
socket-activation-qemu-nbd \
dirty-bitmap.sh \
+   large-status.sh \
structured-read.sh \
$(NULL)

@@ -227,6 +230,9 @@ socket_activation_qemu_nbd_LDADD = 
$(top_builddir)/lib/libnbd.la
 dirty_bitmap_SOURCES = dirty-bitmap.c
 dirty_bitmap_LDADD = $(top_builddir)/lib/libnbd.la

+large_status_SOURCES = large-status.c
+large_status_LDADD = $(top_builddir)/lib/libnbd.la
+
 structured_read_SOURCES = structured-read.c
 structured_read_LDADD = $(top_builddir)/lib/libnbd.la

diff --git a/interop/large-status.c b/interop/large-status.c
new file mode 100644
index ..3cc040fe
--- /dev/null
+++ b/interop/large-status.c
@@ -0,0 +1,186 @@
+/* NBD client library in userspace
+ * Copyright (C) 2013-2021 Red Hat Inc.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+/* Test 64-bit block status with qemu. */
+
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+static const char *bitmap;
+
+struct data {
+  bool req_one;/* input: true if req_one was passed to request */
+  int count;   /* input: count of expected remaining calls */
+  bool seen_base;  /* output: true if base:allocation encountered */
+  bool seen_dirty; /* output: true if qemu:dirty-bitmap encountered */
+};
+
+static int
+cb32 (void *opaque, const char *metacontext, uint64_t offset,
+  uint32_t *entries, size_t len, int *error)
+{
+  struct data *data = opaque;
+
+  assert (offset == 0);
+  assert (data->count-- > 0);
+
+  if (strcmp (metacontext, LIBNBD_CONTEXT_BASE_ALLOCATION) == 0) {
+assert (!data->seen_base);
+data->seen_base = true;
+
+/* Data block offset 0 size 64k, remainder is hole */
+assert (len == 4);
+assert (entries[0] == 65536);
+assert (entries[1] == 0);
+/* libnbd had to truncate qemu's >4G answer */
+assert (entries[2] == 4227858432);
+assert (entries[3] == (LIBNBD_STATE_HOLE|LIBNBD_STATE_ZERO));
+  }
+  else if (strcmp (metacontext, bitmap) == 0) {
+assert (!data->seen_dirty);
+data->seen_dirty = true;
+
+/* Dirty block at offset 5G-64k, remainder is clean */
+/* libnbd had to truncate qemu's >4G answer */
+assert (len == 2);
+assert (entries[0] == 4227858432);
+assert (entries[1] == 0);
+  }
+  else {
+fprintf (stderr, "unexpected context %s\n", metacontext);
+exit (EXIT_FAILURE);
+  }
+  return 0;
+}
+
+static int
+cb64 (void *opaque, const char *metacontext, uint64_t offset,
+  nbd_extent *entries, size_t len, int *error)
+{
+  struct data *data = opaque;
+
+  assert (offset == 0);
+  assert (data->count-- > 0);
+
+  if (strcmp (metacontext, LIBNBD_CONTEXT_BASE_ALLOCATION) == 0) {
+assert (!data->seen_base);
+data->seen_base = true;
+
+/* Data block offset 0 size 64k, remainder is hole */
+assert (len == 2);
+assert (entries[0].length == 65536);
+assert (entries[0].flags == 0);
+assert (entries[1].length == 5368643584ULL);
+assert (entries[1].flags == (LIBNBD_STATE_HOLE|LIBNBD_STATE_ZERO));
+  }
+  else

[libnbd PATCH 05/13] protocol: Prepare to receive 64-bit replies

2021-12-03 Thread Eric Blake

Support receiving headers for 64-bit replies if extended headers were
negotiated.  We already insist that the server not send us too much
payload in one reply, so we can exploit that and merge the 64-bit
length back into a normalized 32-bit field for the rest of the payload
length calculations.  The NBD protocol specifically made extended
simple and structured replies both occupy 32 bytes, while the handle
field is still in the same offset between all reply types.

Note that if we negotiate extended headers, but a non-compliant server
replies with a non-extended header, we will stall waiting for the
server to send more bytes rather than noticing that the magic number
is wrong.  The alternative would be to read just the first 4 bytes of
magic, then determine how many more bytes to expect; but that would
require more states and syscalls, and not worth it since the typical
server will be compliant.

At this point, h->extended_headers is permanently false (we can't
enable it until all other aspects of the protocol have likewise been
converted).
---
 lib/internal.h  |  8 +++-
 generator/states-reply-structured.c | 59 +++--
 generator/states-reply.c| 31 +++
 3 files changed, 68 insertions(+), 30 deletions(-)

diff --git a/lib/internal.h b/lib/internal.h
index 07378588..c9f84441 100644
--- a/lib/internal.h
+++ b/lib/internal.h
@@ -222,8 +222,12 @@ struct nbd_handle {
 }  __attribute__((packed)) or;
 struct nbd_export_name_option_reply export_name_reply;
 struct nbd_simple_reply simple_reply;
+struct nbd_simple_reply_ext simple_reply_ext;
 struct {
-  struct nbd_structured_reply structured_reply;
+  union {
+struct nbd_structured_reply structured_reply;
+struct nbd_structured_reply_ext structured_reply_ext;
+  } hdr;
   union {
 struct nbd_structured_reply_offset_data offset_data;
 struct nbd_structured_reply_offset_hole offset_hole;
@@ -233,7 +237,7 @@ struct nbd_handle {
   uint64_t offset; /* Only used for NBD_REPLY_TYPE_ERROR_OFFSET */
 } __attribute__((packed)) error;
   } payload;
-}  __attribute__((packed)) sr;
+} sr;
 uint16_t gflags;
 uint32_t cflags;
 uint32_t len;
diff --git a/generator/states-reply-structured.c 
b/generator/states-reply-structured.c
index 5524e000..1b675e8d 100644
--- a/generator/states-reply-structured.c
+++ b/generator/states-reply-structured.c
@@ -45,19 +45,23 @@ structured_reply_in_bounds (uint64_t offset, uint32_t 
length,

 STATE_MACHINE {
  REPLY.STRUCTURED_REPLY.START:
-  /* We've only read the simple_reply.  The structured_reply is longer,
-   * so read the remaining part.
+  /* We've only read the simple_reply.  Unless we have extended headers,
+   * the structured_reply is longer, so read the remaining part.
*/
   if (!h->structured_replies) {
 set_error (0, "server sent unexpected structured reply");
 SET_NEXT_STATE(%.DEAD);
 return 0;
   }
-  h->rbuf = >sbuf;
-  h->rbuf += sizeof h->sbuf.simple_reply;
-  h->rlen = sizeof h->sbuf.sr.structured_reply;
-  h->rlen -= sizeof h->sbuf.simple_reply;
-  SET_NEXT_STATE (%RECV_REMAINING);
+  if (h->extended_headers)
+SET_NEXT_STATE (%CHECK);
+  else {
+h->rbuf = >sbuf;
+h->rbuf += sizeof h->sbuf.simple_reply;
+h->rlen = sizeof h->sbuf.sr.hdr.structured_reply;
+h->rlen -= sizeof h->sbuf.simple_reply;
+SET_NEXT_STATE (%RECV_REMAINING);
+  }
   return 0;

  REPLY.STRUCTURED_REPLY.RECV_REMAINING:
@@ -75,12 +79,21 @@ STATE_MACHINE {
   struct command *cmd = h->reply_cmd;
   uint16_t flags, type;
   uint64_t cookie;
-  uint32_t length;
+  uint64_t length;

-  flags = be16toh (h->sbuf.sr.structured_reply.flags);
-  type = be16toh (h->sbuf.sr.structured_reply.type);
-  cookie = be64toh (h->sbuf.sr.structured_reply.handle);
-  length = be32toh (h->sbuf.sr.structured_reply.length);
+  flags = be16toh (h->sbuf.sr.hdr.structured_reply.flags);
+  type = be16toh (h->sbuf.sr.hdr.structured_reply.type);
+  cookie = be64toh (h->sbuf.sr.hdr.structured_reply.handle);
+  if (h->extended_headers) {
+length = be64toh (h->sbuf.sr.hdr.structured_reply_ext.length);
+if (h->sbuf.sr.hdr.structured_reply_ext.pad) {
+  set_error (0, "server sent non-zero padding in structured reply header");
+  SET_NEXT_STATE(%.DEAD);
+  return 0;
+}
+  }
+  else
+length = be32toh (h->sbuf.sr.hdr.structured_reply.length);

   assert (cmd);
   assert (cmd->cookie == cookie);
@@ -97,6 +110,10 @@ STATE_MACHINE {
 SET_NEXT_STATE (%.DEAD);
 return 0;
   }
+  /* For convenience, we now normalize extended replies into compact,
+   * doable since we validated length fits in 32 bits.
+   */
+  h->sbuf.sr.hdr.structured_reply.length = length;

   if (NBD_REPLY_TYPE_IS_ERR (type)) {
 if (length < sizeof h->sbuf.sr.payload.error.error) {
@@ -207,7 +224,7 @@ STATE_MACHINE {
 SET_NEXT_STATE (%.READY);
 return 0;
   case 0:
-

[libnbd PATCH 09/13] block_status: Accept 64-bit extents during block status

2021-12-03 Thread Eric Blake

Support a server giving us a 64-bit extent.  Note that the protocol
says a server should not give a 64-bit answer when extended headers
are not negotiated, but since the client's size is merely a hint, it
is possible for a server to have a 64-bit answer even when the
original query was 32 bits.  At any rate, it is just as easy for us to
always support the new chunk type as it is to complain when it is used
incorrectly by the server, and the user's 32-bit callback doesn't have
to care which size the server's result used (either the server's
result was a 32-bit value, or our shim silently truncates it, but the
user still makes progress).  Of course, until a later patch enables
extended headers negotiation, no compliant server will trigger the new
code here.

Implementation-wise, we don't care if we will be narrowing from the
server's 16-byte extent (including explicit padding) to a 12-byte
struct, or if our 'nbd_extent' type has implicit padding and is thus
also 16 bytes; either way, the order of our byte-swapping traversal is
safe.
---
 lib/internal.h  |  1 +
 generator/states-reply-structured.c | 75 +++--
 2 files changed, 60 insertions(+), 16 deletions(-)

diff --git a/lib/internal.h b/lib/internal.h
index 4800df83..97abf4f2 100644
--- a/lib/internal.h
+++ b/lib/internal.h
@@ -289,6 +289,7 @@ struct nbd_handle {
   union {
 nbd_extent *normal; /* Our 64-bit preferred internal form */
 uint32_t *narrow;   /* 32-bit form of NBD_REPLY_TYPE_BLOCK_STATUS */
+struct nbd_block_descriptor_ext *wide; /* NBD_REPLY_TYPE_BLOCK_STATUS_EXT 
*/
   } bs_entries;

   /* Commands which are waiting to be issued [meaning the request
diff --git a/generator/states-reply-structured.c 
b/generator/states-reply-structured.c
index 71c761e9..29b1c3d8 100644
--- a/generator/states-reply-structured.c
+++ b/generator/states-reply-structured.c
@@ -22,6 +22,8 @@
 #include 
 #include 

+#include "minmax.h"
+
 /* Structured reply must be completely inside the bounds of the
  * requesting command.
  */
@@ -202,7 +204,8 @@ STATE_MACHINE {
 SET_NEXT_STATE (%RECV_OFFSET_HOLE);
 return 0;
   }
-  else if (type == NBD_REPLY_TYPE_BLOCK_STATUS) {
+  else if (type == NBD_REPLY_TYPE_BLOCK_STATUS ||
+   type == NBD_REPLY_TYPE_BLOCK_STATUS_EXT) {
 if (cmd->type != NBD_CMD_BLOCK_STATUS) {
   SET_NEXT_STATE (%.DEAD);
   set_error (0, "invalid command for receiving block-status chunk, "
@@ -211,12 +214,19 @@ STATE_MACHINE {
  cmd->type);
   return 0;
 }
-/* XXX We should be able to skip the bad reply in these two cases. */
-if (length < 12 || ((length-4) & 7) != 0) {
+/* XXX We should be able to skip the bad reply in these cases. */
+if (type == NBD_REPLY_TYPE_BLOCK_STATUS &&
+(length < 12 || (length-4) % (2 * sizeof(uint32_t {
   SET_NEXT_STATE (%.DEAD);
   set_error (0, "invalid length in NBD_REPLY_TYPE_BLOCK_STATUS");
   return 0;
 }
+if (type == NBD_REPLY_TYPE_BLOCK_STATUS_EXT &&
+(length < 20 || (length-4) % sizeof(struct nbd_block_descriptor_ext))) 
{
+  SET_NEXT_STATE (%.DEAD);
+  set_error (0, "invalid length in NBD_REPLY_TYPE_BLOCK_STATUS_EXT");
+  return 0;
+}
 if (CALLBACK_IS_NULL (cmd->cb.fn.extent)) {
   SET_NEXT_STATE (%.DEAD);
   set_error (0, "not expecting NBD_REPLY_TYPE_BLOCK_STATUS here");
@@ -495,6 +505,7 @@ STATE_MACHINE {
   struct command *cmd = h->reply_cmd;
   uint32_t length;
   uint32_t count;
+  uint16_t type;

   switch (recv_into_rbuf (h)) {
   case -1: SET_NEXT_STATE (%.DEAD); return 0;
@@ -504,24 +515,33 @@ STATE_MACHINE {
 return 0;
   case 0:
 length = h->sbuf.sr.hdr.structured_reply.length; /* normalized in CHECK */
+type = be16toh (h->sbuf.sr.hdr.structured_reply.type);

 assert (cmd); /* guaranteed by CHECK */
 assert (cmd->type == NBD_CMD_BLOCK_STATUS);
 assert (length >= 12);
 length -= sizeof h->bs_contextid;
-count = length / (2 * sizeof (uint32_t));
+if (type == NBD_REPLY_TYPE_BLOCK_STATUS)
+  count = length / (2 * sizeof (uint32_t));
+else {
+  assert (type == NBD_REPLY_TYPE_BLOCK_STATUS_EXT);
+  /* XXX Insist on h->extended_headers? */
+  count = length / sizeof (struct nbd_block_descriptor_ext);
+}

-/* Read raw data into a subset of h->bs_entries, then expand it
+/* Read raw data into an overlap of h->bs_entries, then move it
  * into place later later during byte-swapping.
  */
 free (h->bs_entries.normal);
-h->bs_entries.normal = malloc (count * sizeof *h->bs_entries.normal);
+h->bs_entries.normal = malloc (MAX (count * sizeof *h->bs_entries.normal,
+length));
 if (h->bs_entries.normal == NULL) {
   SET_NEXT_STATE (%.DEAD);
   set_error (errno, "malloc");
   return 0;
 }
-h->rbuf = h->bs_entries.narrow;
+h->rbuf = type == NBD_REPLY_TYPE_BLOCK_STATUS
+  ?

[libnbd PATCH 07/13] generator: Add struct nbd_extent in prep for 64-bit extents

2021-12-03 Thread Eric Blake

The existing nbd_block_status() callback is permanently stuck with an
array of uint32_t pairs (len/2 extents), and exposing 64-bit extents
requires a new API.  Before we get there, we first need a way to
express a struct containing uint64_t length and uint32_t flags across
the various language bindings in the callback that will be used by the
new API.

For the language bindings, we have to construct an array of a similar
struct in the target language's preferred format.  The bindings for
Python and OCaml were relatively straightforward; the Golang bindings
took a bit more effort for me to write.  Temporary unused attributes
are needed to keep the compiler happy until a later patch exposes a
new API using the new callback type.
---
 generator/API.ml| 12 +++-
 generator/API.mli   |  3 ++-
 generator/C.ml  | 24 +---
 generator/GoLang.ml | 24 
 generator/OCaml.ml  | 21 +
 generator/Python.ml | 30 ++
 ocaml/helpers.c | 22 +-
 ocaml/nbd-c.h   |  3 ++-
 golang/handle.go|  6 ++
 9 files changed, 130 insertions(+), 15 deletions(-)

diff --git a/generator/API.ml b/generator/API.ml
index cf2e7543..70ae721d 100644
--- a/generator/API.ml
+++ b/generator/API.ml
@@ -42,6 +42,7 @@
 | BytesPersistOut of string * string
 | Closure of closure
 | Enum of string * enum
+| Extent64 of string
 | Fd of string
 | Flags of string * flags
 | Int of string
@@ -142,6 +143,14 @@ let extent_closure =
 "nr_entries");
  CBMutable (Int "error") ]
 }
+let extent64_closure = {
+  cbname = "extent64";
+  cbargs = [ CBString "metacontext";
+ CBUInt64 "offset";
+ CBArrayAndLen (Extent64 "entries",
+"nr_entries");
+ CBMutable (Int "error") ]
+}
 let list_closure = {
   cbname = "list";
   cbargs = [ CBString "name"; CBString "description" ]
@@ -151,7 +160,8 @@ let context_closure =
   cbargs = [ CBString "name" ]
 }
 let all_closures = [ chunk_closure; completion_closure;
- debug_closure; extent_closure; list_closure;
+ debug_closure; extent_closure; extent64_closure;
+ list_closure;
  context_closure ]

 (* Enums. *)
diff --git a/generator/API.mli b/generator/API.mli
index d284637f..922d8120 100644
--- a/generator/API.mli
+++ b/generator/API.mli
@@ -1,6 +1,6 @@
 (* hey emacs, this is OCaml code: -*- tuareg -*- *)
 (* nbd client library in userspace: the API
- * Copyright (C) 2013-2020 Red Hat Inc.
+ * Copyright (C) 2013-2021 Red Hat Inc.
  *
  * This library is free software; you can redistribute it and/or
  * modify it under the terms of the GNU Lesser General Public
@@ -52,6 +52,7 @@ and
 | BytesPersistOut of string * string
 | Closure of closure   (** function pointer + void *opaque *)
 | Enum of string * enum(** enum/union type, int in C *)
+| Extent64 of string   (** extent descriptor, struct nbd_extent in C *)
 | Fd of string (** file descriptor *)
 | Flags of string * flags  (** flags, uint32_t in C *)
 | Int of string(** small int *)
diff --git a/generator/C.ml b/generator/C.ml
index 797af531..7b0be583 100644
--- a/generator/C.ml
+++ b/generator/C.ml
@@ -1,6 +1,6 @@
 (* hey emacs, this is OCaml code: -*- tuareg -*- *)
 (* nbd client library in userspace: generate the C API and documentation
- * Copyright (C) 2013-2020 Red Hat Inc.
+ * Copyright (C) 2013-2021 Red Hat Inc.
  *
  * This library is free software; you can redistribute it and/or
  * modify it under the terms of the GNU Lesser General Public
@@ -90,6 +90,7 @@ let
 | Closure { cbname } ->
[ sprintf "%s_callback" cbname; sprintf "%s_user_data" cbname ]
 | Enum (n, _) -> [n]
+| Extent64 n -> [n]
 | Fd n -> [n]
 | Flags (n, _) -> [n]
 | Int n -> [n]
@@ -152,6 +153,9 @@ and
   | Enum (n, _) ->
  if types then pr "int ";
  pr "%s" n
+  | Extent64 n ->
+ if types then pr "nbd_extent ";
+ pr "%s" n
   | Flags (n, _) ->
  if types then pr "uint32_t ";
  pr "%s" n
@@ -238,6 +242,11 @@ and
  pr "%s, " n;
  if types then pr "size_t ";
  pr "%s" len
+  | CBArrayAndLen (Extent64 n, len) ->
+ if types then pr "nbd_extent *";
+ pr "%s, " n;
+ if types then pr "size_t ";
+ pr "%s" len
   | CBArrayAndLen _ -> assert false
   | CBBytesIn (n, len) ->
  if types then pr "const void *";
@@ -388,6 +397,13 @@ let
   pr "extern int nbd_get_errno (void);\n";
   pr "#define LIBNBD_HAVE_NBD_GET_ERRNO 1\n";
   pr "\n";
+  pr "/* This is used in the callback for nbd_block_status_64.\n";
+  pr " */\n";
+  pr "typedef struct {\n";
+  pr "  uint64_t length;\n";
+  pr "  uint32_t flags;\n";
+  pr "} nbd_extent;\n";
+  pr "\n";
   print_closure_structs ();
   List.iter (
 fun (name, { args; optargs; ret }) ->
@@

[libnbd PATCH 04/13] protocol: Prepare to send 64-bit requests

2021-12-03 Thread Eric Blake

Support sending 64-bit requests if extended headers were negotiated.

At this point, h->extended_headers is permanently false (we can't
enable it until all other aspects of the protocol have likewise been
converted).
---
 lib/internal.h  | 12 ---
 generator/states-issue-command.c| 31 +++--
 generator/states-reply-structured.c |  2 +-
 lib/rw.c| 10 --
 4 files changed, 34 insertions(+), 21 deletions(-)

diff --git a/lib/internal.h b/lib/internal.h
index 7e96e8e9..07378588 100644
--- a/lib/internal.h
+++ b/lib/internal.h
@@ -1,5 +1,5 @@
 /* nbd client library in userspace: internal definitions
- * Copyright (C) 2013-2020 Red Hat Inc.
+ * Copyright (C) 2013-2021 Red Hat Inc.
  *
  * This library is free software; you can redistribute it and/or
  * modify it under the terms of the GNU Lesser General Public
@@ -106,6 +106,9 @@ struct nbd_handle {
   char *tls_username;   /* Username, NULL = use current username */
   char *tls_psk_file;   /* PSK filename, NULL = no PSK */

+  /* Extended headers. */
+  bool extended_headers;/* If we negotiated NBD_OPT_EXTENDED_HEADERS */
+
   /* Desired metadata contexts. */
   bool request_sr;
   string_vector request_meta_contexts;
@@ -242,7 +245,10 @@ struct nbd_handle {
   /* Issuing a command must use a buffer separate from sbuf, for the
* case when we interrupt a request to service a reply.
*/
-  struct nbd_request request;
+  union {
+struct nbd_request request;
+struct nbd_request_ext request_ext;
+  } req;
   bool in_write_payload;
   bool in_write_shutdown;

@@ -347,7 +353,7 @@ struct command {
   uint16_t type;
   uint64_t cookie;
   uint64_t offset;
-  uint32_t count;
+  uint64_t count;
   void *data; /* Buffer for read/write */
   struct command_cb cb;
   enum state state; /* State to resume with on next POLLIN */
diff --git a/generator/states-issue-command.c b/generator/states-issue-command.c
index a8101144..7b1d6dc7 100644
--- a/generator/states-issue-command.c
+++ b/generator/states-issue-command.c
@@ -1,5 +1,5 @@
 /* nbd client library in userspace: state machine
- * Copyright (C) 2013-2020 Red Hat Inc.
+ * Copyright (C) 2013-2021 Red Hat Inc.
  *
  * This library is free software; you can redistribute it and/or
  * modify it under the terms of the GNU Lesser General Public
@@ -41,14 +41,23 @@ STATE_MACHINE {
 return 0;
   }

-  h->request.magic = htobe32 (NBD_REQUEST_MAGIC);
-  h->request.flags = htobe16 (cmd->flags);
-  h->request.type = htobe16 (cmd->type);
-  h->request.handle = htobe64 (cmd->cookie);
-  h->request.offset = htobe64 (cmd->offset);
-  h->request.count = htobe32 ((uint32_t) cmd->count);
-  h->wbuf = >request;
-  h->wlen = sizeof (h->request);
+  /* These fields are coincident between req.request and req.request_ext */
+  h->req.request.flags = htobe16 (cmd->flags);
+  h->req.request.type = htobe16 (cmd->type);
+  h->req.request.handle = htobe64 (cmd->cookie);
+  h->req.request.offset = htobe64 (cmd->offset);
+  if (h->extended_headers) {
+h->req.request_ext.magic = htobe32 (NBD_REQUEST_EXT_MAGIC);
+h->req.request_ext.count = htobe64 (cmd->count);
+h->wlen = sizeof (h->req.request_ext);
+  }
+  else {
+assert (cmd->count <= UINT32_MAX);
+h->req.request.magic = htobe32 (NBD_REQUEST_MAGIC);
+h->req.request.count = htobe32 (cmd->count);
+h->wlen = sizeof (h->req.request);
+  }
+  h->wbuf = >req;
   if (cmd->type == NBD_CMD_WRITE || cmd->next)
 h->wflags = MSG_MORE;
   SET_NEXT_STATE (%SEND_REQUEST);
@@ -73,7 +82,7 @@ STATE_MACHINE {

   assert (h->cmds_to_issue != NULL);
   cmd = h->cmds_to_issue;
-  assert (cmd->cookie == be64toh (h->request.handle));
+  assert (cmd->cookie == be64toh (h->req.request.handle));
   if (cmd->type == NBD_CMD_WRITE) {
 h->wbuf = cmd->data;
 h->wlen = cmd->count;
@@ -119,7 +128,7 @@ STATE_MACHINE {
   assert (!h->wlen);
   assert (h->cmds_to_issue != NULL);
   cmd = h->cmds_to_issue;
-  assert (cmd->cookie == be64toh (h->request.handle));
+  assert (cmd->cookie == be64toh (h->req.request.handle));
   h->cmds_to_issue = cmd->next;
   if (h->cmds_to_issue_tail == cmd)
 h->cmds_to_issue_tail = NULL;
diff --git a/generator/states-reply-structured.c 
b/generator/states-reply-structured.c
index e1da850d..5524e000 100644
--- a/generator/states-reply-structured.c
+++ b/generator/states-reply-structured.c
@@ -34,7 +34,7 @@ structured_reply_in_bounds (uint64_t offset, uint32_t length,
   offset + length > cmd->offset + cmd->count) {
 set_error (0, "range of structured reply is out of bounds, "
"offset=%" PRIu64 ", cmd->offset=%" PRIu64 ", "
-   "length=%" PRIu32 ", cmd->count=%" PRIu32 ": "
+   "length=%" PRIu32 ", cmd->count=%" PRIu64 ": "
"this is likely to be a bug in the NBD server",
offset, cmd->offset, length, cmd->count);
 return false;
diff --git a/lib/rw.c

[libnbd PATCH 02/13] block_status: Refactor array storage

2021-12-03 Thread Eric Blake

For 32-bit block status, we were able to cheat and use an array with
an odd number of elements, with array[0] holding the context id, and
passing [1] to the user's callback.  But once we have 64-bit
extents, we can no longer abuse array element 0 like that.  Split out
a new state to receive the context id separately from the extents
array.  No behavioral change, other than the rare possibility of
landing in the new state.
---
 lib/internal.h  |  1 +
 generator/state_machine.ml  | 11 +-
 generator/states-reply-structured.c | 58 -
 3 files changed, 51 insertions(+), 19 deletions(-)

diff --git a/lib/internal.h b/lib/internal.h
index 0e205aba..7e96e8e9 100644
--- a/lib/internal.h
+++ b/lib/internal.h
@@ -274,6 +274,7 @@ struct nbd_handle {
   size_t querynum;

   /* When receiving block status, this is used. */
+  uint32_t bs_contextid;
   uint32_t *bs_entries;

   /* Commands which are waiting to be issued [meaning the request
diff --git a/generator/state_machine.ml b/generator/state_machine.ml
index 3bc77f24..99652948 100644
--- a/generator/state_machine.ml
+++ b/generator/state_machine.ml
@@ -1,6 +1,6 @@
 (* hey emacs, this is OCaml code: -*- tuareg -*- *)
 (* nbd client library in userspace: state machine definition
- * Copyright (C) 2013-2020 Red Hat Inc.
+ * Copyright (C) 2013-2021 Red Hat Inc.
  *
  * This library is free software; you can redistribute it and/or
  * modify it under the terms of the GNU Lesser General Public
@@ -862,10 +862,17 @@ and
 external_events = [];
   };

+  State {
+default_state with
+name = "RECV_BS_CONTEXTID";
+comment = "Receive contextid of structured reply block-status payload";
+external_events = [];
+  };
+
   State {
 default_state with
 name = "RECV_BS_ENTRIES";
-comment = "Receive a structured reply block-status payload";
+comment = "Receive entries array of structured reply block-status payload";
 external_events = [];
   };

diff --git a/generator/states-reply-structured.c 
b/generator/states-reply-structured.c
index 70010474..e1da850d 100644
--- a/generator/states-reply-structured.c
+++ b/generator/states-reply-structured.c
@@ -1,5 +1,5 @@
 /* nbd client library in userspace: state machine
- * Copyright (C) 2013-2019 Red Hat Inc.
+ * Copyright (C) 2013-2021 Red Hat Inc.
  *
  * This library is free software; you can redistribute it and/or
  * modify it under the terms of the GNU Lesser General Public
@@ -185,19 +185,10 @@ STATE_MACHINE {
   set_error (0, "not expecting NBD_REPLY_TYPE_BLOCK_STATUS here");
   return 0;
 }
-/* We read the context ID followed by all the entries into a
- * single array and deal with it at the end.
- */
-free (h->bs_entries);
-h->bs_entries = malloc (length);
-if (h->bs_entries == NULL) {
-  SET_NEXT_STATE (%.DEAD);
-  set_error (errno, "malloc");
-  return 0;
-}
-h->rbuf = h->bs_entries;
-h->rlen = length;
-SET_NEXT_STATE (%RECV_BS_ENTRIES);
+/* Start by reading the context ID. */
+h->rbuf = >bs_contextid;
+h->rlen = sizeof h->bs_contextid;
+SET_NEXT_STATE (%RECV_BS_CONTEXTID);
 return 0;
   }
   else {
@@ -452,9 +443,41 @@ STATE_MACHINE {
   }
   return 0;

+ REPLY.STRUCTURED_REPLY.RECV_BS_CONTEXTID:
+  struct command *cmd = h->reply_cmd;
+  uint32_t length;
+
+  switch (recv_into_rbuf (h)) {
+  case -1: SET_NEXT_STATE (%.DEAD); return 0;
+  case 1:
+save_reply_state (h);
+SET_NEXT_STATE (%.READY);
+return 0;
+  case 0:
+length = be32toh (h->sbuf.sr.structured_reply.length);
+
+assert (cmd); /* guaranteed by CHECK */
+assert (cmd->type == NBD_CMD_BLOCK_STATUS);
+assert (length >= 12);
+length -= sizeof h->bs_contextid;
+
+free (h->bs_entries);
+h->bs_entries = malloc (length);
+if (h->bs_entries == NULL) {
+  SET_NEXT_STATE (%.DEAD);
+  set_error (errno, "malloc");
+  return 0;
+}
+h->rbuf = h->bs_entries;
+h->rlen = length;
+SET_NEXT_STATE (%RECV_BS_ENTRIES);
+  }
+  return 0;
+
  REPLY.STRUCTURED_REPLY.RECV_BS_ENTRIES:
   struct command *cmd = h->reply_cmd;
   uint32_t length;
+  uint32_t count;
   size_t i;
   uint32_t context_id;
   struct meta_context *meta_context;
@@ -473,15 +496,16 @@ STATE_MACHINE {
 assert (CALLBACK_IS_NOT_NULL (cmd->cb.fn.extent));
 assert (h->bs_entries);
 assert (length >= 12);
+count = (length - sizeof h->bs_contextid) / sizeof *h->bs_entries;

 /* Need to byte-swap the entries returned, but apart from that we
  * don't validate them.
  */
-for (i = 0; i < length/4; ++i)
+for (i = 0; i < count; ++i)
   h->bs_entries[i] = be32toh (h->bs_entries[i]);

 /* Look up the context ID. */
-context_id = h->bs_entries[0];
+context_id = be32toh (h->bs_contextid);
 for (meta_context = h->meta_contexts;
  meta_context;
  meta_context = meta_context->next)
@@ -494,7 +518,7 @@ STATE_MACHINE {

[PATCH 12/14] nbd/client: Accept 64-bit block status chunks

2021-12-03 Thread Eric Blake

Because we use NBD_CMD_FLAG_REQ_ONE with NBD_CMD_BLOCK_STATUS, a
client in narrow mode should not be able to provoke a server into
sending a block status result larger than the client's 32-bit request.
But in extended mode, a 64-bit status request must be able to handle a
64-bit status result, once a future patch enables the client
requesting extended mode.  We can also tolerate a non-compliant server
sending the new chunk even when it should not.

Signed-off-by: Eric Blake 
---
 block/nbd.c | 38 +++---
 1 file changed, 27 insertions(+), 11 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index c5dea864ebb6..bd4a9c407bde 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -563,13 +563,15 @@ static int nbd_parse_offset_hole_payload(BDRVNBDState *s,
  */
 static int nbd_parse_blockstatus_payload(BDRVNBDState *s,
  NBDStructuredReplyChunk *chunk,
- uint8_t *payload, uint64_t 
orig_length,
- NBDExtent *extent, Error **errp)
+ uint8_t *payload, bool wide,
+ uint64_t orig_length,
+ NBDExtentExt *extent, Error **errp)
 {
 uint32_t context_id;
+size_t len = wide ? sizeof(*extent) : sizeof(NBDExtent);

 /* The server succeeded, so it must have sent [at least] one extent */
-if (chunk->length < sizeof(context_id) + sizeof(*extent)) {
+if (chunk->length < sizeof(context_id) + len) {
 error_setg(errp, "Protocol error: invalid payload for "
  "NBD_REPLY_TYPE_BLOCK_STATUS");
 return -EINVAL;
@@ -584,8 +586,16 @@ static int nbd_parse_blockstatus_payload(BDRVNBDState *s,
 return -EINVAL;
 }

-extent->length = payload_advance32();
-extent->flags = payload_advance32();
+if (wide) {
+extent->length = payload_advance64();
+extent->flags = payload_advance32();
+if (payload_advance32() != 0) {
+trace_nbd_parse_blockstatus_compliance("non-zero extent padding");
+}
+} else {
+extent->length = payload_advance32();
+extent->flags = payload_advance32();
+}

 if (extent->length == 0) {
 error_setg(errp, "Protocol error: server sent status chunk with "
@@ -625,7 +635,7 @@ static int nbd_parse_blockstatus_payload(BDRVNBDState *s,
  * connection; just ignore trailing extents, and clamp things to
  * the length of our request.
  */
-if (chunk->length > sizeof(context_id) + sizeof(*extent)) {
+if (chunk->length > sizeof(context_id) + len) {
 trace_nbd_parse_blockstatus_compliance("more than one extent");
 }
 if (extent->length > orig_length) {
@@ -1081,7 +1091,7 @@ static int nbd_co_receive_cmdread_reply(BDRVNBDState *s, 
uint64_t handle,

 static int nbd_co_receive_blockstatus_reply(BDRVNBDState *s,
 uint64_t handle, uint64_t length,
-NBDExtent *extent,
+NBDExtentExt *extent,
 int *request_ret, Error **errp)
 {
 NBDReplyChunkIter iter;
@@ -1098,6 +1108,11 @@ static int nbd_co_receive_blockstatus_reply(BDRVNBDState 
*s,
 assert(nbd_reply_is_structured());

 switch (chunk->type) {
+case NBD_REPLY_TYPE_BLOCK_STATUS_EXT:
+if (!s->info.extended_headers) {
+trace_nbd_extended_headers_compliance("block_status_ext");
+}
+/* fallthrough */
 case NBD_REPLY_TYPE_BLOCK_STATUS:
 if (received) {
 nbd_channel_error(s, -EINVAL);
@@ -1106,9 +1121,10 @@ static int nbd_co_receive_blockstatus_reply(BDRVNBDState 
*s,
 }
 received = true;

-ret = nbd_parse_blockstatus_payload(s, ,
-payload, length, extent,
-_err);
+ret = nbd_parse_blockstatus_payload(
+s, , payload,
+chunk->type == NBD_REPLY_TYPE_BLOCK_STATUS_EXT,
+length, extent, _err);
 if (ret < 0) {
 nbd_channel_error(s, ret);
 nbd_iter_channel_error(, ret, _err);
@@ -1337,7 +1353,7 @@ static int coroutine_fn nbd_client_co_block_status(
 int64_t *pnum, int64_t *map, BlockDriverState **file)
 {
 int ret, request_ret;
-NBDExtent extent = { 0 };
+NBDExtentExt extent = { 0 };
 BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
 Error *local_err = NULL;

-- 
2.33.1

[PATCH 13/14] nbd/client: Request extended headers during negotiation

2021-12-03 Thread Eric Blake

All the pieces are in place for a client to finally request extended
headers.  Note that we must not request extended headers when qemu-nbd
is used to connect to the kernel module (as nbd.ko does not expect
them), but there is no harm in all other clients requesting them.

Extended headers do not make a difference to the information collected
during 'qemu-nbd --list', but probing for it gives us one more piece
of information in that output.  Update the iotests affected by the new
line of output.

Signed-off-by: Eric Blake 
---
 nbd/client-connection.c   |  1 +
 nbd/client.c  | 26 ---
 qemu-nbd.c|  2 ++
 tests/qemu-iotests/223.out|  4 +++
 tests/qemu-iotests/233.out|  1 +
 tests/qemu-iotests/241|  8 +++---
 tests/qemu-iotests/307|  2 +-
 tests/qemu-iotests/307.out|  5 
 .../tests/nbd-qemu-allocation.out |  1 +
 9 files changed, 41 insertions(+), 9 deletions(-)

diff --git a/nbd/client-connection.c b/nbd/client-connection.c
index 695f85575414..d8b9ae230264 100644
--- a/nbd/client-connection.c
+++ b/nbd/client-connection.c
@@ -87,6 +87,7 @@ NBDClientConnection *nbd_client_connection_new(const 
SocketAddress *saddr,

 .initial_info.request_sizes = true,
 .initial_info.structured_reply = true,
+.initial_info.extended_headers = true,
 .initial_info.base_allocation = true,
 .initial_info.x_dirty_bitmap = g_strdup(x_dirty_bitmap),
 .initial_info.name = g_strdup(export_name ?: "")
diff --git a/nbd/client.c b/nbd/client.c
index f1aa5256c8bf..0e227255d59b 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -882,8 +882,8 @@ static int nbd_list_meta_contexts(QIOChannel *ioc,
 static int nbd_start_negotiate(AioContext *aio_context, QIOChannel *ioc,
QCryptoTLSCreds *tlscreds,
const char *hostname, QIOChannel **outioc,
-   bool structured_reply, bool *zeroes,
-   Error **errp)
+   bool structured_reply, bool *ext_hdrs,
+   bool *zeroes, Error **errp)
 {
 ERRP_GUARD();
 uint64_t magic;
@@ -960,6 +960,15 @@ static int nbd_start_negotiate(AioContext *aio_context, 
QIOChannel *ioc,
 if (fixedNewStyle) {
 int result = 0;

+if (ext_hdrs && *ext_hdrs) {
+result = nbd_request_simple_option(ioc,
+   NBD_OPT_EXTENDED_HEADERS,
+   false, errp);
+if (result < 0) {
+return -EINVAL;
+}
+*ext_hdrs = result == 1;
+}
 if (structured_reply) {
 result = nbd_request_simple_option(ioc,
NBD_OPT_STRUCTURED_REPLY,
@@ -970,6 +979,9 @@ static int nbd_start_negotiate(AioContext *aio_context, 
QIOChannel *ioc,
 }
 return 2 + result;
 } else {
+if (ext_hdrs) {
+*ext_hdrs = false;
+}
 return 1;
 }
 } else if (magic == NBD_CLIENT_MAGIC) {
@@ -977,6 +989,9 @@ static int nbd_start_negotiate(AioContext *aio_context, 
QIOChannel *ioc,
 error_setg(errp, "Server does not support STARTTLS");
 return -EINVAL;
 }
+if (ext_hdrs) {
+*ext_hdrs = false;
+}
 return 0;
 } else {
 error_setg(errp, "Bad server magic received: 0x%" PRIx64, magic);
@@ -1030,7 +1045,8 @@ int nbd_receive_negotiate(AioContext *aio_context, 
QIOChannel *ioc,
 trace_nbd_receive_negotiate_name(info->name);

 result = nbd_start_negotiate(aio_context, ioc, tlscreds, hostname, outioc,
- info->structured_reply, , errp);
+ info->structured_reply,
+ >extended_headers, , errp);

 info->structured_reply = false;
 info->base_allocation = false;
@@ -1147,10 +1163,11 @@ int nbd_receive_export_list(QIOChannel *ioc, 
QCryptoTLSCreds *tlscreds,
 int ret = -1;
 NBDExportInfo *array = NULL;
 QIOChannel *sioc = NULL;
+bool ext_hdrs;

 *info = NULL;
 result = nbd_start_negotiate(NULL, ioc, tlscreds, hostname, , true,
- NULL, errp);
+ _hdrs, NULL, errp);
 if (tlscreds && sioc) {
 ioc = sioc;
 }
@@ -1179,6 +1196,7 @@ int nbd_receive_export_list(QIOChannel *ioc, 
QCryptoTLSCreds *tlscreds,
 array[count - 1].name = name;
 array[count - 1].description = desc;
 array[count - 1].structured_reply = result == 3;
+array[count -

[libnbd PATCH 01/13] golang: Simplify nbd_block_status callback array copy

2021-12-03 Thread Eric Blake

In the block status callback glue code, we need to copy a C uint32_t[]
into a golang []uint32.  The copy is necessary since the lifetime of
the C array is not guaranteed to outlive whatever the Go callback may
have done with what it was handed; copying ensures that the user's Go
code doesn't have to worry about lifetime issues.  But we don't have
to have quite so many casts and pointer additions: since we can assume
C.uint32_t and uint32 occupy the same amount of memory (even though
they are different types), we can exploit Go's ability to treat an
unsafe pointer as if it were an oversized array, take a slice of that
array, and then use idiomatic Go to copy from the slice.

https://github.com/golang/go/wiki/cgo#turning-c-arrays-into-go-slices
---
 generator/GoLang.ml | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/generator/GoLang.ml b/generator/GoLang.ml
index eb3aa263..d3b7dc79 100644
--- a/generator/GoLang.ml
+++ b/generator/GoLang.ml
@@ -1,6 +1,6 @@
 (* hey emacs, this is OCaml code: -*- tuareg -*- *)
 (* nbd client library in userspace: generator
- * Copyright (C) 2013-2020 Red Hat Inc.
+ * Copyright (C) 2013-2021 Red Hat Inc.
  *
  * This library is free software; you can redistribute it and/or
  * modify it under the terms of the GNU Lesser General Public
@@ -514,11 +514,14 @@ let
 /* Closures. */

 func copy_uint32_array (entries *C.uint32_t, count C.size_t) []uint32 {
-ret := make([]uint32, int (count))
-for i := 0; i < int (count); i++ {
-   entry := (*C.uint32_t) (unsafe.Pointer(uintptr(unsafe.Pointer(entries)) 
+ (unsafe.Sizeof(*entries) * uintptr(i
-   ret[i] = uint32 (*entry)
-}
+/* https://github.com/golang/go/wiki/cgo#turning-c-arrays-into-go-slices */
+unsafePtr := unsafe.Pointer(entries)
+/* Max structured reply payload is 64M, so this array size is more than
+ * sufficient for the underlying slice we want to access.
+ */
+arrayPtr := (*[1 << 20]uint32)(unsafePtr)
+ret := make([]uint32, count)
+copy(ret, arrayPtr[:count:count])
 return ret
 }
 ";
-- 
2.33.1

[libnbd PATCH 08/13] block_status: Track 64-bit extents internally

2021-12-03 Thread Eric Blake

When extended headers are in use, the server can send us 64-bit
extents, even for a 32-bit query (if the server knows the entire image
is data, for example).  For maximum flexibility, we are thus better
off storing 64-bit lengths internally, even if we have to convert it
back to 32-bit lengths when invoking the user's 32-bit callback.  The
next patch will then add a new API for letting the user access the
full 64-bit extent information.  The goal is to let both APIs work all
the time, regardless of the size extents that the server actually
answered with.

Note that when using the old nbd_block_status() API with a server that
lacks extended headers, we now add a double-conversion speed penalty
(converting the server's 32-bit answer into 64-bit internally and back
to 32-bit for the callback).  But the speed penalty will not be a
problem for applications using the new nbd_block_status_64() API (we
have to give a 64-bit answer no matter what the server answered), and
ideally the situation will become less common as more servers learn
extended headers.  So for now I chose to unconditionally use a 64-bit
internal representation; but if it turns out to have noticeable
degredation, we could tweak things to conditionally retain 32-bit
internal representation for servers lacking extended headers at the
expense of more code maintenance.

One of the trickier aspects of this patch is auditing that both the
user's extent and our malloc'd shim get cleaned up once on all
possible paths, so that there is neither a leak nor a double free.
---
 lib/internal.h  |  7 +++-
 generator/states-reply-structured.c | 31 ++-
 lib/handle.c|  4 +-
 lib/rw.c| 59 -
 4 files changed, 85 insertions(+), 16 deletions(-)

diff --git a/lib/internal.h b/lib/internal.h
index 06f3a65c..4800df83 100644
--- a/lib/internal.h
+++ b/lib/internal.h
@@ -75,7 +75,7 @@ struct export {

 struct command_cb {
   union {
-nbd_extent_callback extent;
+nbd_extent64_callback extent;
 nbd_chunk_callback chunk;
 nbd_list_callback list;
 nbd_context_callback context;
@@ -286,7 +286,10 @@ struct nbd_handle {

   /* When receiving block status, this is used. */
   uint32_t bs_contextid;
-  uint32_t *bs_entries;
+  union {
+nbd_extent *normal; /* Our 64-bit preferred internal form */
+uint32_t *narrow;   /* 32-bit form of NBD_REPLY_TYPE_BLOCK_STATUS */
+  } bs_entries;

   /* Commands which are waiting to be issued [meaning the request
* packet is sent to the server].  This is used as a simple linked
diff --git a/generator/states-reply-structured.c 
b/generator/states-reply-structured.c
index a3e0e2ac..71c761e9 100644
--- a/generator/states-reply-structured.c
+++ b/generator/states-reply-structured.c
@@ -494,6 +494,7 @@ STATE_MACHINE {
  REPLY.STRUCTURED_REPLY.RECV_BS_CONTEXTID:
   struct command *cmd = h->reply_cmd;
   uint32_t length;
+  uint32_t count;

   switch (recv_into_rbuf (h)) {
   case -1: SET_NEXT_STATE (%.DEAD); return 0;
@@ -508,15 +509,19 @@ STATE_MACHINE {
 assert (cmd->type == NBD_CMD_BLOCK_STATUS);
 assert (length >= 12);
 length -= sizeof h->bs_contextid;
+count = length / (2 * sizeof (uint32_t));

-free (h->bs_entries);
-h->bs_entries = malloc (length);
-if (h->bs_entries == NULL) {
+/* Read raw data into a subset of h->bs_entries, then expand it
+ * into place later later during byte-swapping.
+ */
+free (h->bs_entries.normal);
+h->bs_entries.normal = malloc (count * sizeof *h->bs_entries.normal);
+if (h->bs_entries.normal == NULL) {
   SET_NEXT_STATE (%.DEAD);
   set_error (errno, "malloc");
   return 0;
 }
-h->rbuf = h->bs_entries;
+h->rbuf = h->bs_entries.narrow;
 h->rlen = length;
 SET_NEXT_STATE (%RECV_BS_ENTRIES);
   }
@@ -528,6 +533,7 @@ STATE_MACHINE {
   uint32_t count;
   size_t i;
   uint32_t context_id;
+  uint32_t *raw;
   struct meta_context *meta_context;

   switch (recv_into_rbuf (h)) {
@@ -542,15 +548,20 @@ STATE_MACHINE {
 assert (cmd); /* guaranteed by CHECK */
 assert (cmd->type == NBD_CMD_BLOCK_STATUS);
 assert (CALLBACK_IS_NOT_NULL (cmd->cb.fn.extent));
-assert (h->bs_entries);
+assert (h->bs_entries.normal);
 assert (length >= 12);
-count = (length - sizeof h->bs_contextid) / sizeof *h->bs_entries;
+count = (length - sizeof h->bs_contextid) / (2 * sizeof (uint32_t));

 /* Need to byte-swap the entries returned, but apart from that we
- * don't validate them.
+ * don't validate them.  Reverse order is essential, since we are
+ * expanding in-place from narrow to wider type.
  */
-for (i = 0; i < count; ++i)
-  h->bs_entries[i] = be32toh (h->bs_entries[i]);
+raw = h->bs_entries.narrow;
+for (i = count; i > 0; ) {
+  --i;
+  h->bs_entries.normal[i].flags = be32toh (raw[i * 2 + 1]);
+  h->bs_entries.normal[i].length = be32toh

[libnbd PATCH 06/13] protocol: Accept 64-bit holes during pread

2021-12-03 Thread Eric Blake

Even though we don't allow the user to request NBD_CMD_READ with more
than 64M (and even if we did, our API signature caps us at SIZE_MAX,
which is 32 bits on a 32-bit machine), the NBD extension to allow
64-bit requests implies that for symmetry we have to be able to
support 64-bit holes over the wire.  Note that we don't have to change
the signature of the callback for nbd_pread_structured; nor is it
worth adding a counterpart to LIBNBD_READ_HOLE, because it is unlikely
that a user callback will ever need to distinguish between which size
was sent over the wire, when the value is always less than 32 bits.

While we cannot guarantee which size structured reply the server will
use, it is easy enough to handle both sizes, even for a non-compliant
server that sends wide replies when extended headers were not
negotiated.  Of course, until a later patch enables extended headers
negotiation, no compliant server will trigger the new code here.
---
 lib/internal.h  |  1 +
 generator/states-reply-structured.c | 41 +
 2 files changed, 37 insertions(+), 5 deletions(-)

diff --git a/lib/internal.h b/lib/internal.h
index c9f84441..06f3a65c 100644
--- a/lib/internal.h
+++ b/lib/internal.h
@@ -231,6 +231,7 @@ struct nbd_handle {
   union {
 struct nbd_structured_reply_offset_data offset_data;
 struct nbd_structured_reply_offset_hole offset_hole;
+struct nbd_structured_reply_offset_hole_ext offset_hole_ext;
 struct {
   struct nbd_structured_reply_error error;
   char msg[NBD_MAX_STRING]; /* Common to all error types */
diff --git a/generator/states-reply-structured.c 
b/generator/states-reply-structured.c
index 1b675e8d..a3e0e2ac 100644
--- a/generator/states-reply-structured.c
+++ b/generator/states-reply-structured.c
@@ -26,15 +26,16 @@
  * requesting command.
  */
 static bool
-structured_reply_in_bounds (uint64_t offset, uint32_t length,
+structured_reply_in_bounds (uint64_t offset, uint64_t length,
 const struct command *cmd)
 {
   if (offset < cmd->offset ||
   offset >= cmd->offset + cmd->count ||
-  offset + length > cmd->offset + cmd->count) {
+  length > cmd->offset + cmd->count ||
+  offset > cmd->offset + cmd->count - length) {
 set_error (0, "range of structured reply is out of bounds, "
"offset=%" PRIu64 ", cmd->offset=%" PRIu64 ", "
-   "length=%" PRIu32 ", cmd->count=%" PRIu64 ": "
+   "length=%" PRIu64 ", cmd->count=%" PRIu64 ": "
"this is likely to be a bug in the NBD server",
offset, cmd->offset, length, cmd->count);
 return false;
@@ -182,6 +183,25 @@ STATE_MACHINE {
 SET_NEXT_STATE (%RECV_OFFSET_HOLE);
 return 0;
   }
+  else if (type == NBD_REPLY_TYPE_OFFSET_HOLE_EXT) {
+if (cmd->type != NBD_CMD_READ) {
+  SET_NEXT_STATE (%.DEAD);
+  set_error (0, "invalid command for receiving offset-hole chunk, "
+ "cmd->type=%" PRIu16 ", "
+ "this is likely to be a bug in the server",
+ cmd->type);
+  return 0;
+}
+if (length != sizeof h->sbuf.sr.payload.offset_hole_ext) {
+  SET_NEXT_STATE (%.DEAD);
+  set_error (0, "invalid length in NBD_REPLY_TYPE_OFFSET_HOLE_EXT");
+  return 0;
+}
+h->rbuf = >sbuf.sr.payload.offset_hole_ext;
+h->rlen = sizeof h->sbuf.sr.payload.offset_hole_ext;
+SET_NEXT_STATE (%RECV_OFFSET_HOLE);
+return 0;
+  }
   else if (type == NBD_REPLY_TYPE_BLOCK_STATUS) {
 if (cmd->type != NBD_CMD_BLOCK_STATUS) {
   SET_NEXT_STATE (%.DEAD);
@@ -415,7 +435,8 @@ STATE_MACHINE {
  REPLY.STRUCTURED_REPLY.RECV_OFFSET_HOLE:
   struct command *cmd = h->reply_cmd;
   uint64_t offset;
-  uint32_t length;
+  uint64_t length;
+  uint16_t type;

   switch (recv_into_rbuf (h)) {
   case -1: SET_NEXT_STATE (%.DEAD); return 0;
@@ -425,7 +446,14 @@ STATE_MACHINE {
 return 0;
   case 0:
 offset = be64toh (h->sbuf.sr.payload.offset_hole.offset);
-length = be32toh (h->sbuf.sr.payload.offset_hole.length);
+type = be16toh (h->sbuf.sr.hdr.structured_reply.type);
+
+if (type == NBD_REPLY_TYPE_OFFSET_HOLE)
+  length = be32toh (h->sbuf.sr.payload.offset_hole.length);
+else {
+  /* XXX Insist on h->extended_headers? */
+  length = be64toh (h->sbuf.sr.payload.offset_hole_ext.length);
+}

 assert (cmd); /* guaranteed by CHECK */

@@ -443,7 +471,10 @@ STATE_MACHINE {
 /* The spec states that 0-length requests are unspecified, but
  * 0-length replies are broken. Still, it's easy enough to support
  * them as an extension, and this works even when length == 0.
+ * Although length is 64 bits, the bounds check above ensures that
+ * it is no larger than the 64M cap we put on NBD_CMD_READ.
  */
+assert (length <= SIZE_MAX);
 memset (cmd->data + offset, 0, length);
 if (CALLBACK_IS_NOT_NULL (cmd->cb.fn.chunk)) {

[libnbd PATCH 00/13] libnbd patches for NBD_OPT_EXTENDED_HEADERS

2021-12-03 Thread Eric Blake

Available here: https://repo.or.cz/libnbd/ericb.git/shortlog/refs/tags/exthdr-v1

I also want to do followup patches to teach 'nbdinfo --map' and
'nbdcopy' to utilize 64-bit extents.

Eric Blake (13):
  golang: Simplify nbd_block_status callback array copy
  block_status: Refactor array storage
  protocol: Add definitions for extended headers
  protocol: Prepare to send 64-bit requests
  protocol: Prepare to receive 64-bit replies
  protocol: Accept 64-bit holes during pread
  generator: Add struct nbd_extent in prep for 64-bit extents
  block_status: Track 64-bit extents internally
  block_status: Accept 64-bit extents during block status
  api: Add [aio_]nbd_block_status_64
  api: Add three functions for controlling extended headers
  generator: Actually request extended headers
  interop: Add test of 64-bit block status

 lib/internal.h|  31 ++-
 lib/nbd-protocol.h|  61 -
 generator/API.ml  | 237 --
 generator/API.mli |   3 +-
 generator/C.ml|  24 +-
 generator/GoLang.ml   |  35 ++-
 generator/Makefile.am |   3 +-
 generator/OCaml.ml|  20 +-
 generator/Python.ml   |  29 ++-
 generator/state_machine.ml|  52 +++-
 generator/states-issue-command.c  |  31 ++-
 .../states-newstyle-opt-extended-headers.c|  90 +++
 generator/states-newstyle-opt-starttls.c  |  10 +-
 generator/states-reply-structured.c   | 220 
 generator/states-reply.c  |  31 ++-
 lib/handle.c  |  27 +-
 lib/rw.c  | 105 +++-
 python/t/110-defaults.py  |   3 +-
 python/t/120-set-non-defaults.py  |   4 +-
 python/t/465-block-status-64.py   |  56 +
 ocaml/helpers.c   |  22 +-
 ocaml/nbd-c.h |   3 +-
 ocaml/tests/Makefile.am   |   5 +-
 ocaml/tests/test_110_defaults.ml  |   4 +-
 ocaml/tests/test_120_set_non_defaults.ml  |   5 +-
 ocaml/tests/test_465_block_status_64.ml   |  58 +
 tests/meta-base-allocation.c  | 111 +++-
 interop/Makefile.am   |   6 +
 interop/large-status.c| 186 ++
 interop/large-status.sh   |  49 
 .gitignore|   1 +
 golang/Makefile.am|   3 +-
 golang/handle.go  |   6 +
 golang/libnbd_110_defaults_test.go|   8 +
 golang/libnbd_120_set_non_defaults_test.go|  12 +
 golang/libnbd_465_block_status_64_test.go | 119 +
 36 files changed, 1511 insertions(+), 159 deletions(-)
 create mode 100644 generator/states-newstyle-opt-extended-headers.c
 create mode 100644 python/t/465-block-status-64.py
 create mode 100644 ocaml/tests/test_465_block_status_64.ml
 create mode 100644 interop/large-status.c
 create mode 100755 interop/large-status.sh
 create mode 100644 golang/libnbd_465_block_status_64_test.go

-- 
2.33.1

[PATCH 14/14] do not apply: nbd/server: Send 64-bit hole chunk

2021-12-03 Thread Eric Blake

Since we cap NBD_CMD_READ requests to 32M, we never have a reason to
send a 64-bit chunk type for a hole; but it is worth producing these
for interoperability testing of clients that want extended headers.
---
 nbd/server.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 7e6140350797..4369a9a8ff08 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -2071,19 +2071,29 @@ static int coroutine_fn 
nbd_co_send_sparse_read(NBDClient *client,
 if (status & BDRV_BLOCK_ZERO) {
 NBDReply hdr;
 NBDStructuredReadHole chunk;
+NBDStructuredReadHoleExt chunk_ext;
 struct iovec iov[] = {
 {.iov_base = },
-{.iov_base = , .iov_len = sizeof(chunk)},
+{.iov_base = client->extended_headers ? _ext
+ : (void *) ,
+ .iov_len = client->extended_headers ? sizeof(chunk_ext)
+ : sizeof(chunk)},
 };

 trace_nbd_co_send_structured_read_hole(handle, offset + progress,
pnum);
 set_be_chunk(client, [0],
  final ? NBD_REPLY_FLAG_DONE : 0,
- NBD_REPLY_TYPE_OFFSET_HOLE,
+ client->extended_headers ? 
NBD_REPLY_TYPE_OFFSET_HOLE_EXT
+ : NBD_REPLY_TYPE_OFFSET_HOLE,
  handle, iov[1].iov_len);
-stq_be_p(, offset + progress);
-stl_be_p(, pnum);
+if (client->extended_headers) {
+stq_be_p(_ext.offset, offset + progress);
+stq_be_p(_ext.length, pnum);
+} else {
+stq_be_p(, offset + progress);
+stl_be_p(, pnum);
+}
 ret = nbd_co_send_iov(client, iov, 2, errp);
 } else {
 ret = blk_pread(exp->common.blk, offset + progress,
-- 
2.33.1

[PATCH 10/14] nbd/client: Initial support for extended headers

2021-12-03 Thread Eric Blake

Update the client code to be able to send an extended request, and
parse an extended header from the server.  Note that since we reject
any structured reply with a too-large payload, we can always normalize
a valid header back into the compact form, so that the caller need not
deal with two branches of a union.  Still, until a later patch lets
the client negotiate extended headers, the code added here should not
be reached.  Note that because of the different magic numbers, it is
just as easy to trace and then tolerate a non-compliant server sending
the wrong header reply as it would be to insist that the server is
compliant.

The only caller to nbd_receive_reply() always passed NULL for errp;
since we are changing the signature anyways, I decided to sink the
decision to ignore errors one layer lower.

Signed-off-by: Eric Blake 
---
 include/block/nbd.h |   2 +-
 block/nbd.c |   3 +-
 nbd/client.c| 112 +++-
 nbd/trace-events|   1 +
 4 files changed, 84 insertions(+), 34 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index 5f9d86a86352..d489c67d98dc 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -366,7 +366,7 @@ int nbd_init(int fd, QIOChannelSocket *sioc, NBDExportInfo 
*info,
  Error **errp);
 int nbd_send_request(QIOChannel *ioc, NBDRequest *request, bool ext_hdr);
 int coroutine_fn nbd_receive_reply(BlockDriverState *bs, QIOChannel *ioc,
-   NBDReply *reply, Error **errp);
+   NBDReply *reply, bool ext_hdrs);
 int nbd_client(int fd);
 int nbd_disconnect(int fd);
 int nbd_errno_to_system_errno(int err);
diff --git a/block/nbd.c b/block/nbd.c
index 3e9875241bec..da5e6ac2d9a5 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -401,7 +401,8 @@ static coroutine_fn int nbd_receive_replies(BDRVNBDState 
*s, uint64_t handle)

 /* We are under mutex and handle is 0. We have to do the dirty work. */
 assert(s->reply.handle == 0);
-ret = nbd_receive_reply(s->bs, s->ioc, >reply, NULL);
+ret = nbd_receive_reply(s->bs, s->ioc, >reply,
+s->info.extended_headers);
 if (ret <= 0) {
 ret = ret ? ret : -EIO;
 nbd_channel_error(s, ret);
diff --git a/nbd/client.c b/nbd/client.c
index aa162b9d08d5..f1aa5256c8bf 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -1347,22 +1347,28 @@ int nbd_disconnect(int fd)

 int nbd_send_request(QIOChannel *ioc, NBDRequest *request, bool ext_hdr)
 {
-uint8_t buf[NBD_REQUEST_SIZE];
+uint8_t buf[NBD_REQUEST_EXT_SIZE];
+size_t len;

-assert(!ext_hdr);
-assert(request->len <= UINT32_MAX);
 trace_nbd_send_request(request->from, request->len, request->handle,
request->flags, request->type,
nbd_cmd_lookup(request->type));

-stl_be_p(buf, NBD_REQUEST_MAGIC);
+stl_be_p(buf, ext_hdr ? NBD_REQUEST_EXT_MAGIC : NBD_REQUEST_MAGIC);
 stw_be_p(buf + 4, request->flags);
 stw_be_p(buf + 6, request->type);
 stq_be_p(buf + 8, request->handle);
 stq_be_p(buf + 16, request->from);
-stl_be_p(buf + 24, request->len);
+if (ext_hdr) {
+stq_be_p(buf + 24, request->len);
+len = NBD_REQUEST_EXT_SIZE;
+} else {
+assert(request->len <= UINT32_MAX);
+stl_be_p(buf + 24, request->len);
+len = NBD_REQUEST_SIZE;
+}

-return nbd_write(ioc, buf, sizeof(buf), NULL);
+return nbd_write(ioc, buf, len, NULL);
 }

 /* nbd_receive_simple_reply
@@ -1370,49 +1376,69 @@ int nbd_send_request(QIOChannel *ioc, NBDRequest 
*request, bool ext_hdr)
  * Payload is not read (payload is possible for CMD_READ, but here we even
  * don't know whether it take place or not).
  */
-static int nbd_receive_simple_reply(QIOChannel *ioc, NBDSimpleReply *reply,
+static int nbd_receive_simple_reply(QIOChannel *ioc, NBDReply *reply,
 Error **errp)
 {
 int ret;
+size_t len;

-assert(reply->magic == NBD_SIMPLE_REPLY_MAGIC);
+if (reply->magic == NBD_SIMPLE_REPLY_MAGIC) {
+len = sizeof(reply->simple);
+} else {
+assert(reply->magic == NBD_SIMPLE_REPLY_EXT_MAGIC);
+len = sizeof(reply->simple_ext);
+}

 ret = nbd_read(ioc, (uint8_t *)reply + sizeof(reply->magic),
-   sizeof(*reply) - sizeof(reply->magic), "reply", errp);
+   len - sizeof(reply->magic), "reply", errp);
 if (ret < 0) {
 return ret;
 }

-reply->error = be32_to_cpu(reply->error);
-reply->handle = be64_to_cpu(reply->handle);
+/* error and handle occupy same space between forms */
+reply->simple.error = be32_to_cpu(reply->simple.error);
+reply->simple.handle = be64_to_cpu(reply->handle);
+if (reply->magic == NBD_SIMPLE_REPLY_EXT_MAGIC) {
+if (reply->simple_ext._pad1 || reply->simple_ext._pad2) {
+

[PATCH 11/14] nbd/client: Accept 64-bit hole chunks

2021-12-03 Thread Eric Blake

Although our read requests are sized such that servers need not send
an extended hole chunk, we still have to be prepared for it to be
fully compliant if we request extended headers.  We can also tolerate
a non-compliant server sending the new chunk even when it should not.

Signed-off-by: Eric Blake 
---
 block/nbd.c| 26 --
 block/trace-events |  1 +
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index da5e6ac2d9a5..c5dea864ebb6 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -518,20 +518,26 @@ static inline uint64_t payload_advance64(uint8_t 
**payload)

 static int nbd_parse_offset_hole_payload(BDRVNBDState *s,
  NBDStructuredReplyChunk *chunk,
- uint8_t *payload, uint64_t 
orig_offset,
+ uint8_t *payload, bool wide,
+ uint64_t orig_offset,
  QEMUIOVector *qiov, Error **errp)
 {
 uint64_t offset;
-uint32_t hole_size;
+uint64_t hole_size;
+size_t len = wide ? sizeof(hole_size) : sizeof(uint32_t);

-if (chunk->length != sizeof(offset) + sizeof(hole_size)) {
+if (chunk->length != sizeof(offset) + len) {
 error_setg(errp, "Protocol error: invalid payload for "
  "NBD_REPLY_TYPE_OFFSET_HOLE");
 return -EINVAL;
 }

 offset = payload_advance64();
-hole_size = payload_advance32();
+if (wide) {
+hole_size = payload_advance64();
+} else {
+hole_size = payload_advance32();
+}

 if (!hole_size || offset < orig_offset || hole_size > qiov->size ||
 offset > orig_offset + qiov->size - hole_size) {
@@ -544,6 +550,7 @@ static int nbd_parse_offset_hole_payload(BDRVNBDState *s,
 trace_nbd_structured_read_compliance("hole");
 }

+assert(hole_size <= SIZE_MAX);
 qemu_iovec_memset(qiov, offset - orig_offset, 0, hole_size);

 return 0;
@@ -1037,9 +1044,16 @@ static int nbd_co_receive_cmdread_reply(BDRVNBDState *s, 
uint64_t handle,
  * in qiov
  */
 break;
+case NBD_REPLY_TYPE_OFFSET_HOLE_EXT:
+if (!s->info.extended_headers) {
+trace_nbd_extended_headers_compliance("hole_ext");
+}
+/* fallthrough */
 case NBD_REPLY_TYPE_OFFSET_HOLE:
-ret = nbd_parse_offset_hole_payload(s, , payload,
-offset, qiov, _err);
+ret = nbd_parse_offset_hole_payload(
+s, , payload,
+chunk->type == NBD_REPLY_TYPE_OFFSET_HOLE_EXT,
+offset, qiov, _err);
 if (ret < 0) {
 nbd_channel_error(s, ret);
 nbd_iter_channel_error(, ret, _err);
diff --git a/block/trace-events b/block/trace-events
index 549090d453e7..ee65da204dde 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -168,6 +168,7 @@ iscsi_xcopy(void *src_lun, uint64_t src_off, void *dst_lun, 
uint64_t dst_off, ui
 # nbd.c
 nbd_parse_blockstatus_compliance(const char *err) "ignoring extra data from 
non-compliant server: %s"
 nbd_structured_read_compliance(const char *type) "server sent non-compliant 
unaligned read %s chunk"
+nbd_extended_headers_compliance(const char *type) "server sent non-compliant 
%s chunk without extended headers"
 nbd_read_reply_entry_fail(int ret, const char *err) "ret = %d, err: %s"
 nbd_co_request_fail(uint64_t from, uint32_t len, uint64_t handle, uint16_t 
flags, uint16_t type, const char *name, int ret, const char *err) "Request 
failed { .from = %" PRIu64", .len = %" PRIu32 ", .handle = %" PRIu64 ", .flags 
= 0x%" PRIx16 ", .type = %" PRIu16 " (%s) } ret = %d, err: %s"
 nbd_client_handshake(const char *export_name) "export '%s'"
-- 
2.33.1

[PATCH 09/14] nbd/server: Support 64-bit block status

2021-12-03 Thread Eric Blake

The previous patch handled extended headers by truncating large block
status requests from the client back to 32 bits.  But this is not
ideal; for cases where we can truly determine the status of the entire
image quickly (for example, when reporting the entire image as
non-sparse because we lack the ability to probe for holes), this
causes more network traffic for the client to iterate through 4G
chunks than for us to just report the entire image at once.  For ease
of implementation, if extended headers were negotiated, then we always
reply with 64-bit block status replies, even when the result could
have fit in the older 32-bit block status chunk (clients supporting
extended headers have to be prepared for either chunk type, so
temporarily reverting this patch proves whether a client is
compliant).

Note that we previously had some interesting size-juggling on call
chains, such as:

nbd_co_send_block_status(uint32_t length)
-> blockstatus_to_extends(uint32_t bytes)
  -> bdrv_block_status_above(bytes, _t num)
  -> nbd_extent_array_add(uint64_t num)
-> store num in 32-bit length

But we were lucky that it never overflowed: bdrv_block_status_above
never sets num larger than bytes, and we had previously been capping
'bytes' at 32 bits (either by the protocol, or in the previous patch
with an explicit truncation).  This patch adds some assertions that
ensure we continue to avoid overflowing 32 bits for a narrow client,
while fully utilizing 64-bits all the way through when the client
understands that.

Signed-off-by: Eric Blake 
---
 nbd/server.c | 72 ++--
 1 file changed, 48 insertions(+), 24 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 0e496f60ffbd..7e6140350797 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -2106,20 +2106,26 @@ static int coroutine_fn 
nbd_co_send_sparse_read(NBDClient *client,
 }

 typedef struct NBDExtentArray {
-NBDExtent *extents;
+union {
+NBDExtent *narrow;
+NBDExtentExt *extents;
+};
 unsigned int nb_alloc;
 unsigned int count;
 uint64_t total_length;
+bool extended; /* Whether 64-bit extents are allowed */
 bool can_add;
 bool converted_to_be;
 } NBDExtentArray;

-static NBDExtentArray *nbd_extent_array_new(unsigned int nb_alloc)
+static NBDExtentArray *nbd_extent_array_new(unsigned int nb_alloc,
+bool extended)
 {
 NBDExtentArray *ea = g_new0(NBDExtentArray, 1);

 ea->nb_alloc = nb_alloc;
-ea->extents = g_new(NBDExtent, nb_alloc);
+ea->extents = g_new(NBDExtentExt, nb_alloc);
+ea->extended = extended;
 ea->can_add = true;

 return ea;
@@ -2133,17 +2139,31 @@ static void nbd_extent_array_free(NBDExtentArray *ea)
 G_DEFINE_AUTOPTR_CLEANUP_FUNC(NBDExtentArray, nbd_extent_array_free);

 /* Further modifications of the array after conversion are abandoned */
-static void nbd_extent_array_convert_to_be(NBDExtentArray *ea)
+static void nbd_extent_array_convert_to_be(NBDExtentArray *ea,
+   struct iovec *iov)
 {
 int i;

 assert(!ea->converted_to_be);
+assert(iov->iov_base == ea->extents);
 ea->can_add = false;
 ea->converted_to_be = true;

-for (i = 0; i < ea->count; i++) {
-ea->extents[i].flags = cpu_to_be32(ea->extents[i].flags);
-ea->extents[i].length = cpu_to_be32(ea->extents[i].length);
+if (ea->extended) {
+for (i = 0; i < ea->count; i++) {
+ea->extents[i].length = cpu_to_be64(ea->extents[i].length);
+ea->extents[i].flags = cpu_to_be32(ea->extents[i].flags);
+assert(ea->extents[i]._pad == 0);
+}
+iov->iov_len = ea->count * sizeof(ea->extents[0]);
+} else {
+/* Conversion reduces memory usage, order of iteration matters */
+for (i = 0; i < ea->count; i++) {
+assert(ea->extents[i].length <= UINT32_MAX);
+ea->narrow[i].length = cpu_to_be32(ea->extents[i].length);
+ea->narrow[i].flags = cpu_to_be32(ea->extents[i].flags);
+}
+iov->iov_len = ea->count * sizeof(ea->narrow[0]);
 }
 }

@@ -2157,19 +2177,23 @@ static void 
nbd_extent_array_convert_to_be(NBDExtentArray *ea)
  * would result in an incorrect range reported to the client)
  */
 static int nbd_extent_array_add(NBDExtentArray *ea,
-uint32_t length, uint32_t flags)
+uint64_t length, uint32_t flags)
 {
 assert(ea->can_add);

 if (!length) {
 return 0;
 }
+if (!ea->extended) {
+assert(length <= UINT32_MAX);
+}

 /* Extend previous extent if flags are the same */
 if (ea->count > 0 && flags == ea->extents[ea->count - 1].flags) {
-uint64_t sum = (uint64_t)length + ea->extents[ea->count - 1].length;
+uint64_t sum = length + ea->extents[ea->count - 1].length;

-if (sum <= UINT32_MAX) {
+

[PATCH 05/14] nbd/server: Prepare for alternate-size headers

2021-12-03 Thread Eric Blake

An upcoming NBD extension wants to add the ability to do 64-bit
requests.  As part of that extension, the size of the reply headers
will change in order to permit a 64-bit length in the reply for
symmetry [*].  Additionally, where the reply header is currently 16
bytes for simple reply, and 20 bytes for structured reply; with the
extension enabled, both reply type headers will be 32 bytes.  Since we
are already wired up to use iovecs, it is easiest to allow for this
change in header size by splitting each structured reply across two
iovecs, one for the header (which will become variable-length in a
future patch according to client negotiation), and the other for the
payload, and removing the header from the payload struct definitions.
Interestingly, the client side code never utilized the packed types,
so only the server code needs to be updated.

[*] Note that on the surface, this is because some server might permit
a 4G+ NBD_CMD_READ and need to reply with that much data in one
transaction.  But even though the extended reply length is widened to
64 bits, we will still never send a reply payload larger than just
over 32M (the maximum buffer we allow in NBD_CMD_READ; and we cap the
number of extents we are willing to report in NBD_CMD_BLOCK_STATUS).
Where 64-bit fields really matter in the extension is in a later patch
adding 64-bit support into a counterpart for REPLY_TYPE_BLOCK_STATUS.

Signed-off-by: Eric Blake 
---
 include/block/nbd.h | 10 +++
 nbd/server.c| 64 -
 2 files changed, 45 insertions(+), 29 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index 78d101b77488..3d0689b69367 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -1,5 +1,5 @@
 /*
- *  Copyright (C) 2016-2020 Red Hat, Inc.
+ *  Copyright (C) 2016-2021 Red Hat, Inc.
  *  Copyright (C) 2005  Anthony Liguori 
  *
  *  Network Block Device
@@ -95,28 +95,28 @@ typedef union NBDReply {

 /* Header of chunk for NBD_REPLY_TYPE_OFFSET_DATA */
 typedef struct NBDStructuredReadData {
-NBDStructuredReplyChunk h; /* h.length >= 9 */
+/* header's .length >= 9 */
 uint64_t offset;
 /* At least one byte of data payload follows, calculated from h.length */
 } QEMU_PACKED NBDStructuredReadData;

 /* Complete chunk for NBD_REPLY_TYPE_OFFSET_HOLE */
 typedef struct NBDStructuredReadHole {
-NBDStructuredReplyChunk h; /* h.length == 12 */
+/* header's length == 12 */
 uint64_t offset;
 uint32_t length;
 } QEMU_PACKED NBDStructuredReadHole;

 /* Header of all NBD_REPLY_TYPE_ERROR* errors */
 typedef struct NBDStructuredError {
-NBDStructuredReplyChunk h; /* h.length >= 6 */
+/* header's length >= 6 */
 uint32_t error;
 uint16_t message_length;
 } QEMU_PACKED NBDStructuredError;

 /* Header of NBD_REPLY_TYPE_BLOCK_STATUS */
 typedef struct NBDStructuredMeta {
-NBDStructuredReplyChunk h; /* h.length >= 12 (at least one extent) */
+/* header's length >= 12 (at least one extent) */
 uint32_t context_id;
 /* extents follows */
 } QEMU_PACKED NBDStructuredMeta;
diff --git a/nbd/server.c b/nbd/server.c
index f302e1cbb03e..64845542fd6b 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -1869,9 +1869,12 @@ static int coroutine_fn nbd_co_send_iov(NBDClient 
*client, struct iovec *iov,
 return ret;
 }

-static inline void set_be_simple_reply(NBDSimpleReply *reply, uint64_t error,
-   uint64_t handle)
+static inline void set_be_simple_reply(NBDClient *client, struct iovec *iov,
+   uint64_t error, uint64_t handle)
 {
+NBDSimpleReply *reply = iov->iov_base;
+
+iov->iov_len = sizeof(*reply);
 stl_be_p(>magic, NBD_SIMPLE_REPLY_MAGIC);
 stl_be_p(>error, error);
 stq_be_p(>handle, handle);
@@ -1884,23 +1887,27 @@ static int nbd_co_send_simple_reply(NBDClient *client,
 size_t len,
 Error **errp)
 {
-NBDSimpleReply reply;
+NBDReply hdr;
 int nbd_err = system_errno_to_nbd_errno(error);
 struct iovec iov[] = {
-{.iov_base = , .iov_len = sizeof(reply)},
+{.iov_base = },
 {.iov_base = data, .iov_len = len}
 };

 trace_nbd_co_send_simple_reply(handle, nbd_err, nbd_err_lookup(nbd_err),
len);
-set_be_simple_reply(, nbd_err, handle);
+set_be_simple_reply(client, [0], nbd_err, handle);

 return nbd_co_send_iov(client, iov, len ? 2 : 1, errp);
 }

-static inline void set_be_chunk(NBDStructuredReplyChunk *chunk, uint16_t flags,
-uint16_t type, uint64_t handle, uint32_t 
length)
+static inline void set_be_chunk(NBDClient *client, struct iovec *iov,
+uint16_t flags, uint16_t type,
+uint64_t handle, uint32_t length)
 {
+NBDStructuredReplyChunk *chunk = iov->iov_base;
+
+

[PATCH 07/14] nbd: Add types for extended headers

2021-12-03 Thread Eric Blake

Add the constants and structs necessary for later patches to start
implementing the NBD_OPT_EXTENDED_HEADERS extension in both the client
and server.  This patch does not change any existing behavior, but
merely sets the stage.

This patch does not change the status quo that neither the client nor
server use a packed-struct representation for the request header.

Signed-off-by: Eric Blake 
---
 docs/interop/nbd.txt |  1 +
 include/block/nbd.h  | 67 +++-
 nbd/common.c | 10 +--
 3 files changed, 62 insertions(+), 16 deletions(-)

diff --git a/docs/interop/nbd.txt b/docs/interop/nbd.txt
index bdb0f2a41aca..6229ea573c04 100644
--- a/docs/interop/nbd.txt
+++ b/docs/interop/nbd.txt
@@ -68,3 +68,4 @@ NBD_CMD_BLOCK_STATUS for "qemu:dirty-bitmap:", NBD_CMD_CACHE
 * 4.2: NBD_FLAG_CAN_MULTI_CONN for shareable read-only exports,
 NBD_CMD_FLAG_FAST_ZERO
 * 5.2: NBD_CMD_BLOCK_STATUS for "qemu:allocation-depth"
+* 7.0: NBD_OPT_EXTENDED_HEADERS
diff --git a/include/block/nbd.h b/include/block/nbd.h
index 732314aaba11..5f9d86a86352 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -69,6 +69,14 @@ typedef struct NBDSimpleReply {
 uint64_t handle;
 } QEMU_PACKED NBDSimpleReply;

+typedef struct NBDSimpleReplyExt {
+uint32_t magic;  /* NBD_SIMPLE_REPLY_EXT_MAGIC */
+uint32_t error;
+uint64_t handle;
+uint64_t _pad1;  /* Must be 0 */
+uint64_t _pad2;  /* Must be 0 */
+} QEMU_PACKED NBDSimpleReplyExt;
+
 /* Header of all structured replies */
 typedef struct NBDStructuredReplyChunk {
 uint32_t magic;  /* NBD_STRUCTURED_REPLY_MAGIC */
@@ -78,9 +86,20 @@ typedef struct NBDStructuredReplyChunk {
 uint32_t length; /* length of payload */
 } QEMU_PACKED NBDStructuredReplyChunk;

+typedef struct NBDStructuredReplyChunkExt {
+uint32_t magic;  /* NBD_STRUCTURED_REPLY_EXT_MAGIC */
+uint16_t flags;  /* combination of NBD_REPLY_FLAG_* */
+uint16_t type;   /* NBD_REPLY_TYPE_* */
+uint64_t handle; /* request handle */
+uint64_t length; /* length of payload */
+uint64_t _pad;   /* Must be 0 */
+} QEMU_PACKED NBDStructuredReplyChunkExt;
+
 typedef union NBDReply {
 NBDSimpleReply simple;
+NBDSimpleReplyExt simple_ext;
 NBDStructuredReplyChunk structured;
+NBDStructuredReplyChunkExt structured_ext;
 struct {
 /* @magic and @handle fields have the same offset and size both in
  * simple reply and structured reply chunk, so let them be accessible
@@ -106,6 +125,13 @@ typedef struct NBDStructuredReadHole {
 uint32_t length;
 } QEMU_PACKED NBDStructuredReadHole;

+/* Complete chunk for NBD_REPLY_TYPE_OFFSET_HOLE_EXT */
+typedef struct NBDStructuredReadHoleExt {
+/* header's length == 16 */
+uint64_t offset;
+uint64_t length;
+} QEMU_PACKED NBDStructuredReadHoleExt;
+
 /* Header of all NBD_REPLY_TYPE_ERROR* errors */
 typedef struct NBDStructuredError {
 /* header's length >= 6 */
@@ -113,19 +139,26 @@ typedef struct NBDStructuredError {
 uint16_t message_length;
 } QEMU_PACKED NBDStructuredError;

-/* Header of NBD_REPLY_TYPE_BLOCK_STATUS */
+/* Header of NBD_REPLY_TYPE_BLOCK_STATUS, NBD_REPLY_TYPE_BLOCK_STATUS_EXT */
 typedef struct NBDStructuredMeta {
-/* header's length >= 12 (at least one extent) */
+/* header's length >= 12 narrow, or >= 20 extended (at least one extent) */
 uint32_t context_id;
-/* extents follows */
+/* extents[] follows: NBDExtent for narrow, NBDExtentExt for extended */
 } QEMU_PACKED NBDStructuredMeta;

-/* Extent chunk for NBD_REPLY_TYPE_BLOCK_STATUS */
+/* Extent array for NBD_REPLY_TYPE_BLOCK_STATUS */
 typedef struct NBDExtent {
 uint32_t length;
 uint32_t flags; /* NBD_STATE_* */
 } QEMU_PACKED NBDExtent;

+/* Extent array for NBD_REPLY_TYPE_BLOCK_STATUS_EXT */
+typedef struct NBDExtentExt {
+uint64_t length;
+uint32_t flags; /* NBD_STATE_* */
+uint32_t _pad;  /* Must be 0 */
+} QEMU_PACKED NBDExtentExt;
+
 /* Transmission (export) flags: sent from server to client during handshake,
but describe what will happen during transmission */
 enum {
@@ -178,6 +211,7 @@ enum {
 #define NBD_OPT_STRUCTURED_REPLY  (8)
 #define NBD_OPT_LIST_META_CONTEXT (9)
 #define NBD_OPT_SET_META_CONTEXT  (10)
+#define NBD_OPT_EXTENDED_HEADERS  (11)

 /* Option reply types. */
 #define NBD_REP_ERR(value) ((UINT32_C(1) << 31) | (value))
@@ -234,12 +268,15 @@ enum {
  */
 #define NBD_MAX_STRING_SIZE 4096

-/* Transmission request structure */
+/* Two types of request structures, a given client will only use 1 */
 #define NBD_REQUEST_MAGIC   0x25609513
+#define NBD_REQUEST_EXT_MAGIC   0x21e41c71

-/* Two types of reply structures */
-#define NBD_SIMPLE_REPLY_MAGIC  0x67446698
-#define NBD_STRUCTURED_REPLY_MAGIC  0x668e33ef
+/* Four types of reply structures, a given client will only use 2 */
+#define NBD_SIMPLE_REPLY_MAGIC  0x67446698
+#define NBD_STRUCTURED_REPLY_MAGIC  0x668e33ef
+#define

[PATCH 08/14] nbd/server: Initial support for extended headers

2021-12-03 Thread Eric Blake

We have no reason to send NBD_REPLY_TYPE_OFFSET_HOLE_EXT since we
already cap NBD_CMD_READ to 32M.  For NBD_CMD_WRITE_ZEROES and
NBD_CMD_TRIM, the block layer already supports 64-bit operations
without any effort on our part.  For NBD_CMD_BLOCK_STATUS, the
client's length is a hint; the easiest approach is to truncate our
answer back to 32 bits, letting us delay the effort of implementing
NBD_REPLY_TYPE_BLOCK_STATUS_EXT to a separate patch.

Signed-off-by: Eric Blake 
---
 nbd/nbd-internal.h |   5 ++-
 nbd/server.c   | 106 ++---
 2 files changed, 85 insertions(+), 26 deletions(-)

diff --git a/nbd/nbd-internal.h b/nbd/nbd-internal.h
index 0016793ff4b1..875b6204c28c 100644
--- a/nbd/nbd-internal.h
+++ b/nbd/nbd-internal.h
@@ -35,8 +35,11 @@
  * https://github.com/yoe/nbd/blob/master/doc/proto.md
  */

-/* Size of all NBD_OPT_*, without payload */
+/* Size of all compact NBD_CMD_*, without payload */
 #define NBD_REQUEST_SIZE(4 + 2 + 2 + 8 + 8 + 4)
+/* Size of all extended NBD_CMD_*, without payload */
+#define NBD_REQUEST_EXT_SIZE(4 + 2 + 2 + 8 + 8 + 8)
+
 /* Size of all NBD_REP_* sent in answer to most NBD_OPT_*, without payload */
 #define NBD_REPLY_SIZE  (4 + 4 + 8)
 /* Size of reply to NBD_OPT_EXPORT_NAME */
diff --git a/nbd/server.c b/nbd/server.c
index 4306fa7b426c..0e496f60ffbd 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -142,6 +142,7 @@ struct NBDClient {
 uint32_t check_align; /* If non-zero, check for aligned client requests */

 bool structured_reply;
+bool extended_headers;
 NBDExportMetaContexts export_meta;

 uint32_t opt; /* Current option being negotiated */
@@ -1275,6 +1276,19 @@ static int nbd_negotiate_options(NBDClient *client, 
Error **errp)
  errp);
 break;

+case NBD_OPT_EXTENDED_HEADERS:
+if (length) {
+ret = nbd_reject_length(client, false, errp);
+} else if (client->extended_headers) {
+ret = nbd_negotiate_send_rep_err(
+client, NBD_REP_ERR_INVALID, errp,
+"extended headers already negotiated");
+} else {
+ret = nbd_negotiate_send_rep(client, NBD_REP_ACK, errp);
+client->extended_headers = true;
+}
+break;
+
 default:
 ret = nbd_opt_drop(client, NBD_REP_ERR_UNSUP, errp,
"Unsupported option %" PRIu32 " (%s)",
@@ -1410,11 +1424,13 @@ nbd_read_eof(NBDClient *client, void *buffer, size_t 
size, Error **errp)
 static int nbd_receive_request(NBDClient *client, NBDRequest *request,
Error **errp)
 {
-uint8_t buf[NBD_REQUEST_SIZE];
-uint32_t magic;
+uint8_t buf[NBD_REQUEST_EXT_SIZE];
+uint32_t magic, expect;
 int ret;
+size_t size = client->extended_headers ? NBD_REQUEST_EXT_SIZE
+: NBD_REQUEST_SIZE;

-ret = nbd_read_eof(client, buf, sizeof(buf), errp);
+ret = nbd_read_eof(client, buf, size, errp);
 if (ret < 0) {
 return ret;
 }
@@ -1422,13 +1438,21 @@ static int nbd_receive_request(NBDClient *client, 
NBDRequest *request,
 return -EIO;
 }

-/* Request
-   [ 0 ..  3]   magic   (NBD_REQUEST_MAGIC)
-   [ 4 ..  5]   flags   (NBD_CMD_FLAG_FUA, ...)
-   [ 6 ..  7]   type(NBD_CMD_READ, ...)
-   [ 8 .. 15]   handle
-   [16 .. 23]   from
-   [24 .. 27]   len
+/*
+ * Compact request
+ *  [ 0 ..  3]   magic   (NBD_REQUEST_MAGIC)
+ *  [ 4 ..  5]   flags   (NBD_CMD_FLAG_FUA, ...)
+ *  [ 6 ..  7]   type(NBD_CMD_READ, ...)
+ *  [ 8 .. 15]   handle
+ *  [16 .. 23]   from
+ *  [24 .. 27]   len
+ * Extended request
+ *  [ 0 ..  3]   magic   (NBD_REQUEST_EXT_MAGIC)
+ *  [ 4 ..  5]   flags   (NBD_CMD_FLAG_FUA, ...)
+ *  [ 6 ..  7]   type(NBD_CMD_READ, ...)
+ *  [ 8 .. 15]   handle
+ *  [16 .. 23]   from
+ *  [24 .. 31]   len
  */

 magic = ldl_be_p(buf);
@@ -1436,12 +1460,18 @@ static int nbd_receive_request(NBDClient *client, 
NBDRequest *request,
 request->type   = lduw_be_p(buf + 6);
 request->handle = ldq_be_p(buf + 8);
 request->from   = ldq_be_p(buf + 16);
-request->len= ldl_be_p(buf + 24); /* widen 32 to 64 bits */
+if (client->extended_headers) {
+request->len = ldq_be_p(buf + 24);
+expect = NBD_REQUEST_EXT_MAGIC;
+} else {
+request->len = ldl_be_p(buf + 24); /* widen 32 to 64 bits */
+expect = NBD_REQUEST_MAGIC;
+}

 trace_nbd_receive_request(magic, request->flags, request->type,
   request->from, request->len);

-if (magic != NBD_REQUEST_MAGIC) {
+if (magic != expect) {
 error_setg(errp, "invalid magic (got 0x%" PRIx32 ")", magic);

[PATCH 06/14] nbd: Prepare for 64-bit requests

2021-12-03 Thread Eric Blake

Widen the length field of NBDRequest to 64-bits, although we can
assert that all current uses are still under 32 bits.  Move the
request magic number to nbd.h, to live alongside the reply magic
number.  Add a bool that will eventually track whether the client
successfully negotiated extended headers with the server, allowing the
nbd driver to pass larger requests along where possible; although in
this patch it always remains false for no semantic change yet.

Signed-off-by: Eric Blake 
---
 include/block/nbd.h | 19 +++
 nbd/nbd-internal.h  |  3 +--
 block/nbd.c | 35 ---
 nbd/client.c|  8 +---
 nbd/server.c| 11 ---
 nbd/trace-events|  8 
 6 files changed, 53 insertions(+), 31 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index 3d0689b69367..732314aaba11 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -52,17 +52,16 @@ typedef struct NBDOptionReplyMetaContext {

 /* Transmission phase structs
  *
- * Note: these are _NOT_ the same as the network representation of an NBD
- * request and reply!
+ * Note: NBDRequest is _NOT_ the same as the network representation of an NBD
+ * request!
  */
-struct NBDRequest {
+typedef struct NBDRequest {
 uint64_t handle;
 uint64_t from;
-uint32_t len;
+uint64_t len;   /* Must fit 32 bits unless extended headers negotiated */
 uint16_t flags; /* NBD_CMD_FLAG_* */
-uint16_t type; /* NBD_CMD_* */
-};
-typedef struct NBDRequest NBDRequest;
+uint16_t type;  /* NBD_CMD_* */
+} NBDRequest;

 typedef struct NBDSimpleReply {
 uint32_t magic;  /* NBD_SIMPLE_REPLY_MAGIC */
@@ -235,6 +234,9 @@ enum {
  */
 #define NBD_MAX_STRING_SIZE 4096

+/* Transmission request structure */
+#define NBD_REQUEST_MAGIC   0x25609513
+
 /* Two types of reply structures */
 #define NBD_SIMPLE_REPLY_MAGIC  0x67446698
 #define NBD_STRUCTURED_REPLY_MAGIC  0x668e33ef
@@ -293,6 +295,7 @@ struct NBDExportInfo {
 /* In-out fields, set by client before nbd_receive_negotiate() and
  * updated by server results during nbd_receive_negotiate() */
 bool structured_reply;
+bool extended_headers;
 bool base_allocation; /* base:allocation context for NBD_CMD_BLOCK_STATUS 
*/

 /* Set by server results during nbd_receive_negotiate() and
@@ -322,7 +325,7 @@ int nbd_receive_export_list(QIOChannel *ioc, 
QCryptoTLSCreds *tlscreds,
 Error **errp);
 int nbd_init(int fd, QIOChannelSocket *sioc, NBDExportInfo *info,
  Error **errp);
-int nbd_send_request(QIOChannel *ioc, NBDRequest *request);
+int nbd_send_request(QIOChannel *ioc, NBDRequest *request, bool ext_hdr);
 int coroutine_fn nbd_receive_reply(BlockDriverState *bs, QIOChannel *ioc,
NBDReply *reply, Error **errp);
 int nbd_client(int fd);
diff --git a/nbd/nbd-internal.h b/nbd/nbd-internal.h
index 1b2141ab4b2d..0016793ff4b1 100644
--- a/nbd/nbd-internal.h
+++ b/nbd/nbd-internal.h
@@ -1,7 +1,7 @@
 /*
  * NBD Internal Declarations
  *
- * Copyright (C) 2016 Red Hat, Inc.
+ * Copyright (C) 2016-2021 Red Hat, Inc.
  *
  * This work is licensed under the terms of the GNU GPL, version 2 or later.
  * See the COPYING file in the top-level directory.
@@ -45,7 +45,6 @@
 #define NBD_OLDSTYLE_NEGOTIATE_SIZE (8 + 8 + 8 + 4 + 124)

 #define NBD_INIT_MAGIC  0x4e42444d41474943LL /* ASCII "NBDMAGIC" */
-#define NBD_REQUEST_MAGIC   0x25609513
 #define NBD_OPTS_MAGIC  0x49484156454F5054LL /* ASCII "IHAVEOPT" */
 #define NBD_CLIENT_MAGIC0x420281861253LL
 #define NBD_REP_MAGIC   0x0003e889045565a9LL
diff --git a/block/nbd.c b/block/nbd.c
index 5ef462db1b7f..3e9875241bec 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -2,7 +2,7 @@
  * QEMU Block driver for  NBD
  *
  * Copyright (c) 2019 Virtuozzo International GmbH.
- * Copyright (C) 2016 Red Hat, Inc.
+ * Copyright (C) 2016-2021 Red Hat, Inc.
  * Copyright (C) 2008 Bull S.A.S.
  * Author: Laurent Vivier 
  *
@@ -300,7 +300,7 @@ int coroutine_fn 
nbd_co_do_establish_connection(BlockDriverState *bs,
  */
 NBDRequest request = { .type = NBD_CMD_DISC };

-nbd_send_request(s->ioc, );
+nbd_send_request(s->ioc, , s->info.extended_headers);

 yank_unregister_function(BLOCKDEV_YANK_INSTANCE(s->bs->node_name),
  nbd_yank, bs);
@@ -470,7 +470,7 @@ static int nbd_co_send_request(BlockDriverState *bs,

 if (qiov) {
 qio_channel_set_cork(s->ioc, true);
-rc = nbd_send_request(s->ioc, request);
+rc = nbd_send_request(s->ioc, request, s->info.extended_headers);
 if (nbd_client_connected(s) && rc >= 0) {
 if (qio_channel_writev_all(s->ioc, qiov->iov, qiov->niov,
NULL) < 0) {
@@ -481,7 +481,7 @@ static int nbd_co_send_request(BlockDriverState *bs,
 }

[PATCH 04/14] nbd/client: Add safety check on chunk payload length

2021-12-03 Thread Eric Blake

Our existing use of structured replies either reads into a qiov capped
at 32M (NBD_CMD_READ) or caps allocation to 1000 bytes (see
NBD_MAX_MALLOC_PAYLOAD in block/nbd.c).  But the existing length
checks are rather late; if we encounter a buggy (or malicious) server
that sends a super-large payload length, we should drop the connection
right then rather than assuming the layer on top will be careful.
This becomes more important when we permit 64-bit lengths which are
even more likely to have the potential for attempted denial of service
abuse.

Signed-off-by: Eric Blake 
---
 nbd/client.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/nbd/client.c b/nbd/client.c
index 30d5383cb195..8f137c2320bb 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -1412,6 +1412,18 @@ static int nbd_receive_structured_reply_chunk(QIOChannel 
*ioc,
 chunk->handle = be64_to_cpu(chunk->handle);
 chunk->length = be32_to_cpu(chunk->length);

+/*
+ * Because we use BLOCK_STATUS with REQ_ONE, and cap READ requests
+ * at 32M, no valid server should send us payload larger than
+ * this.  Even if we stopped using REQ_ONE, sane servers will cap
+ * the number of extents they return for block status.
+ */
+if (chunk->length > NBD_MAX_BUFFER_SIZE + sizeof(NBDStructuredReadData)) {
+error_setg(errp, "server chunk %" PRIu32 " (%s) payload is too long",
+   chunk->type, nbd_rep_lookup(chunk->type));
+return -EINVAL;
+}
+
 return 0;
 }

-- 
2.33.1

[PATCH 01/14] nbd/server: Minor cleanups

2021-12-03 Thread Eric Blake

Spelling fixes, grammar improvements and consistent spacing, noticed
while preparing other patches in this file.

Signed-off-by: Eric Blake 
---
 nbd/server.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 4630dd732250..f302e1cbb03e 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -2085,11 +2085,10 @@ static void 
nbd_extent_array_convert_to_be(NBDExtentArray *ea)
  * Add extent to NBDExtentArray. If extent can't be added (no available space),
  * return -1.
  * For safety, when returning -1 for the first time, .can_add is set to false,
- * further call to nbd_extent_array_add() will crash.
- * (to avoid the situation, when after failing to add an extent (returned -1),
- * user miss this failure and add another extent, which is successfully added
- * (array is full, but new extent may be squashed into the last one), then we
- * have invalid array with skipped extent)
+ * and further calls to nbd_extent_array_add() will crash.
+ * (this avoids the situation where a caller ignores failure to add one extent,
+ * where adding another extent that would squash into the last array entry
+ * would result in an incorrect range reported to the client)
  */
 static int nbd_extent_array_add(NBDExtentArray *ea,
 uint32_t length, uint32_t flags)
@@ -2288,7 +2287,7 @@ static int nbd_co_receive_request(NBDRequestData *req, 
NBDRequest *request,
 assert(client->recv_coroutine == qemu_coroutine_self());
 ret = nbd_receive_request(client, request, errp);
 if (ret < 0) {
-return  ret;
+return ret;
 }

 trace_nbd_co_receive_request_decode_type(request->handle, request->type,
@@ -2648,7 +2647,7 @@ static coroutine_fn void nbd_trip(void *opaque)
 }

 if (ret < 0) {
-/* It wans't -EIO, so, according to nbd_co_receive_request()
+/* It wasn't -EIO, so, according to nbd_co_receive_request()
  * semantics, we should return the error to the client. */
 Error *export_err = local_err;

-- 
2.33.1

[PATCH 00/14] qemu patches for NBD_OPT_EXTENDED_HEADERS

2021-12-03 Thread Eric Blake

Available at https://repo.or.cz/qemu/ericb.git/shortlog/refs/tags/exthdr-v1

Patch 14 is optional; I'm including it now because I tested with it,
but I'm also okay with dropping it based on RFC discussion.

Eric Blake (14):
  nbd/server: Minor cleanups
  qemu-io: Utilize 64-bit status during map
  qemu-io: Allow larger write zeroes under no fallback
  nbd/client: Add safety check on chunk payload length
  nbd/server: Prepare for alternate-size headers
  nbd: Prepare for 64-bit requests
  nbd: Add types for extended headers
  nbd/server: Initial support for extended headers
  nbd/server: Support 64-bit block status
  nbd/client: Initial support for extended headers
  nbd/client: Accept 64-bit hole chunks
  nbd/client: Accept 64-bit block status chunks
  nbd/client: Request extended headers during negotiation
  do not apply: nbd/server: Send 64-bit hole chunk

 docs/interop/nbd.txt  |   1 +
 include/block/nbd.h   |  94 +--
 nbd/nbd-internal.h|   8 +-
 block/nbd.c   | 102 +--
 nbd/client-connection.c   |   1 +
 nbd/client.c  | 150 +++---
 nbd/common.c  |  10 +-
 nbd/server.c  | 262 +-
 qemu-io-cmds.c|  16 +-
 qemu-nbd.c|   2 +
 block/trace-events|   1 +
 nbd/trace-events  |   9 +-
 tests/qemu-iotests/223.out|   4 +
 tests/qemu-iotests/233.out|   1 +
 tests/qemu-iotests/241|   8 +-
 tests/qemu-iotests/307|   2 +-
 tests/qemu-iotests/307.out|   5 +
 .../tests/nbd-qemu-allocation.out |   1 +
 18 files changed, 486 insertions(+), 191 deletions(-)

-- 
2.33.1

[PATCH 03/14] qemu-io: Allow larger write zeroes under no fallback

2021-12-03 Thread Eric Blake

When writing zeroes can fall back to a slow write, permitting an
overly large request can become an amplification denial of service
attack in triggering a large amount of work from a small request.  But
the whole point of the no fallback flag is to quickly determine if
writing an entire device to zero can be done quickly (such as when it
is already known that the device started with zero contents); in those
cases, artificially capping things at 2G in qemu-io itself doesn't
help us.

Signed-off-by: Eric Blake 
---
 qemu-io-cmds.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index 954955c12fb9..45a957093369 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -603,10 +603,6 @@ static int do_co_pwrite_zeroes(BlockBackend *blk, int64_t 
offset,
 .done   = false,
 };

-if (bytes > INT_MAX) {
-return -ERANGE;
-}
-
 co = qemu_coroutine_create(co_pwrite_zeroes_entry, );
 bdrv_coroutine_enter(blk_bs(blk), co);
 while (!data.done) {
@@ -1160,8 +1156,9 @@ static int write_f(BlockBackend *blk, int argc, char 
**argv)
 if (count < 0) {
 print_cvtnum_err(count, argv[optind]);
 return count;
-} else if (count > BDRV_REQUEST_MAX_BYTES) {
-printf("length cannot exceed %" PRIu64 ", given %s\n",
+} else if (count > BDRV_REQUEST_MAX_BYTES &&
+   !(flags & BDRV_REQ_NO_FALLBACK)) {
+printf("length cannot exceed %" PRIu64 " without -n, given %s\n",
(uint64_t)BDRV_REQUEST_MAX_BYTES, argv[optind]);
 return -EINVAL;
 }
-- 
2.33.1

[PATCH 02/14] qemu-io: Utilize 64-bit status during map

2021-12-03 Thread Eric Blake

The block layer has supported 64-bit block status from drivers since
commit 86a3d5c688 ("block: Add .bdrv_co_block_status() callback",
v2.12) and friends, with individual driver callbacks responsible for
capping things where necessary.  Artificially capping things below 2G
in the qemu-io 'map' command, added in commit d6a644bbfe ("block: Make
bdrv_is_allocated() byte-based", v2.10) is thus no longer necessary.

One way to test this is with qemu-nbd as server on a raw file larger
than 4G (the entire file should show as allocated), plus 'qemu-io -f
raw -c map nbd://localhost --trace=nbd_\*' as client.  Prior to this
patch, the NBD_CMD_BLOCK_STATUS requests are fragmented at 0x7e00
distances; with this patch, the fragmenting changes to 0x7fff
(since the NBD protocol is currently still limited to 32-bit
transactions - see block/nbd.c:nbd_client_co_block_status).  Then in
later patches, once I add an NBD extension for a 64-bit block status,
the same map command completes with just one NBD_CMD_BLOCK_STATUS.

Signed-off-by: Eric Blake 
---
 qemu-io-cmds.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index 46593d632d8f..954955c12fb9 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -1993,11 +1993,9 @@ static int map_is_allocated(BlockDriverState *bs, 
int64_t offset,
 int64_t bytes, int64_t *pnum)
 {
 int64_t num;
-int num_checked;
 int ret, firstret;

-num_checked = MIN(bytes, BDRV_REQUEST_MAX_BYTES);
-ret = bdrv_is_allocated(bs, offset, num_checked, );
+ret = bdrv_is_allocated(bs, offset, bytes, );
 if (ret < 0) {
 return ret;
 }
@@ -2009,8 +2007,7 @@ static int map_is_allocated(BlockDriverState *bs, int64_t 
offset,
 offset += num;
 bytes -= num;

-num_checked = MIN(bytes, BDRV_REQUEST_MAX_BYTES);
-ret = bdrv_is_allocated(bs, offset, num_checked, );
+ret = bdrv_is_allocated(bs, offset, bytes, );
 if (ret == firstret && num) {
 *pnum += num;
 } else {
-- 
2.33.1

[PATCH] spec: Add NBD_OPT_EXTENDED_HEADERS

2021-12-03 Thread Eric Blake

Add a new negotiation feature where the client and server agree to use
larger packet headers on every packet sent during transmission phase.
This has two purposes: first, it makes it possible to perform
operations like trim, write zeroes, and block status on more than 2^32
bytes in a single command; this in turn requires that some structured
replies from the server also be extended to match.  The wording chosen
here is careful to permit a server to use either flavor in its reply
(that is, a request less than 32-bits can trigger an extended reply,
and conversely a request larger than 32-bits can trigger a compact
reply).

Second, when structured replies are active, clients have to deal with
the difference between 16- and 20-byte headers of simple
vs. structured replies, which impacts performance if the client must
perform multiple syscalls to first read the magic before knowing how
many additional bytes to read.  In extended header mode, all headers
are the same width, so the client can read a full header before
deciding whether the header describes a simple or structured reply.
Similarly, by having extended mode use a power-of-2 sizing, it becomes
easier to manipulate headers within a single cache line, even if it
requires padding bytes sent over the wire.  However, note that this
change only affects the headers; as data payloads can still be
unaligned (for example, a client performing 1-byte reads or writes),
we would need to negotiate yet another extension if we wanted to
ensure that all NBD transmission packets started on an 8-byte boundary
after option haggling has completed.

This spec addition was done in parallel with a proof of concept
implementation in qemu (server and client) and libnbd (client), and I
also have plans to implement it in nbdkit (server).

Signed-off-by: Eric Blake 
---

Available at https://repo.or.cz/nbd/ericb.git/shortlog/refs/tags/exthdr-v1

 doc/proto.md | 218 +--
 1 file changed, 177 insertions(+), 41 deletions(-)

diff --git a/doc/proto.md b/doc/proto.md
index 3a877a9..46560b6 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -295,6 +295,21 @@ reply is also problematic for error handling of the 
`NBD_CMD_READ`
 request.  Therefore, structured replies can be used to create a
 a context-free server stream; see below.

+The results of client negotiation also determine whether the client
+and server will utilize only compact requests and replies, or whether
+both sides will use only extended packets.  Compact messages are the
+default, but inherently limit single transactions to a 32-bit window
+starting at a 64-bit offset.  Extended messages make it possible to
+perform 64-bit transactions (although typically only for commands that
+do not include a data payload).  Furthermore, when structured replies
+have been negotiated, compact messages require the client to perform
+partial reads to determine which reply packet style (simple or
+structured) is on the wire before knowing the length of the rest of
+the reply, which can reduce client performance.  With extended
+messages, all packet headers have a fixed length of 32 bytes, and
+although this results in more traffic over the network due to padding,
+the resulting layout is friendlier for performance.
+
 Replies need not be sent in the same order as requests (i.e., requests
 may be handled by the server asynchronously), and structured reply
 chunks from one request may be interleaved with reply messages from
@@ -343,7 +358,9 @@ may be useful.

  Request message

-The request message, sent by the client, looks as follows:
+The compact request message, sent by the client when extended
+transactions are not negotiated using `NBD_OPT_EXTENDED_HEADERS`,
+looks as follows:

 C: 32 bits, 0x25609513, magic (`NBD_REQUEST_MAGIC`)  
 C: 16 bits, command flags  
@@ -353,14 +370,26 @@ C: 64 bits, offset (unsigned)
 C: 32 bits, length (unsigned)  
 C: (*length* bytes of data if the request is of type `NBD_CMD_WRITE`)  

+If negotiation agreed on extended transactions with
+`NBD_OPT_EXTENDED_HEADERS`, the client instead uses extended requests:
+
+C: 32 bits, 0x21e41c71, magic (`NBD_REQUEST_EXT_MAGIC`)  
+C: 16 bits, command flags  
+C: 16 bits, type  
+C: 64 bits, handle  
+C: 64 bits, offset (unsigned)  
+C: 64 bits, length (unsigned)  
+C: (*length* bytes of data if the request is of type `NBD_CMD_WRITE`)  
+
  Simple reply message

 The simple reply message MUST be sent by the server in response to all
 requests if structured replies have not been negotiated using
-`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been negotiated, a 
simple
-reply MAY be used as a reply to any request other than `NBD_CMD_READ`,
-but only if the reply has no data payload.  The message looks as
-follows:
+`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been
+negotiated, a simple reply MAY be used as a reply to any request other
+than `NBD_CMD_READ`, but only if the reply has no data payload.

RFC for NBD protocol extension: extended headers

2021-12-03 Thread Eric Blake

In response to this mail, I will be cross-posting a series of patches
to multiple projects as a proof-of-concept implementation and request
for comments on a new NBD protocol extension, called
NBD_OPT_EXTENDED_HEADERS.  With this in place, it will be possible for
clients to request 64-bit zero, trim, cache, and block status
operations when supported by the server.

Not yet complete: an implementation of this in nbdkit.  I also plan to
tweak libnbd's 'nbdinfo --map' and 'nbdcopy' to take advantage of the
larger block status results.  Also, with 64-bit commands, we may want
to also make it easier to let servers advertise an actual maximum size
they are willing to accept for the commands in question (for example,
a server may be happy with a full 64-bit block status, but still want
to limit non-fast zero and cache to a smaller limit to avoid denial of
service).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [RFC PATCH 2/2] qemu-img convert: Fix sparseness detection

2021-12-03 Thread Vladimir Sementsov-Ogievskiy


03.12.2021 14:17, Peter Lieven wrote:

Am 19.05.21 um 18:48 schrieb Kevin Wolf:

Am 19.05.2021 um 15:24 hat Peter Lieven geschrieben:

Am 20.04.21 um 18:52 schrieb Vladimir Sementsov-Ogievskiy:

20.04.2021 18:04, Kevin Wolf wrote:

Am 20.04.2021 um 16:31 hat Vladimir Sementsov-Ogievskiy geschrieben:

15.04.2021 18:22, Kevin Wolf wrote:

In order to avoid RMW cycles, is_allocated_sectors() treats zeroed areas
like non-zero data if the end of the checked area isn't aligned. This
can improve the efficiency of the conversion and was introduced in
commit 8dcd3c9b91a.

However, it comes with a correctness problem: qemu-img convert is
supposed to sparsify areas that contain only zeros, which it doesn't do
any more. It turns out that this even happens when not only the
unaligned area is zeroed, but also the blocks before and after it. In
the bug report, conversion of a fragmented 10G image containing only
zeros resulted in an image consuming 2.82 GiB even though the expected
size is only 4 KiB.

As a tradeoff between both, let's ignore zeroed sectors only after
non-zero data to fix the alignment, but if we're only looking at zeros,
keep them as such, even if it may mean additional RMW cycles.


Hmm.. If I understand correctly, we are going to do unaligned
write-zero. And that helps.

This can happen (mostly raw images on block devices, I think?), but
usually it just means skipping the write because we know that the target
image is already zeroed.

What it does mean is that if the next part is data, we'll have an
unaligned data write.


Doesn't that mean that alignment is wrongly detected?

The problem is that you can have bdrv_block_status_above() return the
same allocation status multiple times in a row, but *pnum can be
unaligned for the conversion.

We only look at a single range returned by it when detecting the
alignment, so it could be that we have zero buffers for both 0-11 and
12-16 and detect two misaligned ranges, when both together are a
perfectly aligned zeroed range.

In theory we could try to do some lookahead and merge ranges where
possible, which should give us the perfect result, but it would make the
code considerably more complicated. (Whether we want to merge them
doesn't only depend on the block status, but possibly also on the
content of a DATA range.)

Kevin


Oh, I understand now the problem, thanks for explanation.

Hmm, yes that means, that if the whole buf is zero, is_allocated_sectors must not align 
it down, to be possibly "merged" with next chunk if it is zero too.

But it's still good to align zeroes down, if data starts somewhere inside the 
buf, isn't it?

what about something like this:

diff --git a/qemu-img.c b/qemu-img.c
index babb5573ab..d1704584a0 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1167,19 +1167,39 @@ static int is_allocated_sectors(const uint8_t *buf, int 
n, int *pnum,
  }
  }
  
+    if (i == n) {

+    /*
+ * The whole buf is the same.
+ *
+ * if it's data, just return it. It's the old behavior.
+ *
+ * if it's zero, just return too. It will work good if target is alredy
+ * zeroed. And if next chunk is zero too we'll have no RMW and no 
reason
+ * to write data.
+ */
+    *pnum = i;
+    return !is_zero;
+    }
+
  tail = (sector_num + i) & (alignment - 1);
  if (tail) {
  if (is_zero && i <= tail) {
-    /* treat unallocated areas which only consist
- * of a small tail as allocated. */
+    /*
+ * For sure next sector after i is data, and it will rewrite this
+ * tail anyway due to RMW. So, let's just write data now.
+ */
  is_zero = false;
  }
  if (!is_zero) {
-    /* align up end offset of allocated areas. */
+    /* If possible, align up end offset of allocated areas. */
  i += alignment - tail;
  i = MIN(i, n);
  } else {
-    /* align down end offset of zero areas. */
+    /*
+ * For sure next sector after i is data, and it will rewrite this
+ * tail anyway due to RMW. Better is avoid RMW and write zeroes up
+ * to aligned bound.
+ */
  i -= tail;
  }
  }

I think we forgot to follow up on this. Has anyone tested this
suggestion?

Otherwise, I would try to rerun the tests I did with the my old and
Kevins suggestion.

I noticed earlier this week that these patches are still in my
development branch, but didn't actually pick it up again yet. So feel
free to try it out.



It seems this time I forgot to follow up. Is this topic still open?



Most probably yes :) I now checked, that my proposed diff is still applicable 
to master and don't break compilation. So, if you have some test, you can check 
if it works better with the change.

--
Best regards,
Vladimir

[PATCH 07/14] block: document connection between child roles and bs->backing/bs->file

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

Make the informal rules formal. In further commit we'll add
corresponding assertions.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/block.h | 42 ++
 1 file changed, 42 insertions(+)

diff --git a/include/block/block.h b/include/block/block.h
index f885f113ef..8a3278d4b6 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -290,6 +290,48 @@ enum {
  *
  * At least one of DATA, METADATA, FILTERED, or COW must be set for
  * every child.
+ *
+ *
+ * = Connection with bs->children, bs->file and bs->backing fields =
+ *
+ * 1. Filters
+ *
+ * Filter drivers has drv->is_filter = true.
+ *
+ * Filter driver has exactly one FILTERED|PRIMARY child, any may have other
+ * children which must not have these bits (the example is copy-before-write
+ * filter that also has target DATA child).
+ *
+ * Filter driver never has COW children.
+ *
+ * For all filters except for mirror_top and commit_top, the filtered child is
+ * linked in bs->file, bs->backing is NULL.
+ *
+ * For mirror_top and commit_top filtered child is linked in bs->backing and
+ * their bs->file is NULL. These two filters has drv->filtered_child_is_backing
+ * = true.
+ *
+ * 2. "raw" driver (block/raw-format.c)
+ *
+ * Formally it's not a filter (drv->is_filter = false)
+ *
+ * bs->backing is always NULL
+ *
+ * Only has one child, linked in bs->file. It's role is either FILTERED|PRIMARY
+ * (like filter) either DATA|PRIMARY depending on options.
+ *
+ * 3. Other drivers
+ *
+ * Doesn't have any FILTERED children.
+ *
+ * May have at most one COW child. In this case it's linked in bs->backing.
+ * Otherwise bs->backing is NULL. COW child is never PRIMARY.
+ *
+ * May have at most one PRIMARY child. In this case it's linked in bs->file.
+ * Otherwise bs->file is NULL.
+ *
+ * May also have some other children that don't have neither PRIMARY nor COW
+ * bits set.
  */
 enum BdrvChildRoleBits {
 /*
-- 
2.31.1

[PATCH 14/14] block/snapshot: drop indirection around bdrv_snapshot_fallback_ptr

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

Now the indirection is not actually used, we can safely reduce it to
simple pointer.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/snapshot.c | 39 +--
 1 file changed, 17 insertions(+), 22 deletions(-)

diff --git a/block/snapshot.c b/block/snapshot.c
index cb184d70b4..e32f9cb2ad 100644
--- a/block/snapshot.c
+++ b/block/snapshot.c
@@ -148,34 +148,29 @@ bool bdrv_snapshot_find_by_id_and_name(BlockDriverState 
*bs,
 }
 
 /**
- * Return a pointer to the child BDS pointer to which we can fall
+ * Return a pointer to child of given BDS to which we can fall
  * back if the given BDS does not support snapshots.
  * Return NULL if there is no BDS to (safely) fall back to.
- *
- * We need to return an indirect pointer because bdrv_snapshot_goto()
- * has to modify the BdrvChild pointer.
  */
-static BdrvChild **bdrv_snapshot_fallback_ptr(BlockDriverState *bs)
+static BdrvChild *bdrv_snapshot_fallback_ptr(BlockDriverState *bs)
 {
-BdrvChild **fallback;
-BdrvChild *child = bdrv_primary_child(bs);
+BdrvChild *fallback = bdrv_primary_child(bs);
+BdrvChild *child;
 
 /* We allow fallback only to primary child */
-if (!child) {
+if (!fallback) {
 return NULL;
 }
-fallback = (child == bs->file ? >file : >backing);
-assert(*fallback == child);
 
 /*
  * Check that there are no other children that would need to be
  * snapshotted.  If there are, it is not safe to fall back to
- * *fallback.
+ * fallback.
  */
 QLIST_FOREACH(child, >children, next) {
 if (child->role & (BDRV_CHILD_DATA | BDRV_CHILD_METADATA |
BDRV_CHILD_FILTERED) &&
-child != *fallback)
+child != fallback)
 {
 return NULL;
 }
@@ -186,8 +181,8 @@ static BdrvChild 
**bdrv_snapshot_fallback_ptr(BlockDriverState *bs)
 
 static BlockDriverState *bdrv_snapshot_fallback(BlockDriverState *bs)
 {
-BdrvChild **child_ptr = bdrv_snapshot_fallback_ptr(bs);
-return child_ptr ? (*child_ptr)->bs : NULL;
+BdrvChild *child_ptr = bdrv_snapshot_fallback_ptr(bs);
+return child_ptr ? child_ptr->bs : NULL;
 }
 
 int bdrv_can_snapshot(BlockDriverState *bs)
@@ -230,7 +225,7 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
Error **errp)
 {
 BlockDriver *drv = bs->drv;
-BdrvChild **fallback_ptr;
+BdrvChild *fallback;
 int ret, open_ret;
 
 if (!drv) {
@@ -251,13 +246,13 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
 return ret;
 }
 
-fallback_ptr = bdrv_snapshot_fallback_ptr(bs);
-if (fallback_ptr) {
+fallback = bdrv_snapshot_fallback_ptr(bs);
+if (fallback) {
 QDict *options;
 QDict *file_options;
 Error *local_err = NULL;
-BlockDriverState *fallback_bs = (*fallback_ptr)->bs;
-char *subqdict_prefix = g_strdup_printf("%s.", (*fallback_ptr)->name);
+BlockDriverState *fallback_bs = fallback->bs;
+char *subqdict_prefix = g_strdup_printf("%s.", fallback->name);
 
 options = qdict_clone_shallow(bs->options);
 
@@ -268,8 +263,8 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
 qobject_unref(file_options);
 g_free(subqdict_prefix);
 
-/* Force .bdrv_open() below to re-attach fallback_bs on *fallback_ptr 
*/
-qdict_put_str(options, (*fallback_ptr)->name,
+/* Force .bdrv_open() below to re-attach fallback_bs on fallback */
+qdict_put_str(options, fallback->name,
   bdrv_get_node_name(fallback_bs));
 
 /* Now close bs, apply the snapshot on fallback_bs, and re-open bs */
@@ -278,7 +273,7 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
 }
 
 /* .bdrv_open() will re-attach it */
-bdrv_unref_child(bs, *fallback_ptr);
+bdrv_unref_child(bs, fallback);
 
 ret = bdrv_snapshot_goto(fallback_bs, snapshot_id, errp);
 open_ret = drv->bdrv_open(bs, options, bs->open_flags, _err);
-- 
2.31.1

[PATCH 08/14] block/snapshot: stress that we fallback to primary child

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

Actually what we chose is a primary child. Let's stress it in the code.

We are going to drop indirect pointer logic here in future. Actually
this commit simplifies the future work: we drop use of indirection in
the assertion now.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/snapshot.c | 30 ++
 1 file changed, 10 insertions(+), 20 deletions(-)

diff --git a/block/snapshot.c b/block/snapshot.c
index ccacda8bd5..12fa0e3904 100644
--- a/block/snapshot.c
+++ b/block/snapshot.c
@@ -158,21 +158,14 @@ bool bdrv_snapshot_find_by_id_and_name(BlockDriverState 
*bs,
 static BdrvChild **bdrv_snapshot_fallback_ptr(BlockDriverState *bs)
 {
 BdrvChild **fallback;
-BdrvChild *child;
+BdrvChild *child = bdrv_primary_child(bs);
 
-/*
- * The only BdrvChild pointers that are safe to modify (and which
- * we can thus return a reference to) are bs->file and
- * bs->backing.
- */
-fallback = >file;
-if (!*fallback && bs->drv && bs->drv->is_filter) {
-fallback = >backing;
-}
-
-if (!*fallback) {
+/* We allow fallback only to primary child */
+if (!child) {
 return NULL;
 }
+fallback = (child == bs->file ? >file : >backing);
+assert(*fallback == child);
 
 /*
  * Check that there are no other children that would need to be
@@ -300,15 +293,12 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
 }
 
 /*
- * fallback_ptr is >file or >backing.  *fallback_ptr
- * was closed above and set to NULL, but the .bdrv_open() call
- * has opened it again, because we set the respective option
- * (with the qdict_put_str() call above).
- * Assert that .bdrv_open() has attached some child on
- * *fallback_ptr, and that it has attached the one we wanted
- * it to (i.e., fallback_bs).
+ * fallback was a primary child. It was closed above and set to NULL,
+ * but the .bdrv_open() call has opened it again, because we set the
+ * respective option (with the qdict_put_str() call above).
+ * Assert that .bdrv_open() has attached some BDS as primary child.
  */
-assert(*fallback_ptr && fallback_bs == (*fallback_ptr)->bs);
+assert(bdrv_primary_bs(bs) == fallback_bs);
 bdrv_unref(fallback_bs);
 return ret;
 }
-- 
2.31.1

[PATCH 06/14] test-bdrv-graph-mod: fix filters to be filters

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

bdrv_pass_through is used as filter, even all node variables has
corresponding names. We want to append it, so it should be
backing-child-based filter like mirror_top.
So, in test_update_perm_tree, first child should be DATA, as we don't
want filters with two filtered children.

bdrv_exclusive_writer is used as a filter once. So it should be filter
anyway. We want to append it, so it should be backing-child-based
fitler too.

Make all FILTERED children to be PRIMARY as well. We are going to force
this rule by assertion soon.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/block_int.h|  5 +++--
 tests/unit/test-bdrv-graph-mod.c | 24 +---
 2 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 9c06f8816e..919e33de5f 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -121,8 +121,9 @@ struct BlockDriver {
 /*
  * Only make sense for filter drivers, for others must be false.
  * If true, filtered child is bs->backing. Otherwise it's bs->file.
- * Only two internal filters use bs->backing as filtered child and has this
- * field set to true: mirror_top and commit_top.
+ * Two internal filters use bs->backing as filtered child and has this
+ * field set to true: mirror_top and commit_top. There also two such test
+ * filters in tests/unit/test-bdrv-graph-mod.c.
  *
  * Never create any more such filters!
  *
diff --git a/tests/unit/test-bdrv-graph-mod.c b/tests/unit/test-bdrv-graph-mod.c
index 40795d3c04..7265971013 100644
--- a/tests/unit/test-bdrv-graph-mod.c
+++ b/tests/unit/test-bdrv-graph-mod.c
@@ -26,6 +26,8 @@
 
 static BlockDriver bdrv_pass_through = {
 .format_name = "pass-through",
+.is_filter = true,
+.filtered_child_is_backing = true,
 .bdrv_child_perm = bdrv_default_perms,
 };
 
@@ -57,6 +59,8 @@ static void exclusive_write_perms(BlockDriverState *bs, 
BdrvChild *c,
 
 static BlockDriver bdrv_exclusive_writer = {
 .format_name = "exclusive-writer",
+.is_filter = true,
+.filtered_child_is_backing = true,
 .bdrv_child_perm = exclusive_write_perms,
 };
 
@@ -134,7 +138,7 @@ static void test_update_perm_tree(void)
 blk_insert_bs(root, bs, _abort);
 
 bdrv_attach_child(filter, bs, "child", _of_bds,
-  BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY, _abort);
+  BDRV_CHILD_DATA, _abort);
 
 ret = bdrv_append(filter, bs, NULL);
 g_assert_cmpint(ret, <, 0);
@@ -228,11 +232,14 @@ static void test_parallel_exclusive_write(void)
  */
 bdrv_ref(base);
 
-bdrv_attach_child(top, fl1, "backing", _of_bds, BDRV_CHILD_DATA,
+bdrv_attach_child(top, fl1, "backing", _of_bds,
+  BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
   _abort);
-bdrv_attach_child(fl1, base, "backing", _of_bds, BDRV_CHILD_FILTERED,
+bdrv_attach_child(fl1, base, "backing", _of_bds,
+  BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
   _abort);
-bdrv_attach_child(fl2, base, "backing", _of_bds, BDRV_CHILD_FILTERED,
+bdrv_attach_child(fl2, base, "backing", _of_bds,
+  BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
   _abort);
 
 bdrv_replace_node(fl1, fl2, _abort);
@@ -344,9 +351,11 @@ static void test_parallel_perm_update(void)
   BDRV_CHILD_DATA, _abort);
 c_fl2 = bdrv_attach_child(ws, fl2, "second", _of_bds,
   BDRV_CHILD_DATA, _abort);
-bdrv_attach_child(fl1, base, "backing", _of_bds, BDRV_CHILD_FILTERED,
+bdrv_attach_child(fl1, base, "backing", _of_bds,
+  BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
   _abort);
-bdrv_attach_child(fl2, base, "backing", _of_bds, BDRV_CHILD_FILTERED,
+bdrv_attach_child(fl2, base, "backing", _of_bds,
+  BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
   _abort);
 
 /* Select fl1 as first child to be active */
@@ -397,7 +406,8 @@ static void test_append_greedy_filter(void)
 BlockDriverState *base = no_perm_node("base");
 BlockDriverState *fl = exclusive_writer_node("fl1");
 
-bdrv_attach_child(top, base, "backing", _of_bds, BDRV_CHILD_COW,
+bdrv_attach_child(top, base, "backing", _of_bds,
+  BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
   _abort);
 
 bdrv_append(fl, base, _abort);
-- 
2.31.1

[PATCH 12/14] Revert "block: Pass BdrvChild ** to replace_child_noperm"

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

That's a preparation to previously reverted
"block: Let replace_child_noperm free children". Drop it too, we don't
need it for a new approach.

This reverts commit be64bbb0149748f3999c49b13976aafb8330ea86.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block.c | 23 +++
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/block.c b/block.c
index 2ba95f71b9..d57d7a80ab 100644
--- a/block.c
+++ b/block.c
@@ -87,7 +87,7 @@ static BlockDriverState *bdrv_open_inherit(const char 
*filename,
 static bool bdrv_recurse_has_child(BlockDriverState *bs,
BlockDriverState *child);
 
-static void bdrv_replace_child_noperm(BdrvChild **child,
+static void bdrv_replace_child_noperm(BdrvChild *child,
   BlockDriverState *new_bs);
 static void bdrv_remove_file_or_backing_child(BlockDriverState *bs,
   BdrvChild *child,
@@ -2270,7 +2270,7 @@ static void bdrv_replace_child_abort(void *opaque)
 BlockDriverState *new_bs = s->child->bs;
 
 /* old_bs reference is transparently moved from @s to @s->child */
-bdrv_replace_child_noperm(>child, s->old_bs);
+bdrv_replace_child_noperm(s->child, s->old_bs);
 bdrv_unref(new_bs);
 }
 
@@ -2300,7 +2300,7 @@ static void bdrv_replace_child_tran(BdrvChild *child, 
BlockDriverState *new_bs,
 if (new_bs) {
 bdrv_ref(new_bs);
 }
-bdrv_replace_child_noperm(, new_bs);
+bdrv_replace_child_noperm(child, new_bs);
 /* old_bs reference is transparently moved from @child to @s */
 }
 
@@ -2672,10 +2672,9 @@ uint64_t bdrv_qapi_perm_to_blk_perm(BlockPermission 
qapi_perm)
 return permissions[qapi_perm];
 }
 
-static void bdrv_replace_child_noperm(BdrvChild **childp,
+static void bdrv_replace_child_noperm(BdrvChild *child,
   BlockDriverState *new_bs)
 {
-BdrvChild *child = *childp;
 BlockDriverState *old_bs = child->bs;
 int new_bs_quiesce_counter;
 int drain_saldo;
@@ -2768,7 +2767,7 @@ static void bdrv_attach_child_common_abort(void *opaque)
 BdrvChild *child = *s->child;
 BlockDriverState *bs = child->bs;
 
-bdrv_replace_child_noperm(s->child, NULL);
+bdrv_replace_child_noperm(child, NULL);
 
 if (bdrv_get_aio_context(bs) != s->old_child_ctx) {
 bdrv_try_set_aio_context(bs, s->old_child_ctx, _abort);
@@ -2868,7 +2867,7 @@ static int bdrv_attach_child_common(BlockDriverState 
*child_bs,
 }
 
 bdrv_ref(child_bs);
-bdrv_replace_child_noperm(_child, child_bs);
+bdrv_replace_child_noperm(new_child, child_bs);
 
 *child = new_child;
 
@@ -2923,12 +2922,12 @@ static int bdrv_attach_child_noperm(BlockDriverState 
*parent_bs,
 return 0;
 }
 
-static void bdrv_detach_child(BdrvChild **childp)
+static void bdrv_detach_child(BdrvChild *child)
 {
-BlockDriverState *old_bs = (*childp)->bs;
+BlockDriverState *old_bs = child->bs;
 
-bdrv_replace_child_noperm(childp, NULL);
-bdrv_child_free(*childp);
+bdrv_replace_child_noperm(child, NULL);
+bdrv_child_free(child);
 
 if (old_bs) {
 /*
@@ -3034,7 +3033,7 @@ void bdrv_root_unref_child(BdrvChild *child)
 BlockDriverState *child_bs;
 
 child_bs = child->bs;
-bdrv_detach_child();
+bdrv_detach_child(child);
 bdrv_unref(child_bs);
 }
 
-- 
2.31.1

[PATCH 04/14] test-bdrv-graph-mod: update test_parallel_perm_update test case

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

test_parallel_perm_update() does two things that we are going to
restrict in the near future:

1. It updates bs->file field by hand. bs->file will be managed
   automatically by generic code (together with bs->children list).

   Let's better refactor our "tricky" bds to have own state where one
   of children is linked as "selected".
   This also looks less "tricky", so avoid using this word.

2. It create FILTERED children that are not PRIMARY. Except for tests
   all FILTERED children in the Qemu block layer are always PRIMARY as
   well.  We are going to formalize this rule, so let's better use DATA
   children here.

While being here, update the picture to better correspond to the test
code.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/unit/test-bdrv-graph-mod.c | 70 
 1 file changed, 44 insertions(+), 26 deletions(-)

diff --git a/tests/unit/test-bdrv-graph-mod.c b/tests/unit/test-bdrv-graph-mod.c
index a6e3bb79be..40795d3c04 100644
--- a/tests/unit/test-bdrv-graph-mod.c
+++ b/tests/unit/test-bdrv-graph-mod.c
@@ -241,13 +241,26 @@ static void test_parallel_exclusive_write(void)
 bdrv_unref(top);
 }
 
-static void write_to_file_perms(BlockDriverState *bs, BdrvChild *c,
- BdrvChildRole role,
- BlockReopenQueue *reopen_queue,
- uint64_t perm, uint64_t shared,
- uint64_t *nperm, uint64_t *nshared)
+/*
+ * write-to-selected node may have several DATA children, one of them may be
+ * "selected". Exclusive write permission is taken on selected child.
+ *
+ * We don't realize write handler itself, as we need only to test how 
permission
+ * update works.
+ */
+typedef struct BDRVWriteToSelectedState {
+BdrvChild *selected;
+} BDRVWriteToSelectedState;
+
+static void write_to_selected_perms(BlockDriverState *bs, BdrvChild *c,
+BdrvChildRole role,
+BlockReopenQueue *reopen_queue,
+uint64_t perm, uint64_t shared,
+uint64_t *nperm, uint64_t *nshared)
 {
-if (bs->file && c == bs->file) {
+BDRVWriteToSelectedState *s = bs->opaque;
+
+if (s->selected && c == s->selected) {
 *nperm = BLK_PERM_WRITE;
 *nshared = BLK_PERM_ALL & ~BLK_PERM_WRITE;
 } else {
@@ -256,9 +269,10 @@ static void write_to_file_perms(BlockDriverState *bs, 
BdrvChild *c,
 }
 }
 
-static BlockDriver bdrv_write_to_file = {
-.format_name = "tricky-perm",
-.bdrv_child_perm = write_to_file_perms,
+static BlockDriver bdrv_write_to_selected = {
+.format_name = "write-to-selected",
+.instance_size = sizeof(BDRVWriteToSelectedState),
+.bdrv_child_perm = write_to_selected_perms,
 };
 
 
@@ -266,15 +280,18 @@ static BlockDriver bdrv_write_to_file = {
  * The following test shows that topological-sort order is required for
  * permission update, simple DFS is not enough.
  *
- * Consider the block driver which has two filter children: one active
- * with exclusive write access and one inactive with no specific
- * permissions.
+ * Consider the block driver (write-to-selected) which has two children: one is
+ * selected so we have exclusive write access to it and for the other one we
+ * don't need any specific permissions.
  *
  * And, these two children has a common base child, like this:
+ *   (additional "top" on top is used in test just because the only public
+ *function to update permission should get a specific child to update.
+ *Making bdrv_refresh_perms() public just for this test doesn't worth it)
  *
- * ┌─┐ ┌──┐
- * │ fl2 │ ◀── │ top  │
- * └─┘ └──┘
+ * ┌─┐ ┌───┐ ┌─┐
+ * │ fl2 │ ◀── │ write-to-selected │ ◀── │ top │
+ * └─┘ └───┘ └─┘
  *   │   │
  *   │   │ w
  *   │   ▼
@@ -290,7 +307,7 @@ static BlockDriver bdrv_write_to_file = {
  *
  * So, exclusive write is propagated.
  *
- * Assume, we want to make fl2 active instead of fl1.
+ * Assume, we want to select fl2  instead of fl1.
  * So, we set some option for top driver and do permission update.
  *
  * With simple DFS, if permission update goes first through
@@ -306,9 +323,10 @@ static BlockDriver bdrv_write_to_file = {
 static void test_parallel_perm_update(void)
 {
 BlockDriverState *top = no_perm_node("top");
-BlockDriverState *tricky =
-bdrv_new_open_driver(_write_to_file, "tricky", BDRV_O_RDWR,
+BlockDriverState *ws =
+bdrv_new_open_driver(_write_to_selected, "ws", BDRV_O_RDWR,
  _abort);
+BDRVWriteToSelectedState *s = ws->opaque;
 BlockDriverState *base = no_perm_node("base");
 BlockDriverState *fl1 = pass_through_node("fl1");
 BlockDriverState *fl2 = pass_through_node("fl2");
@@ -320,33 +338,33

[PATCH 03/14] block/blklogwrites: don't care to remove bs->file child on failure

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

We don't need to remove bs->file, generic layer takes care of it. No
other driver cares to remove bs->file on failure by hand.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/blklogwrites.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/block/blklogwrites.c b/block/blklogwrites.c
index f66a617eb3..7d25df97cc 100644
--- a/block/blklogwrites.c
+++ b/block/blklogwrites.c
@@ -254,10 +254,6 @@ fail_log:
 s->log_file = NULL;
 }
 fail:
-if (ret < 0) {
-bdrv_unref_child(bs, bs->file);
-bs->file = NULL;
-}
 qemu_opts_del(opts);
 return ret;
 }
-- 
2.31.1

[PATCH 02/14] block: introduce bdrv_open_file_child() helper

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

Almost all drivers call bdrv_open_child() similarly. Let's create a
helper for this.

The only not updated driver that call bdrv_open_child() to set
bs->file is raw-format, as it sometimes want to have filtered child but
don't set drv->is_filter to true.

Possibly we should implement drv->is_filter_func() handler, to consider
raw-format as filter when it works as filter.. But it's another story.

Note also, that we decrease assignments to bs->file in code: it helps
us restrict modifying this field in further commit.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/block.h |  3 +++
 block.c   | 21 +
 block/blkdebug.c  |  9 +++--
 block/blklogwrites.c  |  7 ++-
 block/blkreplay.c |  7 ++-
 block/blkverify.c |  9 +++--
 block/bochs.c |  7 +++
 block/cloop.c |  7 +++
 block/copy-before-write.c |  9 -
 block/copy-on-read.c  |  9 -
 block/crypto.c| 11 ++-
 block/dmg.c   |  7 +++
 block/filter-compress.c   |  6 ++
 block/parallels.c |  7 +++
 block/preallocate.c   |  9 -
 block/qcow.c  |  6 ++
 block/qcow2.c |  8 
 block/qed.c   |  8 
 block/replication.c   |  8 +++-
 block/throttle.c  |  8 +++-
 block/vdi.c   |  7 +++
 block/vhdx.c  |  7 +++
 block/vmdk.c  |  7 +++
 block/vpc.c   |  7 +++
 24 files changed, 94 insertions(+), 100 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index e5dd22b034..f885f113ef 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -376,6 +376,9 @@ BdrvChild *bdrv_open_child(const char *filename,
const BdrvChildClass *child_class,
BdrvChildRole child_role,
bool allow_none, Error **errp);
+int bdrv_open_file_child(const char *filename,
+ QDict *options, const char *bdref_key,
+ BlockDriverState *parent, Error **errp);
 BlockDriverState *bdrv_open_blockdev_ref(BlockdevRef *ref, Error **errp);
 int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
 Error **errp);
diff --git a/block.c b/block.c
index 0ac5b163d2..a97720c5eb 100644
--- a/block.c
+++ b/block.c
@@ -3546,6 +3546,27 @@ BdrvChild *bdrv_open_child(const char *filename,
  errp);
 }
 
+/*
+ * Wrapper on bdrv_open_child() for most popular case: open primary child of 
bs.
+ */
+int bdrv_open_file_child(const char *filename,
+ QDict *options, const char *bdref_key,
+ BlockDriverState *parent, Error **errp)
+{
+BdrvChildRole role;
+
+/* commit_top and mirror_top don't use this function */
+assert(!parent->drv->filtered_child_is_backing);
+
+role = parent->drv->is_filter ?
+(BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY) : BDRV_CHILD_IMAGE;
+
+parent->file = bdrv_open_child(filename, options, bdref_key, parent,
+   _of_bds, role, false, errp);
+
+return parent->file ? 0 : -EINVAL;
+}
+
 /*
  * TODO Future callers may need to specify parent/child_class in order for
  * option inheritance to work. Existing callers use it for the root node.
diff --git a/block/blkdebug.c b/block/blkdebug.c
index bbf2948703..5fcfc8ac6f 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -503,12 +503,9 @@ static int blkdebug_open(BlockDriverState *bs, QDict 
*options, int flags,
 }
 
 /* Open the image file */
-bs->file = bdrv_open_child(qemu_opt_get(opts, "x-image"), options, "image",
-   bs, _of_bds,
-   BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
-   false, errp);
-if (!bs->file) {
-ret = -EINVAL;
+ret = bdrv_open_file_child(qemu_opt_get(opts, "x-image"), options, "image",
+   bs, errp);
+if (ret < 0) {
 goto out;
 }
 
diff --git a/block/blklogwrites.c b/block/blklogwrites.c
index f7a251e91f..f66a617eb3 100644
--- a/block/blklogwrites.c
+++ b/block/blklogwrites.c
@@ -155,11 +155,8 @@ static int blk_log_writes_open(BlockDriverState *bs, QDict 
*options, int flags,
 }
 
 /* Open the file */
-bs->file = bdrv_open_child(NULL, options, "file", bs, _of_bds,
-   BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY, false,
-   errp);
-if (!bs->file) {
-ret = -EINVAL;
+ret = bdrv_open_file_child(NULL, options, "file", bs, errp);
+if (ret < 0) {
 goto fail;
 }
 
diff --git a/block/blkreplay.c b/block/blkreplay.c
index dcbe780ddb..76a0b8d12a 100644
--- a/block/blkreplay.c
+++ b/block/blkreplay.c
@@ -26,11 +26,8 @@ static int

[PATCH 13/14] block: Manipulate bs->file / bs->backing pointers in .attach/.detach

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

bs->file and bs->backing are a kind of duplication of part of
bs->children. But very useful diplication, so let's not drop them at
all:)

We should manage bs->file and bs->backing in same place, where we
manage bs->children, to keep them in sync.

Moreover, generic io paths are unprepared to BdrvChild without a bs, so
it's double good to clear bs->file / bs->backing when we detach the
child.

Detach is simple: if we detach bs->file or bs->backing child, just
set corresponding field to NULL.

Attach is a bit more complicated. But we still can precisely detect
should we set one of bs->file / bs->backing or not:

- if role is BDRV_CHILD_COW, we definitely deal with bs->backing
- else, if role is BDRV_CHILD_FILTERED (it must be also
  BDRV_CHILD_PRIMARY), it's a filtered child. Use
  bs->drv->filtered_child_is_backing to chose the pointer field to
  modify.
- else, if role is BDRV_CHILD_PRIMARY, we deal with bs->file
- in all other cases, it's neither bs->backing nor bs->file. It's some
  other child and we shouldn't care

OK. This change brings one more good thing: we can (and should) get rid
of all indirect pointers in the block-graph-change transactions:

bdrv_attach_child_common() stores BdrvChild** into transaction to clear
it on abort.

bdrv_attach_child_common() has two callers: bdrv_attach_child_noperm()
just pass-through this feature, bdrv_root_attach_child() doesn't need
the feature.

Look at bdrv_attach_child_noperm() callers:
  - bdrv_attach_child() doesn't need the feature
  - bdrv_set_file_or_backing_noperm() uses the feature to manage
bs->file and bs->backing, we don't want it anymore
  - bdrv_append() uses the feature to manage bs->backing, again we
don't want it anymore

So, we should drop this stuff! Great!

We still keep BdrvChild** argument to return the child and int return
value, and not move to simply returning BdrvChild*, as we don't want to
lose int return values.

However we don't require *@child to be NULL anymore, and even allow
@child to be NULL, if caller don't need the new child pointer.

Finally, we now set .file / .backing automatically in generic code and
want to restring setting them by hand outside of .attach/.detach.
So, this patch cleanups all remaining places where they were set.
To find such places I use:

  git grep '\->file ='
  git grep '\->backing ='
  git grep '&.*\'
  git grep '&.*\'

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/block_int.h|  15 +++-
 block.c  | 155 ---
 block/raw-format.c   |   4 +-
 block/snapshot.c |   1 -
 tests/unit/test-bdrv-drain.c |  10 +--
 5 files changed, 88 insertions(+), 97 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 919e33de5f..4ea800e589 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -937,9 +937,6 @@ struct BlockDriverState {
 QDict *full_open_options;
 char exact_filename[PATH_MAX];
 
-BdrvChild *backing;
-BdrvChild *file;
-
 /* I/O Limits */
 BlockLimits bl;
 
@@ -992,7 +989,19 @@ struct BlockDriverState {
  * which can affect this node by changing these defaults). This is always a
  * parent node of this node. */
 BlockDriverState *inherits_from;
+
+/*
+ * @backing and @file are some of @children or NULL. All these three fields
+ * (@file, @backing and @children) are modified only in
+ * bdrv_child_cb_attach() and bdrv_child_cb_detach().
+ *
+ * See also comment in include/block/block.h, to learn how backing and file
+ * are connected with BdrvChildRole.
+ */
 QLIST_HEAD(, BdrvChild) children;
+BdrvChild *backing;
+BdrvChild *file;
+
 QLIST_HEAD(, BdrvChild) parents;
 
 QDict *options;
diff --git a/block.c b/block.c
index d57d7a80ab..0c6bbc9b0b 100644
--- a/block.c
+++ b/block.c
@@ -1388,9 +1388,33 @@ static void bdrv_child_cb_attach(BdrvChild *child)
 BlockDriverState *bs = child->opaque;
 
 QLIST_INSERT_HEAD(>children, child, next);
+if (bs->drv->is_filter | (child->role & BDRV_CHILD_FILTERED)) {
+/*
+ * Here we handle filters and block/raw-format.c when it behave like
+ * filter.
+ */
+assert(!(child->role & BDRV_CHILD_COW));
+if (child->role & (BDRV_CHILD_PRIMARY | BDRV_CHILD_FILTERED)) {
+assert(child->role & BDRV_CHILD_PRIMARY);
+assert(child->role & BDRV_CHILD_FILTERED);
+assert(!bs->backing);
+assert(!bs->file);
 
-if (child->role & BDRV_CHILD_COW) {
+if (bs->drv->filtered_child_is_backing) {
+bs->backing = child;
+} else {
+bs->file = child;
+}
+}
+} else if (child->role & BDRV_CHILD_COW) {
+assert(bs->drv->supports_backing);
+assert(!(child->role & BDRV_CHILD_PRIMARY));
+assert(!bs->backing);
+bs->backing = child;

[PATCH 05/14] tests-bdrv-drain: bdrv_replace_test driver: declare supports_backing

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

We do add COW child to the node.  In future we are going to forbid
adding COW child to the node that doesn't support backing. So, fix it
here now.

Don't worry about setting bs->backing itself: it further commit we'll
update the block-layer to automatically set/unset this field in generic
code.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/unit/test-bdrv-drain.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c
index 2d3c17e566..45edbd867f 100644
--- a/tests/unit/test-bdrv-drain.c
+++ b/tests/unit/test-bdrv-drain.c
@@ -1944,6 +1944,7 @@ static void coroutine_fn 
bdrv_replace_test_co_drain_end(BlockDriverState *bs)
 static BlockDriver bdrv_replace_test = {
 .format_name= "replace_test",
 .instance_size  = sizeof(BDRVReplaceTestState),
+.supports_backing   = true,
 
 .bdrv_close = bdrv_replace_test_close,
 .bdrv_co_preadv = bdrv_replace_test_co_preadv,
-- 
2.31.1

[PATCH 11/14] Revert "block: Restructure remove_file_or_backing_child()"

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

That's a preparation to previously reverted
"block: Let replace_child_noperm free children". Drop it too, we don't
need it for a new approach.

This reverts commit 562bda8bb41879eeda0bd484dd3d55134579b28e.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block.c | 21 +
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/block.c b/block.c
index 3eec53796b..2ba95f71b9 100644
--- a/block.c
+++ b/block.c
@@ -4908,33 +4908,30 @@ static void 
bdrv_remove_file_or_backing_child(BlockDriverState *bs,
   BdrvChild *child,
   Transaction *tran)
 {
-BdrvChild **childp;
 BdrvRemoveFilterOrCowChild *s;
 
+assert(child == bs->backing || child == bs->file);
+
 if (!child) {
 return;
 }
 
-if (child == bs->backing) {
-childp = >backing;
-} else if (child == bs->file) {
-childp = >file;
-} else {
-g_assert_not_reached();
-}
-
 if (child->bs) {
-bdrv_replace_child_tran(*childp, NULL, tran);
+bdrv_replace_child_tran(child, NULL, tran);
 }
 
 s = g_new(BdrvRemoveFilterOrCowChild, 1);
 *s = (BdrvRemoveFilterOrCowChild) {
 .child = child,
-.is_backing = (childp == >backing),
+.is_backing = (child == bs->backing),
 };
 tran_add(tran, _remove_filter_or_cow_child_drv, s);
 
-*childp = NULL;
+if (s->is_backing) {
+bs->backing = NULL;
+} else {
+bs->file = NULL;
+}
 }
 
 /*
-- 
2.31.1

[PATCH 09/14] Revert "block: Let replace_child_noperm free children"

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

We are going to reimplement this behavior (clear bs->file / bs->backing
pointers automatically when child->bs is cleared) in a nicer way.

This reverts commit b0a9f6fed3d80de610dcd04a7e66f9f30a04174f.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block.c | 102 +---
 1 file changed, 23 insertions(+), 79 deletions(-)

diff --git a/block.c b/block.c
index a97720c5eb..69c20c729a 100644
--- a/block.c
+++ b/block.c
@@ -87,10 +87,8 @@ static BlockDriverState *bdrv_open_inherit(const char 
*filename,
 static bool bdrv_recurse_has_child(BlockDriverState *bs,
BlockDriverState *child);
 
-static void bdrv_child_free(BdrvChild *child);
 static void bdrv_replace_child_noperm(BdrvChild **child,
-  BlockDriverState *new_bs,
-  bool free_empty_child);
+  BlockDriverState *new_bs);
 static void bdrv_remove_file_or_backing_child(BlockDriverState *bs,
   BdrvChild *child,
   Transaction *tran);
@@ -2258,16 +2256,12 @@ typedef struct BdrvReplaceChildState {
 BdrvChild *child;
 BdrvChild **childp;
 BlockDriverState *old_bs;
-bool free_empty_child;
 } BdrvReplaceChildState;
 
 static void bdrv_replace_child_commit(void *opaque)
 {
 BdrvReplaceChildState *s = opaque;
 
-if (s->free_empty_child && !s->child->bs) {
-bdrv_child_free(s->child);
-}
 bdrv_unref(s->old_bs);
 }
 
@@ -2284,26 +2278,22 @@ static void bdrv_replace_child_abort(void *opaque)
  * modify the BdrvChild * pointer we indirectly pass to it, i.e. it
  * will not modify s->child.  From that perspective, it does not matter
  * whether we pass s->childp or >child.
+ * (TODO: Right now, bdrv_replace_child_noperm() never modifies that
+ * pointer anyway (though it will in the future), so at this point it
+ * absolutely does not matter whether we pass s->childp or >child.)
  * (2) If new_bs is not NULL, s->childp will be NULL.  We then cannot use
  * it here.
  * (3) If new_bs is NULL, *s->childp will have been NULLed by
  * bdrv_replace_child_tran()'s bdrv_replace_child_noperm() call, and we
  * must not pass a NULL *s->childp here.
+ * (TODO: In its current state, bdrv_replace_child_noperm() will not
+ * have NULLed *s->childp, so this does not apply yet.  It will in the
+ * future.)
  *
  * So whether new_bs was NULL or not, we cannot pass s->childp here; and in
  * any case, there is no reason to pass it anyway.
  */
-bdrv_replace_child_noperm(>child, s->old_bs, true);
-/*
- * The child was pre-existing, so s->old_bs must be non-NULL, and
- * s->child thus must not have been freed
- */
-assert(s->child != NULL);
-if (!new_bs) {
-/* As described above, *s->childp was cleared, so restore it */
-assert(s->childp != NULL);
-*s->childp = s->child;
-}
+bdrv_replace_child_noperm(>child, s->old_bs);
 bdrv_unref(new_bs);
 }
 
@@ -2320,44 +2310,30 @@ static TransactionActionDrv bdrv_replace_child_drv = {
  *
  * The function doesn't update permissions, caller is responsible for this.
  *
- * (*childp)->bs must not be NULL.
- *
  * Note that if new_bs == NULL, @childp is stored in a state object attached
  * to @tran, so that the old child can be reinstated in the abort handler.
  * Therefore, if @new_bs can be NULL, @childp must stay valid until the
  * transaction is committed or aborted.
  *
- * If @free_empty_child is true and @new_bs is NULL, the BdrvChild is
- * freed (on commit).  @free_empty_child should only be false if the
- * caller will free the BDrvChild themselves (which may be important
- * if this is in turn called in another transactional context).
+ * (TODO: The reinstating does not happen yet, but it will once
+ * bdrv_replace_child_noperm() NULLs *childp when new_bs is NULL.)
  */
 static void bdrv_replace_child_tran(BdrvChild **childp,
 BlockDriverState *new_bs,
-Transaction *tran,
-bool free_empty_child)
+Transaction *tran)
 {
 BdrvReplaceChildState *s = g_new(BdrvReplaceChildState, 1);
 *s = (BdrvReplaceChildState) {
 .child = *childp,
 .childp = new_bs == NULL ? childp : NULL,
 .old_bs = (*childp)->bs,
-.free_empty_child = free_empty_child,
 };
 tran_add(tran, _replace_child_drv, s);
 
-/* The abort handler relies on this */
-assert(s->old_bs != NULL);
-
 if (new_bs) {
 bdrv_ref(new_bs);
 }
-/*
- * Pass free_empty_child=false, we will free the child (if
- * necessary) in bdrv_replace_child_commit() (if our
- *

[PATCH 10/14] Revert "block: Let replace_child_tran keep indirect pointer"

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

That's a preparation to previously reverted
"block: Let replace_child_noperm free children". Drop it too, we don't
need it for a new approach.

This reverts commit 82b54cf51656bf3cd5ed1ac549e8a1085a0e3290.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block.c | 83 +++--
 1 file changed, 10 insertions(+), 73 deletions(-)

diff --git a/block.c b/block.c
index 69c20c729a..3eec53796b 100644
--- a/block.c
+++ b/block.c
@@ -2254,7 +2254,6 @@ static int bdrv_drv_set_perm(BlockDriverState *bs, 
uint64_t perm,
 
 typedef struct BdrvReplaceChildState {
 BdrvChild *child;
-BdrvChild **childp;
 BlockDriverState *old_bs;
 } BdrvReplaceChildState;
 
@@ -2270,29 +2269,7 @@ static void bdrv_replace_child_abort(void *opaque)
 BdrvReplaceChildState *s = opaque;
 BlockDriverState *new_bs = s->child->bs;
 
-/*
- * old_bs reference is transparently moved from @s to s->child.
- *
- * Pass >child here instead of s->childp, because:
- * (1) s->old_bs must be non-NULL, so bdrv_replace_child_noperm() will not
- * modify the BdrvChild * pointer we indirectly pass to it, i.e. it
- * will not modify s->child.  From that perspective, it does not matter
- * whether we pass s->childp or >child.
- * (TODO: Right now, bdrv_replace_child_noperm() never modifies that
- * pointer anyway (though it will in the future), so at this point it
- * absolutely does not matter whether we pass s->childp or >child.)
- * (2) If new_bs is not NULL, s->childp will be NULL.  We then cannot use
- * it here.
- * (3) If new_bs is NULL, *s->childp will have been NULLed by
- * bdrv_replace_child_tran()'s bdrv_replace_child_noperm() call, and we
- * must not pass a NULL *s->childp here.
- * (TODO: In its current state, bdrv_replace_child_noperm() will not
- * have NULLed *s->childp, so this does not apply yet.  It will in the
- * future.)
- *
- * So whether new_bs was NULL or not, we cannot pass s->childp here; and in
- * any case, there is no reason to pass it anyway.
- */
+/* old_bs reference is transparently moved from @s to @s->child */
 bdrv_replace_child_noperm(>child, s->old_bs);
 bdrv_unref(new_bs);
 }
@@ -2309,32 +2286,22 @@ static TransactionActionDrv bdrv_replace_child_drv = {
  * Note: real unref of old_bs is done only on commit.
  *
  * The function doesn't update permissions, caller is responsible for this.
- *
- * Note that if new_bs == NULL, @childp is stored in a state object attached
- * to @tran, so that the old child can be reinstated in the abort handler.
- * Therefore, if @new_bs can be NULL, @childp must stay valid until the
- * transaction is committed or aborted.
- *
- * (TODO: The reinstating does not happen yet, but it will once
- * bdrv_replace_child_noperm() NULLs *childp when new_bs is NULL.)
  */
-static void bdrv_replace_child_tran(BdrvChild **childp,
-BlockDriverState *new_bs,
+static void bdrv_replace_child_tran(BdrvChild *child, BlockDriverState *new_bs,
 Transaction *tran)
 {
 BdrvReplaceChildState *s = g_new(BdrvReplaceChildState, 1);
 *s = (BdrvReplaceChildState) {
-.child = *childp,
-.childp = new_bs == NULL ? childp : NULL,
-.old_bs = (*childp)->bs,
+.child = child,
+.old_bs = child->bs,
 };
 tran_add(tran, _replace_child_drv, s);
 
 if (new_bs) {
 bdrv_ref(new_bs);
 }
-bdrv_replace_child_noperm(childp, new_bs);
-/* old_bs reference is transparently moved from *childp to @s */
+bdrv_replace_child_noperm(, new_bs);
+/* old_bs reference is transparently moved from @child to @s */
 }
 
 /*
@@ -4898,7 +4865,6 @@ static bool should_update_child(BdrvChild *c, 
BlockDriverState *to)
 
 typedef struct BdrvRemoveFilterOrCowChild {
 BdrvChild *child;
-BlockDriverState *bs;
 bool is_backing;
 } BdrvRemoveFilterOrCowChild;
 
@@ -4928,19 +4894,10 @@ static void bdrv_remove_filter_or_cow_child_commit(void 
*opaque)
 bdrv_child_free(s->child);
 }
 
-static void bdrv_remove_filter_or_cow_child_clean(void *opaque)
-{
-BdrvRemoveFilterOrCowChild *s = opaque;
-
-/* Drop the bs reference after the transaction is done */
-bdrv_unref(s->bs);
-g_free(s);
-}
-
 static TransactionActionDrv bdrv_remove_filter_or_cow_child_drv = {
 .abort = bdrv_remove_filter_or_cow_child_abort,
 .commit = bdrv_remove_filter_or_cow_child_commit,
-.clean = bdrv_remove_filter_or_cow_child_clean,
+.clean = g_free,
 };
 
 /*
@@ -4958,11 +4915,6 @@ static void 
bdrv_remove_file_or_backing_child(BlockDriverState *bs,
 return;
 }
 
-/*
- * Keep a reference to @bs so @childp will stay valid throughout the
- * transaction (required by bdrv_replace_child_tran())
- */
-bdrv_ref(bs);
 if (child == bs->backing) {

[PATCH 00/14] block: cleanup backing and file handling

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

Hi all!

I started this as a follow-up to
"block: Attempt on fixing 030-reported errors" by Hanna.

In Hanna's series I really like how bs->children handling moved to
.attach/.detach handlers.

.file and .backing are kind of "shortcuts" or "links" to some elementes
of this list, they duplicate the information. So they should be updated
in the same place to be in sync.

On the way to this target, I do also the following cleanups:

 - establish, which restrictions we have on how much children of
 different roles should node has, and which of the should be linked in
 .file / .backing. Add documentation and assertions.

 - drop all the complicated logic around passing pointer to bs->backing
 / bs->file  (BdrvChild **c), so that the field be automatically
 updated. Now they are natively automatically updated in
 .attach/.detach, so the rest of the code becomes simpler.

 - now .file / .backing are updated ONLY in .attach / .detach, no other
 code modify these fields

Vladimir Sementsov-Ogievskiy (14):
  block: BlockDriver: add .filtered_child_is_backing field
  block: introduce bdrv_open_file_child() helper
  block/blklogwrites: don't care to remove bs->file child on failure
  test-bdrv-graph-mod: update test_parallel_perm_update test case
  tests-bdrv-drain: bdrv_replace_test driver: declare supports_backing
  test-bdrv-graph-mod: fix filters to be filters
  block: document connection between child roles and
bs->backing/bs->file
  block/snapshot: stress that we fallback to primary child
  Revert "block: Let replace_child_noperm free children"
  Revert "block: Let replace_child_tran keep indirect pointer"
  Revert "block: Restructure remove_file_or_backing_child()"
  Revert "block: Pass BdrvChild ** to replace_child_noperm"
  block: Manipulate bs->file / bs->backing pointers in .attach/.detach
  block/snapshot: drop indirection around bdrv_snapshot_fallback_ptr

 include/block/block.h|  45 +
 include/block/block_int.h|  30 ++-
 block.c  | 335 ++-
 block/blkdebug.c |   9 +-
 block/blklogwrites.c |  11 +-
 block/blkreplay.c|   7 +-
 block/blkverify.c|   9 +-
 block/bochs.c|   7 +-
 block/cloop.c|   7 +-
 block/commit.c   |   1 +
 block/copy-before-write.c|   9 +-
 block/copy-on-read.c |   9 +-
 block/crypto.c   |  11 +-
 block/dmg.c  |   7 +-
 block/filter-compress.c  |   6 +-
 block/mirror.c   |   1 +
 block/parallels.c|   7 +-
 block/preallocate.c  |   9 +-
 block/qcow.c |   6 +-
 block/qcow2.c|   8 +-
 block/qed.c  |   8 +-
 block/raw-format.c   |   4 +-
 block/replication.c  |   8 +-
 block/snapshot.c |  60 ++
 block/throttle.c |   8 +-
 block/vdi.c  |   7 +-
 block/vhdx.c |   7 +-
 block/vmdk.c |   7 +-
 block/vpc.c  |   7 +-
 tests/unit/test-bdrv-drain.c |  11 +-
 tests/unit/test-bdrv-graph-mod.c |  94 ++---
 31 files changed, 343 insertions(+), 412 deletions(-)

-- 
2.31.1

[PATCH 01/14] block: BlockDriver: add .filtered_child_is_backing field

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

Unfortunately not all filters use .file child as filtered child. Two
exclusions are mirror_top and commit_top. Happily they both are private
filters. Bad thing is that this inconsistency is observable through qmp
commands query-block / query-named-block-nodes. So, could we just
change mirror_top and commit_top to use file child as all other filter
driver is an open question. Probably, we could do that with some kind
of deprecation period, but how to warn users during it?

For now, let's just add a field so we can distinguish them in generic
code, it will be used in further commits.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/block_int.h | 14 ++
 block/commit.c|  1 +
 block/mirror.c|  1 +
 3 files changed, 16 insertions(+)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index f4c75e8ba9..9c06f8816e 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -117,6 +117,20 @@ struct BlockDriver {
  * (And this filtered child must then be bs->file or bs->backing.)
  */
 bool is_filter;
+
+/*
+ * Only make sense for filter drivers, for others must be false.
+ * If true, filtered child is bs->backing. Otherwise it's bs->file.
+ * Only two internal filters use bs->backing as filtered child and has this
+ * field set to true: mirror_top and commit_top.
+ *
+ * Never create any more such filters!
+ *
+ * TODO: imagine how to deprecate this behavior and make all filters work
+ * similarly using bs->file as filtered child.
+ */
+bool filtered_child_is_backing;
+
 /*
  * Set to true if the BlockDriver is a format driver.  Format nodes
  * generally do not expect their children to be other format nodes
diff --git a/block/commit.c b/block/commit.c
index 10cc5ff451..23d60aebf4 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -237,6 +237,7 @@ static BlockDriver bdrv_commit_top = {
 .bdrv_child_perm= bdrv_commit_top_child_perm,
 
 .is_filter  = true,
+.filtered_child_is_backing  = true,
 };
 
 void commit_start(const char *job_id, BlockDriverState *bs,
diff --git a/block/mirror.c b/block/mirror.c
index efec2c7674..22e2b7b110 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1587,6 +1587,7 @@ static BlockDriver bdrv_mirror_top = {
 .bdrv_child_perm= bdrv_mirror_top_child_perm,
 
 .is_filter  = true,
+.filtered_child_is_backing  = true,
 };
 
 static BlockJob *mirror_start_job(
-- 
2.31.1

[PATCH v4 12/19] iotests: massive use _qcow2_dump_header

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

We are going to add filtering in _qcow2_dump_header and want all tests
use it.

The patch is generated by commands:
  cd tests/qemu-iotests
  sed -ie 's/$PYTHON qcow2.py "$TEST_IMG" dump-header\($\| 
\)/_qcow2_dump_header\1/' ??? tests/*

(the difficulty is to avoid converting dump-header-exts)

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Max Reitz 
---
 tests/qemu-iotests/031 |  6 +++---
 tests/qemu-iotests/036 |  6 +++---
 tests/qemu-iotests/039 | 20 ++--
 tests/qemu-iotests/060 | 20 ++--
 tests/qemu-iotests/061 | 36 ++--
 tests/qemu-iotests/137 |  2 +-
 tests/qemu-iotests/287 |  8 
 7 files changed, 49 insertions(+), 49 deletions(-)

diff --git a/tests/qemu-iotests/031 b/tests/qemu-iotests/031
index 58b57a0ef2..648112f796 100755
--- a/tests/qemu-iotests/031
+++ b/tests/qemu-iotests/031
@@ -58,21 +58,21 @@ for compat in "compat=0.10" "compat=1.1"; do
 echo
 _make_test_img -o $compat 64M
 $PYTHON qcow2.py "$TEST_IMG" add-header-ext 0x12345678 "This is a test 
header extension"
-$PYTHON qcow2.py "$TEST_IMG" dump-header
+_qcow2_dump_header
 _check_test_img
 
 echo
 echo === Rewrite header with no backing file ===
 echo
 $QEMU_IMG rebase -u -b "" "$TEST_IMG"
-$PYTHON qcow2.py "$TEST_IMG" dump-header
+_qcow2_dump_header
 _check_test_img
 
 echo
 echo === Add a backing file and format ===
 echo
 $QEMU_IMG rebase -u -b "/some/backing/file/path" -F host_device "$TEST_IMG"
-$PYTHON qcow2.py "$TEST_IMG" dump-header
+_qcow2_dump_header
 done
 
 # success, all done
diff --git a/tests/qemu-iotests/036 b/tests/qemu-iotests/036
index 5e567012a8..f703605e44 100755
--- a/tests/qemu-iotests/036
+++ b/tests/qemu-iotests/036
@@ -58,7 +58,7 @@ $PYTHON qcow2.py "$TEST_IMG" set-feature-bit incompatible 63
 
 # Without feature table
 $PYTHON qcow2.py "$TEST_IMG" del-header-ext 0x6803f857
-$PYTHON qcow2.py "$TEST_IMG" dump-header | grep features
+_qcow2_dump_header | grep features
 $PYTHON qcow2.py "$TEST_IMG" dump-header-exts
 _img_info
 
@@ -107,7 +107,7 @@ echo === Create image with unknown autoclear feature bit ===
 echo
 _make_test_img 64M
 $PYTHON qcow2.py "$TEST_IMG" set-feature-bit autoclear 63
-$PYTHON qcow2.py "$TEST_IMG" dump-header | grep features
+_qcow2_dump_header | grep features
 $PYTHON qcow2.py "$TEST_IMG" dump-header-exts
 
 echo
@@ -115,7 +115,7 @@ echo === Repair image ===
 echo
 _check_test_img -r all
 
-$PYTHON qcow2.py "$TEST_IMG" dump-header | grep features
+_qcow2_dump_header | grep features
 $PYTHON qcow2.py "$TEST_IMG" dump-header-exts
 
 # success, all done
diff --git a/tests/qemu-iotests/039 b/tests/qemu-iotests/039
index 12b2c7fa7b..8e783a8380 100755
--- a/tests/qemu-iotests/039
+++ b/tests/qemu-iotests/039
@@ -59,7 +59,7 @@ _make_test_img -o "compat=1.1,lazy_refcounts=on" $size
 $QEMU_IO -c "write -P 0x5a 0 512" "$TEST_IMG" | _filter_qemu_io
 
 # The dirty bit must not be set
-$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features
+_qcow2_dump_header | grep incompatible_features
 _check_test_img
 
 echo
@@ -73,7 +73,7 @@ $QEMU_IO -c "write -P 0x5a 0 512" \
 | _filter_qemu_io
 
 # The dirty bit must be set
-$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features
+_qcow2_dump_header | grep incompatible_features
 _check_test_img
 
 echo
@@ -82,7 +82,7 @@ echo "== Read-only access must still work =="
 $QEMU_IO -r -c "read -P 0x5a 0 512" "$TEST_IMG" | _filter_qemu_io
 
 # The dirty bit must be set
-$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features
+_qcow2_dump_header | grep incompatible_features
 
 echo
 echo "== Repairing the image file must succeed =="
@@ -90,7 +90,7 @@ echo "== Repairing the image file must succeed =="
 _check_test_img -r all
 
 # The dirty bit must not be set
-$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features
+_qcow2_dump_header | grep incompatible_features
 
 echo
 echo "== Data should still be accessible after repair =="
@@ -108,12 +108,12 @@ $QEMU_IO -c "write -P 0x5a 0 512" \
 | _filter_qemu_io
 
 # The dirty bit must be set
-$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features
+_qcow2_dump_header | grep incompatible_features
 
 $QEMU_IO -c "write 0 512" "$TEST_IMG" | _filter_qemu_io
 
 # The dirty bit must not be set
-$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features
+_qcow2_dump_header | grep incompatible_features
 
 echo
 echo "== Creating an image file with lazy_refcounts=off =="
@@ -126,7 +126,7 @@ $QEMU_IO -c "write -P 0x5a 0 512" \
 | _filter_qemu_io
 
 # The dirty bit must not be set since lazy_refcounts=off
-$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features
+_qcow2_dump_header | grep incompatible_features
 _check_test_img
 
 echo
@@ -141,7 +141,7 @@ $QEMU_IO -c "write 0 512" "$TEST_IMG" | _filter_qemu_io
 $QEMU_IMG commit "$TEST_IMG"
 
 # The dirty bit must not be set

[PATCH v4 01/19] iotests.py: img_info_log(): rename imgopts argument

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

We are going to support IMGOPTS environment variable like in bash
tests. Corresponding global variable in iotests.py should be called
imgopts. So to not interfere with function argument, rename it in
advance.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Max Reitz 
---
 tests/qemu-iotests/210| 8 
 tests/qemu-iotests/iotests.py | 5 +++--
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/tests/qemu-iotests/210 b/tests/qemu-iotests/210
index a4dcc5fe59..10b0a0b87c 100755
--- a/tests/qemu-iotests/210
+++ b/tests/qemu-iotests/210
@@ -62,7 +62,7 @@ with iotests.FilePath('t.luks') as disk_path, \
 'driver=luks,file.driver=file,file.filename=%s,key-secret=keysec0' % 
(disk_path),
 filter_path=disk_path,
 extra_args=['--object', 'secret,id=keysec0,data=foo'],
-imgopts=True)
+use_image_opts=True)
 
 #
 # Successful image creation (with non-default options)
@@ -96,7 +96,7 @@ with iotests.FilePath('t.luks') as disk_path, \
 'driver=luks,file.driver=file,file.filename=%s,key-secret=keysec0' % 
(disk_path),
 filter_path=disk_path,
 extra_args=['--object', 'secret,id=keysec0,data=foo'],
-imgopts=True)
+use_image_opts=True)
 
 #
 # Invalid BlockdevRef
@@ -132,7 +132,7 @@ with iotests.FilePath('t.luks') as disk_path, \
 'driver=luks,file.driver=file,file.filename=%s,key-secret=keysec0' % 
(disk_path),
 filter_path=disk_path,
 extra_args=['--object', 'secret,id=keysec0,data=foo'],
-imgopts=True)
+use_image_opts=True)
 
 #
 # Invalid sizes
@@ -176,4 +176,4 @@ with iotests.FilePath('t.luks') as disk_path, \
 'driver=luks,file.driver=file,file.filename=%s,key-secret=keysec0' % 
(disk_path),
 filter_path=disk_path,
 extra_args=['--object', 'secret,id=keysec0,data=foo'],
-imgopts=True)
+use_image_opts=True)
diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 83bfedb902..9b374dad3d 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -219,9 +219,10 @@ def qemu_img_log(*args):
 log(result, filters=[filter_testfiles])
 return result
 
-def img_info_log(filename, filter_path=None, imgopts=False, extra_args=()):
+def img_info_log(filename, filter_path=None, use_image_opts=False,
+ extra_args=()):
 args = ['info']
-if imgopts:
+if use_image_opts:
 args.append('--image-opts')
 else:
 args += ['-f', imgfmt]
-- 
2.31.1

[PATCH v4 06/19] iotest 065: explicit compression type

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

The test checks different options. It of course fails if set
IMGOPTS='compression_type=zstd'. So, let's be explicit in what
compression type we want and independent of IMGOPTS. Test both existing
compression types.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Hanna Reitz 
---
 tests/qemu-iotests/065 | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/tests/qemu-iotests/065 b/tests/qemu-iotests/065
index 3c2ca27627..4b3c5c6c8c 100755
--- a/tests/qemu-iotests/065
+++ b/tests/qemu-iotests/065
@@ -88,7 +88,7 @@ class TestQMP(TestImageInfoSpecific):
 
 class TestQCow2(TestQemuImgInfo):
 '''Testing a qcow2 version 2 image'''
-img_options = 'compat=0.10'
+img_options = 'compat=0.10,compression_type=zlib'
 json_compare = { 'compat': '0.10', 'refcount-bits': 16,
  'compression-type': 'zlib' }
 human_compare = [ 'compat: 0.10', 'compression type: zlib',
@@ -96,17 +96,17 @@ class TestQCow2(TestQemuImgInfo):
 
 class TestQCow3NotLazy(TestQemuImgInfo):
 '''Testing a qcow2 version 3 image with lazy refcounts disabled'''
-img_options = 'compat=1.1,lazy_refcounts=off'
+img_options = 'compat=1.1,lazy_refcounts=off,compression_type=zstd'
 json_compare = { 'compat': '1.1', 'lazy-refcounts': False,
  'refcount-bits': 16, 'corrupt': False,
- 'compression-type': 'zlib', 'extended-l2': False }
-human_compare = [ 'compat: 1.1', 'compression type: zlib',
+ 'compression-type': 'zstd', 'extended-l2': False }
+human_compare = [ 'compat: 1.1', 'compression type: zstd',
   'lazy refcounts: false', 'refcount bits: 16',
   'corrupt: false', 'extended l2: false' ]
 
 class TestQCow3Lazy(TestQemuImgInfo):
 '''Testing a qcow2 version 3 image with lazy refcounts enabled'''
-img_options = 'compat=1.1,lazy_refcounts=on'
+img_options = 'compat=1.1,lazy_refcounts=on,compression_type=zlib'
 json_compare = { 'compat': '1.1', 'lazy-refcounts': True,
  'refcount-bits': 16, 'corrupt': False,
  'compression-type': 'zlib', 'extended-l2': False }
@@ -117,7 +117,7 @@ class TestQCow3Lazy(TestQemuImgInfo):
 class TestQCow3NotLazyQMP(TestQMP):
 '''Testing a qcow2 version 3 image with lazy refcounts disabled, opening
with lazy refcounts enabled'''
-img_options = 'compat=1.1,lazy_refcounts=off'
+img_options = 'compat=1.1,lazy_refcounts=off,compression_type=zlib'
 qemu_options = 'lazy-refcounts=on'
 compare = { 'compat': '1.1', 'lazy-refcounts': False,
 'refcount-bits': 16, 'corrupt': False,
@@ -127,11 +127,11 @@ class TestQCow3NotLazyQMP(TestQMP):
 class TestQCow3LazyQMP(TestQMP):
 '''Testing a qcow2 version 3 image with lazy refcounts enabled, opening
with lazy refcounts disabled'''
-img_options = 'compat=1.1,lazy_refcounts=on'
+img_options = 'compat=1.1,lazy_refcounts=on,compression_type=zstd'
 qemu_options = 'lazy-refcounts=off'
 compare = { 'compat': '1.1', 'lazy-refcounts': True,
 'refcount-bits': 16, 'corrupt': False,
-'compression-type': 'zlib', 'extended-l2': False }
+'compression-type': 'zstd', 'extended-l2': False }
 
 TestImageInfoSpecific = None
 TestQemuImgInfo = None
-- 
2.31.1

[PATCH v4 15/19] iotests 60: more accurate set dirty bit in qcow2 header

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

Don't touch other incompatible bits, like compression-type. This makes
the test pass with IMGOPTS='compression_type=zstd'.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Max Reitz 
---
 tests/qemu-iotests/060 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/060 b/tests/qemu-iotests/060
index d1e3204d4e..df87d600f7 100755
--- a/tests/qemu-iotests/060
+++ b/tests/qemu-iotests/060
@@ -326,7 +326,7 @@ _make_test_img 64M
 # Let the refblock appear unaligned
 poke_file "$TEST_IMG" "$rt_offset""\x00\x00\x00\x00\xff\xff\x2a\x00"
 # Mark the image dirty, thus forcing an automatic check when opening it
-poke_file "$TEST_IMG" 72 "\x00\x00\x00\x00\x00\x00\x00\x01"
+$PYTHON qcow2.py "$TEST_IMG" set-feature-bit incompatible 0
 # Open the image (qemu should refuse to do so)
 $QEMU_IO -c close "$TEST_IMG" 2>&1 | _filter_testdir | _filter_imgfmt
 
-- 
2.31.1

[PATCH v4 16/19] iotest 214: explicit compression type

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

The test-case "Corrupted size field in compressed cluster descriptor"
heavily depends on zlib compression type. So, make it explicit. This
way test passes with IMGOPTS='compression_type=zstd'.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Max Reitz 
---
 tests/qemu-iotests/214 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/214 b/tests/qemu-iotests/214
index 0889089d81..c66e246ba2 100755
--- a/tests/qemu-iotests/214
+++ b/tests/qemu-iotests/214
@@ -51,7 +51,7 @@ echo
 # The L2 entries of the two compressed clusters are located at
 # 0x80 and 0x88, their original values are 0x400800a0
 # and 0x400800a00802 (5 sectors for compressed data each).
-_make_test_img 8M -o cluster_size=2M
+_make_test_img 8M -o cluster_size=2M,compression_type=zlib
 $QEMU_IO -c "write -c -P 0x11 0 2M" -c "write -c -P 0x11 2M 2M" "$TEST_IMG" \
  2>&1 | _filter_qemu_io | _filter_testdir
 
-- 
2.31.1

[PATCH v4 13/19] iotest 39: use _qcow2_dump_header

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

_qcow2_dump_header has filter for compression type, so this change
makes test pass with IMGOPTS='compression_type=zstd'.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Max Reitz 
---
 tests/qemu-iotests/039 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/039 b/tests/qemu-iotests/039
index 8e783a8380..00d379cde2 100755
--- a/tests/qemu-iotests/039
+++ b/tests/qemu-iotests/039
@@ -142,7 +142,7 @@ $QEMU_IMG commit "$TEST_IMG"
 
 # The dirty bit must not be set
 _qcow2_dump_header | grep incompatible_features
-$PYTHON qcow2.py "$TEST_IMG".base dump-header | grep incompatible_features
+_qcow2_dump_header "$TEST_IMG".base | grep incompatible_features
 
 _check_test_img
 TEST_IMG="$TEST_IMG".base _check_test_img
-- 
2.31.1

[PATCH v4 14/19] iotests: bash tests: filter compression type

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

We want iotests pass with both the default zlib compression and with
IMGOPTS='compression_type=zstd'.

Actually the only test that is interested in real compression type in
test output is 287 (test for qcow2 compression type), so implement
specific option for it.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Hanna Reitz 
---
 tests/qemu-iotests/060.out   |  2 +-
 tests/qemu-iotests/061.out   | 12 ++--
 tests/qemu-iotests/082.out   | 14 +++---
 tests/qemu-iotests/198.out   |  4 ++--
 tests/qemu-iotests/287   |  8 
 tests/qemu-iotests/common.filter |  8 
 tests/qemu-iotests/common.rc | 14 +-
 7 files changed, 41 insertions(+), 21 deletions(-)

diff --git a/tests/qemu-iotests/060.out b/tests/qemu-iotests/060.out
index b74540bafb..329977d9b9 100644
--- a/tests/qemu-iotests/060.out
+++ b/tests/qemu-iotests/060.out
@@ -17,7 +17,7 @@ virtual size: 64 MiB (67108864 bytes)
 cluster_size: 65536
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: false
 refcount bits: 16
 corrupt: true
diff --git a/tests/qemu-iotests/061.out b/tests/qemu-iotests/061.out
index 7ecbd4dea8..139fc68177 100644
--- a/tests/qemu-iotests/061.out
+++ b/tests/qemu-iotests/061.out
@@ -525,7 +525,7 @@ virtual size: 64 MiB (67108864 bytes)
 cluster_size: 65536
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: false
 refcount bits: 16
 data file: TEST_DIR/t.IMGFMT.data
@@ -552,7 +552,7 @@ virtual size: 64 MiB (67108864 bytes)
 cluster_size: 65536
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: false
 refcount bits: 16
 data file: foo
@@ -567,7 +567,7 @@ virtual size: 64 MiB (67108864 bytes)
 cluster_size: 65536
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: false
 refcount bits: 16
 data file raw: false
@@ -583,7 +583,7 @@ virtual size: 64 MiB (67108864 bytes)
 cluster_size: 65536
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: false
 refcount bits: 16
 data file: TEST_DIR/t.IMGFMT.data
@@ -597,7 +597,7 @@ virtual size: 64 MiB (67108864 bytes)
 cluster_size: 65536
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: false
 refcount bits: 16
 data file: TEST_DIR/t.IMGFMT.data
@@ -612,7 +612,7 @@ virtual size: 64 MiB (67108864 bytes)
 cluster_size: 65536
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: false
 refcount bits: 16
 data file: TEST_DIR/t.IMGFMT.data
diff --git a/tests/qemu-iotests/082.out b/tests/qemu-iotests/082.out
index 077ed0f2c7..d0dd333117 100644
--- a/tests/qemu-iotests/082.out
+++ b/tests/qemu-iotests/082.out
@@ -17,7 +17,7 @@ virtual size: 128 MiB (134217728 bytes)
 cluster_size: 4096
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: true
 refcount bits: 16
 corrupt: false
@@ -31,7 +31,7 @@ virtual size: 128 MiB (134217728 bytes)
 cluster_size: 8192
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: true
 refcount bits: 16
 corrupt: false
@@ -329,7 +329,7 @@ virtual size: 128 MiB (134217728 bytes)
 cluster_size: 4096
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: true
 refcount bits: 16
 corrupt: false
@@ -342,7 +342,7 @@ virtual size: 128 MiB (134217728 bytes)
 cluster_size: 8192
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: true
 refcount bits: 16
 corrupt: false
@@ -639,7 +639,7 @@ virtual size: 128 MiB (134217728 bytes)
 cluster_size: 65536
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: true
 refcount bits: 16
 corrupt: false
@@ -652,7 +652,7 @@ virtual size: 130 MiB (136314880 bytes)
 cluster_size: 65536
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: false
 refcount bits: 16
 corrupt: false
@@ -665,7 +665,7 @@ virtual size: 132 MiB (138412032 bytes)
 cluster_size: 65536
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: true
 refcount

[PATCH v4 04/19] iotests.py: rewrite default luks support in qemu_img

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

Move the logic to more generic qemu_img_pipe_and_status(). Also behave
better when we have several -o options. And reuse argument parser of
course.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Hanna Reitz 
---
 tests/qemu-iotests/iotests.py | 36 +--
 1 file changed, 17 insertions(+), 19 deletions(-)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 4846ba7c60..055b162531 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -160,9 +160,13 @@ def qemu_img_create_prepare_args(args: List[str]) -> 
List[str]:
 args = args[1:]
 
 p = argparse.ArgumentParser(allow_abbrev=False)
+# -o option may be specified several times
+p.add_argument('-o', action='append', default=[])
 p.add_argument('-f')
 parsed, remaining = p.parse_known_args(args)
 
+opts_list = parsed.o
+
 result = ['create']
 if parsed.f is not None:
 result += ['-f', parsed.f]
@@ -171,8 +175,18 @@ def qemu_img_create_prepare_args(args: List[str]) -> 
List[str]:
 # like extended_l2 or compression_type for qcow2. Test may want to create
 # additional images in other formats that doesn't support these options.
 # So, use IMGOPTS only for images created in imgfmt format.
-if parsed.f == imgfmt and 'IMGOPTS' in os.environ:
-result += ['-o', os.environ['IMGOPTS']]
+imgopts = os.environ.get('IMGOPTS')
+if imgopts and parsed.f == imgfmt:
+opts_list.insert(0, imgopts)
+
+# default luks support
+if parsed.f == 'luks' and \
+all('key-secret' not in opts for opts in opts_list):
+result += ['--object', luks_default_secret_object]
+opts_list.append(luks_default_key_secret_opt)
+
+for opts in opts_list:
+result += ['-o', opts]
 
 result += remaining
 
@@ -203,23 +217,7 @@ def ordered_qmp(qmsg, conv_keys=True):
 return qmsg
 
 def qemu_img_create(*args):
-args = list(args)
-
-# default luks support
-if '-f' in args and args[args.index('-f') + 1] == 'luks':
-if '-o' in args:
-i = args.index('-o')
-if 'key-secret' not in args[i + 1]:
-args[i + 1].append(luks_default_key_secret_opt)
-args.insert(i + 2, '--object')
-args.insert(i + 3, luks_default_secret_object)
-else:
-args = ['-o', luks_default_key_secret_opt,
-'--object', luks_default_secret_object] + args
-
-args.insert(0, 'create')
-
-return qemu_img(*args)
+return qemu_img('create', *args)
 
 def qemu_img_measure(*args):
 return json.loads(qemu_img_pipe("measure", "--output", "json", *args))
-- 
2.31.1

[PATCH v4 19/19] iotests: specify some unsupported_imgopts for python iotests

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

We support IMGOPTS for python iotests now. Still a lot of tests are
unprepared to common IMGOPTS that are used with bash iotests. So we
should define corresponding unsupported_imgopts.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/044 | 3 ++-
 tests/qemu-iotests/065 | 3 ++-
 tests/qemu-iotests/163 | 3 ++-
 tests/qemu-iotests/165 | 3 ++-
 tests/qemu-iotests/196 | 3 ++-
 tests/qemu-iotests/242 | 3 ++-
 tests/qemu-iotests/246 | 3 ++-
 tests/qemu-iotests/254 | 3 ++-
 tests/qemu-iotests/260 | 4 ++--
 tests/qemu-iotests/274 | 3 ++-
 tests/qemu-iotests/281 | 3 ++-
 tests/qemu-iotests/303 | 3 ++-
 tests/qemu-iotests/tests/migrate-bitmaps-postcopy-test | 3 ++-
 tests/qemu-iotests/tests/migrate-bitmaps-test  | 3 ++-
 tests/qemu-iotests/tests/migrate-during-backup | 3 ++-
 tests/qemu-iotests/tests/remove-bitmap-from-backing| 3 ++-
 16 files changed, 32 insertions(+), 17 deletions(-)

diff --git a/tests/qemu-iotests/044 b/tests/qemu-iotests/044
index 714329eb16..a5ee9a7ded 100755
--- a/tests/qemu-iotests/044
+++ b/tests/qemu-iotests/044
@@ -118,4 +118,5 @@ class TestRefcountTableGrowth(iotests.QMPTestCase):
 if __name__ == '__main__':
 iotests.activate_logging()
 iotests.main(supported_fmts=['qcow2'],
- supported_protocols=['file'])
+ supported_protocols=['file'],
+ unsupported_imgopts=['refcount_bits'])
diff --git a/tests/qemu-iotests/065 b/tests/qemu-iotests/065
index 4b3c5c6c8c..f7c1b68dad 100755
--- a/tests/qemu-iotests/065
+++ b/tests/qemu-iotests/065
@@ -139,4 +139,5 @@ TestQMP = None
 
 if __name__ == '__main__':
 iotests.main(supported_fmts=['qcow2'],
- supported_protocols=['file'])
+ supported_protocols=['file'],
+ unsupported_imgopts=['refcount_bits'])
diff --git a/tests/qemu-iotests/163 b/tests/qemu-iotests/163
index dedce8ef43..0b00df519c 100755
--- a/tests/qemu-iotests/163
+++ b/tests/qemu-iotests/163
@@ -169,4 +169,5 @@ ShrinkBaseClass = None
 
 if __name__ == '__main__':
 iotests.main(supported_fmts=['raw', 'qcow2'],
- supported_protocols=['file'])
+ supported_protocols=['file'],
+ unsupported_imgopts=['compat=0.10'])
diff --git a/tests/qemu-iotests/165 b/tests/qemu-iotests/165
index ce499946b8..144cf4a7de 100755
--- a/tests/qemu-iotests/165
+++ b/tests/qemu-iotests/165
@@ -157,4 +157,5 @@ class TestPersistentDirtyBitmap(iotests.QMPTestCase):
 
 if __name__ == '__main__':
 iotests.main(supported_fmts=['qcow2'],
- supported_protocols=['file'])
+ supported_protocols=['file'],
+ unsupported_imgopts=['compat=0.10'])
diff --git a/tests/qemu-iotests/196 b/tests/qemu-iotests/196
index 2451515094..fe93ebd760 100755
--- a/tests/qemu-iotests/196
+++ b/tests/qemu-iotests/196
@@ -65,4 +65,5 @@ class TestInvalidateAutoclear(iotests.QMPTestCase):
 
 if __name__ == '__main__':
 iotests.main(supported_fmts=['qcow2'],
- supported_protocols=['file'])
+ supported_protocols=['file'],
+ unsupported_imgopts=['compat=0.10'])
diff --git a/tests/qemu-iotests/242 b/tests/qemu-iotests/242
index a9b27668c2..8cdf02e8d9 100755
--- a/tests/qemu-iotests/242
+++ b/tests/qemu-iotests/242
@@ -26,7 +26,8 @@ from iotests import qemu_img_create, qemu_io, qemu_img_pipe, \
 file_path, img_info_log, log, filter_qemu_io
 
 iotests.script_initialize(supported_fmts=['qcow2'],
-  supported_protocols=['file'])
+  supported_protocols=['file'],
+  unsupported_imgopts=['refcount_bits', 'compat=0.10'])
 
 disk = file_path('disk')
 chunk = 256 * 1024
diff --git a/tests/qemu-iotests/246 b/tests/qemu-iotests/246
index 5932a0e8a9..a6fef31e1a 100755
--- a/tests/qemu-iotests/246
+++ b/tests/qemu-iotests/246
@@ -23,7 +23,8 @@
 import iotests
 from iotests import log
 
-iotests.script_initialize(supported_fmts=['qcow2'])
+iotests.script_initialize(supported_fmts=['qcow2'],
+  unsupported_imgopts=['compat=0.10'])
 size = 64 * 1024 * 1024 * 1024
 gran_small = 32 * 1024
 gran_large = 128 * 1024
diff --git a/tests/qemu-iotests/254 b/tests/qemu-iotests/254
index 108bf5f894..5f08541214 100755
--- a/tests/qemu-iotests/254
+++ b/tests/qemu-iotests/254
@@ -22,7 +22,8 @@
 import iotests
 from iotests import qemu_img_create, file_path, log
 
-iotests.script_initialize(supported_fmts=['qcow2'])
+iotests.script_initialize(supported_fmts=['qcow2'],
+

[PATCH v4 10/19] qcow2: simple case support for downgrading of qcow2 images with zstd

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

If image doesn't have any compressed cluster we can easily switch to
zlib compression, which may allow to downgrade the image.

That's mostly needed to support IMGOPTS='compression_type=zstd' in some
iotests which do qcow2 downgrade.

While being here also fix checkpatch complain against '#' in printf
formatting.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Max Reitz 
---
 block/qcow2.c | 58 +--
 1 file changed, 56 insertions(+), 2 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index d509016756..c8115e1cba 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -5279,6 +5279,38 @@ static int qcow2_load_vmstate(BlockDriverState *bs, 
QEMUIOVector *qiov,
 return bs->drv->bdrv_co_preadv_part(bs, offset, qiov->size, qiov, 0, 0);
 }
 
+static int qcow2_has_compressed_clusters(BlockDriverState *bs)
+{
+int64_t offset = 0;
+int64_t bytes = bdrv_getlength(bs);
+
+if (bytes < 0) {
+return bytes;
+}
+
+while (bytes != 0) {
+int ret;
+QCow2SubclusterType type;
+unsigned int cur_bytes = MIN(INT_MAX, bytes);
+uint64_t host_offset;
+
+ret = qcow2_get_host_offset(bs, offset, _bytes, _offset,
+);
+if (ret < 0) {
+return ret;
+}
+
+if (type == QCOW2_SUBCLUSTER_COMPRESSED) {
+return 1;
+}
+
+offset += cur_bytes;
+bytes -= cur_bytes;
+}
+
+return 0;
+}
+
 /*
  * Downgrades an image's version. To achieve this, any incompatible features
  * have to be removed.
@@ -5336,9 +5368,10 @@ static int qcow2_downgrade(BlockDriverState *bs, int 
target_version,
  * the first place; if that happens nonetheless, returning -ENOTSUP is the
  * best thing to do anyway */
 
-if (s->incompatible_features) {
+if (s->incompatible_features & ~QCOW2_INCOMPAT_COMPRESSION) {
 error_setg(errp, "Cannot downgrade an image with incompatible features 
"
-   "%#" PRIx64 " set", s->incompatible_features);
+   "0x%" PRIx64 " set",
+   s->incompatible_features & ~QCOW2_INCOMPAT_COMPRESSION);
 return -ENOTSUP;
 }
 
@@ -5356,6 +5389,27 @@ static int qcow2_downgrade(BlockDriverState *bs, int 
target_version,
 return ret;
 }
 
+if (s->incompatible_features & QCOW2_INCOMPAT_COMPRESSION) {
+ret = qcow2_has_compressed_clusters(bs);
+if (ret < 0) {
+error_setg(errp, "Failed to check block status");
+return -EINVAL;
+}
+if (ret) {
+error_setg(errp, "Cannot downgrade an image with zstd compression "
+   "type and existing compressed clusters");
+return -ENOTSUP;
+}
+/*
+ * No compressed clusters for now, so just chose default zlib
+ * compression.
+ */
+s->incompatible_features &= ~QCOW2_INCOMPAT_COMPRESSION;
+s->compression_type = QCOW2_COMPRESSION_TYPE_ZLIB;
+}
+
+assert(s->incompatible_features == 0);
+
 s->qcow_version = target_version;
 ret = qcow2_update_header(bs);
 if (ret < 0) {
-- 
2.31.1

[PATCH v4 11/19] iotests/common.rc: introduce _qcow2_dump_header helper

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

We'll use it in tests instead of explicit qcow2.py. Then we are going
to add some filtering in _qcow2_dump_header.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Max Reitz 
---
 tests/qemu-iotests/common.rc | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/tests/qemu-iotests/common.rc b/tests/qemu-iotests/common.rc
index d8582454de..5dea310ea0 100644
--- a/tests/qemu-iotests/common.rc
+++ b/tests/qemu-iotests/common.rc
@@ -996,5 +996,15 @@ _require_one_device_of()
 _notrun "$* not available"
 }
 
+_qcow2_dump_header()
+{
+img="$1"
+if [ -z "$img" ]; then
+img="$TEST_IMG"
+fi
+
+$PYTHON qcow2.py "$img" dump-header
+}
+
 # make sure this script returns success
 true
-- 
2.31.1

[PATCH v4 09/19] iotest 302: use img_info_log() helper

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

Instead of qemu_img_log("info", ..) use generic helper img_info_log().

img_info_log() has smarter logic. For example it use filter_img_info()
to filter output, which in turns filter a compression type. So it will
help us in future when we implement a possibility to use zstd
compression by default (with help of some runtime config file or maybe
build option). For now to test you should recompile qemu with a small
addition into block/qcow2.c before
"if (qcow2_opts->has_compression_type":

if (!qcow2_opts->has_compression_type && version >= 3) {
qcow2_opts->has_compression_type = true;
qcow2_opts->compression_type = QCOW2_COMPRESSION_TYPE_ZSTD;
}

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Max Reitz 
---
 tests/qemu-iotests/302 | 4 +++-
 tests/qemu-iotests/302.out | 7 +++
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/tests/qemu-iotests/302 b/tests/qemu-iotests/302
index 5695af4914..a6d79e727b 100755
--- a/tests/qemu-iotests/302
+++ b/tests/qemu-iotests/302
@@ -34,6 +34,7 @@ from iotests import (
 qemu_img_measure,
 qemu_io,
 qemu_nbd_popen,
+img_info_log,
 )
 
 iotests.script_initialize(supported_fmts=["qcow2"])
@@ -88,6 +89,7 @@ with tarfile.open(tar_file, "w") as tar:
 tar_file):
 
 iotests.log("=== Target image info ===")
+# Not img_info_log as it enforces imgfmt, but now we print info on raw
 qemu_img_log("info", nbd_uri)
 
 qemu_img(
@@ -99,7 +101,7 @@ with tarfile.open(tar_file, "w") as tar:
 nbd_uri)
 
 iotests.log("=== Converted image info ===")
-qemu_img_log("info", nbd_uri)
+img_info_log(nbd_uri)
 
 iotests.log("=== Converted image check ===")
 qemu_img_log("check", nbd_uri)
diff --git a/tests/qemu-iotests/302.out b/tests/qemu-iotests/302.out
index e2f6077e83..3e7c281b91 100644
--- a/tests/qemu-iotests/302.out
+++ b/tests/qemu-iotests/302.out
@@ -6,14 +6,13 @@ virtual size: 448 KiB (458752 bytes)
 disk size: unavailable
 
 === Converted image info ===
-image: nbd+unix:///exp?socket=SOCK_DIR/PID-nbd-sock
-file format: qcow2
+image: TEST_IMG
+file format: IMGFMT
 virtual size: 1 GiB (1073741824 bytes)
-disk size: unavailable
 cluster_size: 65536
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: false
 refcount bits: 16
 corrupt: false
-- 
2.31.1

[PATCH v4 07/19] iotests.py: filter out successful output of qemu-img create

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

The only "feature" of this "Formatting ..." line is that we have to
update it every time we add new option. Let's drop it.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Hanna Reitz 
---
 tests/qemu-iotests/149.out| 21 -
 tests/qemu-iotests/237.out|  3 ---
 tests/qemu-iotests/255.out|  4 
 tests/qemu-iotests/274.out| 29 -
 tests/qemu-iotests/280.out|  1 -
 tests/qemu-iotests/296.out| 10 +++---
 tests/qemu-iotests/iotests.py | 10 --
 7 files changed, 11 insertions(+), 67 deletions(-)

diff --git a/tests/qemu-iotests/149.out b/tests/qemu-iotests/149.out
index 6877ab6c4a..ab879596ce 100644
--- a/tests/qemu-iotests/149.out
+++ b/tests/qemu-iotests/149.out
@@ -61,7 +61,6 @@ unlink TEST_DIR/luks-aes-256-xts-plain64-sha1.img
 # = qemu-img aes-256-xts-plain64-sha1 =
 # Create image
 qemu-img create -f luks --object secret,id=sec0,data=MTIzNDU2,format=base64 -o 
key-secret=sec0,iter-time=10,cipher-alg=aes-256,cipher-mode=xts,ivgen-alg=plain64,hash-alg=sha1
 TEST_DIR/luks-aes-256-xts-plain64-sha1.img 4194304M
-Formatting 'TEST_DIR/luks-aes-256-xts-plain64-sha1.img', fmt=luks 
size=4398046511104 key-secret=sec0 cipher-alg=aes-256 cipher-mode=xts 
ivgen-alg=plain64 hash-alg=sha1 iter-time=10
 
 # Open dev
 sudo cryptsetup -q -v luksOpen TEST_DIR/luks-aes-256-xts-plain64-sha1.img 
qiotest-145-aes-256-xts-plain64-sha1
@@ -181,7 +180,6 @@ unlink TEST_DIR/luks-twofish-256-xts-plain64-sha1.img
 # = qemu-img twofish-256-xts-plain64-sha1 =
 # Create image
 qemu-img create -f luks --object secret,id=sec0,data=MTIzNDU2,format=base64 -o 
key-secret=sec0,iter-time=10,cipher-alg=twofish-256,cipher-mode=xts,ivgen-alg=plain64,hash-alg=sha1
 TEST_DIR/luks-twofish-256-xts-plain64-sha1.img 4194304M
-Formatting 'TEST_DIR/luks-twofish-256-xts-plain64-sha1.img', fmt=luks 
size=4398046511104 key-secret=sec0 cipher-alg=twofish-256 cipher-mode=xts 
ivgen-alg=plain64 hash-alg=sha1 iter-time=10
 
 # Open dev
 sudo cryptsetup -q -v luksOpen TEST_DIR/luks-twofish-256-xts-plain64-sha1.img 
qiotest-145-twofish-256-xts-plain64-sha1
@@ -301,7 +299,6 @@ unlink TEST_DIR/luks-serpent-256-xts-plain64-sha1.img
 # = qemu-img serpent-256-xts-plain64-sha1 =
 # Create image
 qemu-img create -f luks --object secret,id=sec0,data=MTIzNDU2,format=base64 -o 
key-secret=sec0,iter-time=10,cipher-alg=serpent-256,cipher-mode=xts,ivgen-alg=plain64,hash-alg=sha1
 TEST_DIR/luks-serpent-256-xts-plain64-sha1.img 4194304M
-Formatting 'TEST_DIR/luks-serpent-256-xts-plain64-sha1.img', fmt=luks 
size=4398046511104 key-secret=sec0 cipher-alg=serpent-256 cipher-mode=xts 
ivgen-alg=plain64 hash-alg=sha1 iter-time=10
 
 # Open dev
 sudo cryptsetup -q -v luksOpen TEST_DIR/luks-serpent-256-xts-plain64-sha1.img 
qiotest-145-serpent-256-xts-plain64-sha1
@@ -421,7 +418,6 @@ unlink TEST_DIR/luks-cast5-128-cbc-plain64-sha1.img
 # = qemu-img cast5-128-cbc-plain64-sha1 =
 # Create image
 qemu-img create -f luks --object secret,id=sec0,data=MTIzNDU2,format=base64 -o 
key-secret=sec0,iter-time=10,cipher-alg=cast5-128,cipher-mode=cbc,ivgen-alg=plain64,hash-alg=sha1
 TEST_DIR/luks-cast5-128-cbc-plain64-sha1.img 4194304M
-Formatting 'TEST_DIR/luks-cast5-128-cbc-plain64-sha1.img', fmt=luks 
size=4398046511104 key-secret=sec0 cipher-alg=cast5-128 cipher-mode=cbc 
ivgen-alg=plain64 hash-alg=sha1 iter-time=10
 
 # Open dev
 sudo cryptsetup -q -v luksOpen TEST_DIR/luks-cast5-128-cbc-plain64-sha1.img 
qiotest-145-cast5-128-cbc-plain64-sha1
@@ -542,7 +538,6 @@ unlink TEST_DIR/luks-aes-256-cbc-plain-sha1.img
 # = qemu-img aes-256-cbc-plain-sha1 =
 # Create image
 qemu-img create -f luks --object secret,id=sec0,data=MTIzNDU2,format=base64 -o 
key-secret=sec0,iter-time=10,cipher-alg=aes-256,cipher-mode=cbc,ivgen-alg=plain,hash-alg=sha1
 TEST_DIR/luks-aes-256-cbc-plain-sha1.img 4194304M
-Formatting 'TEST_DIR/luks-aes-256-cbc-plain-sha1.img', fmt=luks 
size=4398046511104 key-secret=sec0 cipher-alg=aes-256 cipher-mode=cbc 
ivgen-alg=plain hash-alg=sha1 iter-time=10
 
 # Open dev
 sudo cryptsetup -q -v luksOpen TEST_DIR/luks-aes-256-cbc-plain-sha1.img 
qiotest-145-aes-256-cbc-plain-sha1
@@ -662,7 +657,6 @@ unlink TEST_DIR/luks-aes-256-cbc-plain64-sha1.img
 # = qemu-img aes-256-cbc-plain64-sha1 =
 # Create image
 qemu-img create -f luks --object secret,id=sec0,data=MTIzNDU2,format=base64 -o 
key-secret=sec0,iter-time=10,cipher-alg=aes-256,cipher-mode=cbc,ivgen-alg=plain64,hash-alg=sha1
 TEST_DIR/luks-aes-256-cbc-plain64-sha1.img 4194304M
-Formatting 'TEST_DIR/luks-aes-256-cbc-plain64-sha1.img', fmt=luks 
size=4398046511104 key-secret=sec0 cipher-alg=aes-256 cipher-mode=cbc 
ivgen-alg=plain64 hash-alg=sha1 iter-time=10
 
 # Open dev
 sudo cryptsetup -q -v luksOpen TEST_DIR/luks-aes-256-cbc-plain64-sha1.img

[PATCH v4 17/19] iotests: declare lack of support for compresion_type in IMGOPTS

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

compression_type can't be used if we want to create image with
compat=0.10. So, skip these tests, not many of them.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Hanna Reitz 
---
 tests/qemu-iotests/031 | 5 +++--
 tests/qemu-iotests/051 | 5 +++--
 tests/qemu-iotests/061 | 6 +-
 tests/qemu-iotests/112 | 3 ++-
 tests/qemu-iotests/290 | 2 +-
 5 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/tests/qemu-iotests/031 b/tests/qemu-iotests/031
index 648112f796..ee587b1606 100755
--- a/tests/qemu-iotests/031
+++ b/tests/qemu-iotests/031
@@ -42,8 +42,9 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 _supported_fmt qcow2
 _supported_proto file fuse
 # We want to test compat=0.10, which does not support external data
-# files or refcount widths other than 16
-_unsupported_imgopts data_file 'refcount_bits=\([^1]\|.\([^6]\|$\)\)'
+# files or refcount widths other than 16 or compression type
+_unsupported_imgopts data_file compression_type \
+'refcount_bits=\([^1]\|.\([^6]\|$\)\)'
 
 CLUSTER_SIZE=65536
 
diff --git a/tests/qemu-iotests/051 b/tests/qemu-iotests/051
index 1d2fa93a11..da93587039 100755
--- a/tests/qemu-iotests/051
+++ b/tests/qemu-iotests/051
@@ -41,8 +41,9 @@ _supported_fmt qcow2
 _supported_proto file
 # A compat=0.10 image is created in this test which does not support anything
 # other than refcount_bits=16;
-# it also will not support an external data file
-_unsupported_imgopts 'refcount_bits=\([^1]\|.\([^6]\|$\)\)' data_file
+# it also will not support an external data file and compression type
+_unsupported_imgopts 'refcount_bits=\([^1]\|.\([^6]\|$\)\)' data_file \
+compression_type
 _require_drivers nbd
 
 do_run_qemu()
diff --git a/tests/qemu-iotests/061 b/tests/qemu-iotests/061
index 70edf1a163..513fbec14c 100755
--- a/tests/qemu-iotests/061
+++ b/tests/qemu-iotests/061
@@ -48,7 +48,11 @@ _supported_os Linux
 # not work with it;
 # we have explicit tests for various cluster sizes, the remaining tests
 # require the default 64k cluster
-_unsupported_imgopts 'refcount_bits=\([^1]\|.\([^6]\|$\)\)' data_file 
cluster_size
+# we don't have explicit tests for zstd qcow2 compression type, as zstd may be
+# not compiled in. And we can't create compat images with comression type
+# extension
+_unsupported_imgopts 'refcount_bits=\([^1]\|.\([^6]\|$\)\)' data_file \
+cluster_size compression_type
 
 echo
 echo "=== Testing version downgrade with zero expansion ==="
diff --git a/tests/qemu-iotests/112 b/tests/qemu-iotests/112
index 07ac74fb2c..5333212993 100755
--- a/tests/qemu-iotests/112
+++ b/tests/qemu-iotests/112
@@ -43,7 +43,8 @@ _supported_proto file fuse
 # This test will set refcount_bits on its own which would conflict with the
 # manual setting; compat will be overridden as well;
 # and external data files do not work well with our refcount testing
-_unsupported_imgopts refcount_bits 'compat=0.10' data_file
+# also, compression type is not supported with compat=0.10 used in test
+_unsupported_imgopts refcount_bits 'compat=0.10' data_file compression_type
 
 print_refcount_bits()
 {
diff --git a/tests/qemu-iotests/290 b/tests/qemu-iotests/290
index ed80da2685..776b59de1b 100755
--- a/tests/qemu-iotests/290
+++ b/tests/qemu-iotests/290
@@ -41,7 +41,7 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 _supported_fmt qcow2
 _supported_proto file fuse
 _supported_os Linux
-_unsupported_imgopts 'compat=0.10' refcount_bits data_file
+_unsupported_imgopts 'compat=0.10' refcount_bits data_file compression_type
 
 echo
 echo "### Test 'qemu-io -c discard' on a QCOW2 image without a backing file"
-- 
2.31.1

[PATCH v4 05/19] iotest 303: explicit compression type

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

The test prints qcow2 header fields which depends on chosen compression
type. So, let's be explicit in what compression type we want and
independent of IMGOPTS. Test both existing compression types.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Max Reitz 
---
 tests/qemu-iotests/303 | 23 +++
 tests/qemu-iotests/303.out | 30 +-
 2 files changed, 44 insertions(+), 9 deletions(-)

diff --git a/tests/qemu-iotests/303 b/tests/qemu-iotests/303
index 425544c064..9dee2bdfb8 100755
--- a/tests/qemu-iotests/303
+++ b/tests/qemu-iotests/303
@@ -53,12 +53,19 @@ def add_bitmap(num, begin, end, disabled):
 log('')
 
 
-qemu_img_create('-f', iotests.imgfmt, disk, '10M')
+def test(compression_type: str, json_output: bool) -> None:
+qemu_img_create('-f', iotests.imgfmt,
+'-o', f'compression_type={compression_type}',
+disk, '10M')
+add_bitmap(1, 0, 6, False)
+add_bitmap(2, 6, 8, True)
 
-add_bitmap(1, 0, 6, False)
-add_bitmap(2, 6, 8, True)
-dump = ['./qcow2.py', disk, 'dump-header']
-subprocess.run(dump)
-# Dump the metadata in JSON format
-dump.append('-j')
-subprocess.run(dump)
+cmd = ['./qcow2.py', disk, 'dump-header']
+if json_output:
+cmd.append('-j')
+
+subprocess.run(cmd)
+
+
+test('zlib', False)
+test('zstd', True)
diff --git a/tests/qemu-iotests/303.out b/tests/qemu-iotests/303.out
index 7c16998587..b3c70827b7 100644
--- a/tests/qemu-iotests/303.out
+++ b/tests/qemu-iotests/303.out
@@ -80,6 +80,34 @@ extra_data_size   0
 Bitmap table   typesize offset
 0  all-zeroes  00
 
+Add bitmap 1
+wrote 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+wrote 1048576/1048576 bytes at offset 1048576
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+wrote 1048576/1048576 bytes at offset 2097152
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+wrote 1048576/1048576 bytes at offset 3145728
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+wrote 1048576/1048576 bytes at offset 4194304
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+wrote 1048576/1048576 bytes at offset 5242880
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+
+Add bitmap 2
+wrote 1048576/1048576 bytes at offset 6291456
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+wrote 1048576/1048576 bytes at offset 7340032
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+
 {
 "magic": 1363560955,
 "version": 3,
@@ -94,7 +122,7 @@ Bitmap table   typesize offset
 "refcount_table_clusters": 1,
 "nb_snapshots": 0,
 "snapshot_offset": 0,
-"incompatible_features": 0,
+"incompatible_features": 8,
 "compatible_features": 0,
 "autoclear_features": 1,
 "refcount_order": 4,
-- 
2.31.1

[PATCH v4 02/19] iotests.py: qemu_img*("create"): support IMGOPTS='compression_type=zstd'

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

Adding support of IMGOPTS (like in bash tests) allows user to pass a
lot of different options. Still, some may require additional logic.

Now we want compression_type option, so add some smart logic around it:
ignore compression_type=zstd in IMGOPTS, if test want qcow2 in
compatibility mode. As well, ignore compression_type for non-qcow2
formats.

Note that we may instead add support only to qemu_img_create(), but
that works bad:

1. We'll have to update a lot of tests to use qemu_img_create instead
   of qemu_img('create'). (still, we may want do it anyway, but no
   reason to create a dependancy between task of supporting IMGOPTS and
   updating a lot of tests)

2. Some tests use qemu_img_pipe('create', ..) - even more work on
   updating

3. Even if we update all tests to go through qemu_img_create, we'll
   need a way to avoid creating new tests using qemu_img*('create') -
   add assertions.. That doesn't seem good.

So, let's add support of IMGOPTS to most generic
qemu_img_pipe_and_status().

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Hanna Reitz 
---
 tests/qemu-iotests/iotests.py | 27 ++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 9b374dad3d..a7a257ee2f 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -16,6 +16,7 @@
 # along with this program.  If not, see .
 #
 
+import argparse
 import atexit
 import bz2
 from collections import OrderedDict
@@ -153,11 +154,35 @@ def qemu_tool_pipe_and_status(tool: str, args: 
Sequence[str],
{-subp.returncode}: {cmd}\n')
 return (output, subp.returncode)
 
+def qemu_img_create_prepare_args(args: List[str]) -> List[str]:
+if not args or args[0] != 'create':
+return list(args)
+args = args[1:]
+
+p = argparse.ArgumentParser(allow_abbrev=False)
+p.add_argument('-f')
+parsed, remaining = p.parse_known_args(args)
+
+result = ['create']
+if parsed.f is not None:
+result += ['-f', parsed.f]
+
+# IMGOPTS most probably contain options specific for the selected format,
+# like extended_l2 or compression_type for qcow2. Test may want to create
+# additional images in other formats that doesn't support these options.
+# So, use IMGOPTS only for images created in imgfmt format.
+if parsed.f == imgfmt and 'IMGOPTS' in os.environ:
+result += ['-o', os.environ['IMGOPTS']]
+
+result += remaining
+
+return result
+
 def qemu_img_pipe_and_status(*args: str) -> Tuple[str, int]:
 """
 Run qemu-img and return both its output and its exit code
 """
-full_args = qemu_img_args + list(args)
+full_args = qemu_img_args + qemu_img_create_prepare_args(list(args))
 return qemu_tool_pipe_and_status('qemu-img', full_args)
 
 def qemu_img(*args: str) -> int:
-- 
2.31.1

[PATCH v4 08/19] iotests.py: filter compression type out

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

We want iotests pass with both the default zlib compression and with
IMGOPTS='compression_type=zstd'.

Actually the only test that is interested in real compression type in
test output is 287 (test for qcow2 compression type) and it's in bash.
So for now we can safely filter out compression type in all qcow2
tests.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Hanna Reitz 
---
 tests/qemu-iotests/206.out| 10 +-
 tests/qemu-iotests/242.out| 10 +-
 tests/qemu-iotests/274.out| 10 +-
 tests/qemu-iotests/iotests.py |  2 ++
 4 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/tests/qemu-iotests/206.out b/tests/qemu-iotests/206.out
index 80cd274223..7e95694777 100644
--- a/tests/qemu-iotests/206.out
+++ b/tests/qemu-iotests/206.out
@@ -18,7 +18,7 @@ virtual size: 128 MiB (134217728 bytes)
 cluster_size: 65536
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: false
 refcount bits: 16
 corrupt: false
@@ -42,7 +42,7 @@ virtual size: 64 MiB (67108864 bytes)
 cluster_size: 65536
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: false
 refcount bits: 16
 corrupt: false
@@ -66,7 +66,7 @@ virtual size: 32 MiB (33554432 bytes)
 cluster_size: 2097152
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: true
 refcount bits: 1
 corrupt: false
@@ -92,7 +92,7 @@ backing file: TEST_IMG.base
 backing file format: IMGFMT
 Format specific information:
 compat: 0.10
-compression type: zlib
+compression type: COMPRESSION_TYPE
 refcount bits: 16
 
 === Successful image creation (encrypted) ===
@@ -109,7 +109,7 @@ encrypted: yes
 cluster_size: 65536
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: false
 refcount bits: 16
 encrypt:
diff --git a/tests/qemu-iotests/242.out b/tests/qemu-iotests/242.out
index 3759c99284..ce231424a7 100644
--- a/tests/qemu-iotests/242.out
+++ b/tests/qemu-iotests/242.out
@@ -12,7 +12,7 @@ virtual size: 1 MiB (1048576 bytes)
 cluster_size: 65536
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: false
 refcount bits: 16
 corrupt: false
@@ -34,7 +34,7 @@ virtual size: 1 MiB (1048576 bytes)
 cluster_size: 65536
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: false
 bitmaps:
 [0]:
@@ -68,7 +68,7 @@ virtual size: 1 MiB (1048576 bytes)
 cluster_size: 65536
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: false
 bitmaps:
 [0]:
@@ -110,7 +110,7 @@ virtual size: 1 MiB (1048576 bytes)
 cluster_size: 65536
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: false
 bitmaps:
 [0]:
@@ -161,7 +161,7 @@ virtual size: 1 MiB (1048576 bytes)
 cluster_size: 65536
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: false
 bitmaps:
 [0]:
diff --git a/tests/qemu-iotests/274.out b/tests/qemu-iotests/274.out
index 1d2928e14d..1ce40d839a 100644
--- a/tests/qemu-iotests/274.out
+++ b/tests/qemu-iotests/274.out
@@ -50,7 +50,7 @@ backing file: TEST_DIR/PID-base
 backing file format: IMGFMT
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: false
 refcount bits: 16
 corrupt: false
@@ -79,7 +79,7 @@ backing file: TEST_DIR/PID-base
 backing file format: IMGFMT
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: false
 refcount bits: 16
 corrupt: false
@@ -114,7 +114,7 @@ backing file: TEST_DIR/PID-base
 backing file format: IMGFMT
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: false
 refcount bits: 16
 corrupt: false
@@ -141,7 +141,7 @@ virtual size: 2 MiB (2097152 bytes)
 cluster_size: 65536
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: false
 refcount bits: 16
 corrupt: false
@@ -176,7 +176,7 @@ backing file: TEST_DIR/PID-base
 backing file format: IMGFMT
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: false
 refcount bits: 16
 corrupt: false
diff

[PATCH v4 00/19] iotests: support zstd

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

These series makes tests pass with

   IMGOPTS='compression_type=zstd'

Also, python iotests start to support IMGOPTS (they didn't before).

v4: 18,19: new: add unsupported_imgopts as suggested by Hanna.

Vladimir Sementsov-Ogievskiy (19):
  iotests.py: img_info_log(): rename imgopts argument
  iotests.py: qemu_img*("create"): support
IMGOPTS='compression_type=zstd'
  iotests: drop qemu_img_verbose() helper
  iotests.py: rewrite default luks support in qemu_img
  iotest 303: explicit compression type
  iotest 065: explicit compression type
  iotests.py: filter out successful output of qemu-img create
  iotests.py: filter compression type out
  iotest 302: use img_info_log() helper
  qcow2: simple case support for downgrading of qcow2 images with zstd
  iotests/common.rc: introduce _qcow2_dump_header helper
  iotests: massive use _qcow2_dump_header
  iotest 39: use _qcow2_dump_header
  iotests: bash tests: filter compression type
  iotests 60: more accurate set dirty bit in qcow2 header
  iotest 214: explicit compression type
  iotests: declare lack of support for compresion_type in IMGOPTS
  iotests.py: implement unsupported_imgopts
  iotests: specify some unsupported_imgopts for python iotests

 block/qcow2.c | 58 ++-
 tests/qemu-iotests/031| 11 ++-
 tests/qemu-iotests/036|  6 +-
 tests/qemu-iotests/039| 22 ++---
 tests/qemu-iotests/044|  8 +-
 tests/qemu-iotests/044.out|  1 +
 tests/qemu-iotests/051|  5 +-
 tests/qemu-iotests/060| 22 ++---
 tests/qemu-iotests/060.out|  2 +-
 tests/qemu-iotests/061| 42 
 tests/qemu-iotests/061.out| 12 +--
 tests/qemu-iotests/065| 19 ++--
 tests/qemu-iotests/082.out| 14 +--
 tests/qemu-iotests/112|  3 +-
 tests/qemu-iotests/137|  2 +-
 tests/qemu-iotests/149.out| 21 
 tests/qemu-iotests/163|  3 +-
 tests/qemu-iotests/165|  3 +-
 tests/qemu-iotests/196|  3 +-
 tests/qemu-iotests/198.out|  4 +-
 tests/qemu-iotests/206.out| 10 +-
 tests/qemu-iotests/209|  7 +-
 tests/qemu-iotests/209.out|  2 +
 tests/qemu-iotests/210|  8 +-
 tests/qemu-iotests/214|  2 +-
 tests/qemu-iotests/237.out|  3 -
 tests/qemu-iotests/242|  3 +-
 tests/qemu-iotests/242.out| 10 +-
 tests/qemu-iotests/246|  3 +-
 tests/qemu-iotests/254|  3 +-
 tests/qemu-iotests/255.out|  4 -
 tests/qemu-iotests/260|  4 +-
 tests/qemu-iotests/274|  3 +-
 tests/qemu-iotests/274.out| 39 +---
 tests/qemu-iotests/280.out|  1 -
 tests/qemu-iotests/281|  3 +-
 tests/qemu-iotests/287|  8 +-
 tests/qemu-iotests/290|  2 +-
 tests/qemu-iotests/296.out| 10 +-
 tests/qemu-iotests/302|  4 +-
 tests/qemu-iotests/302.out|  7 +-
 tests/qemu-iotests/303| 26 +++--
 tests/qemu-iotests/303.out| 30 +-
 tests/qemu-iotests/common.filter  |  8 ++
 tests/qemu-iotests/common.rc  | 22 +
 tests/qemu-iotests/iotests.py | 99 +--
 .../tests/migrate-bitmaps-postcopy-test   |  3 +-
 tests/qemu-iotests/tests/migrate-bitmaps-test |  3 +-
 .../qemu-iotests/tests/migrate-during-backup  |  3 +-
 .../tests/remove-bitmap-from-backing  |  3 +-
 50 files changed, 358 insertions(+), 236 deletions(-)

-- 
2.31.1

[PATCH v4 18/19] iotests.py: implement unsupported_imgopts

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

We have added support for some addition IMGOPTS in python iotests like
in bash iotests. Similarly to bash iotests, we want a way to skip some
tests which can't work with specific IMGOPTS.

Globally for python iotests we now don't support things like
'data_file=$TEST_IMG.ext_data_file' in IMGOPTS, so, forbid this
globally in iotests.py.

Suggested-by: Hanna Reitz 
Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/iotests.py | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 903686b402..2cbb13d67a 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -1238,6 +1238,17 @@ def _verify_virtio_scsi_pci_or_ccw() -> None:
 notrun('Missing virtio-scsi-pci or virtio-scsi-ccw in QEMU binary')
 
 
+def _verify_imgopts(unsupported: Sequence[str] = ()) -> None:
+imgopts = os.environ.get('IMGOPTS')
+# One of usage examples for IMGOPTS is "data_file=$TEST_IMG.ext_data_file"
+# but it supported only for bash tests. We don't have a concept of global
+# TEST_IMG in iotests.py, not saying about somehow parsing $variables.
+# So, for simplicity let's just not support any IMGOPTS with '$' inside.
+unsup = list(unsupported) + ['$']
+if imgopts and any(x in imgopts for x in unsup):
+notrun(f'not suitable for this imgopts: {imgopts}')
+
+
 def supports_quorum():
 return 'quorum' in qemu_img_pipe('--help')
 
@@ -1414,7 +1425,8 @@ def execute_setup_common(supported_fmts: Sequence[str] = 
(),
  unsupported_fmts: Sequence[str] = (),
  supported_protocols: Sequence[str] = (),
  unsupported_protocols: Sequence[str] = (),
- required_fmts: Sequence[str] = ()) -> bool:
+ required_fmts: Sequence[str] = (),
+ unsupported_imgopts: Sequence[str] = ()) -> bool:
 """
 Perform necessary setup for either script-style or unittest-style tests.
 
@@ -1434,6 +1446,7 @@ def execute_setup_common(supported_fmts: Sequence[str] = 
(),
 _verify_aio_mode(supported_aio_modes)
 _verify_formats(required_fmts)
 _verify_virtio_blk()
+_verify_imgopts(unsupported_imgopts)
 
 return debug
 
-- 
2.31.1

[PATCH v4 03/19] iotests: drop qemu_img_verbose() helper

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

qemu_img_verbose() has a drawback of not going through generic
qemu_img_pipe_and_status(). qemu_img_verbose() is not very popular, so
update the only two users to qemu_img_log() and drop qemu_img_verbose()
at all.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Hanna Reitz 
---
 tests/qemu-iotests/044| 5 +++--
 tests/qemu-iotests/044.out| 1 +
 tests/qemu-iotests/209| 7 ---
 tests/qemu-iotests/209.out| 2 ++
 tests/qemu-iotests/iotests.py | 8 
 5 files changed, 10 insertions(+), 13 deletions(-)

diff --git a/tests/qemu-iotests/044 b/tests/qemu-iotests/044
index 64b18eb7c8..714329eb16 100755
--- a/tests/qemu-iotests/044
+++ b/tests/qemu-iotests/044
@@ -24,7 +24,7 @@ import os
 import qcow2
 from qcow2 import QcowHeader
 import iotests
-from iotests import qemu_img, qemu_img_verbose, qemu_io
+from iotests import qemu_img, qemu_img_log, qemu_io
 import struct
 import subprocess
 import sys
@@ -112,9 +112,10 @@ class TestRefcountTableGrowth(iotests.QMPTestCase):
 
 def test_grow_refcount_table(self):
 qemu_io('-c', 'write 3800M 1M', test_img)
-qemu_img_verbose('check' , test_img)
+qemu_img_log('check' , test_img)
 pass
 
 if __name__ == '__main__':
+iotests.activate_logging()
 iotests.main(supported_fmts=['qcow2'],
  supported_protocols=['file'])
diff --git a/tests/qemu-iotests/044.out b/tests/qemu-iotests/044.out
index 703cf3dee1..ff663b17d7 100644
--- a/tests/qemu-iotests/044.out
+++ b/tests/qemu-iotests/044.out
@@ -1,6 +1,7 @@
 No errors were found on the image.
 7292415/33554432 = 21.73% allocated, 0.00% fragmented, 0.00% compressed 
clusters
 Image end offset: 4296217088
+
 .
 --
 Ran 1 tests
diff --git a/tests/qemu-iotests/209 b/tests/qemu-iotests/209
index ff7efea11b..f6ad08ec42 100755
--- a/tests/qemu-iotests/209
+++ b/tests/qemu-iotests/209
@@ -20,8 +20,8 @@
 #
 
 import iotests
-from iotests import qemu_img_create, qemu_io, qemu_img_verbose, qemu_nbd, \
-file_path
+from iotests import qemu_img_create, qemu_io, qemu_img_log, qemu_nbd, \
+file_path, log
 
 iotests.script_initialize(supported_fmts=['qcow2'])
 
@@ -33,4 +33,5 @@ qemu_img_create('-f', iotests.imgfmt, disk, '1M')
 qemu_io('-f', iotests.imgfmt, '-c', 'write 0 512K', disk)
 
 qemu_nbd('-k', nbd_sock, '-x', 'exp', '-f', iotests.imgfmt, disk)
-qemu_img_verbose('map', '-f', 'raw', '--output=json', nbd_uri)
+qemu_img_log('map', '-f', 'raw', '--output=json', nbd_uri)
+log('done.')  # avoid new line at the end of output file
diff --git a/tests/qemu-iotests/209.out b/tests/qemu-iotests/209.out
index f27be3fa7b..515906ac7a 100644
--- a/tests/qemu-iotests/209.out
+++ b/tests/qemu-iotests/209.out
@@ -1,2 +1,4 @@
 [{ "start": 0, "length": 524288, "depth": 0, "present": true, "zero": false, 
"data": true, "offset": 0},
 { "start": 524288, "length": 524288, "depth": 0, "present": true, "zero": 
true, "data": false, "offset": 524288}]
+
+done.
diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index a7a257ee2f..4846ba7c60 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -227,14 +227,6 @@ def qemu_img_measure(*args):
 def qemu_img_check(*args):
 return json.loads(qemu_img_pipe("check", "--output", "json", *args))
 
-def qemu_img_verbose(*args):
-'''Run qemu-img without suppressing its output and return the exit code'''
-exitcode = subprocess.call(qemu_img_args + list(args))
-if exitcode < 0:
-sys.stderr.write('qemu-img received signal %i: %s\n'
- % (-exitcode, ' '.join(qemu_img_args + list(args
-return exitcode
-
 def qemu_img_pipe(*args: str) -> str:
 '''Run qemu-img and return its output'''
 return qemu_img_pipe_and_status(*args)[0]
-- 
2.31.1

[PATCH 2/3] iotests/testrunner.py: move updating last_elapsed to run_tests

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

We are going to use do_run_test() in multiprocessing environment, where
we'll not be able to change original runner object.

Happily, the only thing we change is that last_elapsed and it's simple
to do it in run_tests() instead. All other accesses to self in
do_runt_test() and in run_test() are read-only.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/testrunner.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/testrunner.py b/tests/qemu-iotests/testrunner.py
index fa842252d3..a9f2feb58c 100644
--- a/tests/qemu-iotests/testrunner.py
+++ b/tests/qemu-iotests/testrunner.py
@@ -287,7 +287,6 @@ def do_run_test(self, test: str) -> TestResult:
   diff=diff, casenotrun=casenotrun)
 else:
 f_bad.unlink()
-self.last_elapsed.update(test, elapsed)
 return TestResult(status='pass', elapsed=elapsed,
   casenotrun=casenotrun)
 
@@ -353,6 +352,9 @@ def run_tests(self, tests: List[str]) -> bool:
 print('\n'.join(res.diff))
 elif res.status == 'not run':
 notrun.append(name)
+elif res.status == 'pass':
+assert res.elapsed is not None
+self.last_elapsed.update(t, res.elapsed)
 
 sys.stdout.flush()
 if res.interrupted:
-- 
2.31.1

[PATCH 3/3] iotests: check: multiprocessing support

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

Add -j  parameter, to run tests in several jobs simultaneously.
For realization - simply utilize multiprocessing.Pool class.

Notes:

1. Of course, tests can't run simultaneously in same TEST_DIR. So,
   use subdirectories TEST_DIR/testname/ and SOCK_DIR/testname/
   instead of simply TEST_DIR and SOCK_DIR

2. multiprocessing.Pool.starmap function doesn't support passing
   context managers, so we can't simply pass "self". Happily, we need
   self only for read-only access, and it just works if it is defined
   in global space. So, add a temporary link TestRunner.shared_self
   during run_tests().

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/check |  4 +-
 tests/qemu-iotests/testrunner.py | 69 
 2 files changed, 64 insertions(+), 9 deletions(-)

diff --git a/tests/qemu-iotests/check b/tests/qemu-iotests/check
index 43a4b694cc..0c27721a41 100755
--- a/tests/qemu-iotests/check
+++ b/tests/qemu-iotests/check
@@ -34,6 +34,8 @@ def make_argparser() -> argparse.ArgumentParser:
help='show me, do not run tests')
 p.add_argument('-makecheck', action='store_true',
help='pretty print output for make check')
+p.add_argument('-j', dest='jobs', type=int, default=1,
+   help='run tests in multiple parallel jobs')
 
 p.add_argument('-d', dest='debug', action='store_true', help='debug')
 p.add_argument('-p', dest='print', action='store_true',
@@ -165,6 +167,6 @@ if __name__ == '__main__':
 with TestRunner(env, makecheck=args.makecheck,
 color=args.color) as tr:
 paths = [os.path.join(env.source_iotests, t) for t in tests]
-ok = tr.run_tests(paths)
+ok = tr.run_tests(paths, args.jobs)
 if not ok:
 sys.exit(1)
diff --git a/tests/qemu-iotests/testrunner.py b/tests/qemu-iotests/testrunner.py
index a9f2feb58c..0feaa396d0 100644
--- a/tests/qemu-iotests/testrunner.py
+++ b/tests/qemu-iotests/testrunner.py
@@ -26,6 +26,7 @@
 import json
 import termios
 import sys
+from multiprocessing import Pool
 from contextlib import contextmanager
 from typing import List, Optional, Iterator, Any, Sequence, Dict, \
 ContextManager
@@ -126,6 +127,31 @@ def __init__(self, status: str, description: str = '',
 
 
 class TestRunner(ContextManager['TestRunner']):
+shared_self = None
+
+@staticmethod
+def proc_run_test(test: str, test_field_width: int) -> TestResult:
+# We are in a subprocess, we can't change the runner object!
+runner = TestRunner.shared_self
+assert runner is not None
+return runner.run_test(test, test_field_width, mp=True)
+
+def run_tests_pool(self, tests: List[str],
+   test_field_width: int, jobs: int) -> List[TestResult]:
+
+# passing self directly to Pool.starmap() just doesn't work, because
+# it's a context manager.
+assert TestRunner.shared_self is None
+TestRunner.shared_self = self
+
+with Pool(jobs) as p:
+results = p.starmap(self.proc_run_test,
+zip(tests, [test_field_width] * len(tests)))
+
+TestRunner.shared_self = None
+
+return results
+
 def __init__(self, env: TestEnv, makecheck: bool = False,
  color: str = 'auto') -> None:
 self.env = env
@@ -219,11 +245,16 @@ def find_reference(self, test: str) -> str:
 
 return f'{test}.out'
 
-def do_run_test(self, test: str) -> TestResult:
+def do_run_test(self, test: str, mp: bool) -> TestResult:
 """
 Run one test
 
 :param test: test file path
+:param mp: if true, we are in a multiprocessing environment, use
+   personal subdirectories for test run
+
+Note: this method may be called from subprocess, so it does not
+change ``self`` object in any way!
 """
 
 f_test = Path(test)
@@ -249,6 +280,12 @@ def do_run_test(self, test: str) -> TestResult:
 
 args = [str(f_test.resolve())]
 env = self.env.prepare_subprocess(args)
+if mp:
+# Split test directories, so that tests running in parallel don't
+# break each other.
+for d in ['TEST_DIR', 'SOCK_DIR']:
+env[d] = os.path.join(env[d], f_test.name)
+Path(env[d]).mkdir(parents=True, exist_ok=True)
 
 t0 = time.time()
 with f_bad.open('w', encoding="utf-8") as f:
@@ -291,23 +328,32 @@ def do_run_test(self, test: str) -> TestResult:
   casenotrun=casenotrun)
 
 def run_test(self, test: str,
- test_field_width: Optional[int] = None) -> TestResult:
+ test_field_width: Optional[int] = None,
+ mp: bool = False) -> TestResult:
 """
 Run one test and print short status
 
 :param test: test file path

[PATCH 1/3] iotests/testrunner.py: add doc string for run_test()

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

We are going to modify these methods and will add more documentation in
further commit. As a preparation add basic documentation.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/testrunner.py | 13 +
 1 file changed, 13 insertions(+)

diff --git a/tests/qemu-iotests/testrunner.py b/tests/qemu-iotests/testrunner.py
index 0e29c2fddd..fa842252d3 100644
--- a/tests/qemu-iotests/testrunner.py
+++ b/tests/qemu-iotests/testrunner.py
@@ -220,6 +220,12 @@ def find_reference(self, test: str) -> str:
 return f'{test}.out'
 
 def do_run_test(self, test: str) -> TestResult:
+"""
+Run one test
+
+:param test: test file path
+"""
+
 f_test = Path(test)
 f_bad = Path(f_test.name + '.out.bad')
 f_notrun = Path(f_test.name + '.notrun')
@@ -287,6 +293,13 @@ def do_run_test(self, test: str) -> TestResult:
 
 def run_test(self, test: str,
  test_field_width: Optional[int] = None) -> TestResult:
+"""
+Run one test and print short status
+
+:param test: test file path
+:param test_field_width: width for first field of status format
+"""
+
 last_el = self.last_elapsed.get(test)
 start = datetime.datetime.now().strftime('%H:%M:%S')
 
-- 
2.31.1

[PATCH 0/3] iotests: multiprocessing!!

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

Hi all!

Finally, I can not stand it any longer. So, I'm happy to present
multiprocessing support for iotests test runner.

testing on tmpfs:

Before:

time check -qcow2
...
real12m28.095s
user9m53.398s
sys 2m55.548s

After:

time check -qcow2 -j 12
...
real2m12.136s
user12m40.948s
sys 4m7.449s


Hmm, seems -j 6 should be enough. I have 6 cores, 2 threads per core.
Anyway, that's so fast!

Vladimir Sementsov-Ogievskiy (3):
  iotests/testrunner.py: add doc string for run_test()
  iotests/testrunner.py: move updating last_elapsed to run_tests
  iotests: check: multiprocessing support

 tests/qemu-iotests/check |  4 +-
 tests/qemu-iotests/testrunner.py | 86 
 2 files changed, 80 insertions(+), 10 deletions(-)

-- 
2.31.1

Re: [RFC PATCH 2/2] qemu-img convert: Fix sparseness detection

2021-12-03 Thread Peter Lieven

Am 19.05.21 um 18:48 schrieb Kevin Wolf:
> Am 19.05.2021 um 15:24 hat Peter Lieven geschrieben:
>> Am 20.04.21 um 18:52 schrieb Vladimir Sementsov-Ogievskiy:
>>> 20.04.2021 18:04, Kevin Wolf wrote:
 Am 20.04.2021 um 16:31 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 15.04.2021 18:22, Kevin Wolf wrote:
>> In order to avoid RMW cycles, is_allocated_sectors() treats zeroed areas
>> like non-zero data if the end of the checked area isn't aligned. This
>> can improve the efficiency of the conversion and was introduced in
>> commit 8dcd3c9b91a.
>>
>> However, it comes with a correctness problem: qemu-img convert is
>> supposed to sparsify areas that contain only zeros, which it doesn't do
>> any more. It turns out that this even happens when not only the
>> unaligned area is zeroed, but also the blocks before and after it. In
>> the bug report, conversion of a fragmented 10G image containing only
>> zeros resulted in an image consuming 2.82 GiB even though the expected
>> size is only 4 KiB.
>>
>> As a tradeoff between both, let's ignore zeroed sectors only after
>> non-zero data to fix the alignment, but if we're only looking at zeros,
>> keep them as such, even if it may mean additional RMW cycles.
>>
> Hmm.. If I understand correctly, we are going to do unaligned
> write-zero. And that helps.
 This can happen (mostly raw images on block devices, I think?), but
 usually it just means skipping the write because we know that the target
 image is already zeroed.

 What it does mean is that if the next part is data, we'll have an
 unaligned data write.

> Doesn't that mean that alignment is wrongly detected?
 The problem is that you can have bdrv_block_status_above() return the
 same allocation status multiple times in a row, but *pnum can be
 unaligned for the conversion.

 We only look at a single range returned by it when detecting the
 alignment, so it could be that we have zero buffers for both 0-11 and
 12-16 and detect two misaligned ranges, when both together are a
 perfectly aligned zeroed range.

 In theory we could try to do some lookahead and merge ranges where
 possible, which should give us the perfect result, but it would make the
 code considerably more complicated. (Whether we want to merge them
 doesn't only depend on the block status, but possibly also on the
 content of a DATA range.)

 Kevin

>>> Oh, I understand now the problem, thanks for explanation.
>>>
>>> Hmm, yes that means, that if the whole buf is zero, is_allocated_sectors 
>>> must not align it down, to be possibly "merged" with next chunk if it is 
>>> zero too.
>>>
>>> But it's still good to align zeroes down, if data starts somewhere inside 
>>> the buf, isn't it?
>>>
>>> what about something like this:
>>>
>>> diff --git a/qemu-img.c b/qemu-img.c
>>> index babb5573ab..d1704584a0 100644
>>> --- a/qemu-img.c
>>> +++ b/qemu-img.c
>>> @@ -1167,19 +1167,39 @@ static int is_allocated_sectors(const uint8_t *buf, 
>>> int n, int *pnum,
>>>  }
>>>  }
>>>  
>>> +    if (i == n) {
>>> +    /*
>>> + * The whole buf is the same.
>>> + *
>>> + * if it's data, just return it. It's the old behavior.
>>> + *
>>> + * if it's zero, just return too. It will work good if target is 
>>> alredy
>>> + * zeroed. And if next chunk is zero too we'll have no RMW and no 
>>> reason
>>> + * to write data.
>>> + */
>>> +    *pnum = i;
>>> +    return !is_zero;
>>> +    }
>>> +
>>>  tail = (sector_num + i) & (alignment - 1);
>>>  if (tail) {
>>>  if (is_zero && i <= tail) {
>>> -    /* treat unallocated areas which only consist
>>> - * of a small tail as allocated. */
>>> +    /*
>>> + * For sure next sector after i is data, and it will rewrite 
>>> this
>>> + * tail anyway due to RMW. So, let's just write data now.
>>> + */
>>>  is_zero = false;
>>>  }
>>>  if (!is_zero) {
>>> -    /* align up end offset of allocated areas. */
>>> +    /* If possible, align up end offset of allocated areas. */
>>>  i += alignment - tail;
>>>  i = MIN(i, n);
>>>  } else {
>>> -    /* align down end offset of zero areas. */
>>> +    /*
>>> + * For sure next sector after i is data, and it will rewrite 
>>> this
>>> + * tail anyway due to RMW. Better is avoid RMW and write 
>>> zeroes up
>>> + * to aligned bound.
>>> + */
>>>  i -= tail;
>>>  }
>>>  }
>> I think we forgot to follow up on this. Has anyone tested this
>> suggestion?
>>
>> Otherwise, I would try to rerun the tests I did with the my old and
>> Kevins suggestion.
> I noticed earlier this week that these

72 matches

Mail list logo