date:20230515

Re: [PATCH v2 01/16] migration: Don't use INT64_MAX for unlimited rate

2023-05-15 Thread Harsh Prateek Bora





On 5/16/23 01:26, Juan Quintela wrote:

Define and use RATE_LIMIT_MAX instead.

Signed-off-by: Juan Quintela 
---
  migration/migration-stats.h | 6 ++
  migration/migration.c   | 4 ++--
  migration/qemu-file.c   | 6 +-
  3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/migration/migration-stats.h b/migration/migration-stats.h
index cf8a4f0410..e782f1b0df 100644
--- a/migration/migration-stats.h
+++ b/migration/migration-stats.h
@@ -15,6 +15,12 @@
  
  #include "qemu/stats64.h"
  
+/*

+ * If rate_limit_max is 0, there is special code to remove the rate

.. there is a check in qemu_file_rate_limit() ..

+ * limit.
+ */
+#define RATE_LIMIT_MAX 0
+

Reviewed-by: Harsh Prateek Bora 


  /*
   * These are the ram migration statistic counters.  It is loosely
   * based on MigrationStats.  We change to Stat64 any counter that
diff --git a/migration/migration.c b/migration/migration.c
index 039bba4804..c41c7491bb 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2304,7 +2304,7 @@ static void migration_completion(MigrationState *s)
   * them if migration fails or is cancelled.
   */
  s->block_inactive = !migrate_colo();
-qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX);
+qemu_file_set_rate_limit(s->to_dst_file, RATE_LIMIT_MAX);
  ret = qemu_savevm_state_complete_precopy(s->to_dst_file, 
false,
   s->block_inactive);
  }
@@ -3048,7 +3048,7 @@ static void *bg_migration_thread(void *opaque)
  rcu_register_thread();
  object_ref(OBJECT(s));
  
-qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX);

+qemu_file_set_rate_limit(s->to_dst_file, RATE_LIMIT_MAX);
  
  setup_start = qemu_clock_get_ms(QEMU_CLOCK_HOST);

  /*
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 597054759d..4bc875b452 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -27,6 +27,7 @@
  #include "qemu/error-report.h"
  #include "qemu/iov.h"
  #include "migration.h"
+#include "migration-stats.h"
  #include "qemu-file.h"
  #include "trace.h"
  #include "options.h"
@@ -732,7 +733,10 @@ int qemu_file_rate_limit(QEMUFile *f)
  if (qemu_file_get_error(f)) {
  return 1;
  }
-if (f->rate_limit_max > 0 && f->rate_limit_used > f->rate_limit_max) {
+if (f->rate_limit_max == RATE_LIMIT_MAX) {
+return 0;
+}
+if (f->rate_limit_used > f->rate_limit_max) {
  return 1;
  }
  return 0;

Re: [PULL v2 00/16] Block patches

2023-05-15 Thread Richard Henderson


On 5/15/23 09:04, Stefan Hajnoczi wrote:

The following changes since commit 8844bb8d896595ee1d25d21c770e6e6f29803097:

   Merge tag 'or1k-pull-request-20230513' ofhttps://github.com/stffrdhrn/qemu  
into staging (2023-05-13 11:23:14 +0100)

are available in the Git repository at:

   https://gitlab.com/stefanha/qemu.git  tags/block-pull-request

for you to fetch changes up to 01562fee5f3ad4506d57dbcf4b1903b565eceec7:

   docs/zoned-storage:add zoned emulation use case (2023-05-15 08:19:04 -0400)


Pull request

This pull request contain's Sam Li's zoned storage support in the QEMU block
layer and virtio-blk emulation.

v2:
- Sam fixed the CI failures. CI passes for me now. [Richard]


Applied, thanks.  Please update https://wiki.qemu.org/ChangeLog/8.1 as 
appropriate.


r~

Re: [PATCH 1/8] block: Call .bdrv_co_create(_opts) unlocked

2023-05-15 Thread Eric Blake



On Mon, May 15, 2023 at 06:19:41PM +0200, Kevin Wolf wrote:
> > > @@ -3724,8 +3726,10 @@ qcow2_co_create(BlockdevCreateOptions 
> > > *create_options, Error **errp)
> > >  goto out;
> > >  }
> > >  
> > > +bdrv_graph_co_rdlock();
> > >  ret = qcow2_alloc_clusters(blk_bs(blk), 3 * cluster_size);
> > >  if (ret < 0) {
> > > +bdrv_graph_co_rdunlock();
> > >  error_setg_errno(errp, -ret, "Could not allocate clusters for 
> > > qcow2 "
> > >   "header and refcount table");
> > >  goto out;
> > > @@ -3743,6 +3747,8 @@ qcow2_co_create(BlockdevCreateOptions 
> > > *create_options, Error **errp)
> > >  
> > >  /* Create a full header (including things like feature table) */
> > >  ret = qcow2_update_header(blk_bs(blk));
> > > +bdrv_graph_co_rdunlock();
> > > +
> > 
> > If we ever inject any 'goto out' in the elided lines, we're in
> > trouble.  Would this be any safer by wrapping the intervening
> > statements in a scope-guarded lock?
> 
> TSA doesn't understand these guards, which is why they are only
> annotated as assertions (I think we talked about this in my previous
> series), at the cost of leaving unlocking unchecked. So in cases where
> the scope isn't the full function, individual calls are better at the
> moment. Once clang implements support for __attribute__((cleanup)), we
> can maybe change this.

Yeah, LOTS of people are waiting on clang __attribute__((cleanup))
analysis sanity ;)

> 
> Of course, TSA solves the very maintenance problem you're concerned
> about: With a 'goto out' added, compilation on clang fails because it
> sees that there is a code path that doesn't unlock. So at least it makes
> the compromise not terrible.
> 
> For example, if I comment out the unlock in the error case in the first,
> this is what I get:
> 
> ../block/qcow2.c:3825:5: error: mutex 'graph_lock' is not held on every path 
> through here [-Werror,-Wthread-safety-analysis]
> blk_co_unref(blk);
> ^
> ../block/qcow2.c:3735:5: note: mutex acquired here
> bdrv_graph_co_rdlock();
> ^
> 1 error generated.

I'm sold!  The very reason you can't use a cleanup scope-guard
(because clang can't see through the annotation) is also the reason
why clang is able to flag your error if you don't properly clean up by
hand.  So while it is more tedious to maintain, we've finally got
compiler assistance to something that used to be human-only, which is
a step forwards even if it is more to type in the short term.

Thanks for testing that.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [Libguestfs] [PATCH v3 00/14] qemu patches for 64-bit NBD extensions

2023-05-15 Thread Eric Blake



Adding qemu-block for the cover letter (not sure how I missed that the
first time).

On Mon, May 15, 2023 at 02:53:29PM -0500, Eric Blake wrote:
> 
> v2 was here:
> https://lists.gnu.org/archive/html/qemu-devel/2022-11/msg02340.html
> 
> Since then:
>  - upstream NBD has accepted the extension on a branch; once multiple
>implementations interoperate based on that spec, it will be promoted
>to mainline (my plan: qemu with this series, libnbd nearly ready to
>go, nbdkit a bit further out)
>  - rebase to block changes in meantime
>  - drop RFC patches for 64-bit NBD_CMD_READ (NBD spec did not take them)
>  - per upstream spec decision, extended headers now mandates use of
>NBD_REPLY_TYPE_BLOCK_STATUS_EXT rather than server choice based on
>reply size, which in turn required rearranging server patches a bit
>  - other changes that I noticed while testing with parallel changes
>being added to libnbd (link to those patches to follow in the next
>week or so)

If it helps review, I compared to my v2 posting as follows:

001/14:[0007] [FC] 'nbd/client: Use smarter assert'
002/14:[] [--] 'nbd/client: Add safety check on chunk payload length'
003/14:[] [-C] 'nbd/server: Prepare for alternate-size headers'
004/14:[0099] [FC] 'nbd: Prepare for 64-bit request effect lengths'
005/14:[0002] [FC] 'nbd: Add types for extended headers'
006/14:[0012] [FC] 'nbd/server: Refactor handling of request payload'
007/14:[0026] [FC] 'nbd/server: Refactor to pass full request around'
008/14:[0052] [FC] 'nbd/server: Support 64-bit block status'
009/14:[0032] [FC] 'nbd/server: Initial support for extended headers'
010/14:[0020] [FC] 'nbd/client: Initial support for extended headers'
011/14:[0015] [FC] 'nbd/client: Accept 64-bit block status chunks'
012/14:[0042] [FC] 'nbd/client: Request extended headers during negotiation'
013/14:[0005] [FC] 'nbd/server: Prepare for per-request filtering of 
BLOCK_STATUS'
014/14:[0004] [FC] 'nbd/server: Add FLAG_PAYLOAD support to CMD_BLOCK_STATUS'

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [PULL 00/11] Migration 20230515 patches

2023-05-15 Thread Richard Henderson


On 5/15/23 05:33, Juan Quintela wrote:

The following changes since commit 8844bb8d896595ee1d25d21c770e6e6f29803097:

   Merge tag 'or1k-pull-request-20230513' ofhttps://github.com/stffrdhrn/qemu  
into staging (2023-05-13 11:23:14 +0100)

are available in the Git repository at:

   https://gitlab.com/juan.quintela/qemu.git  
tags/migration-20230515-pull-request

for you to fetch changes up to 6da835d42a2163b43578ae745bc613b06dd5d23c:

   qemu-file: Remove total from qemu_file_total_transferred_*() (2023-05-15 
13:46:14 +0200)


Migration Pull request 20230515

Hi

On this PULL:
- use xxHash for calculate dirty_rate (andrei)
- Create qemu_target_pages_to_MiB() and use them (quintela)
- make dirtyrate target independent (quintela)
- Merge 5 patches from atomic counters series (quintela)

Please apply.


Applied, thanks.  Please update https://wiki.qemu.org/ChangeLog/8.1 as 
appropriate.


r~

[PATCH v2 08/16] migration: Use migration_transferred_bytes() to calculate rate_limit

2023-05-15 Thread Juan Quintela

Signed-off-by: Juan Quintela 
Reviewed-by: Cédric Le Goater 
---
 migration/migration-stats.h | 8 +++-
 migration/migration-stats.c | 7 +--
 migration/migration.c   | 2 +-
 3 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/migration/migration-stats.h b/migration/migration-stats.h
index 91fda378d3..f1465c2ebe 100644
--- a/migration/migration-stats.h
+++ b/migration/migration-stats.h
@@ -81,6 +81,10 @@ typedef struct {
  * Number of bytes sent during precopy stage.
  */
 Stat64 precopy_bytes;
+/*
+ * Amount of transferred data at the start of current cycle.
+ */
+Stat64 rate_limit_start;
 /*
  * Maximum amount of data we can send in a cycle.
  */
@@ -136,8 +140,10 @@ uint64_t migration_rate_get(void);
  * migration_rate_reset: Reset the rate limit counter.
  *
  * This is called when we know we start a new transfer cycle.
+ *
+ * @f: QEMUFile used for main migration channel
  */
-void migration_rate_reset(void);
+void migration_rate_reset(QEMUFile *f);
 
 /**
  * migration_rate_set: Set the maximum amount that can be transferred.
diff --git a/migration/migration-stats.c b/migration/migration-stats.c
index 301392d208..da2bb69a15 100644
--- a/migration/migration-stats.c
+++ b/migration/migration-stats.c
@@ -31,7 +31,9 @@ bool migration_rate_exceeded(QEMUFile *f)
 return true;
 }
 
-uint64_t rate_limit_used = stat64_get(_stats.rate_limit_used);
+uint64_t rate_limit_start = stat64_get(_stats.rate_limit_start);
+uint64_t rate_limit_current = migration_transferred_bytes(f);
+uint64_t rate_limit_used = rate_limit_current - rate_limit_start;
 uint64_t rate_limit_max = stat64_get(_stats.rate_limit_max);
 
 if (rate_limit_max == RATE_LIMIT_MAX) {
@@ -58,9 +60,10 @@ void migration_rate_set(uint64_t limit)
 stat64_set(_stats.rate_limit_max, limit / XFER_LIMIT_RATIO);
 }
 
-void migration_rate_reset(void)
+void migration_rate_reset(QEMUFile *f)
 {
 stat64_set(_stats.rate_limit_used, 0);
+stat64_set(_stats.rate_limit_start, migration_transferred_bytes(f));
 }
 
 void migration_rate_account(uint64_t len)
diff --git a/migration/migration.c b/migration/migration.c
index 39ff538046..e48dd593ed 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2691,7 +2691,7 @@ static void migration_update_counters(MigrationState *s,
 stat64_get(_stats.dirty_bytes_last_sync) / bandwidth;
 }
 
-migration_rate_reset();
+migration_rate_reset(s->to_dst_file);
 
 update_iteration_initial_status(s);
 
-- 
2.40.1

[PATCH v2 10/16] migration: Don't abuse qemu_file transferred for RDMA

2023-05-15 Thread Juan Quintela

Just create a variable for it, the same way that multifd does.  This
way it is safe to use for other thread, etc, etc.

Signed-off-by: Juan Quintela 
---
 migration/migration-stats.h |  4 
 migration/migration-stats.c |  5 +++--
 migration/rdma.c| 22 --
 migration/trace-events  |  2 +-
 4 files changed, 28 insertions(+), 5 deletions(-)

diff --git a/migration/migration-stats.h b/migration/migration-stats.h
index 9568b5b473..2e3e894307 100644
--- a/migration/migration-stats.h
+++ b/migration/migration-stats.h
@@ -89,6 +89,10 @@ typedef struct {
  * Maximum amount of data we can send in a cycle.
  */
 Stat64 rate_limit_max;
+/*
+ * Number of bytes sent through RDMA.
+ */
+Stat64 rdma_bytes;
 /*
  * How long has the setup stage took.
  */
diff --git a/migration/migration-stats.c b/migration/migration-stats.c
index abf2d38b18..4d8e9f93b7 100644
--- a/migration/migration-stats.c
+++ b/migration/migration-stats.c
@@ -68,8 +68,9 @@ void migration_rate_reset(QEMUFile *f)
 uint64_t migration_transferred_bytes(QEMUFile *f)
 {
 uint64_t multifd = stat64_get(_stats.multifd_bytes);
+uint64_t rdma = stat64_get(_stats.rdma_bytes);
 uint64_t qemu_file = qemu_file_transferred(f);
 
-trace_migration_transferred_bytes(qemu_file, multifd);
-return qemu_file + multifd;
+trace_migration_transferred_bytes(qemu_file, multifd, rdma);
+return qemu_file + multifd + rdma;
 }
diff --git a/migration/rdma.c b/migration/rdma.c
index 2e4dcff1c9..074456f9df 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -2122,9 +2122,18 @@ retry:
 return -EIO;
 }
 
+/*
+ * TODO: Here we are sending something, but we are not
+ * accounting for anything transferred.  The following is 
wrong:
+ *
+ * stat64_add(_stats.rdma_bytes, sge.length);
+ *
+ * because we are using some kind of compression.  I
+ * would think that head.len would be the more similar
+ * thing to a correct value.
+ */
 stat64_add(_stats.zero_pages,
sge.length / qemu_target_page_size());
-
 return 1;
 }
 
@@ -2232,8 +2241,17 @@ retry:
 
 set_bit(chunk, block->transit_bitmap);
 stat64_add(_stats.normal_pages, sge.length / qemu_target_page_size());
+/*
+ * We are adding to transferred the amount of data written, but no
+ * overhead at all.  I will asume that RDMA is magicaly and don't
+ * need to transfer (at least) the addresses where it wants to
+ * write the pages.  Here it looks like it should be something
+ * like:
+ * sizeof(send_wr) + sge.length
+ * but this being RDMA, who knows.
+ */
+stat64_add(_stats.rdma_bytes, sge.length);
 ram_transferred_add(sge.length);
-qemu_file_credit_transfer(f, sge.length);
 rdma->total_writes++;
 
 return 0;
diff --git a/migration/trace-events b/migration/trace-events
index cdaef7a1ea..54ae5653fd 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -187,7 +187,7 @@ process_incoming_migration_co_postcopy_end_main(void) ""
 postcopy_preempt_enabled(bool value) "%d"
 
 # migration-stats
-migration_transferred_bytes(uint64_t qemu_file, uint64_t multifd) "qemu_file 
%" PRIu64 " multifd %" PRIu64
+migration_transferred_bytes(uint64_t qemu_file, uint64_t multifd, uint64_t 
rdma) "qemu_file %" PRIu64 " multifd %" PRIu64 " RDMA %" PRIu64
 
 # channel.c
 migration_set_incoming_channel(void *ioc, const char *ioctype) "ioc=%p 
ioctype=%s"
-- 
2.40.1

[PATCH v2 12/16] migration/rdma: Remove QEMUFile parameter when not used

2023-05-15 Thread Juan Quintela

Signed-off-by: Juan Quintela 
---
 migration/rdma.c | 23 +++
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/migration/rdma.c b/migration/rdma.c
index 074456f9df..416dec00a2 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -2027,7 +2027,7 @@ static int qemu_rdma_exchange_recv(RDMAContext *rdma, 
RDMAControlHeader *head,
  * If we're using dynamic registration on the dest-side, we have to
  * send a registration command first.
  */
-static int qemu_rdma_write_one(QEMUFile *f, RDMAContext *rdma,
+static int qemu_rdma_write_one(RDMAContext *rdma,
int current_index, uint64_t current_addr,
uint64_t length)
 {
@@ -2263,7 +2263,7 @@ retry:
  * We support sending out multiple chunks at the same time.
  * Not all of them need to get signaled in the completion queue.
  */
-static int qemu_rdma_write_flush(QEMUFile *f, RDMAContext *rdma)
+static int qemu_rdma_write_flush(RDMAContext *rdma)
 {
 int ret;
 
@@ -2271,7 +2271,7 @@ static int qemu_rdma_write_flush(QEMUFile *f, RDMAContext 
*rdma)
 return 0;
 }
 
-ret = qemu_rdma_write_one(f, rdma,
+ret = qemu_rdma_write_one(rdma,
 rdma->current_index, rdma->current_addr, rdma->current_length);
 
 if (ret < 0) {
@@ -2344,7 +2344,7 @@ static inline int qemu_rdma_buffer_mergable(RDMAContext 
*rdma,
  *and only require that a batch gets acknowledged in the completion
  *queue instead of each individual chunk.
  */
-static int qemu_rdma_write(QEMUFile *f, RDMAContext *rdma,
+static int qemu_rdma_write(RDMAContext *rdma,
uint64_t block_offset, uint64_t offset,
uint64_t len)
 {
@@ -2355,7 +2355,7 @@ static int qemu_rdma_write(QEMUFile *f, RDMAContext *rdma,
 
 /* If we cannot merge it, we flush the current buffer first. */
 if (!qemu_rdma_buffer_mergable(rdma, current_addr, len)) {
-ret = qemu_rdma_write_flush(f, rdma);
+ret = qemu_rdma_write_flush(rdma);
 if (ret) {
 return ret;
 }
@@ -2377,7 +2377,7 @@ static int qemu_rdma_write(QEMUFile *f, RDMAContext *rdma,
 
 /* flush it if buffer is too large */
 if (rdma->current_length >= RDMA_MERGE_MAX) {
-return qemu_rdma_write_flush(f, rdma);
+return qemu_rdma_write_flush(rdma);
 }
 
 return 0;
@@ -2798,7 +2798,6 @@ static ssize_t qio_channel_rdma_writev(QIOChannel *ioc,
Error **errp)
 {
 QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
-QEMUFile *f = rioc->file;
 RDMAContext *rdma;
 int ret;
 ssize_t done = 0;
@@ -2819,7 +2818,7 @@ static ssize_t qio_channel_rdma_writev(QIOChannel *ioc,
  * Push out any writes that
  * we're queued up for VM's ram.
  */
-ret = qemu_rdma_write_flush(f, rdma);
+ret = qemu_rdma_write_flush(rdma);
 if (ret < 0) {
 rdma->error_state = ret;
 error_setg(errp, "qemu_rdma_write_flush returned %d", ret);
@@ -2958,11 +2957,11 @@ static ssize_t qio_channel_rdma_readv(QIOChannel *ioc,
 /*
  * Block until all the outstanding chunks have been delivered by the hardware.
  */
-static int qemu_rdma_drain_cq(QEMUFile *f, RDMAContext *rdma)
+static int qemu_rdma_drain_cq(RDMAContext *rdma)
 {
 int ret;
 
-if (qemu_rdma_write_flush(f, rdma) < 0) {
+if (qemu_rdma_write_flush(rdma) < 0) {
 return -EIO;
 }
 
@@ -3272,7 +3271,7 @@ static size_t qemu_rdma_save_page(QEMUFile *f,
  * is full, or the page doesn't belong to the current chunk,
  * an actual RDMA write will occur and a new chunk will be formed.
  */
-ret = qemu_rdma_write(f, rdma, block_offset, offset, size);
+ret = qemu_rdma_write(rdma, block_offset, offset, size);
 if (ret < 0) {
 error_report("rdma migration: write error! %d", ret);
 goto err;
@@ -3927,7 +3926,7 @@ static int qemu_rdma_registration_stop(QEMUFile *f,
 CHECK_ERROR_STATE();
 
 qemu_fflush(f);
-ret = qemu_rdma_drain_cq(f, rdma);
+ret = qemu_rdma_drain_cq(rdma);
 
 if (ret < 0) {
 goto err;
-- 
2.40.1

[PATCH v2 14/16] migration: Remove unused qemu_file_credit_transfer()

2023-05-15 Thread Juan Quintela

After this change, nothing abuses QEMUFile to account for data
transferrefd during migration.

Signed-off-by: Juan Quintela 
---
 migration/qemu-file.h | 8 
 migration/qemu-file.c | 5 -
 2 files changed, 13 deletions(-)

diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index e649718492..37f42315c7 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -122,14 +122,6 @@ bool qemu_file_buffer_empty(QEMUFile *file);
  */
 int coroutine_mixed_fn qemu_peek_byte(QEMUFile *f, int offset);
 void qemu_file_skip(QEMUFile *f, int size);
-/*
- * qemu_file_credit_transfer:
- *
- * Report on a number of bytes that have been transferred
- * out of band from the main file object I/O methods. This
- * accounting information tracks the total migration traffic.
- */
-void qemu_file_credit_transfer(QEMUFile *f, size_t size);
 int qemu_file_get_error_obj(QEMUFile *f, Error **errp);
 int qemu_file_get_error_obj_any(QEMUFile *f1, QEMUFile *f2, Error **errp);
 void qemu_file_set_error_obj(QEMUFile *f, int ret, Error *err);
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 23a21e2331..72e130631d 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -411,11 +411,6 @@ static ssize_t coroutine_mixed_fn 
qemu_fill_buffer(QEMUFile *f)
 return len;
 }
 
-void qemu_file_credit_transfer(QEMUFile *f, size_t size)
-{
-f->total_transferred += size;
-}
-
 /** Closes the file
  *
  * Returns negative error value if any error happened on previous operations or
-- 
2.40.1

[PATCH v2 04/16] qemu-file: Account for rate_limit usage on qemu_fflush()

2023-05-15 Thread Juan Quintela

That is the moment we know we have transferred something.

Signed-off-by: Juan Quintela 
Reviewed-by: Cédric Le Goater 
---
 migration/qemu-file.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 4bc875b452..956bd2a580 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -302,7 +302,9 @@ void qemu_fflush(QEMUFile *f)
_error) < 0) {
 qemu_file_set_error_obj(f, -EIO, local_error);
 } else {
-f->total_transferred += iov_size(f->iov, f->iovcnt);
+uint64_t size = iov_size(f->iov, f->iovcnt);
+qemu_file_acct_rate_limit(f, size);
+f->total_transferred += size;
 }
 
 qemu_iovec_release_ram(f);
@@ -519,7 +521,6 @@ void qemu_put_buffer_async(QEMUFile *f, const uint8_t *buf, 
size_t size,
 return;
 }
 
-f->rate_limit_used += size;
 add_to_iovec(f, buf, size, may_free);
 }
 
@@ -537,7 +538,6 @@ void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, 
size_t size)
 l = size;
 }
 memcpy(f->buf + f->buf_index, buf, l);
-f->rate_limit_used += l;
 add_buf_to_iovec(f, l);
 if (qemu_file_get_error(f)) {
 break;
@@ -554,7 +554,6 @@ void qemu_put_byte(QEMUFile *f, int v)
 }
 
 f->buf[f->buf_index] = v;
-f->rate_limit_used++;
 add_buf_to_iovec(f, 1);
 }
 
-- 
2.40.1

[PATCH v2 15/16] migration/rdma: Simplify the function that saves a page

2023-05-15 Thread Juan Quintela

When we sent a page through QEMUFile hooks (RDMA) there are three
posiblities:
- We are not using RDMA. return RAM_SAVE_CONTROL_DELAYED and
  control_save_page() returns false to let anything else to proceed.
- There is one error but we are using RDMA.  Then we return a negative
  value, control_save_page() needs to return true.
- Everything goes well and RDMA start the sent of the page
  asynchronously.  It returns RAM_SAVE_CONTROL_DELAYED and we need to
  return 1 for ram_save_page_legacy.

Clear?

I know, I know, the interfaz is as bad as it gets.  I think that now
it is a bit clearer, but this needs to be done some other way.

Signed-off-by: Juan Quintela 
---
 migration/qemu-file.h | 14 ++
 migration/qemu-file.c | 12 ++--
 migration/ram.c   | 10 +++---
 migration/rdma.c  | 19 +++
 4 files changed, 18 insertions(+), 37 deletions(-)

diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index 37f42315c7..ed77996201 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -49,11 +49,10 @@ typedef int (QEMURamHookFunc)(QEMUFile *f, uint64_t flags, 
void *data);
  * This function allows override of where the RAM page
  * is saved (such as RDMA, for example.)
  */
-typedef size_t (QEMURamSaveFunc)(QEMUFile *f,
- ram_addr_t block_offset,
- ram_addr_t offset,
- size_t size,
- uint64_t *bytes_sent);
+typedef int (QEMURamSaveFunc)(QEMUFile *f,
+  ram_addr_t block_offset,
+  ram_addr_t offset,
+  size_t size);
 
 typedef struct QEMUFileHooks {
 QEMURamHookFunc *before_ram_iterate;
@@ -146,9 +145,8 @@ void ram_control_load_hook(QEMUFile *f, uint64_t flags, 
void *data);
 #define RAM_SAVE_CONTROL_NOT_SUPP -1000
 #define RAM_SAVE_CONTROL_DELAYED  -2000
 
-size_t ram_control_save_page(QEMUFile *f, ram_addr_t block_offset,
- ram_addr_t offset, size_t size,
- uint64_t *bytes_sent);
+int ram_control_save_page(QEMUFile *f, ram_addr_t block_offset,
+  ram_addr_t offset, size_t size);
 QIOChannel *qemu_file_get_ioc(QEMUFile *file);
 
 #endif
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 72e130631d..32ef5e9651 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -336,14 +336,14 @@ void ram_control_load_hook(QEMUFile *f, uint64_t flags, 
void *data)
 }
 }
 
-size_t ram_control_save_page(QEMUFile *f, ram_addr_t block_offset,
- ram_addr_t offset, size_t size,
- uint64_t *bytes_sent)
+int ram_control_save_page(QEMUFile *f, ram_addr_t block_offset,
+  ram_addr_t offset, size_t size)
 {
 if (f->hooks && f->hooks->save_page) {
-int ret = f->hooks->save_page(f, block_offset,
-  offset, size, bytes_sent);
-
+int ret = f->hooks->save_page(f, block_offset, offset, size);
+/*
+ * RAM_SAVE_CONTROL_* are negative values
+ */
 if (ret != RAM_SAVE_CONTROL_DELAYED &&
 ret != RAM_SAVE_CONTROL_NOT_SUPP) {
 if (ret < 0) {
diff --git a/migration/ram.c b/migration/ram.c
index 2d3927a15f..f9fcbb3bb8 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1173,23 +1173,19 @@ static int save_zero_page(PageSearchStatus *pss, 
QEMUFile *f, RAMBlock *block,
 static bool control_save_page(PageSearchStatus *pss, RAMBlock *block,
   ram_addr_t offset, int *pages)
 {
-uint64_t bytes_xmit = 0;
 int ret;
 
-*pages = -1;
 ret = ram_control_save_page(pss->pss_channel, block->offset, offset,
-TARGET_PAGE_SIZE, _xmit);
+TARGET_PAGE_SIZE);
 if (ret == RAM_SAVE_CONTROL_NOT_SUPP) {
 return false;
 }
 
-if (bytes_xmit) {
-*pages = 1;
-}
-
 if (ret == RAM_SAVE_CONTROL_DELAYED) {
+*pages = 1;
 return true;
 }
+*pages = ret;
 return true;
 }
 
diff --git a/migration/rdma.c b/migration/rdma.c
index 416dec00a2..12d3c23fdc 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -3239,13 +3239,12 @@ qio_channel_rdma_shutdown(QIOChannel *ioc,
  *
  *@size : Number of bytes to transfer
  *
- *@bytes_sent : User-specificed pointer to indicate how many bytes were
+ *@pages_sent : User-specificed pointer to indicate how many pages were
  *  sent. Usually, this will not be more than a few bytes of
  *  the protocol because most transfers are sent 
asynchronously.
  */
-static size_t qemu_rdma_save_page(QEMUFile *f,
-  ram_addr_t block_offset, ram_addr_t offset,
-  size_t size, uint64_t *bytes_sent)
+static int

[PATCH v2 16/16] migration/multifd: Compute transferred bytes correctly

2023-05-15 Thread Juan Quintela

In the past, we had to put the in the main thread all the operations
related with sizes due to qemu_file not beeing thread safe.  As now
all counters are atomic, we can update the counters just after the
do the write.  As an aditional bonus, we are able to use the right
value for the compression methods.  Right now we were assuming that
there were no compression at all.

Signed-off-by: Juan Quintela 
---
 migration/multifd.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index aabf9b6d98..0bf5958a9c 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -175,6 +175,7 @@ void multifd_register_ops(int method, MultiFDMethods *ops)
 static int multifd_send_initial_packet(MultiFDSendParams *p, Error **errp)
 {
 MultiFDInit_t msg = {};
+size_t size = sizeof(msg);
 int ret;
 
 msg.magic = cpu_to_be32(MULTIFD_MAGIC);
@@ -182,10 +183,12 @@ static int multifd_send_initial_packet(MultiFDSendParams 
*p, Error **errp)
 msg.id = p->id;
 memcpy(msg.uuid, _uuid.data, sizeof(msg.uuid));
 
-ret = qio_channel_write_all(p->c, (char *), sizeof(msg), errp);
+ret = qio_channel_write_all(p->c, (char *), size, errp);
 if (ret != 0) {
 return -1;
 }
+stat64_add(_stats.multifd_bytes, size);
+stat64_add(_stats.transferred, size);
 return 0;
 }
 
@@ -395,7 +398,6 @@ static int multifd_send_pages(QEMUFile *f)
 static int next_channel;
 MultiFDSendParams *p = NULL; /* make happy gcc */
 MultiFDPages_t *pages = multifd_send_state->pages;
-uint64_t transferred;
 
 if (qatomic_read(_send_state->exiting)) {
 return -1;
@@ -430,10 +432,7 @@ static int multifd_send_pages(QEMUFile *f)
 p->packet_num = multifd_send_state->packet_num++;
 multifd_send_state->pages = p->pages;
 p->pages = pages;
-transferred = ((uint64_t) pages->num) * p->page_size + p->packet_len;
 qemu_mutex_unlock(>mutex);
-stat64_add(_stats.transferred, transferred);
-stat64_add(_stats.multifd_bytes, transferred);
 qemu_sem_post(>sem);
 
 return 1;
@@ -715,6 +714,8 @@ static void *multifd_send_thread(void *opaque)
 if (ret != 0) {
 break;
 }
+stat64_add(_stats.multifd_bytes, p->packet_len);
+stat64_add(_stats.transferred, p->packet_len);
 } else {
 /* Send header using the same writev call */
 p->iov[0].iov_len = p->packet_len;
@@ -727,6 +728,8 @@ static void *multifd_send_thread(void *opaque)
 break;
 }
 
+stat64_add(_stats.multifd_bytes, p->next_packet_size);
+stat64_add(_stats.transferred, p->next_packet_size);
 qemu_mutex_lock(>mutex);
 p->pending_job--;
 qemu_mutex_unlock(>mutex);
-- 
2.40.1

[PATCH v2 09/16] migration: We don't need the field rate_limit_used anymore

2023-05-15 Thread Juan Quintela

Since previous commit, we calculate how much data we have send with
migration_transferred_bytes() so no need to maintain this counter and
remember to always update it.

Signed-off-by: Juan Quintela 
Reviewed-by: Cédric Le Goater 
---
 migration/migration-stats.h | 14 --
 migration/migration-stats.c |  6 --
 migration/multifd.c |  1 -
 migration/qemu-file.c   |  4 
 4 files changed, 25 deletions(-)

diff --git a/migration/migration-stats.h b/migration/migration-stats.h
index f1465c2ebe..9568b5b473 100644
--- a/migration/migration-stats.h
+++ b/migration/migration-stats.h
@@ -89,10 +89,6 @@ typedef struct {
  * Maximum amount of data we can send in a cycle.
  */
 Stat64 rate_limit_max;
-/*
- * Amount of data we have sent in the current cycle.
- */
-Stat64 rate_limit_used;
 /*
  * How long has the setup stage took.
  */
@@ -119,16 +115,6 @@ extern MigrationAtomicStats mig_stats;
  */
 void migration_time_since(MigrationAtomicStats *stats, int64_t since);
 
-/**
- * migration_rate_account: Increase the number of bytes transferred.
- *
- * Report on a number of bytes the have been transferred that need to
- * be applied to the rate limiting calcuations.
- *
- * @len: amount of bytes transferred
- */
-void migration_rate_account(uint64_t len);
-
 /**
  * migration_rate_get: Get the maximum amount that can be transferred.
  *
diff --git a/migration/migration-stats.c b/migration/migration-stats.c
index da2bb69a15..abf2d38b18 100644
--- a/migration/migration-stats.c
+++ b/migration/migration-stats.c
@@ -62,15 +62,9 @@ void migration_rate_set(uint64_t limit)
 
 void migration_rate_reset(QEMUFile *f)
 {
-stat64_set(_stats.rate_limit_used, 0);
 stat64_set(_stats.rate_limit_start, migration_transferred_bytes(f));
 }
 
-void migration_rate_account(uint64_t len)
-{
-stat64_add(_stats.rate_limit_used, len);
-}
-
 uint64_t migration_transferred_bytes(QEMUFile *f)
 {
 uint64_t multifd = stat64_get(_stats.multifd_bytes);
diff --git a/migration/multifd.c b/migration/multifd.c
index 5052091ce2..aabf9b6d98 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -431,7 +431,6 @@ static int multifd_send_pages(QEMUFile *f)
 multifd_send_state->pages = p->pages;
 p->pages = pages;
 transferred = ((uint64_t) pages->num) * p->page_size + p->packet_len;
-migration_rate_account(transferred);
 qemu_mutex_unlock(>mutex);
 stat64_add(_stats.transferred, transferred);
 stat64_add(_stats.multifd_bytes, transferred);
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 9c67b52fe0..acc282654a 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -292,7 +292,6 @@ void qemu_fflush(QEMUFile *f)
 qemu_file_set_error_obj(f, -EIO, local_error);
 } else {
 uint64_t size = iov_size(f->iov, f->iovcnt);
-migration_rate_account(size);
 f->total_transferred += size;
 }
 
@@ -344,9 +343,6 @@ size_t ram_control_save_page(QEMUFile *f, ram_addr_t 
block_offset,
 if (f->hooks && f->hooks->save_page) {
 int ret = f->hooks->save_page(f, block_offset,
   offset, size, bytes_sent);
-if (ret != RAM_SAVE_CONTROL_NOT_SUPP) {
-migration_rate_account(size);
-}
 
 if (ret != RAM_SAVE_CONTROL_DELAYED &&
 ret != RAM_SAVE_CONTROL_NOT_SUPP) {
-- 
2.40.1

[PATCH v2 13/16] migration/rdma: Don't use imaginary transfers

2023-05-15 Thread Juan Quintela

RDMA protocol is completely asynchronous, so in qemu_rdma_save_page()
they "invent" that a byte has been transferred.  And then they call
qemu_file_credit_transfer() and ram_transferred_add() with that byte.
Just remove that calls as nothing has been sent.

Signed-off-by: Juan Quintela 
---
 migration/qemu-file.c | 5 +
 migration/ram.c   | 1 -
 2 files changed, 1 insertion(+), 5 deletions(-)

diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index acc282654a..23a21e2331 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -346,13 +346,10 @@ size_t ram_control_save_page(QEMUFile *f, ram_addr_t 
block_offset,
 
 if (ret != RAM_SAVE_CONTROL_DELAYED &&
 ret != RAM_SAVE_CONTROL_NOT_SUPP) {
-if (bytes_sent && *bytes_sent > 0) {
-qemu_file_credit_transfer(f, *bytes_sent);
-} else if (ret < 0) {
+if (ret < 0) {
 qemu_file_set_error(f, ret);
 }
 }
-
 return ret;
 }
 
diff --git a/migration/ram.c b/migration/ram.c
index 67ed49b387..2d3927a15f 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1184,7 +1184,6 @@ static bool control_save_page(PageSearchStatus *pss, 
RAMBlock *block,
 }
 
 if (bytes_xmit) {
-ram_transferred_add(bytes_xmit);
 *pages = 1;
 }
 
-- 
2.40.1

[PATCH v2 11/16] migration/RDMA: It is accounting for zero/normal pages in two places

2023-05-15 Thread Juan Quintela

Remove the one in control_save_page().

Signed-off-by: Juan Quintela 
---
 migration/ram.c | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index a706edecc0..67ed49b387 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1191,13 +1191,6 @@ static bool control_save_page(PageSearchStatus *pss, 
RAMBlock *block,
 if (ret == RAM_SAVE_CONTROL_DELAYED) {
 return true;
 }
-
-if (bytes_xmit > 0) {
-stat64_add(_stats.normal_pages, 1);
-} else if (bytes_xmit == 0) {
-stat64_add(_stats.zero_pages, 1);
-}
-
 return true;
 }
 
-- 
2.40.1

[PATCH v2 00/16] Migration: More migration atomic counters

2023-05-15 Thread Juan Quintela

Hi

In this v2 series:
- More documentation here and there.
- Fix migration_rate_set() to really / XFER_LIMIT_RATIO
- All reviewed patches are in Migration PULL request 20230515
- There are later reviewed patches, but that depend on the first ones
  that are still not reviewed.

Please review.

Thanks, Juan.

Subject: [PULL 00/11] Migration 20230515 patches
Based-on: Message-Id: <20230515123334.58995-1-quint...@redhat.com>

[v1]
In this series:
- play with rate limit
  * document that a value of 0 means no rate-limit
  * change all users of INT64_MAX to use 0
  * Make sure that transferred value is right
This gets transferred == multifd_bytes + qemu_file_transferred()
until the completation stage.  Changing all devices is overkill and not 
useful.
  * Move all rate_limit calculations to use atomics instead of 
qemu_file_transferred().
Use atomics for rate_limit.
  * RDMA
Adjust counters here and there
Change the "imaginary" 1 byte transfer to say if it has sent a page or not.
More cleanups due to this changes
  * multifd: Adjust the number of transferred bytes in the right place and 
right amount
right place: just after write, now with atomic counters we can
right ammount: Now that we are in the right place, we can do it right also 
for compressing

Please review.

ToDo: Best described as ToSend:
- qemu_file_transfered() is based on atomics on my branch
- transferred atomic is not needed anymore

ToDo before my next send:

- downtime_bytes, precopy_bytes and postcopy_bytes should be based on
  migration_transfered_bytes and not need a counter of their own.

With that my cleanup would have finishing, moving from:
- total_transferred in QEMUFile (not atomic)
- rate_limit_used in QEMUFile (not atomic)
- multifd_bytes in mig_stats
- transferred in mig_stats (not updated everywhere needed, the
  following ones are based on this one)
- downtime_bytes in mig_stats
- precopy_bytes in mig_stats
- postcopy_bytes in mig_stats

To just:
- qemu_file_transferred in mig_stats
- multifd_bytes in mig_stats
- rdma_bytes in mig_stats

And for each transfer, we only update one of the three, everything
else is derived from this three values.

Later, Juan.

Juan Quintela (16):
  migration: Don't use INT64_MAX for unlimited rate
  migration: Correct transferred bytes value
  migration: Move setup_time to mig_stats
  qemu-file: Account for rate_limit usage on qemu_fflush()
  migration: Move rate_limit_max and rate_limit_used to migration_stats
  migration: Move migration_total_bytes() to migration-stats.c
  migration: Add a trace for migration_transferred_bytes
  migration: Use migration_transferred_bytes() to calculate rate_limit
  migration: We don't need the field rate_limit_used anymore
  migration: Don't abuse qemu_file transferred for RDMA
  migration/RDMA: It is accounting for zero/normal pages in two places
  migration/rdma: Remove QEMUFile parameter when not used
  migration/rdma: Don't use imaginary transfers
  migration: Remove unused qemu_file_credit_transfer()
  migration/rdma: Simplify the function that saves a page
  migration/multifd: Compute transferred bytes correctly

 include/migration/qemu-file-types.h | 12 -
 migration/migration-stats.h | 73 +++
 migration/migration.h   |  1 -
 migration/options.h |  7 ---
 migration/qemu-file.h   | 33 +++--
 hw/ppc/spapr.c  |  4 +-
 hw/s390x/s390-stattrib.c|  2 +-
 migration/block-dirty-bitmap.c  |  2 +-
 migration/block.c   |  5 +-
 migration/migration-stats.c | 59 ++
 migration/migration.c   | 36 ++
 migration/multifd.c | 14 +++---
 migration/options.c |  7 ++-
 migration/qemu-file.c   | 77 -
 migration/ram.c | 34 +++--
 migration/rdma.c| 64 +---
 migration/savevm.c  | 21 ++--
 migration/vmstate.c |  3 ++
 migration/meson.build   |  2 +-
 migration/trace-events  |  3 ++
 20 files changed, 268 insertions(+), 191 deletions(-)

-- 
2.40.1

[PATCH v2 03/16] migration: Move setup_time to mig_stats

2023-05-15 Thread Juan Quintela

It is a time that needs to be cleaned each time cancel migration.
Once there create migration_time_since() to calculate how time since a
time in the past.

Signed-off-by: Juan Quintela 

---

Rename to migration_time_since (cédric)
---
 migration/migration-stats.h | 13 +
 migration/migration.h   |  1 -
 migration/migration-stats.c |  7 +++
 migration/migration.c   |  9 -
 4 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/migration/migration-stats.h b/migration/migration-stats.h
index e782f1b0df..21402af9e4 100644
--- a/migration/migration-stats.h
+++ b/migration/migration-stats.h
@@ -75,6 +75,10 @@ typedef struct {
  * Number of bytes sent during precopy stage.
  */
 Stat64 precopy_bytes;
+/*
+ * How long has the setup stage took.
+ */
+Stat64 setup_time;
 /*
  * Total number of bytes transferred.
  */
@@ -87,4 +91,13 @@ typedef struct {
 
 extern MigrationAtomicStats mig_stats;
 
+/**
+ * migration_time_since: Calculate how much time has passed
+ *
+ * @stats: migration stats
+ * @since: reference time since we want to calculate
+ *
+ * Returns: Nothing.  The time is stored in val.
+ */
+void migration_time_since(MigrationAtomicStats *stats, int64_t since);
 #endif
diff --git a/migration/migration.h b/migration/migration.h
index 48a46123a0..27aa3b1035 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -316,7 +316,6 @@ struct MigrationState {
 int64_t downtime;
 int64_t expected_downtime;
 bool capabilities[MIGRATION_CAPABILITY__MAX];
-int64_t setup_time;
 /*
  * Whether guest was running when we enter the completion stage.
  * If migration is interrupted by any reason, we need to continue
diff --git a/migration/migration-stats.c b/migration/migration-stats.c
index 2f2cea965c..3431453c90 100644
--- a/migration/migration-stats.c
+++ b/migration/migration-stats.c
@@ -12,6 +12,13 @@
 
 #include "qemu/osdep.h"
 #include "qemu/stats64.h"
+#include "qemu/timer.h"
 #include "migration-stats.h"
 
 MigrationAtomicStats mig_stats;
+
+void migration_time_since(MigrationAtomicStats *stats, int64_t since)
+{
+int64_t now = qemu_clock_get_ms(QEMU_CLOCK_HOST);
+stat64_set(>setup_time, now - since);
+}
diff --git a/migration/migration.c b/migration/migration.c
index c41c7491bb..e9466273bb 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -887,7 +887,7 @@ static void populate_time_info(MigrationInfo *info, 
MigrationState *s)
 {
 info->has_status = true;
 info->has_setup_time = true;
-info->setup_time = s->setup_time;
+info->setup_time = stat64_get(_stats.setup_time);
 
 if (s->state == MIGRATION_STATUS_COMPLETED) {
 info->has_total_time = true;
@@ -1390,7 +1390,6 @@ void migrate_init(MigrationState *s)
 s->pages_per_second = 0.0;
 s->downtime = 0;
 s->expected_downtime = 0;
-s->setup_time = 0;
 s->start_postcopy = false;
 s->postcopy_after_devices = false;
 s->migration_thread_running = false;
@@ -2647,7 +2646,7 @@ static void migration_calculate_complete(MigrationState 
*s)
 s->downtime = end_time - s->downtime_start;
 }
 
-transfer_time = s->total_time - s->setup_time;
+transfer_time = s->total_time - stat64_get(_stats.setup_time);
 if (transfer_time) {
 s->mbps = ((double) bytes * 8.0) / transfer_time / 1000;
 }
@@ -2969,7 +2968,7 @@ static void *migration_thread(void *opaque)
 qemu_savevm_wait_unplug(s, MIGRATION_STATUS_SETUP,
MIGRATION_STATUS_ACTIVE);
 
-s->setup_time = qemu_clock_get_ms(QEMU_CLOCK_HOST) - setup_start;
+migration_time_since(_stats, setup_start);
 
 trace_migration_thread_setup_complete();
 
@@ -3081,7 +3080,7 @@ static void *bg_migration_thread(void *opaque)
 qemu_savevm_wait_unplug(s, MIGRATION_STATUS_SETUP,
MIGRATION_STATUS_ACTIVE);
 
-s->setup_time = qemu_clock_get_ms(QEMU_CLOCK_HOST) - setup_start;
+migration_time_since(_stats, setup_start);
 
 trace_migration_thread_setup_complete();
 s->downtime_start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
-- 
2.40.1

[PATCH v2 07/16] migration: Add a trace for migration_transferred_bytes

2023-05-15 Thread Juan Quintela

Signed-off-by: Juan Quintela 
Reviewed-by: Cédric Le Goater 
---
 migration/migration-stats.c | 8 ++--
 migration/trace-events  | 3 +++
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/migration/migration-stats.c b/migration/migration-stats.c
index 9bd97caa23..301392d208 100644
--- a/migration/migration-stats.c
+++ b/migration/migration-stats.c
@@ -14,6 +14,7 @@
 #include "qemu/stats64.h"
 #include "qemu/timer.h"
 #include "qemu-file.h"
+#include "trace.h"
 #include "migration-stats.h"
 
 MigrationAtomicStats mig_stats;
@@ -69,6 +70,9 @@ void migration_rate_account(uint64_t len)
 
 uint64_t migration_transferred_bytes(QEMUFile *f)
 {
-return qemu_file_transferred(f) + stat64_get(_stats.multifd_bytes);
-}
+uint64_t multifd = stat64_get(_stats.multifd_bytes);
+uint64_t qemu_file = qemu_file_transferred(f);
 
+trace_migration_transferred_bytes(qemu_file, multifd);
+return qemu_file + multifd;
+}
diff --git a/migration/trace-events b/migration/trace-events
index f39818c329..cdaef7a1ea 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -186,6 +186,9 @@ process_incoming_migration_co_end(int ret, int ps) "ret=%d 
postcopy-state=%d"
 process_incoming_migration_co_postcopy_end_main(void) ""
 postcopy_preempt_enabled(bool value) "%d"
 
+# migration-stats
+migration_transferred_bytes(uint64_t qemu_file, uint64_t multifd) "qemu_file 
%" PRIu64 " multifd %" PRIu64
+
 # channel.c
 migration_set_incoming_channel(void *ioc, const char *ioctype) "ioc=%p 
ioctype=%s"
 migration_set_outgoing_channel(void *ioc, const char *ioctype, const char 
*hostname, void *err)  "ioc=%p ioctype=%s hostname=%s err=%p"
-- 
2.40.1

[PATCH v2 02/16] migration: Correct transferred bytes value

2023-05-15 Thread Juan Quintela

We forget several places to add to trasferred amount of data.  With
this fixes I get:

   qemu_file_transferred() + multifd_bytes == transferred

The only place whrer this is not true is during devices sending.  But
going all through the full tree searching for devices that use
QEMUFile directly is a bit too much.

Multifd, precopy and xbzrle work as expected. Postocpy still misses 35
bytes, but searching for them is getting complicated, so I stop here.

Signed-off-by: Juan Quintela 
---
 migration/ram.c   | 14 ++
 migration/savevm.c| 19 +--
 migration/vmstate.c   |  3 +++
 migration/meson.build |  2 +-
 4 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index f69d8d42b0..fd5a8db0f8 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -337,6 +337,7 @@ int64_t ramblock_recv_bitmap_send(QEMUFile *file,
 
 g_free(le_bitmap);
 
+stat64_add(_stats.transferred, 8 + size + 8);
 if (qemu_file_get_error(file)) {
 return qemu_file_get_error(file);
 }
@@ -1392,6 +1393,7 @@ static int find_dirty_block(RAMState *rs, 
PageSearchStatus *pss)
 return ret;
 }
 qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
+stat64_add(_stats.transferred, 8);
 qemu_fflush(f);
 }
 /*
@@ -3020,6 +3022,7 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
 RAMState **rsp = opaque;
 RAMBlock *block;
 int ret;
+size_t size = 0;
 
 if (compress_threads_save_setup()) {
 return -1;
@@ -3038,16 +3041,20 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
 qemu_put_be64(f, ram_bytes_total_with_ignored()
  | RAM_SAVE_FLAG_MEM_SIZE);
 
+size += 8;
 RAMBLOCK_FOREACH_MIGRATABLE(block) {
 qemu_put_byte(f, strlen(block->idstr));
 qemu_put_buffer(f, (uint8_t *)block->idstr, strlen(block->idstr));
 qemu_put_be64(f, block->used_length);
+size += 1 + strlen(block->idstr) + 8;
 if (migrate_postcopy_ram() && block->page_size !=
   qemu_host_page_size) {
 qemu_put_be64(f, block->page_size);
+size += 8;
 }
 if (migrate_ignore_shared()) {
 qemu_put_be64(f, block->mr->addr);
+size += 8;
 }
 }
 }
@@ -3064,11 +3071,14 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
 
 if (!migrate_multifd_flush_after_each_section()) {
 qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
+size += 8;
 }
 
 qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
+size += 8;
 qemu_fflush(f);
 
+stat64_add(_stats.transferred, size);
 return 0;
 }
 
@@ -3209,6 +3219,7 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
 RAMState **temp = opaque;
 RAMState *rs = *temp;
 int ret = 0;
+size_t size = 0;
 
 rs->last_stage = !migration_in_colo_state();
 
@@ -3253,8 +3264,11 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
 
 if (!migrate_multifd_flush_after_each_section()) {
 qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
+size += 8;
 }
 qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
+size += 8;
+stat64_add(_stats.transferred, size);
 qemu_fflush(f);
 
 return 0;
diff --git a/migration/savevm.c b/migration/savevm.c
index e33788343a..c7af9050c2 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -952,6 +952,7 @@ static void save_section_header(QEMUFile *f, SaveStateEntry 
*se,
 qemu_put_byte(f, section_type);
 qemu_put_be32(f, se->section_id);
 
+size_t size = 1 + 4;
 if (section_type == QEMU_VM_SECTION_FULL ||
 section_type == QEMU_VM_SECTION_START) {
 /* ID string */
@@ -961,7 +962,9 @@ static void save_section_header(QEMUFile *f, SaveStateEntry 
*se,
 
 qemu_put_be32(f, se->instance_id);
 qemu_put_be32(f, se->version_id);
+size += 1 + len + 4 + 4;
 }
+stat64_add(_stats.transferred, size);
 }
 
 /*
@@ -973,6 +976,7 @@ static void save_section_footer(QEMUFile *f, SaveStateEntry 
*se)
 if (migrate_get_current()->send_section_footer) {
 qemu_put_byte(f, QEMU_VM_SECTION_FOOTER);
 qemu_put_be32(f, se->section_id);
+stat64_add(_stats.transferred, 1 + 4);
 }
 }
 
@@ -1032,6 +1036,7 @@ static void qemu_savevm_command_send(QEMUFile *f,
 qemu_put_be16(f, (uint16_t)command);
 qemu_put_be16(f, len);
 qemu_put_buffer(f, data, len);
+stat64_add(_stats.transferred, 1 + 2 + 2 + len);
 qemu_fflush(f);
 }
 
@@ -1212,11 +1217,13 @@ void qemu_savevm_state_header(QEMUFile *f)
 trace_savevm_state_header();
 qemu_put_be32(f, QEMU_VM_FILE_MAGIC);
 qemu_put_be32(f, QEMU_VM_FILE_VERSION);
-
+size_t size = 4 + 4;
 if (migrate_get_current()->send_configuration) {

[PATCH v2 06/16] migration: Move migration_total_bytes() to migration-stats.c

2023-05-15 Thread Juan Quintela

Once there rename it to migration_transferred_bytes() and pass a
QEMUFile instead of a migration object.

Signed-off-by: Juan Quintela 
Reviewed-by: Cédric Le Goater 
---
 migration/migration-stats.h | 11 +++
 migration/migration-stats.c |  6 ++
 migration/migration.c   | 13 +++--
 3 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/migration/migration-stats.h b/migration/migration-stats.h
index e39c083245..91fda378d3 100644
--- a/migration/migration-stats.h
+++ b/migration/migration-stats.h
@@ -147,4 +147,15 @@ void migration_rate_reset(void);
  * @new_rate: new maximum amount
  */
 void migration_rate_set(uint64_t new_rate);
+
+/**
+ * migration_transferred_bytes: Return number of bytes transferred
+ *
+ * @f: QEMUFile used for main migration channel
+ *
+ * Returns how many bytes have we transferred since the beginning of
+ * the migration.  It accounts for bytes sent through any migration
+ * channel, multifd, qemu_file, rdma, 
+ */
+uint64_t migration_transferred_bytes(QEMUFile *f);
 #endif
diff --git a/migration/migration-stats.c b/migration/migration-stats.c
index 1b16edae7d..9bd97caa23 100644
--- a/migration/migration-stats.c
+++ b/migration/migration-stats.c
@@ -66,3 +66,9 @@ void migration_rate_account(uint64_t len)
 {
 stat64_add(_stats.rate_limit_used, len);
 }
+
+uint64_t migration_transferred_bytes(QEMUFile *f)
+{
+return qemu_file_transferred(f) + stat64_get(_stats.multifd_bytes);
+}
+
diff --git a/migration/migration.c b/migration/migration.c
index 594709dbbc..39ff538046 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2624,16 +2624,9 @@ static MigThrError migration_detect_error(MigrationState 
*s)
 }
 }
 
-/* How many bytes have we transferred since the beginning of the migration */
-static uint64_t migration_total_bytes(MigrationState *s)
-{
-return qemu_file_transferred(s->to_dst_file) +
-stat64_get(_stats.multifd_bytes);
-}
-
 static void migration_calculate_complete(MigrationState *s)
 {
-uint64_t bytes = migration_total_bytes(s);
+uint64_t bytes = migration_transferred_bytes(s->to_dst_file);
 int64_t end_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
 int64_t transfer_time;
 
@@ -2659,7 +2652,7 @@ static void 
update_iteration_initial_status(MigrationState *s)
  * wrong speed calculation.
  */
 s->iteration_start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
-s->iteration_initial_bytes = migration_total_bytes(s);
+s->iteration_initial_bytes = migration_transferred_bytes(s->to_dst_file);
 s->iteration_initial_pages = ram_get_total_transferred_pages();
 }
 
@@ -2674,7 +2667,7 @@ static void migration_update_counters(MigrationState *s,
 return;
 }
 
-current_bytes = migration_total_bytes(s);
+current_bytes = migration_transferred_bytes(s->to_dst_file);
 transferred = current_bytes - s->iteration_initial_bytes;
 time_spent = current_time - s->iteration_start_time;
 bandwidth = (double)transferred / time_spent;
-- 
2.40.1

[PATCH v2 05/16] migration: Move rate_limit_max and rate_limit_used to migration_stats

2023-05-15 Thread Juan Quintela

These way we can make them atomic and use this functions from any
place.  I also moved all functions that use rate_limit to
migration-stats.

Functions got renamed, they are not qemu_file anymore.

qemu_file_rate_limit -> migration_rate_exceeded
qemu_file_set_rate_limit -> migration_rate_set
qemu_file_get_rate_limit -> migration_rate_get
qemu_file_reset_rate_limit -> migration_rate_reset
qemu_file_acct_rate_limit -> migration_rate_account.

Signed-off-by: Juan Quintela 
Reviewed-by: Harsh Prateek Bora 

---

s/this/these/ (harsh)
If you have any good suggestion for better names, I am all ears.
Fix missing / XFER_LIMIT_RATIO in migration_rate_set(quintela)
---
 include/migration/qemu-file-types.h | 12 ++-
 migration/migration-stats.h | 47 ++
 migration/options.h |  7 
 migration/qemu-file.h   | 11 --
 hw/ppc/spapr.c  |  4 +--
 hw/s390x/s390-stattrib.c|  2 +-
 migration/block-dirty-bitmap.c  |  2 +-
 migration/block.c   |  5 +--
 migration/migration-stats.c | 44 
 migration/migration.c   | 14 
 migration/multifd.c |  2 +-
 migration/options.c |  7 ++--
 migration/qemu-file.c   | 52 ++---
 migration/ram.c |  2 +-
 migration/savevm.c  |  2 +-
 15 files changed, 124 insertions(+), 89 deletions(-)

diff --git a/include/migration/qemu-file-types.h 
b/include/migration/qemu-file-types.h
index 1436f9ce92..9ba163f333 100644
--- a/include/migration/qemu-file-types.h
+++ b/include/migration/qemu-file-types.h
@@ -165,6 +165,16 @@ size_t coroutine_mixed_fn qemu_get_counted_string(QEMUFile 
*f, char buf[256]);
 
 void qemu_put_counted_string(QEMUFile *f, const char *name);
 
-int qemu_file_rate_limit(QEMUFile *f);
+/**
+ * migration_rate_exceeded: Check if we have exceeded rate for this interval
+ *
+ * Checks if we have already transferred more data that we are allowed
+ * in the current interval.
+ *
+ * @f: QEMUFile used for main migration channel
+ *
+ * Returns if we should stop sending data for this interval.
+ */
+bool migration_rate_exceeded(QEMUFile *f);
 
 #endif
diff --git a/migration/migration-stats.h b/migration/migration-stats.h
index 21402af9e4..e39c083245 100644
--- a/migration/migration-stats.h
+++ b/migration/migration-stats.h
@@ -15,6 +15,12 @@
 
 #include "qemu/stats64.h"
 
+/*
+ * Amount of time to allocate to each "chunk" of bandwidth-throttled
+ * data.
+ */
+#define BUFFER_DELAY 100
+
 /*
  * If rate_limit_max is 0, there is special code to remove the rate
  * limit.
@@ -75,6 +81,14 @@ typedef struct {
  * Number of bytes sent during precopy stage.
  */
 Stat64 precopy_bytes;
+/*
+ * Maximum amount of data we can send in a cycle.
+ */
+Stat64 rate_limit_max;
+/*
+ * Amount of data we have sent in the current cycle.
+ */
+Stat64 rate_limit_used;
 /*
  * How long has the setup stage took.
  */
@@ -100,4 +114,37 @@ extern MigrationAtomicStats mig_stats;
  * Returns: Nothing.  The time is stored in val.
  */
 void migration_time_since(MigrationAtomicStats *stats, int64_t since);
+
+/**
+ * migration_rate_account: Increase the number of bytes transferred.
+ *
+ * Report on a number of bytes the have been transferred that need to
+ * be applied to the rate limiting calcuations.
+ *
+ * @len: amount of bytes transferred
+ */
+void migration_rate_account(uint64_t len);
+
+/**
+ * migration_rate_get: Get the maximum amount that can be transferred.
+ *
+ * Returns the maximum number of bytes that can be transferred in a cycle.
+ */
+uint64_t migration_rate_get(void);
+
+/**
+ * migration_rate_reset: Reset the rate limit counter.
+ *
+ * This is called when we know we start a new transfer cycle.
+ */
+void migration_rate_reset(void);
+
+/**
+ * migration_rate_set: Set the maximum amount that can be transferred.
+ *
+ * Sets the maximum amount of bytes that can be transferred in one cycle.
+ *
+ * @new_rate: new maximum amount
+ */
+void migration_rate_set(uint64_t new_rate);
 #endif
diff --git a/migration/options.h b/migration/options.h
index 5cca3326d6..45991af3c2 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -17,13 +17,6 @@
 #include "hw/qdev-properties.h"
 #include "hw/qdev-properties-system.h"
 
-/* constants */
-
-/* Amount of time to allocate to each "chunk" of bandwidth-throttled
- * data. */
-#define BUFFER_DELAY 100
-#define XFER_LIMIT_RATIO (1000 / BUFFER_DELAY)
-
 /* migration properties */
 
 extern Property migration_properties[];
diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index bcc39081f2..e649718492 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -130,17 +130,6 @@ void qemu_file_skip(QEMUFile *f, int size);
  * accounting information tracks the total migration traffic.
  */
 void

[PATCH v2 01/16] migration: Don't use INT64_MAX for unlimited rate

2023-05-15 Thread Juan Quintela

Define and use RATE_LIMIT_MAX instead.

Signed-off-by: Juan Quintela 
---
 migration/migration-stats.h | 6 ++
 migration/migration.c   | 4 ++--
 migration/qemu-file.c   | 6 +-
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/migration/migration-stats.h b/migration/migration-stats.h
index cf8a4f0410..e782f1b0df 100644
--- a/migration/migration-stats.h
+++ b/migration/migration-stats.h
@@ -15,6 +15,12 @@
 
 #include "qemu/stats64.h"
 
+/*
+ * If rate_limit_max is 0, there is special code to remove the rate
+ * limit.
+ */
+#define RATE_LIMIT_MAX 0
+
 /*
  * These are the ram migration statistic counters.  It is loosely
  * based on MigrationStats.  We change to Stat64 any counter that
diff --git a/migration/migration.c b/migration/migration.c
index 039bba4804..c41c7491bb 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2304,7 +2304,7 @@ static void migration_completion(MigrationState *s)
  * them if migration fails or is cancelled.
  */
 s->block_inactive = !migrate_colo();
-qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX);
+qemu_file_set_rate_limit(s->to_dst_file, RATE_LIMIT_MAX);
 ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false,
  s->block_inactive);
 }
@@ -3048,7 +3048,7 @@ static void *bg_migration_thread(void *opaque)
 rcu_register_thread();
 object_ref(OBJECT(s));
 
-qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX);
+qemu_file_set_rate_limit(s->to_dst_file, RATE_LIMIT_MAX);
 
 setup_start = qemu_clock_get_ms(QEMU_CLOCK_HOST);
 /*
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 597054759d..4bc875b452 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -27,6 +27,7 @@
 #include "qemu/error-report.h"
 #include "qemu/iov.h"
 #include "migration.h"
+#include "migration-stats.h"
 #include "qemu-file.h"
 #include "trace.h"
 #include "options.h"
@@ -732,7 +733,10 @@ int qemu_file_rate_limit(QEMUFile *f)
 if (qemu_file_get_error(f)) {
 return 1;
 }
-if (f->rate_limit_max > 0 && f->rate_limit_used > f->rate_limit_max) {
+if (f->rate_limit_max == RATE_LIMIT_MAX) {
+return 0;
+}
+if (f->rate_limit_used > f->rate_limit_max) {
 return 1;
 }
 return 0;
-- 
2.40.1

[PATCH v3 05/14] nbd: Add types for extended headers

2023-05-15 Thread Eric Blake

Add the constants and structs necessary for later patches to start
implementing the NBD_OPT_EXTENDED_HEADERS extension in both the client
and server, matching recent commit e6f3b94a934] in the upstream nbd
project.  This patch does not change any existing behavior, but merely
sets the stage.

This patch does not change the status quo that neither the client nor
server use a packed-struct representation for the request header.

Signed-off-by: Eric Blake 
---
 docs/interop/nbd.txt |  1 +
 include/block/nbd.h  | 74 
 nbd/common.c | 10 +-
 3 files changed, 65 insertions(+), 20 deletions(-)

diff --git a/docs/interop/nbd.txt b/docs/interop/nbd.txt
index f5ca25174a6..abaf4c28a96 100644
--- a/docs/interop/nbd.txt
+++ b/docs/interop/nbd.txt
@@ -69,3 +69,4 @@ NBD_CMD_BLOCK_STATUS for "qemu:dirty-bitmap:", NBD_CMD_CACHE
 NBD_CMD_FLAG_FAST_ZERO
 * 5.2: NBD_CMD_BLOCK_STATUS for "qemu:allocation-depth"
 * 7.1: NBD_FLAG_CAN_MULTI_CONN for shareable writable exports
+* 8.1: NBD_OPT_EXTENDED_HEADERS
diff --git a/include/block/nbd.h b/include/block/nbd.h
index 50626ab2744..d753fb8006f 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -87,13 +87,24 @@ typedef struct NBDStructuredReplyChunk {
 uint32_t length; /* length of payload */
 } QEMU_PACKED NBDStructuredReplyChunk;

+typedef struct NBDExtendedReplyChunk {
+uint32_t magic;  /* NBD_EXTENDED_REPLY_MAGIC */
+uint16_t flags;  /* combination of NBD_REPLY_FLAG_* */
+uint16_t type;   /* NBD_REPLY_TYPE_* */
+uint64_t handle; /* request handle */
+uint64_t offset; /* request offset */
+uint64_t length; /* length of payload */
+} QEMU_PACKED NBDExtendedReplyChunk;
+
 typedef union NBDReply {
 NBDSimpleReply simple;
 NBDStructuredReplyChunk structured;
+NBDExtendedReplyChunk extended;
 struct {
-/* @magic and @handle fields have the same offset and size both in
- * simple reply and structured reply chunk, so let them be accessible
- * without ".simple." or ".structured." specification
+/*
+ * @magic and @handle fields have the same offset and size in all
+ * forms of replies, so let them be accessible without ".simple.",
+ * ".structured.", or ".extended." specifications.
  */
 uint32_t magic;
 uint32_t _skip;
@@ -126,15 +137,29 @@ typedef struct NBDStructuredError {
 typedef struct NBDStructuredMeta {
 /* header's length >= 12 (at least one extent) */
 uint32_t context_id;
-/* extents follows */
+/* NBDExtent extents[] follows, array length implied by header */
 } QEMU_PACKED NBDStructuredMeta;

-/* Extent chunk for NBD_REPLY_TYPE_BLOCK_STATUS */
+/* Extent array for NBD_REPLY_TYPE_BLOCK_STATUS */
 typedef struct NBDExtent {
 uint32_t length;
 uint32_t flags; /* NBD_STATE_* */
 } QEMU_PACKED NBDExtent;

+/* Header of NBD_REPLY_TYPE_BLOCK_STATUS_EXT */
+typedef struct NBDStructuredMetaExt {
+/* header's length >= 24 (at least one extent) */
+uint32_t context_id;
+uint32_t count; /* header length must be count * 16 + 8 */
+/* NBDExtentExt extents[count] follows */
+} QEMU_PACKED NBDStructuredMetaExt;
+
+/* Extent array for NBD_REPLY_TYPE_BLOCK_STATUS_EXT */
+typedef struct NBDExtentExt {
+uint64_t length;
+uint64_t flags; /* NBD_STATE_* */
+} QEMU_PACKED NBDExtentExt;
+
 /* Transmission (export) flags: sent from server to client during handshake,
but describe what will happen during transmission */
 enum {
@@ -187,6 +212,7 @@ enum {
 #define NBD_OPT_STRUCTURED_REPLY  (8)
 #define NBD_OPT_LIST_META_CONTEXT (9)
 #define NBD_OPT_SET_META_CONTEXT  (10)
+#define NBD_OPT_EXTENDED_HEADERS  (11)

 /* Option reply types. */
 #define NBD_REP_ERR(value) ((UINT32_C(1) << 31) | (value))
@@ -204,6 +230,8 @@ enum {
 #define NBD_REP_ERR_UNKNOWN NBD_REP_ERR(6)  /* Export unknown */
 #define NBD_REP_ERR_SHUTDOWNNBD_REP_ERR(7)  /* Server shutting down */
 #define NBD_REP_ERR_BLOCK_SIZE_REQD NBD_REP_ERR(8)  /* Need INFO_BLOCK_SIZE */
+#define NBD_REP_ERR_TOO_BIG NBD_REP_ERR(9)  /* Payload size overflow */
+#define NBD_REP_ERR_EXT_HEADER_REQD NBD_REP_ERR(10) /* Need extended headers */

 /* Info types, used during NBD_REP_INFO */
 #define NBD_INFO_EXPORT 0
@@ -212,12 +240,14 @@ enum {
 #define NBD_INFO_BLOCK_SIZE 3

 /* Request flags, sent from client to server during transmission phase */
-#define NBD_CMD_FLAG_FUA(1 << 0) /* 'force unit access' during write */
-#define NBD_CMD_FLAG_NO_HOLE(1 << 1) /* don't punch hole on zero run */
-#define NBD_CMD_FLAG_DF (1 << 2) /* don't fragment structured read */
-#define NBD_CMD_FLAG_REQ_ONE(1 << 3) /* only one extent in BLOCK_STATUS
-  * reply chunk */
-#define NBD_CMD_FLAG_FAST_ZERO  (1 << 4) /* fail if WRITE_ZEROES is not fast */
+#define NBD_CMD_FLAG_FUA (1 << 0) /* 'force unit access' during write 
*/

[PATCH v3 10/14] nbd/client: Initial support for extended headers

2023-05-15 Thread Eric Blake

Update the client code to be able to send an extended request, and
parse an extended header from the server.  Note that since we reject
any structured reply with a too-large payload, we can always normalize
a valid header back into the compact form, so that the caller need not
deal with two branches of a union.  Still, until a later patch lets
the client negotiate extended headers, the code added here should not
be reached.  Note that because of the different magic numbers, it is
just as easy to trace and then tolerate a non-compliant server sending
the wrong header reply as it would be to insist that the server is
compliant.

The only caller to nbd_receive_reply() always passed NULL for errp;
since we are changing the signature anyways, I decided to sink the
decision to ignore errors one layer lower.

Signed-off-by: Eric Blake 
---
 include/block/nbd.h |  2 +-
 block/nbd.c |  3 +-
 nbd/client.c| 86 +++--
 nbd/trace-events|  1 +
 4 files changed, 63 insertions(+), 29 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index d753fb8006f..865bb4ee2e1 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -371,7 +371,7 @@ int nbd_init(int fd, QIOChannelSocket *sioc, NBDExportInfo 
*info,
  Error **errp);
 int nbd_send_request(QIOChannel *ioc, NBDRequest *request, NBDHeaderStyle hdr);
 int coroutine_fn nbd_receive_reply(BlockDriverState *bs, QIOChannel *ioc,
-   NBDReply *reply, Error **errp);
+   NBDReply *reply, NBDHeaderStyle hdr);
 int nbd_client(int fd);
 int nbd_disconnect(int fd);
 int nbd_errno_to_system_errno(int err);
diff --git a/block/nbd.c b/block/nbd.c
index 6ad6a4f5ecd..d6caea44928 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -458,7 +458,8 @@ static coroutine_fn int nbd_receive_replies(BDRVNBDState 
*s, uint64_t handle)

 /* We are under mutex and handle is 0. We have to do the dirty work. */
 assert(s->reply.handle == 0);
-ret = nbd_receive_reply(s->bs, s->ioc, >reply, NULL);
+ret = nbd_receive_reply(s->bs, s->ioc, >reply,
+s->info.header_style);
 if (ret <= 0) {
 ret = ret ? ret : -EIO;
 nbd_channel_error(s, ret);
diff --git a/nbd/client.c b/nbd/client.c
index 17d1f57da60..e5db3c8b79d 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -1350,22 +1350,29 @@ int nbd_disconnect(int fd)

 int nbd_send_request(QIOChannel *ioc, NBDRequest *request, NBDHeaderStyle hdr)
 {
-uint8_t buf[NBD_REQUEST_SIZE];
+uint8_t buf[NBD_EXTENDED_REQUEST_SIZE];
+size_t len;

-assert(hdr < NBD_HEADER_EXTENDED);
-assert(request->len <= UINT32_MAX);
 trace_nbd_send_request(request->from, request->len, request->handle,
request->flags, request->type,
nbd_cmd_lookup(request->type));

-stl_be_p(buf, NBD_REQUEST_MAGIC);
 stw_be_p(buf + 4, request->flags);
 stw_be_p(buf + 6, request->type);
 stq_be_p(buf + 8, request->handle);
 stq_be_p(buf + 16, request->from);
-stl_be_p(buf + 24, request->len);
+if (hdr >= NBD_HEADER_EXTENDED) {
+stl_be_p(buf, NBD_EXTENDED_REQUEST_MAGIC);
+stq_be_p(buf + 24, request->len);
+len = NBD_EXTENDED_REQUEST_SIZE;
+} else {
+assert(request->len <= UINT32_MAX);
+stl_be_p(buf, NBD_REQUEST_MAGIC);
+stl_be_p(buf + 24, request->len);
+len = NBD_REQUEST_SIZE;
+}

-return nbd_write(ioc, buf, sizeof(buf), NULL);
+return nbd_write(ioc, buf, len, NULL);
 }

 /* nbd_receive_simple_reply
@@ -1394,28 +1401,34 @@ static int nbd_receive_simple_reply(QIOChannel *ioc, 
NBDSimpleReply *reply,

 /* nbd_receive_structured_reply_chunk
  * Read structured reply chunk except magic field (which should be already
- * read).
+ * read).  Normalize into the compact form.
  * Payload is not read.
  */
-static int nbd_receive_structured_reply_chunk(QIOChannel *ioc,
-  NBDStructuredReplyChunk *chunk,
+static int nbd_receive_structured_reply_chunk(QIOChannel *ioc, NBDReply *chunk,
   Error **errp)
 {
 int ret;
+size_t len;
+uint64_t payload_len;

-assert(chunk->magic == NBD_STRUCTURED_REPLY_MAGIC);
+if (chunk->magic == NBD_STRUCTURED_REPLY_MAGIC) {
+len = sizeof(chunk->structured);
+} else {
+assert(chunk->magic == NBD_EXTENDED_REPLY_MAGIC);
+len = sizeof(chunk->extended);
+}

 ret = nbd_read(ioc, (uint8_t *)chunk + sizeof(chunk->magic),
-   sizeof(*chunk) - sizeof(chunk->magic), "structured chunk",
+   len - sizeof(chunk->magic), "structured chunk",
errp);
 if (ret < 0) {
 return ret;
 }

-chunk->flags = be16_to_cpu(chunk->flags);
-chunk->type = be16_to_cpu(chunk->type);
-

[PATCH v3 07/14] nbd/server: Refactor to pass full request around

2023-05-15 Thread Eric Blake

Part of NBD's 64-bit headers extension involves passing the client's
requested offset back as part of the reply header (one reason for this
change: converting absolute offsets stored in
NBD_REPLY_TYPE_OFFSET_DATA to relative offsets within the buffer is
easier if the absolute offset of the buffer is also available).  This
is a refactoring patch to pass the full request around the reply
stack, rather than just the handle, so that later patches can then
access request->from when extended headers are active.  But for this
patch, there are no semantic changes.

Signed-off-by: Eric Blake 
---
 nbd/server.c | 117 +++
 1 file changed, 61 insertions(+), 56 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 5812a773ace..ffab51efd26 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -1887,18 +1887,18 @@ static int coroutine_fn nbd_co_send_iov(NBDClient 
*client, struct iovec *iov,
 }

 static inline void set_be_simple_reply(NBDClient *client, struct iovec *iov,
-   uint64_t error, uint64_t handle)
+   uint64_t error, NBDRequest *request)
 {
 NBDSimpleReply *reply = iov->iov_base;

 iov->iov_len = sizeof(*reply);
 stl_be_p(>magic, NBD_SIMPLE_REPLY_MAGIC);
 stl_be_p(>error, error);
-stq_be_p(>handle, handle);
+stq_be_p(>handle, request->handle);
 }

 static int coroutine_fn nbd_co_send_simple_reply(NBDClient *client,
- uint64_t handle,
+ NBDRequest *request,
  uint32_t error,
  void *data,
  size_t len,
@@ -1911,16 +1911,16 @@ static int coroutine_fn 
nbd_co_send_simple_reply(NBDClient *client,
 {.iov_base = data, .iov_len = len}
 };

-trace_nbd_co_send_simple_reply(handle, nbd_err, nbd_err_lookup(nbd_err),
-   len);
-set_be_simple_reply(client, [0], nbd_err, handle);
+trace_nbd_co_send_simple_reply(request->handle, nbd_err,
+   nbd_err_lookup(nbd_err), len);
+set_be_simple_reply(client, [0], nbd_err, request);

 return nbd_co_send_iov(client, iov, len ? 2 : 1, errp);
 }

 static inline void set_be_chunk(NBDClient *client, struct iovec *iov,
 uint16_t flags, uint16_t type,
-uint64_t handle, uint32_t length)
+NBDRequest *request, uint32_t length)
 {
 NBDStructuredReplyChunk *chunk = iov->iov_base;

@@ -1928,12 +1928,12 @@ static inline void set_be_chunk(NBDClient *client, 
struct iovec *iov,
 stl_be_p(>magic, NBD_STRUCTURED_REPLY_MAGIC);
 stw_be_p(>flags, flags);
 stw_be_p(>type, type);
-stq_be_p(>handle, handle);
+stq_be_p(>handle, request->handle);
 stl_be_p(>length, length);
 }

 static int coroutine_fn nbd_co_send_structured_done(NBDClient *client,
-uint64_t handle,
+NBDRequest *request,
 Error **errp)
 {
 NBDReply hdr;
@@ -1941,15 +1941,15 @@ static int coroutine_fn 
nbd_co_send_structured_done(NBDClient *client,
 {.iov_base = },
 };

-trace_nbd_co_send_structured_done(handle);
+trace_nbd_co_send_structured_done(request->handle);
 set_be_chunk(client, [0], NBD_REPLY_FLAG_DONE,
- NBD_REPLY_TYPE_NONE, handle, 0);
+ NBD_REPLY_TYPE_NONE, request, 0);

 return nbd_co_send_iov(client, iov, 1, errp);
 }

 static int coroutine_fn nbd_co_send_structured_read(NBDClient *client,
-uint64_t handle,
+NBDRequest *request,
 uint64_t offset,
 void *data,
 size_t size,
@@ -1965,16 +1965,16 @@ static int coroutine_fn 
nbd_co_send_structured_read(NBDClient *client,
 };

 assert(size);
-trace_nbd_co_send_structured_read(handle, offset, data, size);
+trace_nbd_co_send_structured_read(request->handle, offset, data, size);
 set_be_chunk(client, [0], final ? NBD_REPLY_FLAG_DONE : 0,
- NBD_REPLY_TYPE_OFFSET_DATA, handle, iov[1].iov_len + size);
+ NBD_REPLY_TYPE_OFFSET_DATA, request, iov[1].iov_len + size);
 stq_be_p(, offset);

 return nbd_co_send_iov(client, iov, 3, errp);
 }

 static int coroutine_fn nbd_co_send_structured_error(NBDClient *client,
- uint64_t handle,
+ NBDRequest *request,

[PATCH v3 12/14] nbd/client: Request extended headers during negotiation

2023-05-15 Thread Eric Blake

All the pieces are in place for a client to finally request extended
headers.  Note that we must not request extended headers when qemu-nbd
is used to connect to the kernel module (as nbd.ko does not expect
them), but there is no harm in all other clients requesting them.

Extended headers are not essential to the information collected during
'qemu-nbd --list', but probing for it gives us one more piece of
information in that output.  Update the iotests affected by the new
line of output.

Signed-off-by: Eric Blake 
---
 block/nbd.c   |  5 +--
 nbd/client-connection.c   |  2 +-
 nbd/client.c  | 38 ---
 qemu-nbd.c|  3 ++
 tests/qemu-iotests/223.out|  6 +++
 tests/qemu-iotests/233.out|  5 +++
 tests/qemu-iotests/241.out|  3 ++
 tests/qemu-iotests/307.out|  5 +++
 .../tests/nbd-qemu-allocation.out |  1 +
 9 files changed, 51 insertions(+), 17 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index 150dfe7170c..db107ff0806 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -1146,10 +1146,9 @@ static int coroutine_fn 
nbd_co_receive_blockstatus_reply(BDRVNBDState *s,

 switch (chunk->type) {
 case NBD_REPLY_TYPE_BLOCK_STATUS_EXT:
-wide = true;
-/* fallthrough */
 case NBD_REPLY_TYPE_BLOCK_STATUS:
-if (s->info.extended_headers != wide) {
+wide = chunk->type == NBD_REPLY_TYPE_BLOCK_STATUS_EXT;
+if ((s->info.header_style == NBD_HEADER_EXTENDED) != wide) {
 trace_nbd_extended_headers_compliance("block_status");
 }
 if (received) {
diff --git a/nbd/client-connection.c b/nbd/client-connection.c
index 62d75af0bb3..8e0606cadf0 100644
--- a/nbd/client-connection.c
+++ b/nbd/client-connection.c
@@ -93,7 +93,7 @@ NBDClientConnection *nbd_client_connection_new(const 
SocketAddress *saddr,
 .do_negotiation = do_negotiation,

 .initial_info.request_sizes = true,
-.initial_info.header_style = NBD_HEADER_STRUCTURED,
+.initial_info.header_style = NBD_HEADER_EXTENDED,
 .initial_info.base_allocation = true,
 .initial_info.x_dirty_bitmap = g_strdup(x_dirty_bitmap),
 .initial_info.name = g_strdup(export_name ?: "")
diff --git a/nbd/client.c b/nbd/client.c
index e5db3c8b79d..7edddfd2f83 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -879,11 +879,12 @@ static int nbd_list_meta_contexts(QIOChannel *ioc,
  *  1: server is newstyle, but can only accept EXPORT_NAME
  *  2: server is newstyle, but lacks structured replies
  *  3: server is newstyle and set up for structured replies
+ *  4: server is newstyle and set up for extended headers
  */
 static int nbd_start_negotiate(AioContext *aio_context, QIOChannel *ioc,
QCryptoTLSCreds *tlscreds,
const char *hostname, QIOChannel **outioc,
-   bool structured_reply, bool *zeroes,
+   NBDHeaderStyle style, bool *zeroes,
Error **errp)
 {
 ERRP_GUARD();
@@ -961,15 +962,23 @@ static int nbd_start_negotiate(AioContext *aio_context, 
QIOChannel *ioc,
 if (fixedNewStyle) {
 int result = 0;

-if (structured_reply) {
+if (style >= NBD_HEADER_EXTENDED) {
+result = nbd_request_simple_option(ioc,
+   NBD_OPT_EXTENDED_HEADERS,
+   false, errp);
+if (result) {
+return result < 0 ? -EINVAL : 4;
+}
+}
+if (style >= NBD_HEADER_STRUCTURED) {
 result = nbd_request_simple_option(ioc,
NBD_OPT_STRUCTURED_REPLY,
false, errp);
-if (result < 0) {
-return -EINVAL;
+if (result) {
+return result < 0 ? -EINVAL : 3;
 }
 }
-return 2 + result;
+return 2;
 } else {
 return 1;
 }
@@ -1031,8 +1040,7 @@ int nbd_receive_negotiate(AioContext *aio_context, 
QIOChannel *ioc,
 trace_nbd_receive_negotiate_name(info->name);

 result = nbd_start_negotiate(aio_context, ioc, tlscreds, hostname, outioc,
- info->header_style >= NBD_HEADER_STRUCTURED,
- , errp);
+ info->header_style, , errp);

 info->header_style = NBD_HEADER_SIMPLE;
 info->base_allocation = false;
@@ -1041,8 +1049,10 @@ int nbd_receive_negotiate(AioContext *aio_context, 
QIOChannel

[PATCH v3 13/14] nbd/server: Prepare for per-request filtering of BLOCK_STATUS

2023-05-15 Thread Eric Blake

The next commit will add support for the new addition of
NBD_CMD_FLAG_PAYLOAD during NBD_CMD_BLOCK_STATUS, where the client can
request that the server only return a subset of negotiated contexts,
rather than all contexts.  To make that task easier, this patch
populates the list of contexts to return on a per-command basis (for
now, identical to the full set of negotiated contexts).

Signed-off-by: Eric Blake 
---
 include/block/nbd.h |  15 ++
 nbd/server.c| 108 +++-
 2 files changed, 72 insertions(+), 51 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index 865bb4ee2e1..6696d61bd59 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -60,6 +60,20 @@ typedef enum NBDHeaderStyle {
 NBD_HEADER_EXTENDED,/* NBD_OPT_EXTENDED_HEADERS negotiated */
 } NBDHeaderStyle;

+/*
+ * NBDMetaContexts represents a list of meta contexts in use, as
+ * selected by NBD_OPT_SET_META_CONTEXT. Also used for
+ * NBD_OPT_LIST_META_CONTEXT, and payload filtering in
+ * NBD_CMD_BLOCK_STATUS.
+ */
+typedef struct NBDMetaContexts {
+size_t count; /* number of negotiated contexts */
+bool base_allocation; /* export base:allocation context (block status) */
+bool allocation_depth; /* export qemu:allocation-depth */
+size_t nr_bitmaps; /* Length of bitmaps array */
+bool *bitmaps; /* export qemu:dirty-bitmap: */
+} NBDMetaContexts;
+
 /*
  * Note: NBDRequest is _NOT_ the same as the network representation of an NBD
  * request!
@@ -70,6 +84,7 @@ typedef struct NBDRequest {
 uint64_t len;   /* Effect length; 32 bit limit without extended headers */
 uint16_t flags; /* NBD_CMD_FLAG_* */
 uint16_t type;  /* NBD_CMD_* */
+NBDMetaContexts contexts; /* Used by NBD_CMD_BLOCK_STATUS */
 } NBDRequest;

 typedef struct NBDSimpleReply {
diff --git a/nbd/server.c b/nbd/server.c
index 6475a76c1f0..db550c82cd2 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -105,20 +105,6 @@ struct NBDExport {

 static QTAILQ_HEAD(, NBDExport) exports = QTAILQ_HEAD_INITIALIZER(exports);

-/* NBDExportMetaContexts represents a list of contexts to be exported,
- * as selected by NBD_OPT_SET_META_CONTEXT. Also used for
- * NBD_OPT_LIST_META_CONTEXT. */
-typedef struct NBDExportMetaContexts {
-NBDExport *exp;
-size_t count; /* number of negotiated contexts */
-bool base_allocation; /* export base:allocation context (block status) */
-bool allocation_depth; /* export qemu:allocation-depth */
-bool *bitmaps; /*
-* export qemu:dirty-bitmap:,
-* sized by exp->nr_export_bitmaps
-*/
-} NBDExportMetaContexts;
-
 struct NBDClient {
 int refcount;
 void (*close_fn)(NBDClient *client, bool negotiated);
@@ -144,7 +130,8 @@ struct NBDClient {
 uint32_t check_align; /* If non-zero, check for aligned client requests */

 NBDHeaderStyle header_style;
-NBDExportMetaContexts export_meta;
+NBDExport *context_exp; /* export of last OPT_SET_META_CONTEXT */
+NBDMetaContexts contexts; /* Negotiated meta contexts */

 uint32_t opt; /* Current option being negotiated */
 uint32_t optlen; /* remaining length of data in ioc for the option being
@@ -457,8 +444,8 @@ static int nbd_negotiate_handle_list(NBDClient *client, 
Error **errp)

 static void nbd_check_meta_export(NBDClient *client)
 {
-if (client->exp != client->export_meta.exp) {
-client->export_meta.count = 0;
+if (client->exp != client->context_exp) {
+client->contexts.count = 0;
 }
 }

@@ -852,7 +839,7 @@ static bool nbd_strshift(const char **str, const char 
*prefix)
  * Handle queries to 'base' namespace. For now, only the base:allocation
  * context is available.  Return true if @query has been handled.
  */
-static bool nbd_meta_base_query(NBDClient *client, NBDExportMetaContexts *meta,
+static bool nbd_meta_base_query(NBDClient *client, NBDMetaContexts *meta,
 const char *query)
 {
 if (!nbd_strshift(, "base:")) {
@@ -872,8 +859,8 @@ static bool nbd_meta_base_query(NBDClient *client, 
NBDExportMetaContexts *meta,
  * and qemu:allocation-depth contexts are available.  Return true if @query
  * has been handled.
  */
-static bool nbd_meta_qemu_query(NBDClient *client, NBDExportMetaContexts *meta,
-const char *query)
+static bool nbd_meta_qemu_query(NBDClient *client, NBDExport *exp,
+NBDMetaContexts *meta, const char *query)
 {
 size_t i;

@@ -884,9 +871,9 @@ static bool nbd_meta_qemu_query(NBDClient *client, 
NBDExportMetaContexts *meta,

 if (!*query) {
 if (client->opt == NBD_OPT_LIST_META_CONTEXT) {
-meta->allocation_depth = meta->exp->allocation_depth;
-if (meta->exp->nr_export_bitmaps) {
-memset(meta->bitmaps, 1, meta->exp->nr_export_bitmaps);
+meta->allocation_depth = exp->allocation_depth;
+

[PATCH v3 04/14] nbd: Prepare for 64-bit request effect lengths

2023-05-15 Thread Eric Blake

Widen the length field of NBDRequest to 64-bits, although we can
assert that all current uses are still under 32 bits.  Move the
request magic number to nbd.h, to live alongside the reply magic
number.  Convert 'bool structured_reply' into a tri-state enum that
will eventually track whether the client successfully negotiated
extended headers with the server, allowing the nbd driver to pass
larger requests along where possible; although in this patch the enum
never surpasses structured replies, for no semantic change yet.

Signed-off-by: Eric Blake 
---
 include/block/nbd.h | 36 +
 nbd/nbd-internal.h  |  3 +--
 block/nbd.c | 45 +++--
 nbd/client-connection.c |  4 ++--
 nbd/client.c| 18 ++---
 nbd/server.c| 37 +++--
 nbd/trace-events|  8 
 7 files changed, 93 insertions(+), 58 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index f1d838d24f5..50626ab2744 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -1,5 +1,5 @@
 /*
- *  Copyright (C) 2016-2022 Red Hat, Inc.
+ *  Copyright Red Hat
  *  Copyright (C) 2005  Anthony Liguori 
  *
  *  Network Block Device
@@ -51,19 +51,26 @@ typedef struct NBDOptionReplyMetaContext {
 /* metadata context name follows */
 } QEMU_PACKED NBDOptionReplyMetaContext;

-/* Transmission phase structs
- *
- * Note: these are _NOT_ the same as the network representation of an NBD
- * request and reply!
+/* Transmission phase structs */
+
+/* Header style in use */
+typedef enum NBDHeaderStyle {
+NBD_HEADER_SIMPLE,  /* default; simple replies only */
+NBD_HEADER_STRUCTURED,  /* NBD_OPT_STRUCTURED_REPLY negotiated */
+NBD_HEADER_EXTENDED,/* NBD_OPT_EXTENDED_HEADERS negotiated */
+} NBDHeaderStyle;
+
+/*
+ * Note: NBDRequest is _NOT_ the same as the network representation of an NBD
+ * request!
  */
-struct NBDRequest {
+typedef struct NBDRequest {
 uint64_t handle;
-uint64_t from;
-uint32_t len;
+uint64_t from;  /* Offset touched by the command */
+uint64_t len;   /* Effect length; 32 bit limit without extended headers */
 uint16_t flags; /* NBD_CMD_FLAG_* */
-uint16_t type; /* NBD_CMD_* */
-};
-typedef struct NBDRequest NBDRequest;
+uint16_t type;  /* NBD_CMD_* */
+} NBDRequest;

 typedef struct NBDSimpleReply {
 uint32_t magic;  /* NBD_SIMPLE_REPLY_MAGIC */
@@ -236,6 +243,9 @@ enum {
  */
 #define NBD_MAX_STRING_SIZE 4096

+/* Transmission request structure */
+#define NBD_REQUEST_MAGIC   0x25609513
+
 /* Two types of reply structures */
 #define NBD_SIMPLE_REPLY_MAGIC  0x67446698
 #define NBD_STRUCTURED_REPLY_MAGIC  0x668e33ef
@@ -293,7 +303,7 @@ struct NBDExportInfo {

 /* In-out fields, set by client before nbd_receive_negotiate() and
  * updated by server results during nbd_receive_negotiate() */
-bool structured_reply;
+NBDHeaderStyle header_style;
 bool base_allocation; /* base:allocation context for NBD_CMD_BLOCK_STATUS 
*/

 /* Set by server results during nbd_receive_negotiate() and
@@ -323,7 +333,7 @@ int nbd_receive_export_list(QIOChannel *ioc, 
QCryptoTLSCreds *tlscreds,
 Error **errp);
 int nbd_init(int fd, QIOChannelSocket *sioc, NBDExportInfo *info,
  Error **errp);
-int nbd_send_request(QIOChannel *ioc, NBDRequest *request);
+int nbd_send_request(QIOChannel *ioc, NBDRequest *request, NBDHeaderStyle hdr);
 int coroutine_fn nbd_receive_reply(BlockDriverState *bs, QIOChannel *ioc,
NBDReply *reply, Error **errp);
 int nbd_client(int fd);
diff --git a/nbd/nbd-internal.h b/nbd/nbd-internal.h
index df42fef7066..133b1d94b50 100644
--- a/nbd/nbd-internal.h
+++ b/nbd/nbd-internal.h
@@ -1,7 +1,7 @@
 /*
  * NBD Internal Declarations
  *
- * Copyright (C) 2016 Red Hat, Inc.
+ * Copyright Red Hat
  *
  * This work is licensed under the terms of the GNU GPL, version 2 or later.
  * See the COPYING file in the top-level directory.
@@ -44,7 +44,6 @@
 #define NBD_OLDSTYLE_NEGOTIATE_SIZE (8 + 8 + 8 + 4 + 124)

 #define NBD_INIT_MAGIC  0x4e42444d41474943LL /* ASCII "NBDMAGIC" */
-#define NBD_REQUEST_MAGIC   0x25609513
 #define NBD_OPTS_MAGIC  0x49484156454F5054LL /* ASCII "IHAVEOPT" */
 #define NBD_CLIENT_MAGIC0x420281861253LL
 #define NBD_REP_MAGIC   0x0003e889045565a9LL
diff --git a/block/nbd.c b/block/nbd.c
index a3f8f8a9d5e..6ad6a4f5ecd 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -1,8 +1,8 @@
 /*
- * QEMU Block driver for  NBD
+ * QEMU Block driver for NBD
  *
  * Copyright (c) 2019 Virtuozzo International GmbH.
- * Copyright (C) 2016 Red Hat, Inc.
+ * Copyright Red Hat
  * Copyright (C) 2008 Bull S.A.S.
  * Author: Laurent Vivier 
  *
@@ -341,7 +341,7 @@ int coroutine_fn 
nbd_co_do_establish_connection(BlockDriverState *bs,
  */

[PATCH v3 06/14] nbd/server: Refactor handling of request payload

2023-05-15 Thread Eric Blake

Upcoming additions to support NBD 64-bit effect lengths allow for the
possibility to distinguish between payload length (capped at 32M) and
effect length (up to 63 bits).  Without that extension, only the
NBD_CMD_WRITE request has a payload; but with the extension, it makes
sense to allow at least NBD_CMD_BLOCK_STATUS to have both a payload
and effect length (where the payload is a limited-size struct that in
turns gives the real effect length as well as a subset of known ids
for which status is requested).  Other future NBD commands may also
have a request payload, so the 64-bit extension introduces a new
NBD_CMD_FLAG_PAYLOAD_LEN that distinguishes between whether the header
length is a payload length or an effect length, rather than
hard-coding the decision based on the command.  Note that we do not
support the payload version of BLOCK_STATUS yet.

For this patch, no semantic change is intended for a compliant client.
For a non-compliant client, it is possible that the error behavior
changes (a different message, a change on whether the connection is
killed or remains alive for the next command, or so forth), but all
errors should still be handled gracefully.

Signed-off-by: Eric Blake 
---
 nbd/server.c | 55 +---
 nbd/trace-events |  1 +
 2 files changed, 39 insertions(+), 17 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index cf38a104d9a..5812a773ace 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -2316,6 +2316,8 @@ static int coroutine_fn 
nbd_co_receive_request(NBDRequestData *req, NBDRequest *
Error **errp)
 {
 NBDClient *client = req->client;
+bool extended_with_payload;
+int payload_len = 0;
 int valid_flags;
 int ret;

@@ -2329,27 +2331,41 @@ static int coroutine_fn 
nbd_co_receive_request(NBDRequestData *req, NBDRequest *
 trace_nbd_co_receive_request_decode_type(request->handle, request->type,
  nbd_cmd_lookup(request->type));

-if (request->type != NBD_CMD_WRITE) {
-/* No payload, we are ready to read the next request.  */
-req->complete = true;
-}
-
 if (request->type == NBD_CMD_DISC) {
 /* Special case: we're going to disconnect without a reply,
  * whether or not flags, from, or len are bogus */
+req->complete = true;
 return -EIO;
 }

+/* Payload and buffer handling. */
+extended_with_payload = client->header_style >= NBD_HEADER_EXTENDED &&
+(request->flags & NBD_CMD_FLAG_PAYLOAD_LEN);
 if (request->type == NBD_CMD_READ || request->type == NBD_CMD_WRITE ||
-request->type == NBD_CMD_CACHE)
-{
+request->type == NBD_CMD_CACHE || extended_with_payload) {
 if (request->len > NBD_MAX_BUFFER_SIZE) {
 error_setg(errp, "len (%" PRIu64" ) is larger than max len (%u)",
request->len, NBD_MAX_BUFFER_SIZE);
 return -EINVAL;
 }

-if (request->type != NBD_CMD_CACHE) {
+if (request->type == NBD_CMD_WRITE || extended_with_payload) {
+payload_len = request->len;
+if (request->type != NBD_CMD_WRITE) {
+/*
+ * For now, we don't support payloads on other
+ * commands; but we can keep the connection alive.
+ */
+request->len = 0;
+} else if (client->header_style >= NBD_HEADER_EXTENDED &&
+   !extended_with_payload) {
+/* The client is noncompliant. Trace it, but proceed. */
+trace_nbd_co_receive_ext_payload_compliance(request->from,
+request->len);
+}
+}
+
+if (request->type == NBD_CMD_WRITE || request->type == NBD_CMD_READ) {
 req->data = blk_try_blockalign(client->exp->common.blk,
request->len);
 if (req->data == NULL) {
@@ -2359,18 +2375,20 @@ static int coroutine_fn 
nbd_co_receive_request(NBDRequestData *req, NBDRequest *
 }
 }

-if (request->type == NBD_CMD_WRITE) {
-assert(request->len <= NBD_MAX_BUFFER_SIZE);
-if (nbd_read(client->ioc, req->data, request->len, "CMD_WRITE data",
- errp) < 0)
-{
+if (payload_len) {
+if (req->data) {
+ret = nbd_read(client->ioc, req->data, payload_len,
+   "CMD_WRITE data", errp);
+} else {
+ret = nbd_drop(client->ioc, payload_len, errp);
+}
+if (ret < 0) {
 return -EIO;
 }
-req->complete = true;
-
 trace_nbd_co_receive_request_payload_received(request->handle,
-  request->len);
+  payload_len);
 }
+req->complete =

[PATCH v3 11/14] nbd/client: Accept 64-bit block status chunks

2023-05-15 Thread Eric Blake

Because we use NBD_CMD_FLAG_REQ_ONE with NBD_CMD_BLOCK_STATUS, a
client in narrow mode should not be able to provoke a server into
sending a block status result larger than the client's 32-bit request.
But in extended mode, a 64-bit status request must be able to handle a
64-bit status result, once a future patch enables the client
requesting extended mode.  We can also tolerate a non-compliant server
sending the new chunk even when it should not.

In normal execution, we are only requesting "base:allocation" which
never exceeds 32 bits. But during testing with x-dirty-bitmap, we can
force qemu to connect to some other context that might have 64-bit
status bit; however, we ignore those upper bits (other than mapping
qemu:allocation-depth into something that 'qemu-img map --output=json'
can expose), and since it is only testing, we really don't bother with
checking whether more than the two least-significant bits are set.

Signed-off-by: Eric Blake 
---
 block/nbd.c| 39 ---
 block/trace-events |  1 +
 2 files changed, 29 insertions(+), 11 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index d6caea44928..150dfe7170c 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -610,13 +610,16 @@ static int nbd_parse_offset_hole_payload(BDRVNBDState *s,
  */
 static int nbd_parse_blockstatus_payload(BDRVNBDState *s,
  NBDStructuredReplyChunk *chunk,
- uint8_t *payload, uint64_t 
orig_length,
- NBDExtent *extent, Error **errp)
+ uint8_t *payload, bool wide,
+ uint64_t orig_length,
+ NBDExtentExt *extent, Error **errp)
 {
 uint32_t context_id;
+uint32_t count = 0;
+size_t len = wide ? sizeof(*extent) : sizeof(NBDExtent);

 /* The server succeeded, so it must have sent [at least] one extent */
-if (chunk->length < sizeof(context_id) + sizeof(*extent)) {
+if (chunk->length < sizeof(context_id) + wide * sizeof(count) + len) {
 error_setg(errp, "Protocol error: invalid payload for "
  "NBD_REPLY_TYPE_BLOCK_STATUS");
 return -EINVAL;
@@ -631,8 +634,14 @@ static int nbd_parse_blockstatus_payload(BDRVNBDState *s,
 return -EINVAL;
 }

-extent->length = payload_advance32();
-extent->flags = payload_advance32();
+if (wide) {
+count = payload_advance32();
+extent->length = payload_advance64();
+extent->flags = payload_advance64();
+} else {
+extent->length = payload_advance32();
+extent->flags = payload_advance32();
+}

 if (extent->length == 0) {
 error_setg(errp, "Protocol error: server sent status chunk with "
@@ -672,7 +681,8 @@ static int nbd_parse_blockstatus_payload(BDRVNBDState *s,
  * connection; just ignore trailing extents, and clamp things to
  * the length of our request.
  */
-if (chunk->length > sizeof(context_id) + sizeof(*extent)) {
+if (count != wide ||
+chunk->length > sizeof(context_id) + wide * sizeof(count) + len) {
 trace_nbd_parse_blockstatus_compliance("more than one extent");
 }
 if (extent->length > orig_length) {
@@ -1117,7 +1127,7 @@ static int coroutine_fn 
nbd_co_receive_cmdread_reply(BDRVNBDState *s, uint64_t h

 static int coroutine_fn nbd_co_receive_blockstatus_reply(BDRVNBDState *s,
  uint64_t handle, 
uint64_t length,
- NBDExtent *extent,
+ NBDExtentExt *extent,
  int *request_ret, 
Error **errp)
 {
 NBDReplyChunkIter iter;
@@ -1125,6 +1135,7 @@ static int coroutine_fn 
nbd_co_receive_blockstatus_reply(BDRVNBDState *s,
 void *payload = NULL;
 Error *local_err = NULL;
 bool received = false;
+bool wide = false;

 assert(!extent->length);
 NBD_FOREACH_REPLY_CHUNK(s, iter, handle, false, NULL, , ) {
@@ -1134,7 +1145,13 @@ static int coroutine_fn 
nbd_co_receive_blockstatus_reply(BDRVNBDState *s,
 assert(nbd_reply_is_structured());

 switch (chunk->type) {
+case NBD_REPLY_TYPE_BLOCK_STATUS_EXT:
+wide = true;
+/* fallthrough */
 case NBD_REPLY_TYPE_BLOCK_STATUS:
+if (s->info.extended_headers != wide) {
+trace_nbd_extended_headers_compliance("block_status");
+}
 if (received) {
 nbd_channel_error(s, -EINVAL);
 error_setg(_err, "Several BLOCK_STATUS chunks in reply");
@@ -1142,9 +1159,9 @@ static int coroutine_fn 
nbd_co_receive_blockstatus_reply(BDRVNBDState *s,
 }
 received = true;

-ret =

[PATCH v3 02/14] nbd/client: Add safety check on chunk payload length

2023-05-15 Thread Eric Blake

Our existing use of structured replies either reads into a qiov capped
at 32M (NBD_CMD_READ) or caps allocation to 1000 bytes (see
NBD_MAX_MALLOC_PAYLOAD in block/nbd.c).  But the existing length
checks are rather late; if we encounter a buggy (or malicious) server
that sends a super-large payload length, we should drop the connection
right then rather than assuming the layer on top will be careful.
This becomes more important when we permit 64-bit lengths which are
even more likely to have the potential for attempted denial of service
abuse.

Signed-off-by: Eric Blake 
---
 nbd/client.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/nbd/client.c b/nbd/client.c
index ff75722e487..46f476400ab 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -1413,6 +1413,18 @@ static int nbd_receive_structured_reply_chunk(QIOChannel 
*ioc,
 chunk->handle = be64_to_cpu(chunk->handle);
 chunk->length = be32_to_cpu(chunk->length);

+/*
+ * Because we use BLOCK_STATUS with REQ_ONE, and cap READ requests
+ * at 32M, no valid server should send us payload larger than
+ * this.  Even if we stopped using REQ_ONE, sane servers will cap
+ * the number of extents they return for block status.
+ */
+if (chunk->length > NBD_MAX_BUFFER_SIZE + sizeof(NBDStructuredReadData)) {
+error_setg(errp, "server chunk %" PRIu32 " (%s) payload is too long",
+   chunk->type, nbd_rep_lookup(chunk->type));
+return -EINVAL;
+}
+
 return 0;
 }

-- 
2.40.1

[PATCH v3 01/14] nbd/client: Use smarter assert

2023-05-15 Thread Eric Blake

Assigning strlen() to a uint32_t and then asserting that it isn't too
large doesn't catch the case of an input string 4G in length.
Thankfully, the incoming strings can never be that large: if the
export name or query is reflecting a string the client got from the
server, we already guarantee that we dropped the NBD connection if the
server sent more than 32M in a single reply to our NBD_OPT_* request;
if the export name is coming from qemu, nbd_receive_negotiate()
asserted that strlen(info->name) <= NBD_MAX_STRING_SIZE; and
similarly, a query string via x->dirty_bitmap coming from the user was
bounds-checked in either qemu-nbd or by the limitations of QMP.
Still, it doesn't hurt to be more explicit in how we write our
assertions to not have to analyze whether inadvertent wraparound is
possible.

Fixes: 93676c88 ("nbd: Don't send oversize strings", v4.2.0)
Reported-by: Dr. David Alan Gilbert 
Signed-off-by: Eric Blake 

---

Looking through older branches, I came across this one that was never
applied at the time, but which also had a useful review comment from
Vladimir that invalidates the R-b it had back then.

v2 was here: https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg02733.html
since then - update David's email, use strnlen before strlen
---
 nbd/client.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/nbd/client.c b/nbd/client.c
index 30d5383cb19..ff75722e487 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -650,19 +650,20 @@ static int nbd_send_meta_query(QIOChannel *ioc, uint32_t 
opt,
Error **errp)
 {
 int ret;
-uint32_t export_len = strlen(export);
+uint32_t export_len;
 uint32_t queries = !!query;
 uint32_t query_len = 0;
 uint32_t data_len;
 char *data;
 char *p;

+assert(strnlen(export, NBD_MAX_STRING_SIZE + 1) <= NBD_MAX_STRING_SIZE);
+export_len = strlen(export);
 data_len = sizeof(export_len) + export_len + sizeof(queries);
-assert(export_len <= NBD_MAX_STRING_SIZE);
 if (query) {
+assert(strnlen(query, NBD_MAX_STRING_SIZE + 1) <= NBD_MAX_STRING_SIZE);
 query_len = strlen(query);
 data_len += sizeof(query_len) + query_len;
-assert(query_len <= NBD_MAX_STRING_SIZE);
 } else {
 assert(opt == NBD_OPT_LIST_META_CONTEXT);
 }
-- 
2.40.1

[PATCH v3 09/14] nbd/server: Initial support for extended headers

2023-05-15 Thread Eric Blake

Time to support clients that request extended headers.  Now we can
finally reach the code added across several previous patches.

Even though the NBD spec has been altered to allow us to accept
NBD_CMD_READ larger than the max payload size (provided our response
is a hole or broken up over more than one data chunk), we are not
planning to take advantage of that, and continue to cap NBD_CMD_READ
to 32M regardless of header size.

For NBD_CMD_WRITE_ZEROES and NBD_CMD_TRIM, the block layer already
supports 64-bit operations without any effort on our part.  For
NBD_CMD_BLOCK_STATUS, the client's length is a hint, and the previous
patch took care of implementing the required
NBD_REPLY_TYPE_BLOCK_STATUS_EXT.

Signed-off-by: Eric Blake 
---
 nbd/nbd-internal.h |   5 +-
 nbd/server.c   | 130 +++--
 2 files changed, 106 insertions(+), 29 deletions(-)

diff --git a/nbd/nbd-internal.h b/nbd/nbd-internal.h
index 133b1d94b50..dfa02f77ee4 100644
--- a/nbd/nbd-internal.h
+++ b/nbd/nbd-internal.h
@@ -34,8 +34,11 @@
  * https://github.com/yoe/nbd/blob/master/doc/proto.md
  */

-/* Size of all NBD_OPT_*, without payload */
+/* Size of all compact NBD_CMD_*, without payload */
 #define NBD_REQUEST_SIZE(4 + 2 + 2 + 8 + 8 + 4)
+/* Size of all extended NBD_CMD_*, without payload */
+#define NBD_EXTENDED_REQUEST_SIZE   (4 + 2 + 2 + 8 + 8 + 8)
+
 /* Size of all NBD_REP_* sent in answer to most NBD_OPT_*, without payload */
 #define NBD_REPLY_SIZE  (4 + 4 + 8)
 /* Size of reply to NBD_OPT_EXPORT_NAME */
diff --git a/nbd/server.c b/nbd/server.c
index b4c15ae1a14..6475a76c1f0 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -482,6 +482,10 @@ static int nbd_negotiate_handle_export_name(NBDClient 
*client, bool no_zeroes,
 [10 .. 133]   reserved (0) [unless no_zeroes]
  */
 trace_nbd_negotiate_handle_export_name();
+if (client->header_style >= NBD_HEADER_EXTENDED) {
+error_setg(errp, "Extended headers already negotiated");
+return -EINVAL;
+}
 if (client->optlen > NBD_MAX_STRING_SIZE) {
 error_setg(errp, "Bad length received");
 return -EINVAL;
@@ -1262,7 +1266,11 @@ static int nbd_negotiate_options(NBDClient *client, 
Error **errp)
 case NBD_OPT_STRUCTURED_REPLY:
 if (length) {
 ret = nbd_reject_length(client, false, errp);
-} else if (client->header_style >= NBD_HEADER_STRUCTURED) {
+} else if (client->header_style >= NBD_HEADER_EXTENDED) {
+ret = nbd_negotiate_send_rep_err(
+client, NBD_REP_ERR_EXT_HEADER_REQD, errp,
+"extended headers already negotiated");
+} else if (client->header_style == NBD_HEADER_STRUCTURED) {
 ret = nbd_negotiate_send_rep_err(
 client, NBD_REP_ERR_INVALID, errp,
 "structured reply already negotiated");
@@ -1278,6 +1286,19 @@ static int nbd_negotiate_options(NBDClient *client, 
Error **errp)
  errp);
 break;

+case NBD_OPT_EXTENDED_HEADERS:
+if (length) {
+ret = nbd_reject_length(client, false, errp);
+} else if (client->header_style >= NBD_HEADER_EXTENDED) {
+ret = nbd_negotiate_send_rep_err(
+client, NBD_REP_ERR_INVALID, errp,
+"extended headers already negotiated");
+} else {
+ret = nbd_negotiate_send_rep(client, NBD_REP_ACK, errp);
+client->header_style = NBD_HEADER_EXTENDED;
+}
+break;
+
 default:
 ret = nbd_opt_drop(client, NBD_REP_ERR_UNSUP, errp,
"Unsupported option %" PRIu32 " (%s)",
@@ -1413,11 +1434,13 @@ nbd_read_eof(NBDClient *client, void *buffer, size_t 
size, Error **errp)
 static int coroutine_fn nbd_receive_request(NBDClient *client, NBDRequest 
*request,
 Error **errp)
 {
-uint8_t buf[NBD_REQUEST_SIZE];
-uint32_t magic;
+uint8_t buf[NBD_EXTENDED_REQUEST_SIZE];
+uint32_t magic, expect;
 int ret;
+size_t size = client->header_style == NBD_HEADER_EXTENDED ?
+NBD_EXTENDED_REQUEST_SIZE : NBD_REQUEST_SIZE;

-ret = nbd_read_eof(client, buf, sizeof(buf), errp);
+ret = nbd_read_eof(client, buf, size, errp);
 if (ret < 0) {
 return ret;
 }
@@ -1425,13 +1448,21 @@ static int coroutine_fn nbd_receive_request(NBDClient 
*client, NBDRequest *reque
 return -EIO;
 }

-/* Request
-   [ 0 ..  3]   magic   (NBD_REQUEST_MAGIC)
-   [ 4 ..  5]   flags   (NBD_CMD_FLAG_FUA, ...)
-   [ 6 ..  7]   type(NBD_CMD_READ, ...)
-   [ 8 .. 15]   handle
-   [16 .. 23]   from
-   [24

[PATCH v3 14/14] nbd/server: Add FLAG_PAYLOAD support to CMD_BLOCK_STATUS

2023-05-15 Thread Eric Blake

Allow a client to request a subset of negotiated meta contexts.  For
example, a client may ask to use a single connection to learn about
both block status and dirty bitmaps, but where the dirty bitmap
queries only need to be performed on a subset of the disk; forcing the
server to compute that information on block status queries in the rest
of the disk is wasted effort (both at the server, and on the amount of
traffic sent over the wire to be parsed and ignored by the client).

Qemu as an NBD client never requests to use more than one meta
context, so it has no need to use block status payloads.  Testing this
instead requires support from libnbd, which CAN access multiple meta
contexts in parallel from a single NBD connection; an interop test
submitted to the libnbd project at the same time as this patch
demonstrates the feature working, as well as testing some corner cases
(for example, when the payload length is longer than the export
length), although other corner cases (like passing the same id
duplicated) requires a protocol fuzzer because libnbd is not wired up
to break the protocol that badly.

This also includes tweaks to 'qemu-nbd --list' to show when a server
is advertising the capability, and to the testsuite to reflect the
addition to that output.

Signed-off-by: Eric Blake 
---
 docs/interop/nbd.txt  |   2 +-
 include/block/nbd.h   |  32 --
 nbd/server.c  | 106 +-
 qemu-nbd.c|   1 +
 nbd/trace-events  |   1 +
 tests/qemu-iotests/223.out|  12 +-
 tests/qemu-iotests/307.out|  10 +-
 .../tests/nbd-qemu-allocation.out |   2 +-
 8 files changed, 136 insertions(+), 30 deletions(-)

diff --git a/docs/interop/nbd.txt b/docs/interop/nbd.txt
index abaf4c28a96..83d85ce8d13 100644
--- a/docs/interop/nbd.txt
+++ b/docs/interop/nbd.txt
@@ -69,4 +69,4 @@ NBD_CMD_BLOCK_STATUS for "qemu:dirty-bitmap:", NBD_CMD_CACHE
 NBD_CMD_FLAG_FAST_ZERO
 * 5.2: NBD_CMD_BLOCK_STATUS for "qemu:allocation-depth"
 * 7.1: NBD_FLAG_CAN_MULTI_CONN for shareable writable exports
-* 8.1: NBD_OPT_EXTENDED_HEADERS
+* 8.1: NBD_OPT_EXTENDED_HEADERS, NBD_FLAG_BLOCK_STATUS_PAYLOAD
diff --git a/include/block/nbd.h b/include/block/nbd.h
index 6696d61bd59..3d8d7150121 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -175,6 +175,12 @@ typedef struct NBDExtentExt {
 uint64_t flags; /* NBD_STATE_* */
 } QEMU_PACKED NBDExtentExt;

+/* Client payload for limiting NBD_CMD_BLOCK_STATUS reply */
+typedef struct NBDBlockStatusPayload {
+uint64_t effect_length;
+/* uint32_t ids[] follows, array length implied by header */
+} QEMU_PACKED NBDBlockStatusPayload;
+
 /* Transmission (export) flags: sent from server to client during handshake,
but describe what will happen during transmission */
 enum {
@@ -191,20 +197,22 @@ enum {
 NBD_FLAG_SEND_RESIZE_BIT=  9, /* Send resize */
 NBD_FLAG_SEND_CACHE_BIT = 10, /* Send CACHE (prefetch) */
 NBD_FLAG_SEND_FAST_ZERO_BIT = 11, /* FAST_ZERO flag for WRITE_ZEROES */
+NBD_FLAG_BLOCK_STAT_PAYLOAD_BIT = 12, /* PAYLOAD flag for BLOCK_STATUS */
 };

-#define NBD_FLAG_HAS_FLAGS (1 << NBD_FLAG_HAS_FLAGS_BIT)
-#define NBD_FLAG_READ_ONLY (1 << NBD_FLAG_READ_ONLY_BIT)
-#define NBD_FLAG_SEND_FLUSH(1 << NBD_FLAG_SEND_FLUSH_BIT)
-#define NBD_FLAG_SEND_FUA  (1 << NBD_FLAG_SEND_FUA_BIT)
-#define NBD_FLAG_ROTATIONAL(1 << NBD_FLAG_ROTATIONAL_BIT)
-#define NBD_FLAG_SEND_TRIM (1 << NBD_FLAG_SEND_TRIM_BIT)
-#define NBD_FLAG_SEND_WRITE_ZEROES (1 << NBD_FLAG_SEND_WRITE_ZEROES_BIT)
-#define NBD_FLAG_SEND_DF   (1 << NBD_FLAG_SEND_DF_BIT)
-#define NBD_FLAG_CAN_MULTI_CONN(1 << NBD_FLAG_CAN_MULTI_CONN_BIT)
-#define NBD_FLAG_SEND_RESIZE   (1 << NBD_FLAG_SEND_RESIZE_BIT)
-#define NBD_FLAG_SEND_CACHE(1 << NBD_FLAG_SEND_CACHE_BIT)
-#define NBD_FLAG_SEND_FAST_ZERO(1 << NBD_FLAG_SEND_FAST_ZERO_BIT)
+#define NBD_FLAG_HAS_FLAGS  (1 << NBD_FLAG_HAS_FLAGS_BIT)
+#define NBD_FLAG_READ_ONLY  (1 << NBD_FLAG_READ_ONLY_BIT)
+#define NBD_FLAG_SEND_FLUSH (1 << NBD_FLAG_SEND_FLUSH_BIT)
+#define NBD_FLAG_SEND_FUA   (1 << NBD_FLAG_SEND_FUA_BIT)
+#define NBD_FLAG_ROTATIONAL (1 << NBD_FLAG_ROTATIONAL_BIT)
+#define NBD_FLAG_SEND_TRIM  (1 << NBD_FLAG_SEND_TRIM_BIT)
+#define NBD_FLAG_SEND_WRITE_ZEROES  (1 << NBD_FLAG_SEND_WRITE_ZEROES_BIT)
+#define NBD_FLAG_SEND_DF(1 << NBD_FLAG_SEND_DF_BIT)
+#define NBD_FLAG_CAN_MULTI_CONN (1 << NBD_FLAG_CAN_MULTI_CONN_BIT)
+#define NBD_FLAG_SEND_RESIZE(1 << NBD_FLAG_SEND_RESIZE_BIT)
+#define NBD_FLAG_SEND_CACHE (1 << NBD_FLAG_SEND_CACHE_BIT)
+#define NBD_FLAG_SEND_FAST_ZERO (1 << NBD_FLAG_SEND_FAST_ZERO_BIT)
+#define NBD_FLAG_BLOCK_STAT_PAYLOAD (1 << NBD_FLAG_BLOCK_STAT_PAYLOAD_BIT)

 /* New-style

[PATCH v3 08/14] nbd/server: Support 64-bit block status

2023-05-15 Thread Eric Blake

The NBD spec states that if the client negotiates extended headers,
the server must avoid NBD_REPLY_TYPE_BLOCK_STATUS and instead use
NBD_REPLY_TYPE_BLOCK_STATUS_EXT which supports 64-bit lengths, even if
the reply does not need more than 32 bits.  As of this patch,
client->header_style is still never NBD_HEADER_EXTENDED, so the code
added here does not take effect until the next patch enables
negotiation.

For now, all metacontexts that we know how to export never populate
more than 32 bits of information, so we don't have to worry about
NBD_REP_ERR_EXT_HEADER_REQD or filtering during handshake, and we
always send all zeroes for the upper 32 bits of status during
NBD_CMD_BLOCK_STATUS.

Note that we previously had some interesting size-juggling on call
chains, such as:

nbd_co_send_block_status(uint32_t length)
-> blockstatus_to_extents(uint32_t bytes)
  -> bdrv_block_status_above(bytes, _t num)
  -> nbd_extent_array_add(uint64_t num)
-> store num in 32-bit length

But we were lucky that it never overflowed: bdrv_block_status_above
never sets num larger than bytes, and we had previously been capping
'bytes' at 32 bits (since the protocol does not allow sending a larger
request without extended headers).  This patch adds some assertions
that ensure we continue to avoid overflowing 32 bits for a narrow
client, while fully utilizing 64-bits all the way through when the
client understands that.

Signed-off-by: Eric Blake 
---
 nbd/server.c | 86 +---
 1 file changed, 62 insertions(+), 24 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index ffab51efd26..b4c15ae1a14 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -2073,7 +2073,15 @@ static int coroutine_fn 
nbd_co_send_sparse_read(NBDClient *client,
 }

 typedef struct NBDExtentArray {
-NBDExtent *extents;
+NBDHeaderStyle style;   /* 32- or 64-bit extent descriptions */
+union {
+NBDStructuredMeta id;   /* style == NBD_HEADER_STRUCTURED */
+NBDStructuredMetaExt meta;  /* style == NBD_HEADER_EXTENDED */
+};
+union {
+NBDExtent *narrow;  /* style == NBD_HEADER_STRUCTURED */
+NBDExtentExt *extents;  /* style == NBD_HEADER_EXTENDED */
+};
 unsigned int nb_alloc;
 unsigned int count;
 uint64_t total_length;
@@ -2081,12 +2089,15 @@ typedef struct NBDExtentArray {
 bool converted_to_be;
 } NBDExtentArray;

-static NBDExtentArray *nbd_extent_array_new(unsigned int nb_alloc)
+static NBDExtentArray *nbd_extent_array_new(unsigned int nb_alloc,
+NBDHeaderStyle style)
 {
 NBDExtentArray *ea = g_new0(NBDExtentArray, 1);

+assert(style >= NBD_HEADER_STRUCTURED);
 ea->nb_alloc = nb_alloc;
-ea->extents = g_new(NBDExtent, nb_alloc);
+ea->extents = g_new(NBDExtentExt, nb_alloc);
+ea->style = style;
 ea->can_add = true;

 return ea;
@@ -2100,17 +2111,37 @@ static void nbd_extent_array_free(NBDExtentArray *ea)
 G_DEFINE_AUTOPTR_CLEANUP_FUNC(NBDExtentArray, nbd_extent_array_free)

 /* Further modifications of the array after conversion are abandoned */
-static void nbd_extent_array_convert_to_be(NBDExtentArray *ea)
+static void nbd_extent_array_convert_to_be(NBDExtentArray *ea,
+   uint32_t context_id,
+   struct iovec *iov)
 {
 int i;

 assert(!ea->converted_to_be);
+assert(iov[0].iov_base == >meta);
+assert(iov[1].iov_base == ea->extents);
 ea->can_add = false;
 ea->converted_to_be = true;

-for (i = 0; i < ea->count; i++) {
-ea->extents[i].flags = cpu_to_be32(ea->extents[i].flags);
-ea->extents[i].length = cpu_to_be32(ea->extents[i].length);
+stl_be_p(>meta.context_id, context_id);
+if (ea->style >= NBD_HEADER_EXTENDED) {
+stl_be_p(>meta.count, ea->count);
+for (i = 0; i < ea->count; i++) {
+ea->extents[i].length = cpu_to_be64(ea->extents[i].length);
+ea->extents[i].flags = cpu_to_be64(ea->extents[i].flags);
+}
+iov[0].iov_len = sizeof(ea->meta);
+iov[1].iov_len = ea->count * sizeof(ea->extents[0]);
+} else {
+/* Conversion reduces memory usage, order of iteration matters */
+for (i = 0; i < ea->count; i++) {
+assert(ea->extents[i].length <= UINT32_MAX);
+assert((uint32_t) ea->extents[i].flags == ea->extents[i].flags);
+ea->narrow[i].length = cpu_to_be32(ea->extents[i].length);
+ea->narrow[i].flags = cpu_to_be32(ea->extents[i].flags);
+}
+iov[0].iov_len = sizeof(ea->id);
+iov[1].iov_len = ea->count * sizeof(ea->narrow[0]);
 }
 }

@@ -2124,19 +2155,23 @@ static void 
nbd_extent_array_convert_to_be(NBDExtentArray *ea)
  * would result in an incorrect range reported to the client)
  */
 static int nbd_extent_array_add(NBDExtentArray *ea,
-

[PATCH v3 03/14] nbd/server: Prepare for alternate-size headers

2023-05-15 Thread Eric Blake

Upstream NBD now documents[1] an extension that supports 64-bit effect
lengths in requests.  As part of that extension, the size of the reply
headers will change in order to permit a 64-bit length in the reply
for symmetry[2].  Additionally, where the reply header is currently
16 bytes for simple reply, and 20 bytes for structured reply; with the
extension enabled, there will only be one structured reply type, of 32
bytes.  Since we are already wired up to use iovecs, it is easiest to
allow for this change in header size by splitting each structured
reply across two iovecs, one for the header (which will become
variable-length in a future patch according to client negotiation),
and the other for the payload, and removing the header from the
payload struct definitions.  Interestingly, the client side code never
utilized the packed types, so only the server code needs to be
updated.

[1] 
https://github.com/NetworkBlockDevice/nbd/blob/extension-ext-header/doc/proto.md
as of NBD commit e6f3b94a934

[2] Note that on the surface, this is because some future server might
permit a 4G+ NBD_CMD_READ and need to reply with that much data in one
transaction.  But even though the extended reply length is widened to
64 bits, for now the NBD spec is clear that servers will not reply
with more than a maximum payload bounded by the 32-bit
NBD_INFO_BLOCK_SIZE field; allowing a client and server to mutually
agree to transactions larger than 4G would require yet another
extension.

Signed-off-by: Eric Blake 
---
 include/block/nbd.h |  8 +++---
 nbd/server.c| 64 -
 2 files changed, 44 insertions(+), 28 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index a4c98169c39..f1d838d24f5 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -96,28 +96,28 @@ typedef union NBDReply {

 /* Header of chunk for NBD_REPLY_TYPE_OFFSET_DATA */
 typedef struct NBDStructuredReadData {
-NBDStructuredReplyChunk h; /* h.length >= 9 */
+/* header's .length >= 9 */
 uint64_t offset;
 /* At least one byte of data payload follows, calculated from h.length */
 } QEMU_PACKED NBDStructuredReadData;

 /* Complete chunk for NBD_REPLY_TYPE_OFFSET_HOLE */
 typedef struct NBDStructuredReadHole {
-NBDStructuredReplyChunk h; /* h.length == 12 */
+/* header's length == 12 */
 uint64_t offset;
 uint32_t length;
 } QEMU_PACKED NBDStructuredReadHole;

 /* Header of all NBD_REPLY_TYPE_ERROR* errors */
 typedef struct NBDStructuredError {
-NBDStructuredReplyChunk h; /* h.length >= 6 */
+/* header's length >= 6 */
 uint32_t error;
 uint16_t message_length;
 } QEMU_PACKED NBDStructuredError;

 /* Header of NBD_REPLY_TYPE_BLOCK_STATUS */
 typedef struct NBDStructuredMeta {
-NBDStructuredReplyChunk h; /* h.length >= 12 (at least one extent) */
+/* header's length >= 12 (at least one extent) */
 uint32_t context_id;
 /* extents follows */
 } QEMU_PACKED NBDStructuredMeta;
diff --git a/nbd/server.c b/nbd/server.c
index e239c2890fa..eefe3401560 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -1885,9 +1885,12 @@ static int coroutine_fn nbd_co_send_iov(NBDClient 
*client, struct iovec *iov,
 return ret;
 }

-static inline void set_be_simple_reply(NBDSimpleReply *reply, uint64_t error,
-   uint64_t handle)
+static inline void set_be_simple_reply(NBDClient *client, struct iovec *iov,
+   uint64_t error, uint64_t handle)
 {
+NBDSimpleReply *reply = iov->iov_base;
+
+iov->iov_len = sizeof(*reply);
 stl_be_p(>magic, NBD_SIMPLE_REPLY_MAGIC);
 stl_be_p(>error, error);
 stq_be_p(>handle, handle);
@@ -1900,23 +1903,27 @@ static int coroutine_fn 
nbd_co_send_simple_reply(NBDClient *client,
  size_t len,
  Error **errp)
 {
-NBDSimpleReply reply;
+NBDReply hdr;
 int nbd_err = system_errno_to_nbd_errno(error);
 struct iovec iov[] = {
-{.iov_base = , .iov_len = sizeof(reply)},
+{.iov_base = },
 {.iov_base = data, .iov_len = len}
 };

 trace_nbd_co_send_simple_reply(handle, nbd_err, nbd_err_lookup(nbd_err),
len);
-set_be_simple_reply(, nbd_err, handle);
+set_be_simple_reply(client, [0], nbd_err, handle);

 return nbd_co_send_iov(client, iov, len ? 2 : 1, errp);
 }

-static inline void set_be_chunk(NBDStructuredReplyChunk *chunk, uint16_t flags,
-uint16_t type, uint64_t handle, uint32_t 
length)
+static inline void set_be_chunk(NBDClient *client, struct iovec *iov,
+uint16_t flags, uint16_t type,
+uint64_t handle, uint32_t length)
 {
+NBDStructuredReplyChunk *chunk = iov->iov_base;
+
+iov->iov_len = sizeof(*chunk);
 stl_be_p(>magic,

Re: [PATCH 10/21] migration: Move rate_limit_max and rate_limit_used to migration_stats

2023-05-15 Thread Juan Quintela

Cédric Le Goater  wrote:
> On 5/15/23 15:09, Juan Quintela wrote:
>> Cédric Le Goater  wrote:
>>> On 5/8/23 15:08, Juan Quintela wrote:
 This way we can make them atomic and use this functions from any
 place.  I also moved all functions that use rate_limit to
 migration-stats.
 Functions got renamed, they are not qemu_file anymore.
 qemu_file_rate_limit -> migration_rate_limit_exceeded
 qemu_file_set_rate_limit -> migration_rate_limit_set
 qemu_file_get_rate_limit -> migration_rate_limit_get
 qemu_file_reset_rate_limit -> migration_rate_limit_reset
 qemu_file_acct_rate_limit -> migration_rate_limit_account.
 Signed-off-by: Juan Quintela 
 ---
 If you have any good suggestion for better names, I am all ears.
>>>
>>> May be :
>>>
>>>   qemu_file_rate_limit -> migration_rate_limit_is_exceeded
>> I try not to put _is_ in function names.  If it needs to be there, I
>> think that I need to rename the functino.
>
> It is common practice for functions doing a simple test and returning a bool.
> No big deal anyway.
>  > migration_rate_limit_exceeded()
>> seems clear to me.
>> 
>>>   qemu_file_acct_rate_limit -> migration_rate_limit_inc
>> My problem for this one is that we are not increasing the
>> rate_limit, we
>> are "decreasing" the amount of data we have for this period.  That is
>> why I thought about _account(), but who knows.
>> 
>>> Also, migration_rate_limit() would need some prefix to understand what is
>>> its purpose.
>> What do you mean here?
>
> I am referring to :
>
>   /* Returns true if the rate limiting was broken by an urgent request */
>   bool migration_rate_limit(void)
>   {
>   ...
>   return urgent;
>   }

out of ideas:

migration_rate_wait()
- the good
  *we wait if we have to
- the bad
  we can be interrupted if there is anything urgent
  we only wait if counters says that we have to

migration_rate_check()
* we always check
* we return a value consistent with checking
* but we check if we have to wait, not if there is anythying urgent

I am leaving it with migration_rate_limit() name until someone cames
with a better one.  It is not worse than what we have in.


>
> which existed prior to the name changes and I thought migration_rate_limit()
> would suffer the same fate. May be keep the '_limit' suffix for this one if
> you remove it for the others ?

I am no sure if migration_rate() is better than migration_rate_limit().

Later, Juan.

Re: [PATCH 1/8] block: Call .bdrv_co_create(_opts) unlocked

2023-05-15 Thread Kevin Wolf

Am 12.05.2023 um 18:12 hat Eric Blake geschrieben:
> 
> On Wed, May 10, 2023 at 10:35:54PM +0200, Kevin Wolf wrote:
> > 
> > These are functions that modify the graph, so they must be able to take
> > a writer lock. This is impossible if they already hold the reader lock.
> > If they need a reader lock for some of their operations, they should
> > take it internally.
> > 
> > Many of them go through blk_*(), which will always take the lock itself.
> > Direct calls of bdrv_*() need to take the reader lock. Note that while
> > locking for bdrv_co_*() calls is checked by TSA, this is not the case
> > for the mixed_coroutine_fns bdrv_*(). Holding the lock is still required
> > when they are called from coroutine context like here!
> > 
> > This effectively reverts 4ec8df0183, but adds some internal locking
> > instead.
> > 
> > Signed-off-by: Kevin Wolf 
> > ---
> 
> > +++ b/block/qcow2.c
> 
> > -static int coroutine_fn
> > +static int coroutine_fn GRAPH_UNLOCKED
> >  qcow2_co_create(BlockdevCreateOptions *create_options, Error **errp)
> >  {
> >  BlockdevCreateOptionsQcow2 *qcow2_opts;
> > @@ -3724,8 +3726,10 @@ qcow2_co_create(BlockdevCreateOptions 
> > *create_options, Error **errp)
> >  goto out;
> >  }
> >  
> > +bdrv_graph_co_rdlock();
> >  ret = qcow2_alloc_clusters(blk_bs(blk), 3 * cluster_size);
> >  if (ret < 0) {
> > +bdrv_graph_co_rdunlock();
> >  error_setg_errno(errp, -ret, "Could not allocate clusters for 
> > qcow2 "
> >   "header and refcount table");
> >  goto out;
> > @@ -3743,6 +3747,8 @@ qcow2_co_create(BlockdevCreateOptions 
> > *create_options, Error **errp)
> >  
> >  /* Create a full header (including things like feature table) */
> >  ret = qcow2_update_header(blk_bs(blk));
> > +bdrv_graph_co_rdunlock();
> > +
> 
> If we ever inject any 'goto out' in the elided lines, we're in
> trouble.  Would this be any safer by wrapping the intervening
> statements in a scope-guarded lock?

TSA doesn't understand these guards, which is why they are only
annotated as assertions (I think we talked about this in my previous
series), at the cost of leaving unlocking unchecked. So in cases where
the scope isn't the full function, individual calls are better at the
moment. Once clang implements support for __attribute__((cleanup)), we
can maybe change this.

Of course, TSA solves the very maintenance problem you're concerned
about: With a 'goto out' added, compilation on clang fails because it
sees that there is a code path that doesn't unlock. So at least it makes
the compromise not terrible.

For example, if I comment out the unlock in the error case in the first,
this is what I get:

../block/qcow2.c:3825:5: error: mutex 'graph_lock' is not held on every path 
through here [-Werror,-Wthread-safety-analysis]
blk_co_unref(blk);
^
../block/qcow2.c:3735:5: note: mutex acquired here
bdrv_graph_co_rdlock();
^
1 error generated.

Kevin

[PULL v2 11/16] qemu-iotests: test zone append operation

2023-05-15 Thread Stefan Hajnoczi

From: Sam Li 

The patch tests zone append writes by reporting the zone wp after
the completion of the call. "zap -p" option can print the sector
offset value after completion, which should be the start sector
where the append write begins.

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Message-id: 20230508051510.177850-4-faithilike...@gmail.com
Signed-off-by: Stefan Hajnoczi 
---
 qemu-io-cmds.c | 75 ++
 tests/qemu-iotests/tests/zoned | 16 +++
 tests/qemu-iotests/tests/zoned.out | 16 +++
 3 files changed, 107 insertions(+)

diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index f35ea627d7..3f75d2f5a6 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -1874,6 +1874,80 @@ static const cmdinfo_t zone_reset_cmd = {
 .oneline = "reset a zone write pointer in zone block device",
 };
 
+static int do_aio_zone_append(BlockBackend *blk, QEMUIOVector *qiov,
+  int64_t *offset, int flags, int *total)
+{
+int async_ret = NOT_DONE;
+
+blk_aio_zone_append(blk, offset, qiov, flags, aio_rw_done, _ret);
+while (async_ret == NOT_DONE) {
+main_loop_wait(false);
+}
+
+*total = qiov->size;
+return async_ret < 0 ? async_ret : 1;
+}
+
+static int zone_append_f(BlockBackend *blk, int argc, char **argv)
+{
+int ret;
+bool pflag = false;
+int flags = 0;
+int total = 0;
+int64_t offset;
+char *buf;
+int c, nr_iov;
+int pattern = 0xcd;
+QEMUIOVector qiov;
+
+if (optind > argc - 3) {
+return -EINVAL;
+}
+
+if ((c = getopt(argc, argv, "p")) != -1) {
+pflag = true;
+}
+
+offset = cvtnum(argv[optind]);
+if (offset < 0) {
+print_cvtnum_err(offset, argv[optind]);
+return offset;
+}
+optind++;
+nr_iov = argc - optind;
+buf = create_iovec(blk, , [optind], nr_iov, pattern,
+   flags & BDRV_REQ_REGISTERED_BUF);
+if (buf == NULL) {
+return -EINVAL;
+}
+ret = do_aio_zone_append(blk, , , flags, );
+if (ret < 0) {
+printf("zone append failed: %s\n", strerror(-ret));
+goto out;
+}
+
+if (pflag) {
+printf("After zap done, the append sector is 0x%" PRIx64 "\n",
+   tosector(offset));
+}
+
+out:
+qemu_io_free(blk, buf, qiov.size,
+ flags & BDRV_REQ_REGISTERED_BUF);
+qemu_iovec_destroy();
+return ret;
+}
+
+static const cmdinfo_t zone_append_cmd = {
+.name = "zone_append",
+.altname = "zap",
+.cfunc = zone_append_f,
+.argmin = 3,
+.argmax = 4,
+.args = "offset len [len..]",
+.oneline = "append write a number of bytes at a specified offset",
+};
+
 static int truncate_f(BlockBackend *blk, int argc, char **argv);
 static const cmdinfo_t truncate_cmd = {
 .name   = "truncate",
@@ -2672,6 +2746,7 @@ static void __attribute((constructor)) 
init_qemuio_commands(void)
 qemuio_add_command(_close_cmd);
 qemuio_add_command(_finish_cmd);
 qemuio_add_command(_reset_cmd);
+qemuio_add_command(_append_cmd);
 qemuio_add_command(_cmd);
 qemuio_add_command(_cmd);
 qemuio_add_command(_cmd);
diff --git a/tests/qemu-iotests/tests/zoned b/tests/qemu-iotests/tests/zoned
index 56f60616b5..3d23ce9cc1 100755
--- a/tests/qemu-iotests/tests/zoned
+++ b/tests/qemu-iotests/tests/zoned
@@ -82,6 +82,22 @@ echo "(5) resetting the second zone"
 $QEMU_IO $IMG -c "zrs 268435456 268435456"
 echo "After resetting a zone:"
 $QEMU_IO $IMG -c "zrp 268435456 1"
+echo
+echo
+echo "(6) append write" # the physical block size of the device is 4096
+$QEMU_IO $IMG -c "zrp 0 1"
+$QEMU_IO $IMG -c "zap -p 0 0x1000 0x2000"
+echo "After appending the first zone firstly:"
+$QEMU_IO $IMG -c "zrp 0 1"
+$QEMU_IO $IMG -c "zap -p 0 0x1000 0x2000"
+echo "After appending the first zone secondly:"
+$QEMU_IO $IMG -c "zrp 0 1"
+$QEMU_IO $IMG -c "zap -p 268435456 0x1000 0x2000"
+echo "After appending the second zone firstly:"
+$QEMU_IO $IMG -c "zrp 268435456 1"
+$QEMU_IO $IMG -c "zap -p 268435456 0x1000 0x2000"
+echo "After appending the second zone secondly:"
+$QEMU_IO $IMG -c "zrp 268435456 1"
 
 # success, all done
 echo "*** done"
diff --git a/tests/qemu-iotests/tests/zoned.out 
b/tests/qemu-iotests/tests/zoned.out
index b2d061da49..fe53ba4744 100644
--- a/tests/qemu-iotests/tests/zoned.out
+++ b/tests/qemu-iotests/tests/zoned.out
@@ -50,4 +50,20 @@ start: 0x8, len 0x8, cap 0x8, wptr 0x10, 
zcond:14, [type: 2]
 (5) resetting the second zone
 After resetting a zone:
 start: 0x8, len 0x8, cap 0x8, wptr 0x8, zcond:1, [type: 2]
+
+
+(6) append write
+start: 0x0, len 0x8, cap 0x8, wptr 0x0, zcond:1, [type: 2]
+After zap done, the append sector is 0x0
+After appending the first zone firstly:
+start: 0x0, len 0x8, cap 0x8, wptr 0x18, zcond:2, [type: 2]
+After zap done, the append sector is 0x18
+After appending the first zone secondly:
+start: 0x0, len

[PULL v2 10/16] block: introduce zone append write for zoned devices

2023-05-15 Thread Stefan Hajnoczi

From: Sam Li 

A zone append command is a write operation that specifies the first
logical block of a zone as the write position. When writing to a zoned
block device using zone append, the byte offset of the call may point at
any position within the zone to which the data is being appended. Upon
completion the device will respond with the position where the data has
been written in the zone.

Signed-off-by: Sam Li 
Reviewed-by: Dmitry Fomichev 
Reviewed-by: Stefan Hajnoczi 
Message-id: 20230508051510.177850-3-faithilike...@gmail.com
Signed-off-by: Stefan Hajnoczi 
---
 include/block/block-io.h  |  4 ++
 include/block/block_int-common.h  |  3 ++
 include/block/raw-aio.h   |  4 +-
 include/sysemu/block-backend-io.h |  9 +
 block/block-backend.c | 61 +++
 block/file-posix.c| 58 +
 block/io.c| 27 ++
 block/io_uring.c  |  4 ++
 block/linux-aio.c |  3 ++
 block/raw-format.c|  8 
 10 files changed, 173 insertions(+), 8 deletions(-)

diff --git a/include/block/block-io.h b/include/block/block-io.h
index f099b204bc..a27e471a87 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -122,6 +122,10 @@ int coroutine_fn GRAPH_RDLOCK 
bdrv_co_zone_report(BlockDriverState *bs,
 int coroutine_fn GRAPH_RDLOCK bdrv_co_zone_mgmt(BlockDriverState *bs,
 BlockZoneOp op,
 int64_t offset, int64_t len);
+int coroutine_fn GRAPH_RDLOCK bdrv_co_zone_append(BlockDriverState *bs,
+  int64_t *offset,
+  QEMUIOVector *qiov,
+  BdrvRequestFlags flags);
 
 bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs);
 int bdrv_block_status(BlockDriverState *bs, int64_t offset,
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 1674b4745d..dbec0e3bb4 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -723,6 +723,9 @@ struct BlockDriver {
 BlockZoneDescriptor *zones);
 int coroutine_fn (*bdrv_co_zone_mgmt)(BlockDriverState *bs, BlockZoneOp op,
 int64_t offset, int64_t len);
+int coroutine_fn (*bdrv_co_zone_append)(BlockDriverState *bs,
+int64_t *offset, QEMUIOVector *qiov,
+BdrvRequestFlags flags);
 
 /* removable device specific */
 bool coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_is_inserted)(
diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index afb9bdf51b..0fe85ade77 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -30,6 +30,7 @@
 #define QEMU_AIO_TRUNCATE 0x0080
 #define QEMU_AIO_ZONE_REPORT  0x0100
 #define QEMU_AIO_ZONE_MGMT0x0200
+#define QEMU_AIO_ZONE_APPEND  0x0400
 #define QEMU_AIO_TYPE_MASK \
 (QEMU_AIO_READ | \
  QEMU_AIO_WRITE | \
@@ -40,7 +41,8 @@
  QEMU_AIO_COPY_RANGE | \
  QEMU_AIO_TRUNCATE | \
  QEMU_AIO_ZONE_REPORT | \
- QEMU_AIO_ZONE_MGMT)
+ QEMU_AIO_ZONE_MGMT | \
+ QEMU_AIO_ZONE_APPEND)
 
 /* AIO flags */
 #define QEMU_AIO_MISALIGNED   0x1000
diff --git a/include/sysemu/block-backend-io.h 
b/include/sysemu/block-backend-io.h
index eb1c1ebfec..d62a7ee773 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -53,6 +53,9 @@ BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t 
offset,
 BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
   int64_t offset, int64_t len,
   BlockCompletionFunc *cb, void *opaque);
+BlockAIOCB *blk_aio_zone_append(BlockBackend *blk, int64_t *offset,
+QEMUIOVector *qiov, BdrvRequestFlags flags,
+BlockCompletionFunc *cb, void *opaque);
 BlockAIOCB *blk_aio_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes,
  BlockCompletionFunc *cb, void *opaque);
 void blk_aio_cancel_async(BlockAIOCB *acb);
@@ -208,6 +211,12 @@ int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, 
BlockZoneOp op,
   int64_t offset, int64_t len);
 int co_wrapper_mixed blk_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
int64_t offset, int64_t len);
+int coroutine_fn blk_co_zone_append(BlockBackend *blk, int64_t *offset,
+QEMUIOVector *qiov,
+BdrvRequestFlags flags);
+int co_wrapper_mixed blk_zone_append(BlockBackend *blk, int64_t *offset,
+ QEMUIOVector *qiov,
+ BdrvRequestFlags flags);
 
 int co_wrapper_mixed

[PULL v2 05/16] block: add zoned BlockDriver check to block layer

2023-05-15 Thread Stefan Hajnoczi

From: Sam Li 

Putting zoned/non-zoned BlockDrivers on top of each other is not
allowed.

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Dmitry Fomichev 
Acked-by: Kevin Wolf 
Signed-off-by: Stefan Hajnoczi 
Message-id: 20230508045533.175575-6-faithilike...@gmail.com
Message-id: 20230324090605.28361-6-faithilike...@gmail.com
[Adjust commit message prefix as suggested by Philippe Mathieu-Daudé
 and clarify that the check is about zoned
BlockDrivers.
--Stefan]
Signed-off-by: Stefan Hajnoczi 
---
 include/block/block_int-common.h |  5 +
 block.c  | 19 +++
 block/file-posix.c   | 12 
 block/raw-format.c   |  1 +
 4 files changed, 37 insertions(+)

diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index b2612f06ec..e6975d3933 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -137,6 +137,11 @@ struct BlockDriver {
  */
 bool is_format;
 
+/*
+ * Set to true if the BlockDriver supports zoned children.
+ */
+bool supports_zoned_children;
+
 /*
  * Drivers not implementing bdrv_parse_filename nor bdrv_open should have
  * this field set to true, except ones that are defined only by their
diff --git a/block.c b/block.c
index dad9a4fa43..f04a6ad4e8 100644
--- a/block.c
+++ b/block.c
@@ -7982,6 +7982,25 @@ void bdrv_add_child(BlockDriverState *parent_bs, 
BlockDriverState *child_bs,
 return;
 }
 
+/*
+ * Non-zoned block drivers do not follow zoned storage constraints
+ * (i.e. sequential writes to zones). Refuse mixing zoned and non-zoned
+ * drivers in a graph.
+ */
+if (!parent_bs->drv->supports_zoned_children &&
+child_bs->bl.zoned == BLK_Z_HM) {
+/*
+ * The host-aware model allows zoned storage constraints and random
+ * write. Allow mixing host-aware and non-zoned drivers. Using
+ * host-aware device as a regular device.
+ */
+error_setg(errp, "Cannot add a %s child to a %s parent",
+   child_bs->bl.zoned == BLK_Z_HM ? "zoned" : "non-zoned",
+   parent_bs->drv->supports_zoned_children ?
+   "support zoned children" : "not support zoned children");
+return;
+}
+
 if (!QLIST_EMPTY(_bs->parents)) {
 error_setg(errp, "The node %s already has a parent",
child_bs->node_name);
diff --git a/block/file-posix.c b/block/file-posix.c
index 11eaa780df..9a52ad4c65 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -776,6 +776,18 @@ static int raw_open_common(BlockDriverState *bs, QDict 
*options,
 goto fail;
 }
 }
+#ifdef CONFIG_BLKZONED
+/*
+ * The kernel page cache does not reliably work for writes to SWR zones
+ * of zoned block device because it can not guarantee the order of writes.
+ */
+if ((bs->bl.zoned != BLK_Z_NONE) &&
+(!(s->open_flags & O_DIRECT))) {
+error_setg(errp, "The driver supports zoned devices, and it requires "
+ "cache.direct=on, which was not specified.");
+return -EINVAL; /* No host kernel page cache */
+}
+#endif
 
 if (S_ISBLK(st.st_mode)) {
 #ifdef __linux__
diff --git a/block/raw-format.c b/block/raw-format.c
index 6c11b7235f..bbb644cd95 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -623,6 +623,7 @@ static void raw_child_perm(BlockDriverState *bs, BdrvChild 
*c,
 BlockDriver bdrv_raw = {
 .format_name  = "raw",
 .instance_size= sizeof(BDRVRawState),
+.supports_zoned_children = true,
 .bdrv_probe   = _probe,
 .bdrv_reopen_prepare  = _reopen_prepare,
 .bdrv_reopen_commit   = _reopen_commit,
-- 
2.40.1

[PULL v2 12/16] block: add some trace events for zone append

2023-05-15 Thread Stefan Hajnoczi

From: Sam Li 

Signed-off-by: Sam Li 
Reviewed-by: Dmitry Fomichev 
Reviewed-by: Stefan Hajnoczi 
Message-id: 20230508051510.177850-5-faithilike...@gmail.com
Signed-off-by: Stefan Hajnoczi 
---
 block/file-posix.c | 3 +++
 block/trace-events | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/block/file-posix.c b/block/file-posix.c
index 179263fec6..0ab158efba 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2513,6 +2513,8 @@ out:
 if (!BDRV_ZT_IS_CONV(*wp)) {
 if (type & QEMU_AIO_ZONE_APPEND) {
 *s->offset = *wp;
+trace_zbd_zone_append_complete(bs, *s->offset
+>> BDRV_SECTOR_BITS);
 }
 /* Advance the wp if needed */
 if (offset + bytes > *wp) {
@@ -3554,6 +3556,7 @@ static int coroutine_fn 
raw_co_zone_append(BlockDriverState *bs,
 len += iov_len;
 }
 
+trace_zbd_zone_append(bs, *offset >> BDRV_SECTOR_BITS);
 return raw_co_prw(bs, *offset, len, qiov, QEMU_AIO_ZONE_APPEND);
 }
 #endif
diff --git a/block/trace-events b/block/trace-events
index 3f4e1d088a..32665158d6 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -211,6 +211,8 @@ file_hdev_is_sg(int type, int version) "SG device found: 
type=%d, version=%d"
 file_flush_fdatasync_failed(int err) "errno %d"
 zbd_zone_report(void *bs, unsigned int nr_zones, int64_t sector) "bs %p report 
%d zones starting at sector offset 0x%" PRIx64 ""
 zbd_zone_mgmt(void *bs, const char *op_name, int64_t sector, int64_t len) "bs 
%p %s starts at sector offset 0x%" PRIx64 " over a range of 0x%" PRIx64 " 
sectors"
+zbd_zone_append(void *bs, int64_t sector) "bs %p append at sector offset 0x%" 
PRIx64 ""
+zbd_zone_append_complete(void *bs, int64_t sector) "bs %p returns append 
sector 0x%" PRIx64 ""
 
 # ssh.c
 sftp_error(const char *op, const char *ssh_err, int ssh_err_code, int 
sftp_err_code) "%s failed: %s (libssh error code: %d, sftp error code: %d)"
-- 
2.40.1

[PULL v2 09/16] file-posix: add tracking of the zone write pointers

2023-05-15 Thread Stefan Hajnoczi

From: Sam Li 

Since Linux doesn't have a user API to issue zone append operations to
zoned devices from user space, the file-posix driver is modified to add
zone append emulation using regular writes. To do this, the file-posix
driver tracks the wp location of all zones of the device. It uses an
array of uint64_t. The most significant bit of each wp location indicates
if the zone type is conventional zones.

The zones wp can be changed due to the following operations issued:
- zone reset: change the wp to the start offset of that zone
- zone finish: change to the end location of that zone
- write to a zone
- zone append

Signed-off-by: Sam Li 
Message-id: 20230508051510.177850-2-faithilike...@gmail.com
[Fix errno propagation from handle_aiocb_zone_mgmt()
--Stefan]
Signed-off-by: Stefan Hajnoczi 
---
 include/block/block-common.h |  14 +++
 include/block/block_int-common.h |   5 +
 block/file-posix.c   | 178 ++-
 3 files changed, 193 insertions(+), 4 deletions(-)

diff --git a/include/block/block-common.h b/include/block/block-common.h
index 1576fcf2ed..93196229ac 100644
--- a/include/block/block-common.h
+++ b/include/block/block-common.h
@@ -118,6 +118,14 @@ typedef struct BlockZoneDescriptor {
 BlockZoneState state;
 } BlockZoneDescriptor;
 
+/*
+ * Track write pointers of a zone in bytes.
+ */
+typedef struct BlockZoneWps {
+CoMutex colock;
+uint64_t wp[];
+} BlockZoneWps;
+
 typedef struct BlockDriverInfo {
 /* in bytes, 0 if irrelevant */
 int cluster_size;
@@ -240,6 +248,12 @@ typedef enum {
 #define BDRV_SECTOR_BITS   9
 #define BDRV_SECTOR_SIZE   (1ULL << BDRV_SECTOR_BITS)
 
+/*
+ * Get the first most significant bit of wp. If it is zero, then
+ * the zone type is SWR.
+ */
+#define BDRV_ZT_IS_CONV(wp)(wp & (1ULL << 63))
+
 #define BDRV_REQUEST_MAX_SECTORS MIN_CONST(SIZE_MAX >> BDRV_SECTOR_BITS, \
INT_MAX >> BDRV_SECTOR_BITS)
 #define BDRV_REQUEST_MAX_BYTES (BDRV_REQUEST_MAX_SECTORS << BDRV_SECTOR_BITS)
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index e6975d3933..1674b4745d 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -891,6 +891,8 @@ typedef struct BlockLimits {
 
 /* maximum number of active zones */
 uint32_t max_active_zones;
+
+uint32_t write_granularity;
 } BlockLimits;
 
 typedef struct BdrvOpBlocker BdrvOpBlocker;
@@ -1252,6 +1254,9 @@ struct BlockDriverState {
 CoMutex bsc_modify_lock;
 /* Always non-NULL, but must only be dereferenced under an RCU read guard 
*/
 BdrvBlockStatusCache *block_status_cache;
+
+/* array of write pointers' location of each zone in the zoned device. */
+BlockZoneWps *wps;
 };
 
 struct BlockBackendRootState {
diff --git a/block/file-posix.c b/block/file-posix.c
index e143de8217..56f57515d4 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1323,9 +1323,93 @@ static int hdev_get_max_segments(int fd, struct stat *st)
 }
 
 #if defined(CONFIG_BLKZONED)
+/*
+ * If the reset_all flag is true, then the wps of zone whose state is
+ * not readonly or offline should be all reset to the start sector.
+ * Else, take the real wp of the device.
+ */
+static int get_zones_wp(BlockDriverState *bs, int fd, int64_t offset,
+unsigned int nrz, bool reset_all)
+{
+struct blk_zone *blkz;
+size_t rep_size;
+uint64_t sector = offset >> BDRV_SECTOR_BITS;
+BlockZoneWps *wps = bs->wps;
+unsigned int j = offset / bs->bl.zone_size;
+unsigned int n = 0, i = 0;
+int ret;
+rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct blk_zone);
+g_autofree struct blk_zone_report *rep = NULL;
+
+rep = g_malloc(rep_size);
+blkz = (struct blk_zone *)(rep + 1);
+while (n < nrz) {
+memset(rep, 0, rep_size);
+rep->sector = sector;
+rep->nr_zones = nrz - n;
+
+do {
+ret = ioctl(fd, BLKREPORTZONE, rep);
+} while (ret != 0 && errno == EINTR);
+if (ret != 0) {
+error_report("%d: ioctl BLKREPORTZONE at %" PRId64 " failed %d",
+fd, offset, errno);
+return -errno;
+}
+
+if (!rep->nr_zones) {
+break;
+}
+
+for (i = 0; i < rep->nr_zones; ++i, ++n, ++j) {
+/*
+ * The wp tracking cares only about sequential writes required and
+ * sequential write preferred zones so that the wp can advance to
+ * the right location.
+ * Use the most significant bit of the wp location to indicate the
+ * zone type: 0 for SWR/SWP zones and 1 for conventional zones.
+ */
+if (blkz[i].type == BLK_ZONE_TYPE_CONVENTIONAL) {
+wps->wp[j] |= 1ULL << 63;
+} else {
+switch(blkz[i].cond) {
+case BLK_ZONE_COND_FULL:
+

[PULL v2 08/16] docs/zoned-storage: add zoned device documentation

2023-05-15 Thread Stefan Hajnoczi

From: Sam Li 

Add the documentation about the zoned device support to virtio-blk
emulation.

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Damien Le Moal 
Reviewed-by: Dmitry Fomichev 
Acked-by: Kevin Wolf 
Signed-off-by: Stefan Hajnoczi 
Message-id: 20230508045533.175575-9-faithilike...@gmail.com
Message-id: 20230324090605.28361-9-faithilike...@gmail.com
[Add index-api.rst to fix "zoned-storage.rst:document isn't included in
any toctree" error and fix pre-formatted code syntax.
--Stefan]
Signed-off-by: Stefan Hajnoczi 
---
 docs/devel/index-api.rst   |  1 +
 docs/devel/zoned-storage.rst   | 43 ++
 docs/system/qemu-block-drivers.rst.inc |  6 
 3 files changed, 50 insertions(+)
 create mode 100644 docs/devel/zoned-storage.rst

diff --git a/docs/devel/index-api.rst b/docs/devel/index-api.rst
index 60c0d7459d..7108821746 100644
--- a/docs/devel/index-api.rst
+++ b/docs/devel/index-api.rst
@@ -12,3 +12,4 @@ generated from in-code annotations to function prototypes.
memory
modules
ui
+   zoned-storage
diff --git a/docs/devel/zoned-storage.rst b/docs/devel/zoned-storage.rst
new file mode 100644
index 00..da78db2783
--- /dev/null
+++ b/docs/devel/zoned-storage.rst
@@ -0,0 +1,43 @@
+=
+zoned-storage
+=
+
+Zoned Block Devices (ZBDs) divide the LBA space into block regions called zones
+that are larger than the LBA size. They can only allow sequential writes, which
+can reduce write amplification in SSDs, and potentially lead to higher
+throughput and increased capacity. More details about ZBDs can be found at:
+
+https://zonedstorage.io/docs/introduction/zoned-storage
+
+1. Block layer APIs for zoned storage
+-
+QEMU block layer supports three zoned storage models:
+- BLK_Z_HM: The host-managed zoned model only allows sequential writes access
+to zones. It supports ZBD-specific I/O commands that can be used by a host to
+manage the zones of a device.
+- BLK_Z_HA: The host-aware zoned model allows random write operations in
+zones, making it backward compatible with regular block devices.
+- BLK_Z_NONE: The non-zoned model has no zones support. It includes both
+regular and drive-managed ZBD devices. ZBD-specific I/O commands are not
+supported.
+
+The block device information resides inside BlockDriverState. QEMU uses
+BlockLimits struct(BlockDriverState::bl) that is continuously accessed by the
+block layer while processing I/O requests. A BlockBackend has a root pointer to
+a BlockDriverState graph(for example, raw format on top of file-posix). The
+zoned storage information can be propagated from the leaf BlockDriverState all
+the way up to the BlockBackend. If the zoned storage model in file-posix is
+set to BLK_Z_HM, then block drivers will declare support for zoned host device.
+
+The block layer APIs support commands needed for zoned storage devices,
+including report zones, four zone operations, and zone append.
+
+2. Emulating zoned storage controllers
+--
+When the BlockBackend's BlockLimits model reports a zoned storage device, users
+like the virtio-blk emulation or the qemu-io-cmds.c utility can use block layer
+APIs for zoned storage emulation or testing.
+
+For example, to test zone_report on a null_blk device using qemu-io is::
+
+  $ path/to/qemu-io --image-opts -n driver=host_device,filename=/dev/nullb0 -c 
"zrp offset nr_zones"
diff --git a/docs/system/qemu-block-drivers.rst.inc 
b/docs/system/qemu-block-drivers.rst.inc
index dfe5d2293d..105cb9679c 100644
--- a/docs/system/qemu-block-drivers.rst.inc
+++ b/docs/system/qemu-block-drivers.rst.inc
@@ -430,6 +430,12 @@ Hard disks
   you may corrupt your host data (use the ``-snapshot`` command
   line option or modify the device permissions accordingly).
 
+Zoned block devices
+  Zoned block devices can be passed through to the guest if the emulated 
storage
+  controller supports zoned storage. Use ``--blockdev host_device,
+  node-name=drive0,filename=/dev/nullb0,cache.direct=on`` to pass through
+  ``/dev/nullb0`` as ``drive0``.
+
 Windows
 ^^^
 
-- 
2.40.1

[PULL v2 04/16] block/raw-format: add zone operations to pass through requests

2023-05-15 Thread Stefan Hajnoczi

From: Sam Li 

raw-format driver usually sits on top of file-posix driver. It needs to
pass through requests of zone commands.

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Damien Le Moal 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Dmitry Fomichev 
Acked-by: Kevin Wolf 
Signed-off-by: Stefan Hajnoczi 
Message-id: 20230508045533.175575-5-faithilike...@gmail.com
Message-id: 20230324090605.28361-5-faithilike...@gmail.com
[Adjust commit message prefix as suggested by Philippe Mathieu-Daudé
.
--Stefan]
Signed-off-by: Stefan Hajnoczi 
---
 block/raw-format.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/block/raw-format.c b/block/raw-format.c
index fd9e61f58e..6c11b7235f 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -317,6 +317,21 @@ raw_co_pdiscard(BlockDriverState *bs, int64_t offset, 
int64_t bytes)
 return bdrv_co_pdiscard(bs->file, offset, bytes);
 }
 
+static int coroutine_fn GRAPH_RDLOCK
+raw_co_zone_report(BlockDriverState *bs, int64_t offset,
+   unsigned int *nr_zones,
+   BlockZoneDescriptor *zones)
+{
+return bdrv_co_zone_report(bs->file->bs, offset, nr_zones, zones);
+}
+
+static int coroutine_fn GRAPH_RDLOCK
+raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
+ int64_t offset, int64_t len)
+{
+return bdrv_co_zone_mgmt(bs->file->bs, op, offset, len);
+}
+
 static int64_t coroutine_fn GRAPH_RDLOCK
 raw_co_getlength(BlockDriverState *bs)
 {
@@ -619,6 +634,8 @@ BlockDriver bdrv_raw = {
 .bdrv_co_pwritev  = _co_pwritev,
 .bdrv_co_pwrite_zeroes = _co_pwrite_zeroes,
 .bdrv_co_pdiscard = _co_pdiscard,
+.bdrv_co_zone_report  = _co_zone_report,
+.bdrv_co_zone_mgmt  = _co_zone_mgmt,
 .bdrv_co_block_status = _co_block_status,
 .bdrv_co_copy_range_from = _co_copy_range_from,
 .bdrv_co_copy_range_to  = _co_copy_range_to,
-- 
2.40.1

[PULL v2 14/16] block: add accounting for zone append operation

2023-05-15 Thread Stefan Hajnoczi

From: Sam Li 

Taking account of the new zone append write operation for zoned devices,
BLOCK_ACCT_ZONE_APPEND enum is introduced as other I/O request type (read,
write, flush).

Signed-off-by: Sam Li 
Message-id: 20230508051916.178322-3-faithilike...@gmail.com
Signed-off-by: Stefan Hajnoczi 
---
 qapi/block-core.json   | 68 --
 qapi/block.json|  4 +++
 include/block/accounting.h |  1 +
 block/qapi-sysemu.c| 11 ++
 block/qapi.c   | 18 ++
 hw/block/virtio-blk.c  |  4 +++
 tests/qemu-iotests/227.out | 18 ++
 7 files changed, 113 insertions(+), 11 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 187e35d473..98d9116dae 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -854,6 +854,10 @@
 # @min_wr_latency_ns: Minimum latency of write operations in the
 # defined interval, in nanoseconds.
 #
+# @min_zone_append_latency_ns: Minimum latency of zone append operations
+#  in the defined interval, in nanoseconds
+#  (since 8.1)
+#
 # @min_flush_latency_ns: Minimum latency of flush operations in the
 # defined interval, in nanoseconds.
 #
@@ -863,6 +867,10 @@
 # @max_wr_latency_ns: Maximum latency of write operations in the
 # defined interval, in nanoseconds.
 #
+# @max_zone_append_latency_ns: Maximum latency of zone append operations
+#  in the defined interval, in nanoseconds
+#  (since 8.1)
+#
 # @max_flush_latency_ns: Maximum latency of flush operations in the
 # defined interval, in nanoseconds.
 #
@@ -872,6 +880,10 @@
 # @avg_wr_latency_ns: Average latency of write operations in the
 # defined interval, in nanoseconds.
 #
+# @avg_zone_append_latency_ns: Average latency of zone append operations
+#  in the defined interval, in nanoseconds
+#  (since 8.1)
+#
 # @avg_flush_latency_ns: Average latency of flush operations in the
 # defined interval, in nanoseconds.
 #
@@ -881,15 +893,23 @@
 # @avg_wr_queue_depth: Average number of pending write operations in
 # the defined interval.
 #
+# @avg_zone_append_queue_depth: Average number of pending zone append
+#   operations in the defined interval
+#   (since 8.1).
+#
 # Since: 2.5
 ##
 { 'struct': 'BlockDeviceTimedStats',
   'data': { 'interval_length': 'int', 'min_rd_latency_ns': 'int',
 'max_rd_latency_ns': 'int', 'avg_rd_latency_ns': 'int',
 'min_wr_latency_ns': 'int', 'max_wr_latency_ns': 'int',
-'avg_wr_latency_ns': 'int', 'min_flush_latency_ns': 'int',
-'max_flush_latency_ns': 'int', 'avg_flush_latency_ns': 'int',
-'avg_rd_queue_depth': 'number', 'avg_wr_queue_depth': 'number' } }
+'avg_wr_latency_ns': 'int', 'min_zone_append_latency_ns': 'int',
+'max_zone_append_latency_ns': 'int',
+'avg_zone_append_latency_ns': 'int',
+'min_flush_latency_ns': 'int', 'max_flush_latency_ns': 'int',
+'avg_flush_latency_ns': 'int', 'avg_rd_queue_depth': 'number',
+'avg_wr_queue_depth': 'number',
+'avg_zone_append_queue_depth': 'number'  } }
 
 ##
 # @BlockDeviceStats:
@@ -900,6 +920,9 @@
 #
 # @wr_bytes: The number of bytes written by the device.
 #
+# @zone_append_bytes: The number of bytes appended by the zoned devices
+# (since 8.1)
+#
 # @unmap_bytes: The number of bytes unmapped by the device (Since 4.2)
 #
 # @rd_operations: The number of read operations performed by the
@@ -908,6 +931,9 @@
 # @wr_operations: The number of write operations performed by the
 # device.
 #
+# @zone_append_operations: The number of zone append operations performed
+#  by the zoned devices (since 8.1)
+#
 # @flush_operations: The number of cache flush operations performed by
 # the device (since 0.15)
 #
@@ -920,6 +946,9 @@
 # @wr_total_time_ns: Total time spent on writes in nanoseconds (since
 # 0.15).
 #
+# @zone_append_total_time_ns: Total time spent on zone append writes
+# in nanoseconds (since 8.1)
+#
 # @flush_total_time_ns: Total time spent on cache flushes in
 # nanoseconds (since 0.15).
 #
@@ -937,6 +966,9 @@
 # @wr_merged: Number of write requests that have been merged into
 # another request (Since 2.3).
 #
+# @zone_append_merged: Number of zone append requests that have been merged
+#  into another request (since 8.1)
+#
 # @unmap_merged: Number of unmap requests that have been merged into
 # another request (Since 4.2)
 #
@@ -950,6 +982,10 @@
 # @failed_wr_operations: The number of failed write operations
 # performed by the device (Since 2.5)
 #
+# @failed_zone_append_operations: The number of failed zone append write
+#

[PULL v2 07/16] block: add some trace events for new block layer APIs

2023-05-15 Thread Stefan Hajnoczi

From: Sam Li 

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Dmitry Fomichev 
Acked-by: Kevin Wolf 
Signed-off-by: Stefan Hajnoczi 
Message-id: 20230508045533.175575-8-faithilike...@gmail.com
Message-id: 20230324090605.28361-8-faithilike...@gmail.com
Signed-off-by: Stefan Hajnoczi 
---
 block/file-posix.c | 3 +++
 block/trace-events | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/block/file-posix.c b/block/file-posix.c
index 9a52ad4c65..e143de8217 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -3267,6 +3267,7 @@ static int coroutine_fn 
raw_co_zone_report(BlockDriverState *bs, int64_t offset,
 },
 };
 
+trace_zbd_zone_report(bs, *nr_zones, offset >> BDRV_SECTOR_BITS);
 return raw_thread_pool_submit(handle_aiocb_zone_report, );
 }
 #endif
@@ -,6 +3334,8 @@ static int coroutine_fn raw_co_zone_mgmt(BlockDriverState 
*bs, BlockZoneOp op,
 },
 };
 
+trace_zbd_zone_mgmt(bs, op_name, offset >> BDRV_SECTOR_BITS,
+len >> BDRV_SECTOR_BITS);
 ret = raw_thread_pool_submit(handle_aiocb_zone_mgmt, );
 if (ret != 0) {
 error_report("ioctl %s failed %d", op_name, ret);
diff --git a/block/trace-events b/block/trace-events
index 48dbf10c66..3f4e1d088a 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -209,6 +209,8 @@ file_FindEjectableOpticalMedia(const char *media) "Matching 
using %s"
 file_setup_cdrom(const char *partition) "Using %s as optical disc"
 file_hdev_is_sg(int type, int version) "SG device found: type=%d, version=%d"
 file_flush_fdatasync_failed(int err) "errno %d"
+zbd_zone_report(void *bs, unsigned int nr_zones, int64_t sector) "bs %p report 
%d zones starting at sector offset 0x%" PRIx64 ""
+zbd_zone_mgmt(void *bs, const char *op_name, int64_t sector, int64_t len) "bs 
%p %s starts at sector offset 0x%" PRIx64 " over a range of 0x%" PRIx64 " 
sectors"
 
 # ssh.c
 sftp_error(const char *op, const char *ssh_err, int ssh_err_code, int 
sftp_err_code) "%s failed: %s (libssh error code: %d, sftp error code: %d)"
-- 
2.40.1

[PULL v2 16/16] docs/zoned-storage:add zoned emulation use case

2023-05-15 Thread Stefan Hajnoczi

From: Sam Li 

Add the documentation about the example of using virtio-blk driver
to pass the zoned block devices through to the guest.

Signed-off-by: Sam Li 
Message-id: 20230508051916.178322-5-faithilike...@gmail.com
[Fix pre-formatted code syntax
--Stefan]
Signed-off-by: Stefan Hajnoczi 
---
 docs/devel/zoned-storage.rst | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/docs/devel/zoned-storage.rst b/docs/devel/zoned-storage.rst
index da78db2783..30296d3c85 100644
--- a/docs/devel/zoned-storage.rst
+++ b/docs/devel/zoned-storage.rst
@@ -41,3 +41,22 @@ APIs for zoned storage emulation or testing.
 For example, to test zone_report on a null_blk device using qemu-io is::
 
   $ path/to/qemu-io --image-opts -n driver=host_device,filename=/dev/nullb0 -c 
"zrp offset nr_zones"
+
+To expose the host's zoned block device through virtio-blk, the command line
+can be (includes the -device parameter)::
+
+  -blockdev 
node-name=drive0,driver=host_device,filename=/dev/nullb0,cache.direct=on \
+  -device virtio-blk-pci,drive=drive0
+
+Or only use the -drive parameter::
+
+  -driver driver=host_device,file=/dev/nullb0,if=virtio,cache.direct=on
+
+Additionally, QEMU has several ways of supporting zoned storage, including:
+(1) Using virtio-scsi: --device scsi-block allows for the passing through of
+SCSI ZBC devices, enabling the attachment of ZBC or ZAC HDDs to QEMU.
+(2) PCI device pass-through: While NVMe ZNS emulation is available for testing
+purposes, it cannot yet pass through a zoned device from the host. To pass on
+the NVMe ZNS device to the guest, use VFIO PCI pass the entire NVMe PCI adapter
+through to the guest. Likewise, an HDD HBA can be passed on to QEMU all HDDs
+attached to the HBA.
-- 
2.40.1

[PULL v2 13/16] virtio-blk: add zoned storage emulation for zoned devices

2023-05-15 Thread Stefan Hajnoczi

From: Sam Li 

This patch extends virtio-blk emulation to handle zoned device commands
by calling the new block layer APIs to perform zoned device I/O on
behalf of the guest. It supports Report Zone, four zone oparations (open,
close, finish, reset), and Append Zone.

The VIRTIO_BLK_F_ZONED feature bit will only be set if the host does
support zoned block devices. Regular block devices(conventional zones)
will not be set.

The guest os can use blktests, fio to test those commands on zoned devices.
Furthermore, using zonefs to test zone append write is also supported.

Signed-off-by: Sam Li 
Message-id: 20230508051916.178322-2-faithilike...@gmail.com
Signed-off-by: Stefan Hajnoczi 
---
 hw/block/virtio-blk-common.c |   2 +
 hw/block/virtio-blk.c| 389 +++
 hw/virtio/virtio-qmp.c   |   2 +
 3 files changed, 393 insertions(+)

diff --git a/hw/block/virtio-blk-common.c b/hw/block/virtio-blk-common.c
index ac52d7c176..e2f8e2f6da 100644
--- a/hw/block/virtio-blk-common.c
+++ b/hw/block/virtio-blk-common.c
@@ -29,6 +29,8 @@ static const VirtIOFeature feature_sizes[] = {
  .end = endof(struct virtio_blk_config, discard_sector_alignment)},
 {.flags = 1ULL << VIRTIO_BLK_F_WRITE_ZEROES,
  .end = endof(struct virtio_blk_config, write_zeroes_may_unmap)},
+{.flags = 1ULL << VIRTIO_BLK_F_ZONED,
+ .end = endof(struct virtio_blk_config, zoned)},
 {}
 };
 
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index cefca93b31..cb741dec39 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -17,6 +17,7 @@
 #include "qemu/module.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
+#include "block/block_int.h"
 #include "trace.h"
 #include "hw/block/block.h"
 #include "hw/qdev-properties.h"
@@ -601,6 +602,335 @@ err:
 return err_status;
 }
 
+typedef struct ZoneCmdData {
+VirtIOBlockReq *req;
+struct iovec *in_iov;
+unsigned in_num;
+union {
+struct {
+unsigned int nr_zones;
+BlockZoneDescriptor *zones;
+} zone_report_data;
+struct {
+int64_t offset;
+} zone_append_data;
+};
+} ZoneCmdData;
+
+/*
+ * check zoned_request: error checking before issuing requests. If all checks
+ * passed, return true.
+ * append: true if only zone append requests issued.
+ */
+static bool check_zoned_request(VirtIOBlock *s, int64_t offset, int64_t len,
+ bool append, uint8_t *status) {
+BlockDriverState *bs = blk_bs(s->blk);
+int index;
+
+if (!virtio_has_feature(s->host_features, VIRTIO_BLK_F_ZONED)) {
+*status = VIRTIO_BLK_S_UNSUPP;
+return false;
+}
+
+if (offset < 0 || len < 0 || len > (bs->total_sectors << BDRV_SECTOR_BITS)
+|| offset > (bs->total_sectors << BDRV_SECTOR_BITS) - len) {
+*status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+return false;
+}
+
+if (append) {
+if (bs->bl.write_granularity) {
+if ((offset % bs->bl.write_granularity) != 0) {
+*status = VIRTIO_BLK_S_ZONE_UNALIGNED_WP;
+return false;
+}
+}
+
+index = offset / bs->bl.zone_size;
+if (BDRV_ZT_IS_CONV(bs->wps->wp[index])) {
+*status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+return false;
+}
+
+if (len / 512 > bs->bl.max_append_sectors) {
+if (bs->bl.max_append_sectors == 0) {
+*status = VIRTIO_BLK_S_UNSUPP;
+} else {
+*status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+}
+return false;
+}
+}
+return true;
+}
+
+static void virtio_blk_zone_report_complete(void *opaque, int ret)
+{
+ZoneCmdData *data = opaque;
+VirtIOBlockReq *req = data->req;
+VirtIOBlock *s = req->dev;
+VirtIODevice *vdev = VIRTIO_DEVICE(req->dev);
+struct iovec *in_iov = data->in_iov;
+unsigned in_num = data->in_num;
+int64_t zrp_size, n, j = 0;
+int64_t nz = data->zone_report_data.nr_zones;
+int8_t err_status = VIRTIO_BLK_S_OK;
+
+if (ret) {
+err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+goto out;
+}
+
+struct virtio_blk_zone_report zrp_hdr = (struct virtio_blk_zone_report) {
+.nr_zones = cpu_to_le64(nz),
+};
+zrp_size = sizeof(struct virtio_blk_zone_report)
+   + sizeof(struct virtio_blk_zone_descriptor) * nz;
+n = iov_from_buf(in_iov, in_num, 0, _hdr, sizeof(zrp_hdr));
+if (n != sizeof(zrp_hdr)) {
+virtio_error(vdev, "Driver provided input buffer that is too small!");
+err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+goto out;
+}
+
+for (size_t i = sizeof(zrp_hdr); i < zrp_size;
+i += sizeof(struct virtio_blk_zone_descriptor), ++j) {
+struct virtio_blk_zone_descriptor desc =
+(struct virtio_blk_zone_descriptor) {
+.z_start =

[PULL v2 01/16] block/block-common: add zoned device structs

2023-05-15 Thread Stefan Hajnoczi

From: Sam Li 

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Damien Le Moal 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Dmitry Fomichev 
Acked-by: Kevin Wolf 
Signed-off-by: Stefan Hajnoczi 
Message-id: 20230508045533.175575-2-faithilike...@gmail.com
Message-id: 20230324090605.28361-2-faithilike...@gmail.com
[Adjust commit message prefix as suggested by Philippe Mathieu-Daudé
.
--Stefan]
Signed-off-by: Stefan Hajnoczi 
---
 include/block/block-common.h | 43 
 1 file changed, 43 insertions(+)

diff --git a/include/block/block-common.h b/include/block/block-common.h
index b5122ef8ab..1576fcf2ed 100644
--- a/include/block/block-common.h
+++ b/include/block/block-common.h
@@ -75,6 +75,49 @@ typedef struct BlockDriver BlockDriver;
 typedef struct BdrvChild BdrvChild;
 typedef struct BdrvChildClass BdrvChildClass;
 
+typedef enum BlockZoneOp {
+BLK_ZO_OPEN,
+BLK_ZO_CLOSE,
+BLK_ZO_FINISH,
+BLK_ZO_RESET,
+} BlockZoneOp;
+
+typedef enum BlockZoneModel {
+BLK_Z_NONE = 0x0, /* Regular block device */
+BLK_Z_HM = 0x1, /* Host-managed zoned block device */
+BLK_Z_HA = 0x2, /* Host-aware zoned block device */
+} BlockZoneModel;
+
+typedef enum BlockZoneState {
+BLK_ZS_NOT_WP = 0x0,
+BLK_ZS_EMPTY = 0x1,
+BLK_ZS_IOPEN = 0x2,
+BLK_ZS_EOPEN = 0x3,
+BLK_ZS_CLOSED = 0x4,
+BLK_ZS_RDONLY = 0xD,
+BLK_ZS_FULL = 0xE,
+BLK_ZS_OFFLINE = 0xF,
+} BlockZoneState;
+
+typedef enum BlockZoneType {
+BLK_ZT_CONV = 0x1, /* Conventional random writes supported */
+BLK_ZT_SWR = 0x2, /* Sequential writes required */
+BLK_ZT_SWP = 0x3, /* Sequential writes preferred */
+} BlockZoneType;
+
+/*
+ * Zone descriptor data structure.
+ * Provides information on a zone with all position and size values in bytes.
+ */
+typedef struct BlockZoneDescriptor {
+uint64_t start;
+uint64_t length;
+uint64_t cap;
+uint64_t wp;
+BlockZoneType type;
+BlockZoneState state;
+} BlockZoneDescriptor;
+
 typedef struct BlockDriverInfo {
 /* in bytes, 0 if irrelevant */
 int cluster_size;
-- 
2.40.1

[PULL v2 15/16] virtio-blk: add some trace events for zoned emulation

2023-05-15 Thread Stefan Hajnoczi

From: Sam Li 

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Message-id: 20230508051916.178322-4-faithilike...@gmail.com
Signed-off-by: Stefan Hajnoczi 
---
 hw/block/virtio-blk.c | 12 
 hw/block/trace-events |  7 +++
 2 files changed, 19 insertions(+)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index bf05251a75..8f65ea4659 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -676,6 +676,7 @@ static void virtio_blk_zone_report_complete(void *opaque, 
int ret)
 int64_t nz = data->zone_report_data.nr_zones;
 int8_t err_status = VIRTIO_BLK_S_OK;
 
+trace_virtio_blk_zone_report_complete(vdev, req, nz, ret);
 if (ret) {
 err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
 goto out;
@@ -792,6 +793,8 @@ static void virtio_blk_handle_zone_report(VirtIOBlockReq 
*req,
 nr_zones = (req->in_len - sizeof(struct virtio_blk_inhdr) -
 sizeof(struct virtio_blk_zone_report)) /
sizeof(struct virtio_blk_zone_descriptor);
+trace_virtio_blk_handle_zone_report(vdev, req,
+offset >> BDRV_SECTOR_BITS, nr_zones);
 
 zone_size = sizeof(BlockZoneDescriptor) * nr_zones;
 data = g_malloc(sizeof(ZoneCmdData));
@@ -814,7 +817,9 @@ static void virtio_blk_zone_mgmt_complete(void *opaque, int 
ret)
 {
 VirtIOBlockReq *req = opaque;
 VirtIOBlock *s = req->dev;
+VirtIODevice *vdev = VIRTIO_DEVICE(s);
 int8_t err_status = VIRTIO_BLK_S_OK;
+trace_virtio_blk_zone_mgmt_complete(vdev, req,ret);
 
 if (ret) {
 err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
@@ -841,6 +846,8 @@ static int virtio_blk_handle_zone_mgmt(VirtIOBlockReq *req, 
BlockZoneOp op)
 /* Entire drive capacity */
 offset = 0;
 len = capacity;
+trace_virtio_blk_handle_zone_reset_all(vdev, req, 0,
+   bs->total_sectors);
 } else {
 if (bs->bl.zone_size > capacity - offset) {
 /* The zoned device allows the last smaller zone. */
@@ -848,6 +855,9 @@ static int virtio_blk_handle_zone_mgmt(VirtIOBlockReq *req, 
BlockZoneOp op)
 } else {
 len = bs->bl.zone_size;
 }
+trace_virtio_blk_handle_zone_mgmt(vdev, req, op,
+  offset >> BDRV_SECTOR_BITS,
+  len >> BDRV_SECTOR_BITS);
 }
 
 if (!check_zoned_request(s, offset, len, false, _status)) {
@@ -888,6 +898,7 @@ static void virtio_blk_zone_append_complete(void *opaque, 
int ret)
 err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
 goto out;
 }
+trace_virtio_blk_zone_append_complete(vdev, req, append_sector, ret);
 
 out:
 aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
@@ -909,6 +920,7 @@ static int virtio_blk_handle_zone_append(VirtIOBlockReq 
*req,
 int64_t offset = virtio_ldq_p(vdev, >out.sector) << BDRV_SECTOR_BITS;
 int64_t len = iov_size(out_iov, out_num);
 
+trace_virtio_blk_handle_zone_append(vdev, req, offset >> BDRV_SECTOR_BITS);
 if (!check_zoned_request(s, offset, len, true, _status)) {
 goto out;
 }
diff --git a/hw/block/trace-events b/hw/block/trace-events
index 2c45a62bd5..34be8b9135 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -44,9 +44,16 @@ pflash_write_unknown(const char *name, uint8_t cmd) "%s: 
unknown command 0x%02x"
 # virtio-blk.c
 virtio_blk_req_complete(void *vdev, void *req, int status) "vdev %p req %p 
status %d"
 virtio_blk_rw_complete(void *vdev, void *req, int ret) "vdev %p req %p ret %d"
+virtio_blk_zone_report_complete(void *vdev, void *req, unsigned int nr_zones, 
int ret) "vdev %p req %p nr_zones %u ret %d"
+virtio_blk_zone_mgmt_complete(void *vdev, void *req, int ret) "vdev %p req %p 
ret %d"
+virtio_blk_zone_append_complete(void *vdev, void *req, int64_t sector, int 
ret) "vdev %p req %p, append sector 0x%" PRIx64 " ret %d"
 virtio_blk_handle_write(void *vdev, void *req, uint64_t sector, size_t 
nsectors) "vdev %p req %p sector %"PRIu64" nsectors %zu"
 virtio_blk_handle_read(void *vdev, void *req, uint64_t sector, size_t 
nsectors) "vdev %p req %p sector %"PRIu64" nsectors %zu"
 virtio_blk_submit_multireq(void *vdev, void *mrb, int start, int num_reqs, 
uint64_t offset, size_t size, bool is_write) "vdev %p mrb %p start %d num_reqs 
%d offset %"PRIu64" size %zu is_write %d"
+virtio_blk_handle_zone_report(void *vdev, void *req, int64_t sector, unsigned 
int nr_zones) "vdev %p req %p sector 0x%" PRIx64 " nr_zones %u"
+virtio_blk_handle_zone_mgmt(void *vdev, void *req, uint8_t op, int64_t sector, 
int64_t len) "vdev %p req %p op 0x%x sector 0x%" PRIx64 " len 0x%" PRIx64 ""
+virtio_blk_handle_zone_reset_all(void *vdev, void *req, int64_t sector, 
int64_t len) "vdev %p req %p sector 0x%" PRIx64 " cap 0x%" PRIx64 ""
+virtio_blk_handle_zone_append(void *vdev, void *req, int64_t sector) "vdev %p 
req %p, append sector 0x%"

[PULL v2 06/16] iotests: test new zone operations

2023-05-15 Thread Stefan Hajnoczi

From: Sam Li 

The new block layer APIs of zoned block devices can be tested by:
$ tests/qemu-iotests/check zoned
Run each zone operation on a newly created null_blk device
and see whether it outputs the same zone information.

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Acked-by: Kevin Wolf 
Signed-off-by: Stefan Hajnoczi 
Message-id: 20230508045533.175575-7-faithilike...@gmail.com
Message-id: 20230324090605.28361-7-faithilike...@gmail.com
[Adjust commit message prefix as suggested by Philippe Mathieu-Daudé
.
--Stefan]
Signed-off-by: Stefan Hajnoczi 
---
 tests/qemu-iotests/tests/zoned | 89 ++
 tests/qemu-iotests/tests/zoned.out | 53 ++
 2 files changed, 142 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/zoned
 create mode 100644 tests/qemu-iotests/tests/zoned.out

diff --git a/tests/qemu-iotests/tests/zoned b/tests/qemu-iotests/tests/zoned
new file mode 100755
index 00..56f60616b5
--- /dev/null
+++ b/tests/qemu-iotests/tests/zoned
@@ -0,0 +1,89 @@
+#!/usr/bin/env bash
+#
+# Test zone management operations.
+#
+
+seq="$(basename $0)"
+echo "QA output created by $seq"
+status=1 # failure is the default!
+
+_cleanup()
+{
+  _cleanup_test_img
+  sudo -n rmmod null_blk
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ../common.rc
+. ../common.filter
+. ../common.qemu
+
+# This test only runs on Linux hosts with raw image files.
+_supported_fmt raw
+_supported_proto file
+_supported_os Linux
+
+sudo -n true || \
+_notrun 'Password-less sudo required'
+
+IMG="--image-opts -n driver=host_device,filename=/dev/nullb0"
+QEMU_IO_OPTIONS=$QEMU_IO_OPTIONS_NO_FMT
+
+echo "Testing a null_blk device:"
+echo "case 1: if the operations work"
+sudo -n modprobe null_blk nr_devices=1 zoned=1
+sudo -n chmod 0666 /dev/nullb0
+
+echo "(1) report the first zone:"
+$QEMU_IO $IMG -c "zrp 0 1"
+echo
+echo "report the first 10 zones"
+$QEMU_IO $IMG -c "zrp 0 10"
+echo
+echo "report the last zone:"
+$QEMU_IO $IMG -c "zrp 0x3e7000 2" # 0x3e7000 / 512 = 0x1f38
+echo
+echo
+echo "(2) opening the first zone"
+$QEMU_IO $IMG -c "zo 0 268435456"  # 268435456 / 512 = 524288
+echo "report after:"
+$QEMU_IO $IMG -c "zrp 0 1"
+echo
+echo "opening the second zone"
+$QEMU_IO $IMG -c "zo 268435456 268435456" #
+echo "report after:"
+$QEMU_IO $IMG -c "zrp 268435456 1"
+echo
+echo "opening the last zone"
+$QEMU_IO $IMG -c "zo 0x3e7000 268435456"
+echo "report after:"
+$QEMU_IO $IMG -c "zrp 0x3e7000 2"
+echo
+echo
+echo "(3) closing the first zone"
+$QEMU_IO $IMG -c "zc 0 268435456"
+echo "report after:"
+$QEMU_IO $IMG -c "zrp 0 1"
+echo
+echo "closing the last zone"
+$QEMU_IO $IMG -c "zc 0x3e7000 268435456"
+echo "report after:"
+$QEMU_IO $IMG -c "zrp 0x3e7000 2"
+echo
+echo
+echo "(4) finishing the second zone"
+$QEMU_IO $IMG -c "zf 268435456 268435456"
+echo "After finishing a zone:"
+$QEMU_IO $IMG -c "zrp 268435456 1"
+echo
+echo
+echo "(5) resetting the second zone"
+$QEMU_IO $IMG -c "zrs 268435456 268435456"
+echo "After resetting a zone:"
+$QEMU_IO $IMG -c "zrp 268435456 1"
+
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/tests/zoned.out 
b/tests/qemu-iotests/tests/zoned.out
new file mode 100644
index 00..b2d061da49
--- /dev/null
+++ b/tests/qemu-iotests/tests/zoned.out
@@ -0,0 +1,53 @@
+QA output created by zoned
+Testing a null_blk device:
+case 1: if the operations work
+(1) report the first zone:
+start: 0x0, len 0x8, cap 0x8, wptr 0x0, zcond:1, [type: 2]
+
+report the first 10 zones
+start: 0x0, len 0x8, cap 0x8, wptr 0x0, zcond:1, [type: 2]
+start: 0x8, len 0x8, cap 0x8, wptr 0x8, zcond:1, [type: 2]
+start: 0x10, len 0x8, cap 0x8, wptr 0x10, zcond:1, [type: 2]
+start: 0x18, len 0x8, cap 0x8, wptr 0x18, zcond:1, [type: 2]
+start: 0x20, len 0x8, cap 0x8, wptr 0x20, zcond:1, [type: 2]
+start: 0x28, len 0x8, cap 0x8, wptr 0x28, zcond:1, [type: 2]
+start: 0x30, len 0x8, cap 0x8, wptr 0x30, zcond:1, [type: 2]
+start: 0x38, len 0x8, cap 0x8, wptr 0x38, zcond:1, [type: 2]
+start: 0x40, len 0x8, cap 0x8, wptr 0x40, zcond:1, [type: 2]
+start: 0x48, len 0x8, cap 0x8, wptr 0x48, zcond:1, [type: 2]
+
+report the last zone:
+start: 0x1f38, len 0x8, cap 0x8, wptr 0x1f38, zcond:1, [type: 
2]
+
+
+(2) opening the first zone
+report after:
+start: 0x0, len 0x8, cap 0x8, wptr 0x0, zcond:3, [type: 2]
+
+opening the second zone
+report after:
+start: 0x8, len 0x8, cap 0x8, wptr 0x8, zcond:3, [type: 2]
+
+opening the last zone
+report after:
+start: 0x1f38, len 0x8, cap 0x8, wptr 0x1f38, zcond:3, [type: 
2]
+
+
+(3) closing the first zone
+report after:
+start: 0x0, len 0x8, cap 0x8, wptr 0x0, zcond:1, [type: 2]
+
+closing the

[PULL v2 02/16] block/file-posix: introduce helper functions for sysfs attributes

2023-05-15 Thread Stefan Hajnoczi

From: Sam Li 

Use get_sysfs_str_val() to get the string value of device
zoned model. Then get_sysfs_zoned_model() can convert it to
BlockZoneModel type of QEMU.

Use get_sysfs_long_val() to get the long value of zoned device
information.

Signed-off-by: Sam Li 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Damien Le Moal 
Reviewed-by: Dmitry Fomichev 
Acked-by: Kevin Wolf 
Signed-off-by: Stefan Hajnoczi 
Message-id: 20230508045533.175575-3-faithilike...@gmail.com
Message-id: 20230324090605.28361-3-faithilike...@gmail.com
[Adjust commit message prefix as suggested by Philippe Mathieu-Daudé
.
--Stefan]
Signed-off-by: Stefan Hajnoczi 
---
 include/block/block_int-common.h |   3 +
 block/file-posix.c   | 135 ++-
 2 files changed, 100 insertions(+), 38 deletions(-)

diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 4909876756..c7ca5a83e9 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -862,6 +862,9 @@ typedef struct BlockLimits {
  * an explicit monitor command to load the disk inside the guest).
  */
 bool has_variable_length;
+
+/* device zone model */
+BlockZoneModel zoned;
 } BlockLimits;
 
 typedef struct BdrvOpBlocker BdrvOpBlocker;
diff --git a/block/file-posix.c b/block/file-posix.c
index c7b723368e..97c597a2a0 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1202,15 +1202,89 @@ static int hdev_get_max_hw_transfer(int fd, struct stat 
*st)
 #endif
 }
 
-static int hdev_get_max_segments(int fd, struct stat *st)
+/*
+ * Get a sysfs attribute value as character string.
+ */
+#ifdef CONFIG_LINUX
+static int get_sysfs_str_val(struct stat *st, const char *attribute,
+ char **val) {
+g_autofree char *sysfspath = NULL;
+int ret;
+size_t len;
+
+if (!S_ISBLK(st->st_mode)) {
+return -ENOTSUP;
+}
+
+sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/%s",
+major(st->st_rdev), minor(st->st_rdev),
+attribute);
+ret = g_file_get_contents(sysfspath, val, , NULL);
+if (ret == -1) {
+return -ENOENT;
+}
+
+/* The file is ended with '\n' */
+char *p;
+p = *val;
+if (*(p + len - 1) == '\n') {
+*(p + len - 1) = '\0';
+}
+return ret;
+}
+#endif
+
+static int get_sysfs_zoned_model(struct stat *st, BlockZoneModel *zoned)
 {
+g_autofree char *val = NULL;
+int ret;
+
+ret = get_sysfs_str_val(st, "zoned", );
+if (ret < 0) {
+return ret;
+}
+
+if (strcmp(val, "host-managed") == 0) {
+*zoned = BLK_Z_HM;
+} else if (strcmp(val, "host-aware") == 0) {
+*zoned = BLK_Z_HA;
+} else if (strcmp(val, "none") == 0) {
+*zoned = BLK_Z_NONE;
+} else {
+return -ENOTSUP;
+}
+return 0;
+}
+
+/*
+ * Get a sysfs attribute value as a long integer.
+ */
 #ifdef CONFIG_LINUX
-char buf[32];
+static long get_sysfs_long_val(struct stat *st, const char *attribute)
+{
+g_autofree char *str = NULL;
 const char *end;
-char *sysfspath = NULL;
+long val;
+int ret;
+
+ret = get_sysfs_str_val(st, attribute, );
+if (ret < 0) {
+return ret;
+}
+
+/* The file is ended with '\n', pass 'end' to accept that. */
+ret = qemu_strtol(str, , 10, );
+if (ret == 0 && end && *end == '\0') {
+ret = val;
+}
+return ret;
+}
+#endif
+
+static int hdev_get_max_segments(int fd, struct stat *st)
+{
+#ifdef CONFIG_LINUX
 int ret;
-int sysfd = -1;
-long max_segments;
 
 if (S_ISCHR(st->st_mode)) {
 if (ioctl(fd, SG_GET_SG_TABLESIZE, ) == 0) {
@@ -1218,44 +1292,27 @@ static int hdev_get_max_segments(int fd, struct stat 
*st)
 }
 return -ENOTSUP;
 }
-
-if (!S_ISBLK(st->st_mode)) {
-return -ENOTSUP;
-}
-
-sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments",
-major(st->st_rdev), minor(st->st_rdev));
-sysfd = open(sysfspath, O_RDONLY);
-if (sysfd == -1) {
-ret = -errno;
-goto out;
-}
-ret = RETRY_ON_EINTR(read(sysfd, buf, sizeof(buf) - 1));
-if (ret < 0) {
-ret = -errno;
-goto out;
-} else if (ret == 0) {
-ret = -EIO;
-goto out;
-}
-buf[ret] = 0;
-/* The file is ended with '\n', pass 'end' to accept that. */
-ret = qemu_strtol(buf, , 10, _segments);
-if (ret == 0 && end && *end == '\n') {
-ret = max_segments;
-}
-
-out:
-if (sysfd != -1) {
-close(sysfd);
-}
-g_free(sysfspath);
-return ret;
+return get_sysfs_long_val(st, "max_segments");
 #else
 return -ENOTSUP;
 #endif
 }
 
+static void raw_refresh_zoned_limits(BlockDriverState *bs, struct stat *st,
+ Error **errp)
+{
+BlockZoneModel zoned;
+int

[PULL v2 00/16] Block patches

2023-05-15 Thread Stefan Hajnoczi

The following changes since commit 8844bb8d896595ee1d25d21c770e6e6f29803097:

  Merge tag 'or1k-pull-request-20230513' of https://github.com/stffrdhrn/qemu 
into staging (2023-05-13 11:23:14 +0100)

are available in the Git repository at:

  https://gitlab.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to 01562fee5f3ad4506d57dbcf4b1903b565eceec7:

  docs/zoned-storage:add zoned emulation use case (2023-05-15 08:19:04 -0400)


Pull request

This pull request contain's Sam Li's zoned storage support in the QEMU block
layer and virtio-blk emulation.

v2:
- Sam fixed the CI failures. CI passes for me now. [Richard]



Sam Li (16):
  block/block-common: add zoned device structs
  block/file-posix: introduce helper functions for sysfs attributes
  block/block-backend: add block layer APIs resembling Linux
ZonedBlockDevice ioctls
  block/raw-format: add zone operations to pass through requests
  block: add zoned BlockDriver check to block layer
  iotests: test new zone operations
  block: add some trace events for new block layer APIs
  docs/zoned-storage: add zoned device documentation
  file-posix: add tracking of the zone write pointers
  block: introduce zone append write for zoned devices
  qemu-iotests: test zone append operation
  block: add some trace events for zone append
  virtio-blk: add zoned storage emulation for zoned devices
  block: add accounting for zone append operation
  virtio-blk: add some trace events for zoned emulation
  docs/zoned-storage:add zoned emulation use case

 docs/devel/index-api.rst   |   1 +
 docs/devel/zoned-storage.rst   |  62 +++
 qapi/block-core.json   |  68 ++-
 qapi/block.json|   4 +
 meson.build|   5 +
 include/block/accounting.h |   1 +
 include/block/block-common.h   |  57 ++
 include/block/block-io.h   |  13 +
 include/block/block_int-common.h   |  37 ++
 include/block/raw-aio.h|   8 +-
 include/sysemu/block-backend-io.h  |  27 +
 block.c|  19 +
 block/block-backend.c  | 198 +++
 block/file-posix.c | 692 +++--
 block/io.c |  68 +++
 block/io_uring.c   |   4 +
 block/linux-aio.c  |   3 +
 block/qapi-sysemu.c|  11 +
 block/qapi.c   |  18 +
 block/raw-format.c |  26 +
 hw/block/virtio-blk-common.c   |   2 +
 hw/block/virtio-blk.c  | 405 +++
 hw/virtio/virtio-qmp.c |   2 +
 qemu-io-cmds.c | 224 
 block/trace-events |   4 +
 docs/system/qemu-block-drivers.rst.inc |   6 +
 hw/block/trace-events  |   7 +
 tests/qemu-iotests/227.out |  18 +
 tests/qemu-iotests/tests/zoned | 105 
 tests/qemu-iotests/tests/zoned.out |  69 +++
 30 files changed, 2106 insertions(+), 58 deletions(-)
 create mode 100644 docs/devel/zoned-storage.rst
 create mode 100755 tests/qemu-iotests/tests/zoned
 create mode 100644 tests/qemu-iotests/tests/zoned.out

-- 
2.40.1

[PULL v2 03/16] block/block-backend: add block layer APIs resembling Linux ZonedBlockDevice ioctls

2023-05-15 Thread Stefan Hajnoczi

From: Sam Li 

Add zoned device option to host_device BlockDriver. It will be presented only
for zoned host block devices. By adding zone management operations to the
host_block_device BlockDriver, users can use the new block layer APIs
including Report Zone and four zone management operations
(open, close, finish, reset, reset_all).

Qemu-io uses the new APIs to perform zoned storage commands of the device:
zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs),
zone_finish(zf).

For example, to test zone_report, use following command:
$ ./build/qemu-io --image-opts -n driver=host_device, filename=/dev/nullb0
-c "zrp offset nr_zones"

Signed-off-by: Sam Li 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Dmitry Fomichev 
Acked-by: Kevin Wolf 
Signed-off-by: Stefan Hajnoczi 
Message-id: 20230508045533.175575-4-faithilike...@gmail.com
Message-id: 20230324090605.28361-4-faithilike...@gmail.com
[Adjust commit message prefix as suggested by Philippe Mathieu-Daudé
 and remove spurious ret = -errno in
raw_co_zone_mgmt().
--Stefan]
Signed-off-by: Stefan Hajnoczi 
---
 meson.build   |   5 +
 include/block/block-io.h  |   9 +
 include/block/block_int-common.h  |  21 ++
 include/block/raw-aio.h   |   6 +-
 include/sysemu/block-backend-io.h |  18 ++
 block/block-backend.c | 137 +
 block/file-posix.c| 313 +-
 block/io.c|  41 
 qemu-io-cmds.c| 149 ++
 9 files changed, 696 insertions(+), 3 deletions(-)

diff --git a/meson.build b/meson.build
index d3cf48960b..25a4b9f2c1 100644
--- a/meson.build
+++ b/meson.build
@@ -2025,6 +2025,8 @@ if rdma.found()
 endif
 
 # has_header_symbol
+config_host_data.set('CONFIG_BLKZONED',
+ cc.has_header_symbol('linux/blkzoned.h', 'BLKOPENZONE'))
 config_host_data.set('CONFIG_EPOLL_CREATE1',
  cc.has_header_symbol('sys/epoll.h', 'epoll_create1'))
 config_host_data.set('CONFIG_FALLOCATE_PUNCH_HOLE',
@@ -2060,6 +2062,9 @@ config_host_data.set('HAVE_SIGEV_NOTIFY_THREAD_ID',
 config_host_data.set('HAVE_STRUCT_STAT_ST_ATIM',
  cc.has_member('struct stat', 'st_atim',
prefix: '#include '))
+config_host_data.set('HAVE_BLK_ZONE_REP_CAPACITY',
+ cc.has_member('struct blk_zone', 'capacity',
+   prefix: '#include '))
 
 # has_type
 config_host_data.set('CONFIG_IOVEC',
diff --git a/include/block/block-io.h b/include/block/block-io.h
index 1f612ec5bd..f099b204bc 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -114,6 +114,15 @@ int coroutine_fn GRAPH_RDLOCK 
bdrv_co_flush(BlockDriverState *bs);
 int coroutine_fn GRAPH_RDLOCK bdrv_co_pdiscard(BdrvChild *child, int64_t 
offset,
int64_t bytes);
 
+/* Report zone information of zone block device. */
+int coroutine_fn GRAPH_RDLOCK bdrv_co_zone_report(BlockDriverState *bs,
+  int64_t offset,
+  unsigned int *nr_zones,
+  BlockZoneDescriptor *zones);
+int coroutine_fn GRAPH_RDLOCK bdrv_co_zone_mgmt(BlockDriverState *bs,
+BlockZoneOp op,
+int64_t offset, int64_t len);
+
 bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs);
 int bdrv_block_status(BlockDriverState *bs, int64_t offset,
   int64_t bytes, int64_t *pnum, int64_t *map,
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index c7ca5a83e9..b2612f06ec 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -713,6 +713,12 @@ struct BlockDriver {
 int coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_load_vmstate)(
 BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
 
+int coroutine_fn (*bdrv_co_zone_report)(BlockDriverState *bs,
+int64_t offset, unsigned int *nr_zones,
+BlockZoneDescriptor *zones);
+int coroutine_fn (*bdrv_co_zone_mgmt)(BlockDriverState *bs, BlockZoneOp op,
+int64_t offset, int64_t len);
+
 /* removable device specific */
 bool coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_is_inserted)(
 BlockDriverState *bs);
@@ -865,6 +871,21 @@ typedef struct BlockLimits {
 
 /* device zone model */
 BlockZoneModel zoned;
+
+/* zone size expressed in bytes */
+uint32_t zone_size;
+
+/* total number of zones */
+uint32_t nr_zones;
+
+/* maximum sectors of a zone append write operation */
+uint32_t max_append_sectors;
+
+/* maximum number of open zones */
+uint32_t max_open_zones;
+
+/* maximum number of active zones */
+uint32_t max_active_zones;
 } BlockLimits;

Re: [PULL 09/28] block: bdrv/blk_co_unref() for calls in coroutine context

2023-05-15 Thread Michael Tokarev


15.05.2023 16:07, Kevin Wolf wrote:

Am 11.05.2023 um 17:32 hat Michael Tokarev geschrieben:

10.05.2023 15:20, Kevin Wolf wrote:

These functions must not be called in coroutine context, because they
need write access to the graph.


How important for this and 2 surrounding changes to be for 7.2-stable
(if we'll ever release one)? It smells like real bugs are being fixed
here, is it ever possible to hit those in 7.2?

Provided that whole no_coroutine_fn  infrastructure is missing there,
including the no_co_wrapper parts?  It's not difficult to back-port some
of that stuff to 7.2.


In theory this has always been wrong, but we've only seen actual bugs
manifesting in 8.0 with the other multiqueue-related changes. So I think
it's safe to skip them for 7.2.

The bug fixed by the previous patch (bdrv_activate()) might not even
theoretically be a problem while bdrv_co_activate() didn't exist, though
I haven't investigated this in detail.


Thank you very much for the reply Kevin.
This is basically what I suspected, but wanted a confirmation.
This definitely makes sense.

/mjt

Re: [PATCH 10/21] migration: Move rate_limit_max and rate_limit_used to migration_stats

2023-05-15 Thread Juan Quintela

Cédric Le Goater  wrote:
> On 5/15/23 15:09, Juan Quintela wrote:
>> Cédric Le Goater  wrote:
>>> On 5/8/23 15:08, Juan Quintela wrote:
 This way we can make them atomic and use this functions from any
 place.  I also moved all functions that use rate_limit to
 migration-stats.
 Functions got renamed, they are not qemu_file anymore.
 qemu_file_rate_limit -> migration_rate_limit_exceeded
 qemu_file_set_rate_limit -> migration_rate_limit_set
 qemu_file_get_rate_limit -> migration_rate_limit_get
 qemu_file_reset_rate_limit -> migration_rate_limit_reset
 qemu_file_acct_rate_limit -> migration_rate_limit_account.
 Signed-off-by: Juan Quintela 
 ---
 If you have any good suggestion for better names, I am all ears.
>>>
>>> May be :
>>>
>>>   qemu_file_rate_limit -> migration_rate_limit_is_exceeded
>> I try not to put _is_ in function names.  If it needs to be there, I
>> think that I need to rename the functino.
>
> It is common practice for functions doing a simple test and returning a bool.
> No big deal anyway.
>  > migration_rate_limit_exceeded()
>> seems clear to me.
>> 
>>>   qemu_file_acct_rate_limit -> migration_rate_limit_inc
>> My problem for this one is that we are not increasing the
>> rate_limit, we
>> are "decreasing" the amount of data we have for this period.  That is
>> why I thought about _account(), but who knows.
>> 
>>> Also, migration_rate_limit() would need some prefix to understand what is
>>> its purpose.
>> What do you mean here?
>
> I am referring to :
>
>   /* Returns true if the rate limiting was broken by an urgent request */
>   bool migration_rate_limit(void)
>   {
>   ...
>   return urgent;
>   }
>
> which existed prior to the name changes and I thought migration_rate_limit()
> would suffer the same fate. May be keep the '_limit' suffix for this one if
> you remove it for the others ?

ok, will think about this one.

Later, Juan.

Re: [PATCH 10/21] migration: Move rate_limit_max and rate_limit_used to migration_stats

2023-05-15 Thread Cédric Le Goater


On 5/15/23 15:09, Juan Quintela wrote:

Cédric Le Goater  wrote:

On 5/8/23 15:08, Juan Quintela wrote:

This way we can make them atomic and use this functions from any
place.  I also moved all functions that use rate_limit to
migration-stats.
Functions got renamed, they are not qemu_file anymore.
qemu_file_rate_limit -> migration_rate_limit_exceeded
qemu_file_set_rate_limit -> migration_rate_limit_set
qemu_file_get_rate_limit -> migration_rate_limit_get
qemu_file_reset_rate_limit -> migration_rate_limit_reset
qemu_file_acct_rate_limit -> migration_rate_limit_account.
Signed-off-by: Juan Quintela 
---
If you have any good suggestion for better names, I am all ears.


May be :

  qemu_file_rate_limit -> migration_rate_limit_is_exceeded


I try not to put _is_ in function names.  If it needs to be there, I
think that I need to rename the functino.


It is common practice for functions doing a simple test and returning a bool.
No big deal anyway.
 

migration_rate_limit_exceeded()

seems clear to me.


  qemu_file_acct_rate_limit -> migration_rate_limit_inc


My problem for this one is that we are not increasing the rate_limit, we
are "decreasing" the amount of data we have for this period.  That is
why I thought about _account(), but who knows.



Also, migration_rate_limit() would need some prefix to understand what is
its purpose.


What do you mean here?


I am referring to :

  /* Returns true if the rate limiting was broken by an urgent request */
  bool migration_rate_limit(void)
  {
  ...
  return urgent;
  }

which existed prior to the name changes and I thought migration_rate_limit()
would suffer the same fate. May be keep the '_limit' suffix for this one if
you remove it for the others ?

Thanks,

C.



This is the only rate_limit that I can think in migration.


Do we really need "_limit" in the names ?


You have a point here.

If nobody complains/suggest anything else, I will drop the _limit for
the next submission.

Thanks very much.

Re: [PATCH 14/21] migration: We don't need the field rate_limit_used anymore

2023-05-15 Thread Cédric Le Goater


On 5/8/23 15:09, Juan Quintela wrote:

Since previous commit, we calculate how much data we have send with
migration_transferred_bytes() so no need to maintain this counter and
remember to always update it.

Signed-off-by: Juan Quintela 




Reviewed-by: Cédric Le Goater 

Thanks,

C.



---
  migration/migration-stats.c |  6 --
  migration/migration-stats.h | 14 --
  migration/multifd.c |  1 -
  migration/qemu-file.c   |  4 
  4 files changed, 25 deletions(-)

diff --git a/migration/migration-stats.c b/migration/migration-stats.c
index eb1a2c1ad4..a42b5d953e 100644
--- a/migration/migration-stats.c
+++ b/migration/migration-stats.c
@@ -59,15 +59,9 @@ void migration_rate_limit_set(uint64_t limit)
  
  void migration_rate_limit_reset(QEMUFile *f)

  {
-stat64_set(_stats.rate_limit_used, 0);
  stat64_set(_stats.rate_limit_start, migration_transferred_bytes(f));
  }
  
-void migration_rate_limit_account(uint64_t len)

-{
-stat64_add(_stats.rate_limit_used, len);
-}
-
  uint64_t migration_transferred_bytes(QEMUFile *f)
  {
  uint64_t multifd = stat64_get(_stats.multifd_bytes);
diff --git a/migration/migration-stats.h b/migration/migration-stats.h
index 4029f1deab..ab4cc15a74 100644
--- a/migration/migration-stats.h
+++ b/migration/migration-stats.h
@@ -77,10 +77,6 @@ typedef struct {
   * Maximum amount of data we can send in a cycle.
   */
  Stat64 rate_limit_max;
-/*
- * Amount of data we have sent in the current cycle.
- */
-Stat64 rate_limit_used;
  /*
   * How long has the setup stage took.
   */
@@ -108,16 +104,6 @@ extern MigrationAtomicStats mig_stats;
  
  void calculate_time_since(Stat64 *val, int64_t since);
  
-/**

- * migration_rate_limit_account: Increase the number of bytes transferred.
- *
- * Report on a number of bytes the have been transferred that need to
- * be applied to the rate limiting calcuations.
- *
- * @len: amount of bytes transferred
- */
-void migration_rate_limit_account(uint64_t len);
-
  /**
   * migration_rate_limit_get: Get the maximum amount that can be transferred.
   *
diff --git a/migration/multifd.c b/migration/multifd.c
index 2efb313be4..9d2ade7abc 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -432,7 +432,6 @@ static int multifd_send_pages(QEMUFile *f)
  multifd_send_state->pages = p->pages;
  p->pages = pages;
  transferred = ((uint64_t) pages->num) * p->page_size + p->packet_len;
-migration_rate_limit_account(transferred);
  qemu_mutex_unlock(>mutex);
  stat64_add(_stats.transferred, transferred);
  stat64_add(_stats.multifd_bytes, transferred);
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 3f993e24af..0086d67d83 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -292,7 +292,6 @@ void qemu_fflush(QEMUFile *f)
  qemu_file_set_error_obj(f, -EIO, local_error);
  } else {
  uint64_t size = iov_size(f->iov, f->iovcnt);
-migration_rate_limit_account(size);
  f->total_transferred += size;
  }
  
@@ -344,9 +343,6 @@ size_t ram_control_save_page(QEMUFile *f, ram_addr_t block_offset,

  if (f->hooks && f->hooks->save_page) {
  int ret = f->hooks->save_page(f, block_offset,
offset, size, bytes_sent);
-if (ret != RAM_SAVE_CONTROL_NOT_SUPP) {
-migration_rate_limit_account(size);
-}
  
  if (ret != RAM_SAVE_CONTROL_DELAYED &&

  ret != RAM_SAVE_CONTROL_NOT_SUPP) {

Re: [PATCH 11/21] migration: Move migration_total_bytes() to migration-stats.c

2023-05-15 Thread Cédric Le Goater


On 5/8/23 15:08, Juan Quintela wrote:

Once there rename it to migration_transferred_bytes() and pass a
QEMUFile instead of a migration object.

Signed-off-by: Juan Quintela 


Reviewed-by: Cédric Le Goater 

C.



---
  migration/migration-stats.c |  6 ++
  migration/migration-stats.h |  9 +
  migration/migration.c   | 13 +++--
  3 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/migration/migration-stats.c b/migration/migration-stats.c
index e01842cabc..fba66c4577 100644
--- a/migration/migration-stats.c
+++ b/migration/migration-stats.c
@@ -63,3 +63,9 @@ void migration_rate_limit_account(uint64_t len)
  {
  stat64_add(_stats.rate_limit_used, len);
  }
+
+uint64_t migration_transferred_bytes(QEMUFile *f)
+{
+return qemu_file_transferred(f) + stat64_get(_stats.multifd_bytes);
+}
+
diff --git a/migration/migration-stats.h b/migration/migration-stats.h
index 65f11ec7d1..c82fce9608 100644
--- a/migration/migration-stats.h
+++ b/migration/migration-stats.h
@@ -137,4 +137,13 @@ void migration_rate_limit_reset(void);
   */
  void migration_rate_limit_set(uint64_t new_rate);
  
+/**

+ * migration_transferred_bytes: Return number of bytes transferred
+ *
+ * Returtns how many bytes have we transferred since the beginning of
+ * the migration.  It accounts for bytes sent through any migration
+ * channel, multifd, qemu_file, rdma, 
+ */
+uint64_t migration_transferred_bytes(QEMUFile *f);
+
  #endif
diff --git a/migration/migration.c b/migration/migration.c
index 370998600e..e6d262ffe1 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2617,16 +2617,9 @@ static MigThrError migration_detect_error(MigrationState 
*s)
  }
  }
  
-/* How many bytes have we transferred since the beginning of the migration */

-static uint64_t migration_total_bytes(MigrationState *s)
-{
-return qemu_file_transferred(s->to_dst_file) +
-stat64_get(_stats.multifd_bytes);
-}
-
  static void migration_calculate_complete(MigrationState *s)
  {
-uint64_t bytes = migration_total_bytes(s);
+uint64_t bytes = migration_transferred_bytes(s->to_dst_file);
  int64_t end_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
  int64_t transfer_time;
  
@@ -2652,7 +2645,7 @@ static void update_iteration_initial_status(MigrationState *s)

   * wrong speed calculation.
   */
  s->iteration_start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
-s->iteration_initial_bytes = migration_total_bytes(s);
+s->iteration_initial_bytes = migration_transferred_bytes(s->to_dst_file);
  s->iteration_initial_pages = ram_get_total_transferred_pages();
  }
  
@@ -2667,7 +2660,7 @@ static void migration_update_counters(MigrationState *s,

  return;
  }
  
-current_bytes = migration_total_bytes(s);

+current_bytes = migration_transferred_bytes(s->to_dst_file);
  transferred = current_bytes - s->iteration_initial_bytes;
  time_spent = current_time - s->iteration_start_time;
  bandwidth = (double)transferred / time_spent;

Re: [PATCH 10/21] migration: Move rate_limit_max and rate_limit_used to migration_stats

2023-05-15 Thread Juan Quintela

Cédric Le Goater  wrote:
> On 5/8/23 15:08, Juan Quintela wrote:
>> This way we can make them atomic and use this functions from any
>> place.  I also moved all functions that use rate_limit to
>> migration-stats.
>> Functions got renamed, they are not qemu_file anymore.
>> qemu_file_rate_limit -> migration_rate_limit_exceeded
>> qemu_file_set_rate_limit -> migration_rate_limit_set
>> qemu_file_get_rate_limit -> migration_rate_limit_get
>> qemu_file_reset_rate_limit -> migration_rate_limit_reset
>> qemu_file_acct_rate_limit -> migration_rate_limit_account.
>> Signed-off-by: Juan Quintela 
>> ---
>> If you have any good suggestion for better names, I am all ears.
>
> May be :
>
>  qemu_file_rate_limit -> migration_rate_limit_is_exceeded

I try not to put _is_ in function names.  If it needs to be there, I
think that I need to rename the functino.

migration_rate_limit_exceeded()

seems clear to me.

>  qemu_file_acct_rate_limit -> migration_rate_limit_inc

My problem for this one is that we are not increasing the rate_limit, we
are "decreasing" the amount of data we have for this period.  That is
why I thought about _account(), but who knows.

> Also, migration_rate_limit() would need some prefix to understand what is
> its purpose.

What do you mean here?
This is the only rate_limit that I can think in migration.

> Do we really need "_limit" in the names ?

You have a point here.

If nobody complains/suggest anything else, I will drop the _limit for
the next submission.

Thanks very much.

Re: [PULL 09/28] block: bdrv/blk_co_unref() for calls in coroutine context

2023-05-15 Thread Kevin Wolf

Am 11.05.2023 um 17:32 hat Michael Tokarev geschrieben:
> 10.05.2023 15:20, Kevin Wolf wrote:
> > These functions must not be called in coroutine context, because they
> > need write access to the graph.
> 
> How important for this and 2 surrounding changes to be for 7.2-stable
> (if we'll ever release one)? It smells like real bugs are being fixed
> here, is it ever possible to hit those in 7.2?
> 
> Provided that whole no_coroutine_fn  infrastructure is missing there,
> including the no_co_wrapper parts?  It's not difficult to back-port some
> of that stuff to 7.2.

In theory this has always been wrong, but we've only seen actual bugs
manifesting in 8.0 with the other multiqueue-related changes. So I think
it's safe to skip them for 7.2.

The bug fixed by the previous patch (bdrv_activate()) might not even
theoretically be a problem while bdrv_co_activate() didn't exist, though
I haven't investigated this in detail.

Kevin

Re: [PATCH 12/21] migration: Add a trace for migration_transferred_bytes

2023-05-15 Thread Cédric Le Goater


On 5/8/23 15:09, Juan Quintela wrote:

Signed-off-by: Juan Quintela 


Reviewed-by: Cédric Le Goater 

Thanks,

C.



---
  migration/migration-stats.c | 8 ++--
  migration/trace-events  | 3 +++
  2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/migration/migration-stats.c b/migration/migration-stats.c
index fba66c4577..46b2b0d06e 100644
--- a/migration/migration-stats.c
+++ b/migration/migration-stats.c
@@ -14,6 +14,7 @@
  #include "qemu/stats64.h"
  #include "qemu/timer.h"
  #include "qemu-file.h"
+#include "trace.h"
  #include "migration-stats.h"
  
  MigrationAtomicStats mig_stats;

@@ -66,6 +67,9 @@ void migration_rate_limit_account(uint64_t len)
  
  uint64_t migration_transferred_bytes(QEMUFile *f)

  {
-return qemu_file_transferred(f) + stat64_get(_stats.multifd_bytes);
-}
+uint64_t multifd = stat64_get(_stats.multifd_bytes);
+uint64_t qemu_file = qemu_file_transferred(f);
  
+trace_migration_transferred_bytes(qemu_file, multifd);

+return qemu_file + multifd;
+}
diff --git a/migration/trace-events b/migration/trace-events
index 92161eeac5..4b6e802833 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -186,6 +186,9 @@ process_incoming_migration_co_end(int ret, int ps) "ret=%d 
postcopy-state=%d"
  process_incoming_migration_co_postcopy_end_main(void) ""
  postcopy_preempt_enabled(bool value) "%d"
  
+# migration-stats

+migration_transferred_bytes(uint64_t qemu_file, uint64_t multifd) "qemu_file %" PRIu64 
" multifd %" PRIu64
+
  # channel.c
  migration_set_incoming_channel(void *ioc, const char *ioctype) "ioc=%p 
ioctype=%s"
  migration_set_outgoing_channel(void *ioc, const char *ioctype, const char *hostname, 
void *err)  "ioc=%p ioctype=%s hostname=%s err=%p"

Re: [PATCH 13/21] migration: Use migration_transferred_bytes() to calculate rate_limit

2023-05-15 Thread Cédric Le Goater


On 5/8/23 15:09, Juan Quintela wrote:

Signed-off-by: Juan Quintela 


Reviewed-by: Cédric Le Goater 

C.


---
  migration/migration-stats.c | 7 +--
  migration/migration-stats.h | 6 +-
  migration/migration.c   | 2 +-
  3 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/migration/migration-stats.c b/migration/migration-stats.c
index 46b2b0d06e..eb1a2c1ad4 100644
--- a/migration/migration-stats.c
+++ b/migration/migration-stats.c
@@ -31,7 +31,9 @@ bool migration_rate_limit_exceeded(QEMUFile *f)
  return true;
  }
  
-uint64_t rate_limit_used = stat64_get(_stats.rate_limit_used);

+uint64_t rate_limit_start = stat64_get(_stats.rate_limit_start);
+uint64_t rate_limit_current = migration_transferred_bytes(f);
+uint64_t rate_limit_used = rate_limit_current - rate_limit_start;
  uint64_t rate_limit_max = stat64_get(_stats.rate_limit_max);
  /*
   *  rate_limit_max == 0 means no rate_limit enfoncement.
@@ -55,9 +57,10 @@ void migration_rate_limit_set(uint64_t limit)
  stat64_set(_stats.rate_limit_max, limit);
  }
  
-void migration_rate_limit_reset(void)

+void migration_rate_limit_reset(QEMUFile *f)
  {
  stat64_set(_stats.rate_limit_used, 0);
+stat64_set(_stats.rate_limit_start, migration_transferred_bytes(f));
  }
  
  void migration_rate_limit_account(uint64_t len)

diff --git a/migration/migration-stats.h b/migration/migration-stats.h
index c82fce9608..4029f1deab 100644
--- a/migration/migration-stats.h
+++ b/migration/migration-stats.h
@@ -69,6 +69,10 @@ typedef struct {
   * Number of bytes sent during precopy stage.
   */
  Stat64 precopy_bytes;
+/*
+ * Amount of transferred data at the start of current cycle.
+ */
+Stat64 rate_limit_start;
  /*
   * Maximum amount of data we can send in a cycle.
   */
@@ -126,7 +130,7 @@ uint64_t migration_rate_limit_get(void);
   *
   * This is called when we know we start a new transfer cycle.
   */
-void migration_rate_limit_reset(void);
+void migration_rate_limit_reset(QEMUFile *f);
  
  /**

   * migration_rate_limit_set: Set the maximum amount that can be transferred.
diff --git a/migration/migration.c b/migration/migration.c
index e6d262ffe1..6922c612e4 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2684,7 +2684,7 @@ static void migration_update_counters(MigrationState *s,
  stat64_get(_stats.dirty_bytes_last_sync) / bandwidth;
  }
  
-migration_rate_limit_reset();

+migration_rate_limit_reset(s->to_dst_file);
  
  update_iteration_initial_status(s);

Re: [PATCH 10/21] migration: Move rate_limit_max and rate_limit_used to migration_stats

2023-05-15 Thread Cédric Le Goater


On 5/8/23 15:08, Juan Quintela wrote:

This way we can make them atomic and use this functions from any
place.  I also moved all functions that use rate_limit to
migration-stats.

Functions got renamed, they are not qemu_file anymore.

qemu_file_rate_limit -> migration_rate_limit_exceeded
qemu_file_set_rate_limit -> migration_rate_limit_set
qemu_file_get_rate_limit -> migration_rate_limit_get
qemu_file_reset_rate_limit -> migration_rate_limit_reset
qemu_file_acct_rate_limit -> migration_rate_limit_account.

Signed-off-by: Juan Quintela 

---

If you have any good suggestion for better names, I am all ears.


May be :

 qemu_file_rate_limit -> migration_rate_limit_is_exceeded
 qemu_file_acct_rate_limit -> migration_rate_limit_inc


Also, migration_rate_limit() would need some prefix to understand what is
its purpose.

Do we really need "_limit" in the names ?

Thanks,

C.


---
  hw/ppc/spapr.c  |  5 +--
  hw/s390x/s390-stattrib.c|  2 +-
  include/migration/qemu-file-types.h |  2 +-
  migration/block-dirty-bitmap.c  |  2 +-
  migration/block.c   |  5 +--
  migration/migration-stats.c | 41 ++
  migration/migration-stats.h | 42 +++
  migration/migration.c   | 14 
  migration/multifd.c |  2 +-
  migration/options.c |  7 ++--
  migration/qemu-file.c   | 53 ++---
  migration/qemu-file.h   | 11 --
  migration/ram.c |  2 +-
  migration/savevm.c  |  2 +-
  14 files changed, 108 insertions(+), 82 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index ddc9c7b1a1..dbd2753278 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2166,7 +2166,7 @@ static void htab_save_first_pass(QEMUFile *f, 
SpaprMachineState *spapr,
  break;
  }
  }
-} while ((index < htabslots) && !qemu_file_rate_limit(f));
+} while ((index < htabslots) && !migration_rate_limit_exceeded(f));
  
  if (index >= htabslots) {

  assert(index == htabslots);
@@ -2237,7 +2237,8 @@ static int htab_save_later_pass(QEMUFile *f, 
SpaprMachineState *spapr,
  assert(index == htabslots);
  index = 0;
  }
-} while ((examined < htabslots) && (!qemu_file_rate_limit(f) || final));
+} while ((examined < htabslots) &&
+ (!migration_rate_limit_exceeded(f) || final));
  
  if (index >= htabslots) {

  assert(index == htabslots);
diff --git a/hw/s390x/s390-stattrib.c b/hw/s390x/s390-stattrib.c
index aed919ad7d..fb0a20f2e1 100644
--- a/hw/s390x/s390-stattrib.c
+++ b/hw/s390x/s390-stattrib.c
@@ -209,7 +209,7 @@ static int cmma_save(QEMUFile *f, void *opaque, int final)
  return -ENOMEM;
  }
  
-while (final ? 1 : qemu_file_rate_limit(f) == 0) {

+while (final ? 1 : migration_rate_limit_exceeded(f) == 0) {
  reallen = sac->get_stattr(sas, _gfn, buflen, buf);
  if (reallen < 0) {
  g_free(buf);
diff --git a/include/migration/qemu-file-types.h 
b/include/migration/qemu-file-types.h
index 1436f9ce92..0354f45198 100644
--- a/include/migration/qemu-file-types.h
+++ b/include/migration/qemu-file-types.h
@@ -165,6 +165,6 @@ size_t coroutine_mixed_fn qemu_get_counted_string(QEMUFile 
*f, char buf[256]);
  
  void qemu_put_counted_string(QEMUFile *f, const char *name);
  
-int qemu_file_rate_limit(QEMUFile *f);

+bool migration_rate_limit_exceeded(QEMUFile *f);
  
  #endif

diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
index 20f36e6bd8..a815678926 100644
--- a/migration/block-dirty-bitmap.c
+++ b/migration/block-dirty-bitmap.c
@@ -706,7 +706,7 @@ static void bulk_phase(QEMUFile *f, DBMSaveState *s, bool 
limit)
  QSIMPLEQ_FOREACH(dbms, >dbms_list, entry) {
  while (!dbms->bulk_completed) {
  bulk_phase_send_chunk(f, s, dbms);
-if (limit && qemu_file_rate_limit(f)) {
+if (limit && migration_rate_limit_exceeded(f)) {
  return;
  }
  }
diff --git a/migration/block.c b/migration/block.c
index 12617b4152..fc1caa9ca6 100644
--- a/migration/block.c
+++ b/migration/block.c
@@ -23,6 +23,7 @@
  #include "block/dirty-bitmap.h"
  #include "migration/misc.h"
  #include "migration.h"
+#include "migration-stats.h"
  #include "migration/register.h"
  #include "qemu-file.h"
  #include "migration/vmstate.h"
@@ -625,7 +626,7 @@ static int flush_blks(QEMUFile *f)
  
  blk_mig_lock();

  while ((blk = QSIMPLEQ_FIRST(_mig_state.blk_list)) != NULL) {
-if (qemu_file_rate_limit(f)) {
+if (migration_rate_limit_exceeded(f)) {
  break;
  }
  if (blk->ret < 0) {
@@ -762,7 +763,7 @@ static int block_save_iterate(QEMUFile *f, void *opaque)
  /* control the rate of transfer */
  blk_mig_lock();
  while

[PULL 05/11] migration: Teach dirtyrate about qemu_target_page_bits()

2023-05-15 Thread Juan Quintela

Signed-off-by: Juan Quintela 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Richard Henderson 
Message-Id: <20230511141208.17779-5-quint...@redhat.com>
---
 migration/dirtyrate.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/migration/dirtyrate.c b/migration/dirtyrate.c
index 9383e91cd6..f32a690a56 100644
--- a/migration/dirtyrate.c
+++ b/migration/dirtyrate.c
@@ -401,7 +401,7 @@ static void get_ramblock_dirty_info(RAMBlock *block,
 sample_pages_per_gigabytes) >> 30;
 /* Right shift TARGET_PAGE_BITS to calc page count */
 info->ramblock_pages = qemu_ram_get_used_length(block) >>
-   TARGET_PAGE_BITS;
+   qemu_target_page_bits();
 info->ramblock_addr = qemu_ram_get_host_addr(block);
 strcpy(info->idstr, qemu_ram_get_idstr(block));
 }
@@ -512,7 +512,7 @@ find_block_matched(RAMBlock *block, int count,
 
 if (infos[i].ramblock_addr != qemu_ram_get_host_addr(block) ||
 infos[i].ramblock_pages !=
-(qemu_ram_get_used_length(block) >> TARGET_PAGE_BITS)) {
+(qemu_ram_get_used_length(block) >> qemu_target_page_bits())) {
 trace_find_page_matched(block->idstr);
 return NULL;
 }
-- 
2.40.1

[PULL 09/11] qemu-file: make qemu_file_[sg]et_rate_limit() use an uint64_t

2023-05-15 Thread Juan Quintela

It is really size_t.  Everything else uses uint64_t, so move this to
uint64_t as well.  A size can't be negative anyways.

Signed-off-by: Juan Quintela 
Reviewed-by: Cédric Le Goater 
Message-Id: <20230508130909.65420-5-quint...@redhat.com>
---
 migration/qemu-file.h | 4 ++--
 migration/qemu-file.c | 6 +++---
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index 4ee58a87dd..13c7c78c0d 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -139,8 +139,8 @@ void qemu_file_reset_rate_limit(QEMUFile *f);
  * need to be applied to the rate limiting calcuations
  */
 void qemu_file_acct_rate_limit(QEMUFile *f, int64_t len);
-void qemu_file_set_rate_limit(QEMUFile *f, int64_t new_rate);
-int64_t qemu_file_get_rate_limit(QEMUFile *f);
+void qemu_file_set_rate_limit(QEMUFile *f, uint64_t new_rate);
+uint64_t qemu_file_get_rate_limit(QEMUFile *f);
 int qemu_file_get_error_obj(QEMUFile *f, Error **errp);
 int qemu_file_get_error_obj_any(QEMUFile *f1, QEMUFile *f2, Error **errp);
 void qemu_file_set_error_obj(QEMUFile *f, int ret, Error *err);
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 60f6345033..94d1069c8e 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -44,7 +44,7 @@ struct QEMUFile {
  * Maximum amount of data in bytes to transfer during one
  * rate limiting time window
  */
-int64_t rate_limit_max;
+uint64_t rate_limit_max;
 /*
  * Total amount of data in bytes queued for transfer
  * during this rate limiting time window
@@ -738,12 +738,12 @@ int qemu_file_rate_limit(QEMUFile *f)
 return 0;
 }
 
-int64_t qemu_file_get_rate_limit(QEMUFile *f)
+uint64_t qemu_file_get_rate_limit(QEMUFile *f)
 {
 return f->rate_limit_max;
 }
 
-void qemu_file_set_rate_limit(QEMUFile *f, int64_t limit)
+void qemu_file_set_rate_limit(QEMUFile *f, uint64_t limit)
 {
 /*
  * 'limit' is per second.  But we check it each 100 miliseconds.
-- 
2.40.1

[PULL 04/11] migration: Teach dirtyrate about qemu_target_page_size()

2023-05-15 Thread Juan Quintela

Signed-off-by: Juan Quintela 
Reviewed-by: Richard Henderson 
Reviewed-by: Philippe Mathieu-Daudé 
Message-Id: <20230511141208.17779-4-quint...@redhat.com>
---
 migration/dirtyrate.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/migration/dirtyrate.c b/migration/dirtyrate.c
index ae52c42c4c..9383e91cd6 100644
--- a/migration/dirtyrate.c
+++ b/migration/dirtyrate.c
@@ -313,6 +313,7 @@ static void update_dirtyrate(uint64_t msec)
  */
 static uint32_t compute_page_hash(void *ptr)
 {
+size_t page_size = qemu_target_page_size();
 uint32_t i;
 uint64_t v1, v2, v3, v4;
 uint64_t res;
@@ -322,14 +323,14 @@ static uint32_t compute_page_hash(void *ptr)
 v2 = QEMU_XXHASH_SEED + XXH_PRIME64_2;
 v3 = QEMU_XXHASH_SEED + 0;
 v4 = QEMU_XXHASH_SEED - XXH_PRIME64_1;
-for (i = 0; i < TARGET_PAGE_SIZE / 8; i += 4) {
+for (i = 0; i < page_size / 8; i += 4) {
 v1 = XXH64_round(v1, p[i + 0]);
 v2 = XXH64_round(v2, p[i + 1]);
 v3 = XXH64_round(v3, p[i + 2]);
 v4 = XXH64_round(v4, p[i + 3]);
 }
 res = XXH64_mergerounds(v1, v2, v3, v4);
-res += TARGET_PAGE_SIZE;
+res += page_size;
 res = XXH64_avalanche(res);
 return (uint32_t)(res & UINT32_MAX);
 }
@@ -344,7 +345,8 @@ static uint32_t get_ramblock_vfn_hash(struct 
RamblockDirtyInfo *info,
 {
 uint32_t hash;
 
-hash = compute_page_hash(info->ramblock_addr + vfn * TARGET_PAGE_SIZE);
+hash = compute_page_hash(info->ramblock_addr +
+ vfn * qemu_target_page_size());
 
 trace_get_ramblock_vfn_hash(info->idstr, vfn, hash);
 return hash;
-- 
2.40.1

[PULL 03/11] Use new created qemu_target_pages_to_MiB()

2023-05-15 Thread Juan Quintela

Signed-off-by: Juan Quintela 
Reviewed-by: Richard Henderson 
Message-Id: <20230511141208.17779-3-quint...@redhat.com>
---
 migration/dirtyrate.c | 11 +--
 softmmu/dirtylimit.c  | 11 +++
 2 files changed, 8 insertions(+), 14 deletions(-)

diff --git a/migration/dirtyrate.c b/migration/dirtyrate.c
index 5bac984fa5..ae52c42c4c 100644
--- a/migration/dirtyrate.c
+++ b/migration/dirtyrate.c
@@ -16,6 +16,7 @@
 #include "qapi/error.h"
 #include "cpu.h"
 #include "exec/ramblock.h"
+#include "exec/target_page.h"
 #include "exec/ram_addr.h"
 #include "qemu/rcu_queue.h"
 #include "qemu/main-loop.h"
@@ -75,13 +76,11 @@ static inline void record_dirtypages(DirtyPageRecord 
*dirty_pages,
 static int64_t do_calculate_dirtyrate(DirtyPageRecord dirty_pages,
   int64_t calc_time_ms)
 {
-uint64_t memory_size_MB;
 uint64_t increased_dirty_pages =
 dirty_pages.end_pages - dirty_pages.start_pages;
+uint64_t memory_size_MiB = qemu_target_pages_to_MiB(increased_dirty_pages);
 
-memory_size_MB = (increased_dirty_pages * TARGET_PAGE_SIZE) >> 20;
-
-return memory_size_MB * 1000 / calc_time_ms;
+return memory_size_MiB * 1000 / calc_time_ms;
 }
 
 void global_dirty_log_change(unsigned int flag, bool start)
@@ -292,8 +291,8 @@ static void update_dirtyrate_stat(struct RamblockDirtyInfo 
*info)
 DirtyStat.page_sampling.total_dirty_samples += info->sample_dirty_count;
 DirtyStat.page_sampling.total_sample_count += info->sample_pages_count;
 /* size of total pages in MB */
-DirtyStat.page_sampling.total_block_mem_MB += (info->ramblock_pages *
-   TARGET_PAGE_SIZE) >> 20;
+DirtyStat.page_sampling.total_block_mem_MB +=
+qemu_target_pages_to_MiB(info->ramblock_pages);
 }
 
 static void update_dirtyrate(uint64_t msec)
diff --git a/softmmu/dirtylimit.c b/softmmu/dirtylimit.c
index 71bf6dc7a4..015a9038d1 100644
--- a/softmmu/dirtylimit.c
+++ b/softmmu/dirtylimit.c
@@ -235,20 +235,15 @@ bool dirtylimit_vcpu_index_valid(int cpu_index)
 static uint64_t dirtylimit_dirty_ring_full_time(uint64_t dirtyrate)
 {
 static uint64_t max_dirtyrate;
-unsigned target_page_bits = qemu_target_page_bits();
-uint64_t dirty_ring_size_MB;
+uint64_t dirty_ring_size_MiB;
 
-/* So far, the largest (non-huge) page size is 64k, i.e. 16 bits. */
-assert(target_page_bits < 20);
-
-/* Convert ring size (pages) to MiB (2**20). */
-dirty_ring_size_MB = kvm_dirty_ring_size() >> (20 - target_page_bits);
+dirty_ring_size_MiB = qemu_target_pages_to_MiB(kvm_dirty_ring_size());
 
 if (max_dirtyrate < dirtyrate) {
 max_dirtyrate = dirtyrate;
 }
 
-return dirty_ring_size_MB * 100 / max_dirtyrate;
+return dirty_ring_size_MiB * 100 / max_dirtyrate;
 }
 
 static inline bool dirtylimit_done(uint64_t quota,
-- 
2.40.1

[PULL 08/11] migration: We set the rate_limit by a second

2023-05-15 Thread Juan Quintela

That the implementation does the check every 100 milliseconds is an
implementation detail that shouldn't be seen on the interfaz.
Notice that all callers of qemu_file_set_rate_limit() used the
division or pass 0, so this change is a NOP.

Signed-off-by: Juan Quintela 
Reviewed-by: Cédric Le Goater 
Message-Id: <20230508130909.65420-4-quint...@redhat.com>
---
 migration/migration.c | 7 +++
 migration/options.c   | 4 ++--
 migration/qemu-file.c | 6 +-
 3 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 5636119e8e..73ac63746b 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2140,7 +2140,7 @@ static int postcopy_start(MigrationState *ms)
  * will notice we're in POSTCOPY_ACTIVE and not actually
  * wrap their state up here
  */
-qemu_file_set_rate_limit(ms->to_dst_file, bandwidth / XFER_LIMIT_RATIO);
+qemu_file_set_rate_limit(ms->to_dst_file, bandwidth);
 if (migrate_postcopy_ram()) {
 /* Ping just for debugging, helps line traces up */
 qemu_savevm_send_ping(ms->to_dst_file, 2);
@@ -3231,11 +3231,10 @@ void migrate_fd_connect(MigrationState *s, Error 
*error_in)
 
 if (resume) {
 /* This is a resumed migration */
-rate_limit = migrate_max_postcopy_bandwidth() /
-XFER_LIMIT_RATIO;
+rate_limit = migrate_max_postcopy_bandwidth();
 } else {
 /* This is a fresh new migration */
-rate_limit = migrate_max_bandwidth() / XFER_LIMIT_RATIO;
+rate_limit = migrate_max_bandwidth();
 
 /* Notify before starting migration thread */
 notifier_list_notify(_state_notifiers, s);
diff --git a/migration/options.c b/migration/options.c
index 7ed88b7b32..c2a278ee2d 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -1243,7 +1243,7 @@ static void migrate_params_apply(MigrateSetParameters 
*params, Error **errp)
 s->parameters.max_bandwidth = params->max_bandwidth;
 if (s->to_dst_file && !migration_in_postcopy()) {
 qemu_file_set_rate_limit(s->to_dst_file,
-s->parameters.max_bandwidth / 
XFER_LIMIT_RATIO);
+s->parameters.max_bandwidth);
 }
 }
 
@@ -1273,7 +1273,7 @@ static void migrate_params_apply(MigrateSetParameters 
*params, Error **errp)
 s->parameters.max_postcopy_bandwidth = params->max_postcopy_bandwidth;
 if (s->to_dst_file && migration_in_postcopy()) {
 qemu_file_set_rate_limit(s->to_dst_file,
-s->parameters.max_postcopy_bandwidth / XFER_LIMIT_RATIO);
+s->parameters.max_postcopy_bandwidth);
 }
 }
 if (params->has_max_cpu_throttle) {
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 61fb580342..60f6345033 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -29,6 +29,7 @@
 #include "migration.h"
 #include "qemu-file.h"
 #include "trace.h"
+#include "options.h"
 #include "qapi/error.h"
 
 #define IO_BUF_SIZE 32768
@@ -744,7 +745,10 @@ int64_t qemu_file_get_rate_limit(QEMUFile *f)
 
 void qemu_file_set_rate_limit(QEMUFile *f, int64_t limit)
 {
-f->rate_limit_max = limit;
+/*
+ * 'limit' is per second.  But we check it each 100 miliseconds.
+ */
+f->rate_limit_max = limit / XFER_LIMIT_RATIO;
 }
 
 void qemu_file_reset_rate_limit(QEMUFile *f)
-- 
2.40.1

[PULL 07/11] migration: A rate limit value of 0 is valid

2023-05-15 Thread Juan Quintela

And it is the best way to not have rate_limit.

Signed-off-by: Juan Quintela 
Reviewed-by: Cédric Le Goater 
Message-Id: <20230508130909.65420-2-quint...@redhat.com>
---
 migration/migration.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 439e8651df..5636119e8e 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2140,12 +2140,7 @@ static int postcopy_start(MigrationState *ms)
  * will notice we're in POSTCOPY_ACTIVE and not actually
  * wrap their state up here
  */
-/* 0 max-postcopy-bandwidth means unlimited */
-if (!bandwidth) {
-qemu_file_set_rate_limit(ms->to_dst_file, INT64_MAX);
-} else {
-qemu_file_set_rate_limit(ms->to_dst_file, bandwidth / 
XFER_LIMIT_RATIO);
-}
+qemu_file_set_rate_limit(ms->to_dst_file, bandwidth / XFER_LIMIT_RATIO);
 if (migrate_postcopy_ram()) {
 /* Ping just for debugging, helps line traces up */
 qemu_savevm_send_ping(ms->to_dst_file, 2);
-- 
2.40.1

[PULL 11/11] qemu-file: Remove total from qemu_file_total_transferred_*()

2023-05-15 Thread Juan Quintela

Function is already quite long.

Signed-off-by: Juan Quintela 
Reviewed-by: Cédric Le Goater 
Message-Id: <20230508130909.65420-7-quint...@redhat.com>
---
 migration/qemu-file.h | 10 +-
 migration/block.c |  4 ++--
 migration/migration.c |  2 +-
 migration/qemu-file.c |  4 ++--
 migration/savevm.c|  6 +++---
 migration/vmstate.c   |  5 ++---
 6 files changed, 15 insertions(+), 16 deletions(-)

diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index 6905825f23..bcc39081f2 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -68,7 +68,7 @@ void qemu_file_set_hooks(QEMUFile *f, const QEMUFileHooks 
*hooks);
 int qemu_fclose(QEMUFile *f);
 
 /*
- * qemu_file_total_transferred:
+ * qemu_file_transferred:
  *
  * Report the total number of bytes transferred with
  * this file.
@@ -83,19 +83,19 @@ int qemu_fclose(QEMUFile *f);
  *
  * Returns: the total bytes transferred
  */
-uint64_t qemu_file_total_transferred(QEMUFile *f);
+uint64_t qemu_file_transferred(QEMUFile *f);
 
 /*
- * qemu_file_total_transferred_fast:
+ * qemu_file_transferred_fast:
  *
- * As qemu_file_total_transferred except for writable
+ * As qemu_file_transferred except for writable
  * files, where no flush is performed and the reported
  * amount will include the size of any queued buffers,
  * on top of the amount actually transferred.
  *
  * Returns: the total bytes transferred and queued
  */
-uint64_t qemu_file_total_transferred_fast(QEMUFile *f);
+uint64_t qemu_file_transferred_fast(QEMUFile *f);
 
 /*
  * put_buffer without copying the buffer.
diff --git a/migration/block.c b/migration/block.c
index a37678ce95..12617b4152 100644
--- a/migration/block.c
+++ b/migration/block.c
@@ -747,7 +747,7 @@ static int block_save_setup(QEMUFile *f, void *opaque)
 static int block_save_iterate(QEMUFile *f, void *opaque)
 {
 int ret;
-uint64_t last_bytes = qemu_file_total_transferred(f);
+uint64_t last_bytes = qemu_file_transferred(f);
 
 trace_migration_block_save("iterate", block_mig_state.submitted,
block_mig_state.transferred);
@@ -799,7 +799,7 @@ static int block_save_iterate(QEMUFile *f, void *opaque)
 }
 
 qemu_put_be64(f, BLK_MIG_FLAG_EOS);
-uint64_t delta_bytes = qemu_file_total_transferred(f) - last_bytes;
+uint64_t delta_bytes = qemu_file_transferred(f) - last_bytes;
 return (delta_bytes > 0);
 }
 
diff --git a/migration/migration.c b/migration/migration.c
index 73ac63746b..00d8ba8da0 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2648,7 +2648,7 @@ static MigThrError migration_detect_error(MigrationState 
*s)
 /* How many bytes have we transferred since the beginning of the migration */
 static uint64_t migration_total_bytes(MigrationState *s)
 {
-return qemu_file_total_transferred(s->to_dst_file) +
+return qemu_file_transferred(s->to_dst_file) +
 stat64_get(_stats.multifd_bytes);
 }
 
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 1b39d51dd4..597054759d 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -709,7 +709,7 @@ int coroutine_mixed_fn qemu_get_byte(QEMUFile *f)
 return result;
 }
 
-uint64_t qemu_file_total_transferred_fast(QEMUFile *f)
+uint64_t qemu_file_transferred_fast(QEMUFile *f)
 {
 uint64_t ret = f->total_transferred;
 int i;
@@ -721,7 +721,7 @@ uint64_t qemu_file_total_transferred_fast(QEMUFile *f)
 return ret;
 }
 
-uint64_t qemu_file_total_transferred(QEMUFile *f)
+uint64_t qemu_file_transferred(QEMUFile *f)
 {
 qemu_fflush(f);
 return f->total_transferred;
diff --git a/migration/savevm.c b/migration/savevm.c
index 032044b1d5..e33788343a 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -927,9 +927,9 @@ static int vmstate_load(QEMUFile *f, SaveStateEntry *se)
 static void vmstate_save_old_style(QEMUFile *f, SaveStateEntry *se,
JSONWriter *vmdesc)
 {
-uint64_t old_offset = qemu_file_total_transferred_fast(f);
+uint64_t old_offset = qemu_file_transferred_fast(f);
 se->ops->save_state(f, se->opaque);
-uint64_t size = qemu_file_total_transferred_fast(f) - old_offset;
+uint64_t size = qemu_file_transferred_fast(f) - old_offset;
 
 if (vmdesc) {
 json_writer_int64(vmdesc, "size", size);
@@ -2956,7 +2956,7 @@ bool save_snapshot(const char *name, bool overwrite, 
const char *vmstate,
 goto the_end;
 }
 ret = qemu_savevm_state(f, errp);
-vm_state_size = qemu_file_total_transferred(f);
+vm_state_size = qemu_file_transferred(f);
 ret2 = qemu_fclose(f);
 if (ret < 0) {
 goto the_end;
diff --git a/migration/vmstate.c b/migration/vmstate.c
index 351f56104e..af01d54b6f 100644
--- a/migration/vmstate.c
+++ b/migration/vmstate.c
@@ -361,7 +361,7 @@ int vmstate_save_state_v(QEMUFile *f, const 
VMStateDescription *vmsd,
 void *curr_elem = first_elem + size * i;
 
 vmsd_desc_field_start(vmsd,

[PULL 06/11] migration: Make dirtyrate.c target independent

2023-05-15 Thread Juan Quintela

After the previous two patches, there is nothing else that is target
specific.

Signed-off-by: Juan Quintela 
Reviewed-by: Richard Henderson 
Reviewed-by: Philippe Mathieu-Daudé 
Message-Id: <20230511141208.17779-6-quint...@redhat.com>
---
 migration/dirtyrate.c | 2 --
 migration/meson.build | 4 ++--
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/migration/dirtyrate.c b/migration/dirtyrate.c
index f32a690a56..c06f12c39d 100644
--- a/migration/dirtyrate.c
+++ b/migration/dirtyrate.c
@@ -14,10 +14,8 @@
 #include "qemu/error-report.h"
 #include 
 #include "qapi/error.h"
-#include "cpu.h"
 #include "exec/ramblock.h"
 #include "exec/target_page.h"
-#include "exec/ram_addr.h"
 #include "qemu/rcu_queue.h"
 #include "qemu/main-loop.h"
 #include "qapi/qapi-commands-migration.h"
diff --git a/migration/meson.build b/migration/meson.build
index eb41b77db9..dc8b1daef5 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -13,6 +13,7 @@ softmmu_ss.add(files(
   'block-dirty-bitmap.c',
   'channel.c',
   'channel-block.c',
+  'dirtyrate.c',
   'exec.c',
   'fd.c',
   'global_state.c',
@@ -42,6 +43,5 @@ endif
 softmmu_ss.add(when: zstd, if_true: files('multifd-zstd.c'))
 
 specific_ss.add(when: 'CONFIG_SOFTMMU',
-if_true: files('dirtyrate.c',
-   'ram.c',
+if_true: files('ram.c',
'target.c'))
-- 
2.40.1

[PULL 10/11] qemu-file: Make rate_limit_used an uint64_t

2023-05-15 Thread Juan Quintela

Change all the functions that use it.  It was already passed as
uint64_t.

Signed-off-by: Juan Quintela 
Reviewed-by: Daniel P. Berrangé 
Reviewed-by: Cédric Le Goater 
Message-Id: <20230508130909.65420-6-quint...@redhat.com>
---
 migration/qemu-file.h | 2 +-
 migration/qemu-file.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index 13c7c78c0d..6905825f23 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -138,7 +138,7 @@ void qemu_file_reset_rate_limit(QEMUFile *f);
  * out of band from the main file object I/O methods, and
  * need to be applied to the rate limiting calcuations
  */
-void qemu_file_acct_rate_limit(QEMUFile *f, int64_t len);
+void qemu_file_acct_rate_limit(QEMUFile *f, uint64_t len);
 void qemu_file_set_rate_limit(QEMUFile *f, uint64_t new_rate);
 uint64_t qemu_file_get_rate_limit(QEMUFile *f);
 int qemu_file_get_error_obj(QEMUFile *f, Error **errp);
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 94d1069c8e..1b39d51dd4 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -49,7 +49,7 @@ struct QEMUFile {
  * Total amount of data in bytes queued for transfer
  * during this rate limiting time window
  */
-int64_t rate_limit_used;
+uint64_t rate_limit_used;
 
 /* The sum of bytes transferred on the wire */
 uint64_t total_transferred;
@@ -756,7 +756,7 @@ void qemu_file_reset_rate_limit(QEMUFile *f)
 f->rate_limit_used = 0;
 }
 
-void qemu_file_acct_rate_limit(QEMUFile *f, int64_t len)
+void qemu_file_acct_rate_limit(QEMUFile *f, uint64_t len)
 {
 f->rate_limit_used += len;
 }
-- 
2.40.1

[PULL 01/11] migration/calc-dirty-rate: replaced CRC32 with xxHash

2023-05-15 Thread Juan Quintela

From: Andrei Gudkov 

This significantly reduces overhead of dirty page
rate calculation in sampling mode.
Tested using 32GiB VM on E5-2690 CPU.

With CRC32:
total_pages=8388608 sampled_pages=16384 millis=71

With xxHash:
total_pages=8388608 sampled_pages=16384 millis=14

Signed-off-by: Andrei Gudkov 
Message-Id: 

Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/dirtyrate.c  | 45 +-
 migration/trace-events |  4 ++--
 2 files changed, 38 insertions(+), 11 deletions(-)

diff --git a/migration/dirtyrate.c b/migration/dirtyrate.c
index 388337a332..5bac984fa5 100644
--- a/migration/dirtyrate.c
+++ b/migration/dirtyrate.c
@@ -29,6 +29,7 @@
 #include "sysemu/kvm.h"
 #include "sysemu/runstate.h"
 #include "exec/memory.h"
+#include "qemu/xxhash.h"
 
 /*
  * total_dirty_pages is procted by BQL and is used
@@ -308,6 +309,33 @@ static void update_dirtyrate(uint64_t msec)
 DirtyStat.dirty_rate = dirtyrate;
 }
 
+/*
+ * Compute hash of a single page of size TARGET_PAGE_SIZE.
+ */
+static uint32_t compute_page_hash(void *ptr)
+{
+uint32_t i;
+uint64_t v1, v2, v3, v4;
+uint64_t res;
+const uint64_t *p = ptr;
+
+v1 = QEMU_XXHASH_SEED + XXH_PRIME64_1 + XXH_PRIME64_2;
+v2 = QEMU_XXHASH_SEED + XXH_PRIME64_2;
+v3 = QEMU_XXHASH_SEED + 0;
+v4 = QEMU_XXHASH_SEED - XXH_PRIME64_1;
+for (i = 0; i < TARGET_PAGE_SIZE / 8; i += 4) {
+v1 = XXH64_round(v1, p[i + 0]);
+v2 = XXH64_round(v2, p[i + 1]);
+v3 = XXH64_round(v3, p[i + 2]);
+v4 = XXH64_round(v4, p[i + 3]);
+}
+res = XXH64_mergerounds(v1, v2, v3, v4);
+res += TARGET_PAGE_SIZE;
+res = XXH64_avalanche(res);
+return (uint32_t)(res & UINT32_MAX);
+}
+
+
 /*
  * get hash result for the sampled memory with length of TARGET_PAGE_SIZE
  * in ramblock, which starts from ramblock base address.
@@ -315,13 +343,12 @@ static void update_dirtyrate(uint64_t msec)
 static uint32_t get_ramblock_vfn_hash(struct RamblockDirtyInfo *info,
   uint64_t vfn)
 {
-uint32_t crc;
+uint32_t hash;
 
-crc = crc32(0, (info->ramblock_addr +
-vfn * TARGET_PAGE_SIZE), TARGET_PAGE_SIZE);
+hash = compute_page_hash(info->ramblock_addr + vfn * TARGET_PAGE_SIZE);
 
-trace_get_ramblock_vfn_hash(info->idstr, vfn, crc);
-return crc;
+trace_get_ramblock_vfn_hash(info->idstr, vfn, hash);
+return hash;
 }
 
 static bool save_ramblock_hash(struct RamblockDirtyInfo *info)
@@ -454,13 +481,13 @@ out:
 
 static void calc_page_dirty_rate(struct RamblockDirtyInfo *info)
 {
-uint32_t crc;
+uint32_t hash;
 int i;
 
 for (i = 0; i < info->sample_pages_count; i++) {
-crc = get_ramblock_vfn_hash(info, info->sample_page_vfn[i]);
-if (crc != info->hash_result[i]) {
-trace_calc_page_dirty_rate(info->idstr, crc, info->hash_result[i]);
+hash = get_ramblock_vfn_hash(info, info->sample_page_vfn[i]);
+if (hash != info->hash_result[i]) {
+trace_calc_page_dirty_rate(info->idstr, hash, 
info->hash_result[i]);
 info->sample_dirty_count++;
 }
 }
diff --git a/migration/trace-events b/migration/trace-events
index 92161eeac5..f39818c329 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -342,8 +342,8 @@ dirty_bitmap_load_success(void) ""
 # dirtyrate.c
 dirtyrate_set_state(const char *new_state) "new state %s"
 query_dirty_rate_info(const char *new_state) "current state %s"
-get_ramblock_vfn_hash(const char *idstr, uint64_t vfn, uint32_t crc) "ramblock 
name: %s, vfn: %"PRIu64 ", crc: %" PRIu32
-calc_page_dirty_rate(const char *idstr, uint32_t new_crc, uint32_t old_crc) 
"ramblock name: %s, new crc: %" PRIu32 ", old crc: %" PRIu32
+get_ramblock_vfn_hash(const char *idstr, uint64_t vfn, uint32_t hash) 
"ramblock name: %s, vfn: %"PRIu64 ", hash: %" PRIu32
+calc_page_dirty_rate(const char *idstr, uint32_t new_hash, uint32_t old_hash) 
"ramblock name: %s, new hash: %" PRIu32 ", old hash: %" PRIu32
 skip_sample_ramblock(const char *idstr, uint64_t ramblock_size) "ramblock 
name: %s, ramblock size: %" PRIu64
 find_page_matched(const char *idstr) "ramblock %s addr or size changed"
 dirtyrate_calculate(int64_t dirtyrate) "dirty rate: %" PRIi64 " MB/s"
-- 
2.40.1

[PULL 02/11] softmmu: Create qemu_target_pages_to_MiB()

2023-05-15 Thread Juan Quintela

Function that convert a number of target_pages into its size in MiB.

Suggested-by: Richard Henderson 
Richard Henderson 
Signed-off-by: Juan Quintela 
Message-Id: <20230511141208.17779-2-quint...@redhat.com>
---
 include/exec/target_page.h |  1 +
 softmmu/physmem.c  | 11 +++
 2 files changed, 12 insertions(+)

diff --git a/include/exec/target_page.h b/include/exec/target_page.h
index 96726c36a4..bbf37aea17 100644
--- a/include/exec/target_page.h
+++ b/include/exec/target_page.h
@@ -18,4 +18,5 @@ size_t qemu_target_page_size(void);
 int qemu_target_page_bits(void);
 int qemu_target_page_bits_min(void);
 
+size_t qemu_target_pages_to_MiB(size_t pages);
 #endif
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 0e0182d9f2..efaed36773 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -3357,6 +3357,17 @@ int qemu_target_page_bits_min(void)
 return TARGET_PAGE_BITS_MIN;
 }
 
+/* Convert target pages to MiB (2**20). */
+size_t qemu_target_pages_to_MiB(size_t pages)
+{
+int page_bits = TARGET_PAGE_BITS;
+
+/* So far, the largest (non-huge) page size is 64k, i.e. 16 bits. */
+g_assert(page_bits < 20);
+
+return pages >> (20 - page_bits);
+}
+
 bool cpu_physical_memory_is_io(hwaddr phys_addr)
 {
 MemoryRegion*mr;
-- 
2.40.1

[PULL 00/11] Migration 20230515 patches

2023-05-15 Thread Juan Quintela

The following changes since commit 8844bb8d896595ee1d25d21c770e6e6f29803097:

  Merge tag 'or1k-pull-request-20230513' of https://github.com/stffrdhrn/qemu 
into staging (2023-05-13 11:23:14 +0100)

are available in the Git repository at:

  https://gitlab.com/juan.quintela/qemu.git tags/migration-20230515-pull-request

for you to fetch changes up to 6da835d42a2163b43578ae745bc613b06dd5d23c:

  qemu-file: Remove total from qemu_file_total_transferred_*() (2023-05-15 
13:46:14 +0200)


Migration Pull request 20230515

Hi

On this PULL:
- use xxHash for calculate dirty_rate (andrei)
- Create qemu_target_pages_to_MiB() and use them (quintela)
- make dirtyrate target independent (quintela)
- Merge 5 patches from atomic counters series (quintela)

Please apply.



Andrei Gudkov (1):
  migration/calc-dirty-rate: replaced CRC32 with xxHash

Juan Quintela (10):
  softmmu: Create qemu_target_pages_to_MiB()
  Use new created qemu_target_pages_to_MiB()
  migration: Teach dirtyrate about qemu_target_page_size()
  migration: Teach dirtyrate about qemu_target_page_bits()
  migration: Make dirtyrate.c target independent
  migration: A rate limit value of 0 is valid
  migration: We set the rate_limit by a second
  qemu-file: make qemu_file_[sg]et_rate_limit() use an uint64_t
  qemu-file: Make rate_limit_used an uint64_t
  qemu-file: Remove total from qemu_file_total_transferred_*()

 include/exec/target_page.h |  1 +
 migration/qemu-file.h  | 16 +-
 migration/block.c  |  4 +--
 migration/dirtyrate.c  | 64 +++---
 migration/migration.c  | 14 +++--
 migration/options.c|  4 +--
 migration/qemu-file.c  | 20 +++-
 migration/savevm.c |  6 ++--
 migration/vmstate.c|  5 ++-
 softmmu/dirtylimit.c   | 11 ++-
 softmmu/physmem.c  | 11 +++
 migration/meson.build  |  4 +--
 migration/trace-events |  4 +--
 13 files changed, 97 insertions(+), 67 deletions(-)

-- 
2.40.1

Re: [PATCH 09/21] qemu-file: Account for rate_limit usage on qemu_fflush()

2023-05-15 Thread Cédric Le Goater


On 5/8/23 15:08, Juan Quintela wrote:

That is the moment we know we have transferred something.

Signed-off-by: Juan Quintela 


Reviewed-by: Cédric Le Goater 

Thanks,

C.


---
  migration/qemu-file.c | 7 +++
  1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 6ebc2bd3ec..8de1ecd082 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -301,7 +301,9 @@ void qemu_fflush(QEMUFile *f)
 _error) < 0) {
  qemu_file_set_error_obj(f, -EIO, local_error);
  } else {
-f->total_transferred += iov_size(f->iov, f->iovcnt);
+uint64_t size = iov_size(f->iov, f->iovcnt);
+qemu_file_acct_rate_limit(f, size);
+f->total_transferred += size;
  }
  
  qemu_iovec_release_ram(f);

@@ -518,7 +520,6 @@ void qemu_put_buffer_async(QEMUFile *f, const uint8_t *buf, 
size_t size,
  return;
  }
  
-f->rate_limit_used += size;

  add_to_iovec(f, buf, size, may_free);
  }
  
@@ -536,7 +537,6 @@ void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, size_t size)

  l = size;
  }
  memcpy(f->buf + f->buf_index, buf, l);
-f->rate_limit_used += l;
  add_buf_to_iovec(f, l);
  if (qemu_file_get_error(f)) {
  break;
@@ -553,7 +553,6 @@ void qemu_put_byte(QEMUFile *f, int v)
  }
  
  f->buf[f->buf_index] = v;

-f->rate_limit_used++;
  add_buf_to_iovec(f, 1);
  }

Re: [PATCH 08/21] migration: Move setup_time to mig_stats

2023-05-15 Thread Juan Quintela

Cédric Le Goater  wrote:
> On 5/8/23 15:08, Juan Quintela wrote:
>> It is a time that needs to be cleaned each time cancel migration.
>> Once there ccreate calculate_time_since() to calculate how time since
>> a time in the past.
>> Signed-off-by: Juan Quintela 
>> ---
>>   migration/migration-stats.c |  7 +++
>>   migration/migration-stats.h | 14 ++
>>   migration/migration.c   |  9 -
>>   migration/migration.h   |  1 -
>>   4 files changed, 25 insertions(+), 6 deletions(-)
>> diff --git a/migration/migration-stats.c
>> b/migration/migration-stats.c
>> index 2f2cea965c..5278c6c821 100644
>> --- a/migration/migration-stats.c
>> +++ b/migration/migration-stats.c
>> @@ -12,6 +12,13 @@
>> #include "qemu/osdep.h"
>>   #include "qemu/stats64.h"
>> +#include "qemu/timer.h"
>>   #include "migration-stats.h"
>> MigrationAtomicStats mig_stats;
>> +
>> +void calculate_time_since(Stat64 *val, int64_t since)
>> +{
>> +int64_t now = qemu_clock_get_ms(QEMU_CLOCK_HOST);
>> +stat64_set(val, now - since);
>> +}
>> diff --git a/migration/migration-stats.h b/migration/migration-stats.h
>> index cf8a4f0410..73c73d75b9 100644
>> --- a/migration/migration-stats.h
>> +++ b/migration/migration-stats.h
>> @@ -69,6 +69,10 @@ typedef struct {
>>* Number of bytes sent during precopy stage.
>>*/
>>   Stat64 precopy_bytes;
>> +/*
>> + * How long has the setup stage took.
>> + */
>> +Stat64 setup_time;
>>   /*
>>* Total number of bytes transferred.
>>*/
>> @@ -81,4 +85,14 @@ typedef struct {
>> extern MigrationAtomicStats mig_stats;
>>   +/**
>> + * calculate_time_since: Calculate how much time has passed
>> + *
>> + * @val: stat64 where to store the time
>> + * @since: reference time since we want to calculate
>> + *
>> + * Returns: Nothing.  The time is stored in val.
>> + */
>> +
>> +void calculate_time_since(Stat64 *val, int64_t since);
>
> Since this routine is in the "migration" namespace, I would rename it to
>
>   void migration_time_since(Stat64 *val, int64_t since);
>
> of even
>
>   void migration_time_since(MigrationAtomicStats *stat, int64_t since);
>
> Do you need it elsewhere than in migration.c ?

Not yet.

I can change to this and later change if needed.

Thanks, Juan.

Re: [PATCH 03/21] migration: We set the rate_limit by a second

2023-05-15 Thread Juan Quintela

Cédric Le Goater  wrote:
> On 5/8/23 15:08, Juan Quintela wrote:
>> That the implementation does the check every 100 milliseconds is an
>> implementation detail that shouldn't be seen on the interfaz.
>
> Si. Pero, "interface" es mejor aqui.

Muchas gracias.

>
>> Notice that all callers of qemu_file_set_rate_limit() used the
>> division or pass 0, so this change is a NOP.
>> Signed-off-by: Juan Quintela 
>
>
> Reviewed-by: Cédric Le Goater 

Gracias de nuevo.

Re: [PATCH 02/21] migration: Don't use INT64_MAX for unlimited rate

2023-05-15 Thread Juan Quintela

Cédric Le Goater  wrote:
> On 5/9/23 13:51, Juan Quintela wrote:
>> Harsh Prateek Bora  wrote:
>>> On 5/8/23 18:38, Juan Quintela wrote:
 Use 0 instead.
 Signed-off-by: Juan Quintela 
 ---
migration/migration.c | 4 ++--
migration/qemu-file.c | 3 +++
2 files changed, 5 insertions(+), 2 deletions(-)
 diff --git a/migration/migration.c b/migration/migration.c
 index 1192f1ebf1..3979a98949 100644
 --- a/migration/migration.c
 +++ b/migration/migration.c
 @@ -2296,7 +2296,7 @@ static void migration_completion(MigrationState *s)
}
if (ret >= 0) {
s->block_inactive = !migrate_colo();
 -qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX);
 +qemu_file_set_rate_limit(s->to_dst_file, 0);
>>>
>>> #define RATE_LIMIT_MAX 0
>>>
>>> How about having a macro and use that which conveys the meaning in all
>>> call instances wherever it is getting passed ?
>> I almost preffer the macro.
>>qemu_file_set_rate_limit(s->to_dst_file, RATE_LIMIT_MAX);
>> seems quite explanatory?
>
> yep. and I would drop the comment qemu_file_rate_limit().

I dropped it once by error.
And reviewer didn't noticed either.

So

Re: [PATCH 08/21] migration: Move setup_time to mig_stats

2023-05-15 Thread Cédric Le Goater


On 5/8/23 15:08, Juan Quintela wrote:

It is a time that needs to be cleaned each time cancel migration.
Once there ccreate calculate_time_since() to calculate how time since
a time in the past.

Signed-off-by: Juan Quintela 
---
  migration/migration-stats.c |  7 +++
  migration/migration-stats.h | 14 ++
  migration/migration.c   |  9 -
  migration/migration.h   |  1 -
  4 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/migration/migration-stats.c b/migration/migration-stats.c
index 2f2cea965c..5278c6c821 100644
--- a/migration/migration-stats.c
+++ b/migration/migration-stats.c
@@ -12,6 +12,13 @@
  
  #include "qemu/osdep.h"

  #include "qemu/stats64.h"
+#include "qemu/timer.h"
  #include "migration-stats.h"
  
  MigrationAtomicStats mig_stats;

+
+void calculate_time_since(Stat64 *val, int64_t since)
+{
+int64_t now = qemu_clock_get_ms(QEMU_CLOCK_HOST);
+stat64_set(val, now - since);
+}
diff --git a/migration/migration-stats.h b/migration/migration-stats.h
index cf8a4f0410..73c73d75b9 100644
--- a/migration/migration-stats.h
+++ b/migration/migration-stats.h
@@ -69,6 +69,10 @@ typedef struct {
   * Number of bytes sent during precopy stage.
   */
  Stat64 precopy_bytes;
+/*
+ * How long has the setup stage took.
+ */
+Stat64 setup_time;
  /*
   * Total number of bytes transferred.
   */
@@ -81,4 +85,14 @@ typedef struct {
  
  extern MigrationAtomicStats mig_stats;
  
+/**

+ * calculate_time_since: Calculate how much time has passed
+ *
+ * @val: stat64 where to store the time
+ * @since: reference time since we want to calculate
+ *
+ * Returns: Nothing.  The time is stored in val.
+ */
+
+void calculate_time_since(Stat64 *val, int64_t since);


Since this routine is in the "migration" namespace, I would rename it to

  void migration_time_since(Stat64 *val, int64_t since);

of even

  void migration_time_since(MigrationAtomicStats *stat, int64_t since);

Do you need it elsewhere than in migration.c ?

Thanks,

C.


  #endif
diff --git a/migration/migration.c b/migration/migration.c
index b1cfb56523..72286de969 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -884,7 +884,7 @@ static void populate_time_info(MigrationInfo *info, 
MigrationState *s)
  {
  info->has_status = true;
  info->has_setup_time = true;
-info->setup_time = s->setup_time;
+info->setup_time = stat64_get(_stats.setup_time);
  
  if (s->state == MIGRATION_STATUS_COMPLETED) {

  info->has_total_time = true;
@@ -1387,7 +1387,6 @@ void migrate_init(MigrationState *s)
  s->pages_per_second = 0.0;
  s->downtime = 0;
  s->expected_downtime = 0;
-s->setup_time = 0;
  s->start_postcopy = false;
  s->postcopy_after_devices = false;
  s->migration_thread_running = false;
@@ -2640,7 +2639,7 @@ static void migration_calculate_complete(MigrationState 
*s)
  s->downtime = end_time - s->downtime_start;
  }
  
-transfer_time = s->total_time - s->setup_time;

+transfer_time = s->total_time - stat64_get(_stats.setup_time);
  if (transfer_time) {
  s->mbps = ((double) bytes * 8.0) / transfer_time / 1000;
  }
@@ -2965,7 +2964,7 @@ static void *migration_thread(void *opaque)
  qemu_savevm_wait_unplug(s, MIGRATION_STATUS_SETUP,
 MIGRATION_STATUS_ACTIVE);
  
-s->setup_time = qemu_clock_get_ms(QEMU_CLOCK_HOST) - setup_start;

+calculate_time_since(_stats.setup_time, setup_start);
  
  trace_migration_thread_setup_complete();
  
@@ -3077,7 +3076,7 @@ static void *bg_migration_thread(void *opaque)

  qemu_savevm_wait_unplug(s, MIGRATION_STATUS_SETUP,
 MIGRATION_STATUS_ACTIVE);
  
-s->setup_time = qemu_clock_get_ms(QEMU_CLOCK_HOST) - setup_start;

+calculate_time_since(_stats.setup_time, setup_start);
  
  trace_migration_thread_setup_complete();

  s->downtime_start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
diff --git a/migration/migration.h b/migration/migration.h
index 3a918514e7..7f554455ac 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -311,7 +311,6 @@ struct MigrationState {
  int64_t downtime;
  int64_t expected_downtime;
  bool capabilities[MIGRATION_CAPABILITY__MAX];
-int64_t setup_time;
  /*
   * Whether guest was running when we enter the completion stage.
   * If migration is interrupted by any reason, we need to continue

Re: [PATCH v3 1/1] block/blkio: use qemu_open() to support fd passing for virtio-blk

2023-05-15 Thread Stefano Garzarella


On Thu, May 11, 2023 at 11:03:22AM -0500, Jonathon Jongsma wrote:

On 5/11/23 4:15 AM, Stefano Garzarella wrote:

The virtio-blk-vhost-vdpa driver in libblkio 1.3.0 supports the new
'fd' property. Let's expose this to the user, so the management layer
can pass the file descriptor of an already opened vhost-vdpa character
device. This is useful especially when the device can only be accessed
with certain privileges.

If the libblkio virtio-blk driver supports fd passing, let's always
use qemu_open() to open the `path`, so we can handle fd passing
from the management layer through the "/dev/fdset/N" special path.

Signed-off-by: Stefano Garzarella 
---

Notes:
v3:
- use qemu_open() on `path` to simplify libvirt code [Jonathon]



Thanks

The one drawback now is that it doesn't seem possible for libvirt to 
introspect whether or not qemu supports passing an fd to the driver or 
not.


Yep, this was because the libblkio library did not support this new way.

When I was writing my initial patch (before I realized that it was 
missing fd-passing), I just checked for the existence of the 
virtio-blk-vhost-vdpa device. But we actually need to know both that 
this device exists and supports fd passing.


Yep, this was one of the advantages of using the new `fd` parameter.
Can't libvirt handle the later failure?

As far as I can tell, versions 7.2.0 and 8.0.0 include this device but 
won't accept fds.


Right.

How do you suggest to proceed?

Thanks,
Stefano

Re: [PATCH 06/21] qemu-file: Remove total from qemu_file_total_transferred_*()

2023-05-15 Thread Cédric Le Goater


On 5/8/23 15:08, Juan Quintela wrote:

Function is already quite long.

Signed-off-by: Juan Quintela 




Reviewed-by: Cédric Le Goater 

C.


---
  migration/block.c |  4 ++--
  migration/migration.c |  2 +-
  migration/qemu-file.c |  4 ++--
  migration/qemu-file.h | 10 +-
  migration/savevm.c|  6 +++---
  migration/vmstate.c   |  5 ++---
  6 files changed, 15 insertions(+), 16 deletions(-)

diff --git a/migration/block.c b/migration/block.c
index a37678ce95..12617b4152 100644
--- a/migration/block.c
+++ b/migration/block.c
@@ -747,7 +747,7 @@ static int block_save_setup(QEMUFile *f, void *opaque)
  static int block_save_iterate(QEMUFile *f, void *opaque)
  {
  int ret;
-uint64_t last_bytes = qemu_file_total_transferred(f);
+uint64_t last_bytes = qemu_file_transferred(f);
  
  trace_migration_block_save("iterate", block_mig_state.submitted,

 block_mig_state.transferred);
@@ -799,7 +799,7 @@ static int block_save_iterate(QEMUFile *f, void *opaque)
  }
  
  qemu_put_be64(f, BLK_MIG_FLAG_EOS);

-uint64_t delta_bytes = qemu_file_total_transferred(f) - last_bytes;
+uint64_t delta_bytes = qemu_file_transferred(f) - last_bytes;
  return (delta_bytes > 0);
  }
  
diff --git a/migration/migration.c b/migration/migration.c

index e17a6538b4..b1cfb56523 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2621,7 +2621,7 @@ static MigThrError migration_detect_error(MigrationState 
*s)
  /* How many bytes have we transferred since the beginning of the migration */
  static uint64_t migration_total_bytes(MigrationState *s)
  {
-return qemu_file_total_transferred(s->to_dst_file) +
+return qemu_file_transferred(s->to_dst_file) +
  stat64_get(_stats.multifd_bytes);
  }
  
diff --git a/migration/qemu-file.c b/migration/qemu-file.c

index f3cb0cd94f..6ebc2bd3ec 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -709,7 +709,7 @@ int coroutine_mixed_fn qemu_get_byte(QEMUFile *f)
  return result;
  }
  
-uint64_t qemu_file_total_transferred_fast(QEMUFile *f)

+uint64_t qemu_file_transferred_fast(QEMUFile *f)
  {
  uint64_t ret = f->total_transferred;
  int i;
@@ -721,7 +721,7 @@ uint64_t qemu_file_total_transferred_fast(QEMUFile *f)
  return ret;
  }
  
-uint64_t qemu_file_total_transferred(QEMUFile *f)

+uint64_t qemu_file_transferred(QEMUFile *f)
  {
  qemu_fflush(f);
  return f->total_transferred;
diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index d758e7f10b..ab164a58d0 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -68,7 +68,7 @@ void qemu_file_set_hooks(QEMUFile *f, const QEMUFileHooks 
*hooks);
  int qemu_fclose(QEMUFile *f);
  
  /*

- * qemu_file_total_transferred:
+ * qemu_file_transferred:
   *
   * Report the total number of bytes transferred with
   * this file.
@@ -83,19 +83,19 @@ int qemu_fclose(QEMUFile *f);
   *
   * Returns: the total bytes transferred
   */
-uint64_t qemu_file_total_transferred(QEMUFile *f);
+uint64_t qemu_file_transferred(QEMUFile *f);
  
  /*

- * qemu_file_total_transferred_fast:
+ * qemu_file_transferred_fast:
   *
- * As qemu_file_total_transferred except for writable
+ * As qemu_file_transferred except for writable
   * files, where no flush is performed and the reported
   * amount will include the size of any queued buffers,
   * on top of the amount actually transferred.
   *
   * Returns: the total bytes transferred and queued
   */
-uint64_t qemu_file_total_transferred_fast(QEMUFile *f);
+uint64_t qemu_file_transferred_fast(QEMUFile *f);
  
  /*

   * put_buffer without copying the buffer.
diff --git a/migration/savevm.c b/migration/savevm.c
index 032044b1d5..e33788343a 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -927,9 +927,9 @@ static int vmstate_load(QEMUFile *f, SaveStateEntry *se)
  static void vmstate_save_old_style(QEMUFile *f, SaveStateEntry *se,
 JSONWriter *vmdesc)
  {
-uint64_t old_offset = qemu_file_total_transferred_fast(f);
+uint64_t old_offset = qemu_file_transferred_fast(f);
  se->ops->save_state(f, se->opaque);
-uint64_t size = qemu_file_total_transferred_fast(f) - old_offset;
+uint64_t size = qemu_file_transferred_fast(f) - old_offset;
  
  if (vmdesc) {

  json_writer_int64(vmdesc, "size", size);
@@ -2956,7 +2956,7 @@ bool save_snapshot(const char *name, bool overwrite, 
const char *vmstate,
  goto the_end;
  }
  ret = qemu_savevm_state(f, errp);
-vm_state_size = qemu_file_total_transferred(f);
+vm_state_size = qemu_file_transferred(f);
  ret2 = qemu_fclose(f);
  if (ret < 0) {
  goto the_end;
diff --git a/migration/vmstate.c b/migration/vmstate.c
index 351f56104e..af01d54b6f 100644
--- a/migration/vmstate.c
+++ b/migration/vmstate.c
@@ -361,7 +361,7 @@ int vmstate_save_state_v(QEMUFile *f, const 
VMStateDescription *vmsd,
  void *curr_elem

Re: [PATCH 10/21] migration: Move rate_limit_max and rate_limit_used to migration_stats

2023-05-15 Thread Harsh Prateek Bora





On 5/9/23 16:40, Juan Quintela wrote:

Harsh Prateek Bora  wrote:

On 5/8/23 18:38, Juan Quintela wrote:

This way we can make them atomic and use this functions from any


s/this/these



Fixed.


Sure, providing ack from ppc/spapr perspective.

Reviewed-by: Harsh Prateek Bora 


Thanks.

Re: [PATCH 05/21] qemu-file: Make rate_limit_used an uint64_t

2023-05-15 Thread Cédric Le Goater


On 5/8/23 15:08, Juan Quintela wrote:

Change all the functions that use it.  It was already passed as
uint64_t.

Signed-off-by: Juan Quintela 
Reviewed-by: Daniel P. Berrangé 
Message-Id: <20230504113841.23130-5-quint...@redhat.com>



Reviewed-by: Cédric Le Goater 

C.


---
  migration/qemu-file.c | 4 ++--
  migration/qemu-file.h | 2 +-
  2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 346b683929..f3cb0cd94f 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -49,7 +49,7 @@ struct QEMUFile {
   * Total amount of data in bytes queued for transfer
   * during this rate limiting time window
   */
-int64_t rate_limit_used;
+uint64_t rate_limit_used;
  
  /* The sum of bytes transferred on the wire */

  uint64_t total_transferred;
@@ -759,7 +759,7 @@ void qemu_file_reset_rate_limit(QEMUFile *f)
  f->rate_limit_used = 0;
  }
  
-void qemu_file_acct_rate_limit(QEMUFile *f, int64_t len)

+void qemu_file_acct_rate_limit(QEMUFile *f, uint64_t len)
  {
  f->rate_limit_used += len;
  }
diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index 04ca48cbef..d758e7f10b 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -137,7 +137,7 @@ void qemu_file_reset_rate_limit(QEMUFile *f);
   * out of band from the main file object I/O methods, and
   * need to be applied to the rate limiting calcuations
   */
-void qemu_file_acct_rate_limit(QEMUFile *f, int64_t len);
+void qemu_file_acct_rate_limit(QEMUFile *f, uint64_t len);
  void qemu_file_set_rate_limit(QEMUFile *f, uint64_t new_rate);
  uint64_t qemu_file_get_rate_limit(QEMUFile *f);
  int qemu_file_get_error_obj(QEMUFile *f, Error **errp);

Re: [PATCH 04/21] qemu-file: make qemu_file_[sg]et_rate_limit() use an uint64_t

2023-05-15 Thread Cédric Le Goater


On 5/8/23 15:08, Juan Quintela wrote:

It is really size_t.  Everything else uses uint64_t, so move this to
uint64_t as well.  A size can't be negative anyways.

Signed-off-by: Juan Quintela 
Message-Id: <20230504113841.23130-4-quint...@redhat.com>




Reviewed-by: Cédric Le Goater 

C. *


---

Don't drop the check if rate_limit_max is zero.
---
  migration/qemu-file.c | 6 +++---
  migration/qemu-file.h | 4 ++--
  2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 12cf7fb04e..346b683929 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -44,7 +44,7 @@ struct QEMUFile {
   * Maximum amount of data in bytes to transfer during one
   * rate limiting time window
   */
-int64_t rate_limit_max;
+uint64_t rate_limit_max;
  /*
   * Total amount of data in bytes queued for transfer
   * during this rate limiting time window
@@ -741,12 +741,12 @@ int qemu_file_rate_limit(QEMUFile *f)
  return 0;
  }
  
-int64_t qemu_file_get_rate_limit(QEMUFile *f)

+uint64_t qemu_file_get_rate_limit(QEMUFile *f)
  {
  return f->rate_limit_max;
  }
  
-void qemu_file_set_rate_limit(QEMUFile *f, int64_t limit)

+void qemu_file_set_rate_limit(QEMUFile *f, uint64_t limit)
  {
  /*
   * 'limit' is per second.  But we check it each 100 miliseconds.
diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index 4f26bf6961..04ca48cbef 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -138,8 +138,8 @@ void qemu_file_reset_rate_limit(QEMUFile *f);
   * need to be applied to the rate limiting calcuations
   */
  void qemu_file_acct_rate_limit(QEMUFile *f, int64_t len);
-void qemu_file_set_rate_limit(QEMUFile *f, int64_t new_rate);
-int64_t qemu_file_get_rate_limit(QEMUFile *f);
+void qemu_file_set_rate_limit(QEMUFile *f, uint64_t new_rate);
+uint64_t qemu_file_get_rate_limit(QEMUFile *f);
  int qemu_file_get_error_obj(QEMUFile *f, Error **errp);
  int qemu_file_get_error_obj_any(QEMUFile *f1, QEMUFile *f2, Error **errp);
  void qemu_file_set_error_obj(QEMUFile *f, int ret, Error *err);

Re: [PATCH 03/21] migration: We set the rate_limit by a second

2023-05-15 Thread Cédric Le Goater


On 5/8/23 15:08, Juan Quintela wrote:

That the implementation does the check every 100 milliseconds is an
implementation detail that shouldn't be seen on the interfaz.


Si. Pero, "interface" es mejor aqui.


Notice that all callers of qemu_file_set_rate_limit() used the
division or pass 0, so this change is a NOP.

Signed-off-by: Juan Quintela 



Reviewed-by: Cédric Le Goater 

C.



---
  migration/migration.c | 7 +++
  migration/options.c   | 4 ++--
  migration/qemu-file.c | 6 +-
  3 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 3979a98949..e17a6538b4 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2117,7 +2117,7 @@ static int postcopy_start(MigrationState *ms)
   * will notice we're in POSTCOPY_ACTIVE and not actually
   * wrap their state up here
   */
-qemu_file_set_rate_limit(ms->to_dst_file, bandwidth / XFER_LIMIT_RATIO);
+qemu_file_set_rate_limit(ms->to_dst_file, bandwidth);
  if (migrate_postcopy_ram()) {
  /* Ping just for debugging, helps line traces up */
  qemu_savevm_send_ping(ms->to_dst_file, 2);
@@ -3207,11 +3207,10 @@ void migrate_fd_connect(MigrationState *s, Error 
*error_in)
  
  if (resume) {

  /* This is a resumed migration */
-rate_limit = migrate_max_postcopy_bandwidth() /
-XFER_LIMIT_RATIO;
+rate_limit = migrate_max_postcopy_bandwidth();
  } else {
  /* This is a fresh new migration */
-rate_limit = migrate_max_bandwidth() / XFER_LIMIT_RATIO;
+rate_limit = migrate_max_bandwidth();
  
  /* Notify before starting migration thread */

  notifier_list_notify(_state_notifiers, s);
diff --git a/migration/options.c b/migration/options.c
index 2e759cc306..d04b5fbc3a 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -1243,7 +1243,7 @@ static void migrate_params_apply(MigrateSetParameters 
*params, Error **errp)
  s->parameters.max_bandwidth = params->max_bandwidth;
  if (s->to_dst_file && !migration_in_postcopy()) {
  qemu_file_set_rate_limit(s->to_dst_file,
-s->parameters.max_bandwidth / 
XFER_LIMIT_RATIO);
+s->parameters.max_bandwidth);
  }
  }
  
@@ -1275,7 +1275,7 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)

  s->parameters.max_postcopy_bandwidth = params->max_postcopy_bandwidth;
  if (s->to_dst_file && migration_in_postcopy()) {
  qemu_file_set_rate_limit(s->to_dst_file,
-s->parameters.max_postcopy_bandwidth / XFER_LIMIT_RATIO);
+s->parameters.max_postcopy_bandwidth);
  }
  }
  if (params->has_max_cpu_throttle) {
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 745361d238..12cf7fb04e 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -29,6 +29,7 @@
  #include "migration.h"
  #include "qemu-file.h"
  #include "trace.h"
+#include "options.h"
  #include "qapi/error.h"
  
  #define IO_BUF_SIZE 32768

@@ -747,7 +748,10 @@ int64_t qemu_file_get_rate_limit(QEMUFile *f)
  
  void qemu_file_set_rate_limit(QEMUFile *f, int64_t limit)

  {
-f->rate_limit_max = limit;
+/*
+ * 'limit' is per second.  But we check it each 100 miliseconds.
+ */
+f->rate_limit_max = limit / XFER_LIMIT_RATIO;
  }
  
  void qemu_file_reset_rate_limit(QEMUFile *f)

Re: [PATCH 02/21] migration: Don't use INT64_MAX for unlimited rate

2023-05-15 Thread Cédric Le Goater


On 5/9/23 13:51, Juan Quintela wrote:

Harsh Prateek Bora  wrote:

On 5/8/23 18:38, Juan Quintela wrote:

Use 0 instead.
Signed-off-by: Juan Quintela 
---
   migration/migration.c | 4 ++--
   migration/qemu-file.c | 3 +++
   2 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/migration/migration.c b/migration/migration.c
index 1192f1ebf1..3979a98949 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2296,7 +2296,7 @@ static void migration_completion(MigrationState *s)
   }
   if (ret >= 0) {
   s->block_inactive = !migrate_colo();
-qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX);
+qemu_file_set_rate_limit(s->to_dst_file, 0);


#define RATE_LIMIT_MAX 0

How about having a macro and use that which conveys the meaning in all
call instances wherever it is getting passed ?


I almost preffer the macro.

   qemu_file_set_rate_limit(s->to_dst_file, RATE_LIMIT_MAX);

seems quite explanatory?


yep. and I would drop the comment qemu_file_rate_limit().

Thanks,

C.




Thanks, Juan.




   ret = qemu_savevm_state_complete_precopy(s->to_dst_file, 
false,
s->block_inactive);
   }
@@ -3044,7 +3044,7 @@ static void *bg_migration_thread(void *opaque)
   rcu_register_thread();
   object_ref(OBJECT(s));
   -qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX);
+qemu_file_set_rate_limit(s->to_dst_file, 0);
 setup_start = qemu_clock_get_ms(QEMU_CLOCK_HOST);
   /*
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index f4cfd05c67..745361d238 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -731,6 +731,9 @@ int qemu_file_rate_limit(QEMUFile *f)
   if (qemu_file_get_error(f)) {
   return 1;
   }
+/*
+ *  rate_limit_max == 0 means no rate_limit enfoncement.
+ */
   if (f->rate_limit_max > 0 && f->rate_limit_used > f->rate_limit_max) {
   return 1;
   }

Re: [PATCH 01/21] migration: A rate limit value of 0 is valid

2023-05-15 Thread Cédric Le Goater


On 5/8/23 15:08, Juan Quintela wrote:

And it is the best way to not have rate_limit.

Signed-off-by: Juan Quintela 


Reviewed-by: Cédric Le Goater 

C.


---
  migration/migration.c | 7 +--
  1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 232e387109..1192f1ebf1 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2117,12 +2117,7 @@ static int postcopy_start(MigrationState *ms)
   * will notice we're in POSTCOPY_ACTIVE and not actually
   * wrap their state up here
   */
-/* 0 max-postcopy-bandwidth means unlimited */
-if (!bandwidth) {
-qemu_file_set_rate_limit(ms->to_dst_file, INT64_MAX);
-} else {
-qemu_file_set_rate_limit(ms->to_dst_file, bandwidth / 
XFER_LIMIT_RATIO);
-}
+qemu_file_set_rate_limit(ms->to_dst_file, bandwidth / XFER_LIMIT_RATIO);
  if (migrate_postcopy_ram()) {
  /* Ping just for debugging, helps line traces up */
  qemu_savevm_send_ping(ms->to_dst_file, 2);

90 matches

Mail list logo