Re: [PATCH v2 2/5] Support format or cache specific out file

2022-12-13 Thread Hanna Reitz

On 13.12.22 16:56, Nir Soffer wrote:

On Mon, Dec 12, 2022 at 12:38 PM Hanna Reitz  wrote:

On 28.11.22 15:15, Nir Soffer wrote:

Extend the test finder to find tests with format (*.out.qcow2) or cache
specific (*.out.nocache) out file. This worked before only for the
numbered tests.
---
   tests/qemu-iotests/findtests.py | 10 --
   1 file changed, 8 insertions(+), 2 deletions(-)

This patch lacks an S-o-b, too.


diff --git a/tests/qemu-iotests/findtests.py b/tests/qemu-iotests/findtests.py
index dd77b453b8..f4344ce78c 100644
--- a/tests/qemu-iotests/findtests.py
+++ b/tests/qemu-iotests/findtests.py
@@ -38,31 +38,37 @@ def chdir(path: Optional[str] = None) -> Iterator[None]:
   os.chdir(saved_dir)


   class TestFinder:
   def __init__(self, test_dir: Optional[str] = None) -> None:
   self.groups = defaultdict(set)

   with chdir(test_dir):
   self.all_tests = glob.glob('[0-9][0-9][0-9]')
   self.all_tests += [f for f in glob.iglob('tests/*')
-   if not f.endswith('.out') and
-   os.path.isfile(f + '.out')]
+   if self.is_test(f)]

So previously a file was only considered a test file if there was a
corresponding reference output file (`f + '.out'`), so files without
such a reference output aren’t considered test files...


   for t in self.all_tests:
   with open(t, encoding="utf-8") as f:
   for line in f:
   if line.startswith('# group: '):
   for g in line.split()[2:]:
   self.groups[g].add(t)
   break

+def is_test(self, fname: str) -> bool:
+"""
+The tests directory contains tests (no extension) and out files
+(*.out, *.out.{format}, *.out.{option}).
+"""
+return re.search(r'.+\.out(\.\w+)?$', fname) is None

...but this new function doesn’t check that.  I think we should check it
(just whether there’s any variant of `/{fname}\.out(\.\w+)?/` to go with
`fname`) so that behavior isn’t changed.

This means that you cannot add a test without a *.out* file, which may
  be useful when you don't use the out file for validation, but we can
add this later if needed.


I don’t think tests work without a reference output, do they?  At least 
a couple of years ago, the ./check script would refuse to run tests 
without a corresponding .out file.


Hanna




Re: [PATCH v2 5/5] qemu-img: Speed up checksum

2022-12-12 Thread Hanna Reitz

On 28.11.22 15:15, Nir Soffer wrote:

Add coroutine based loop inspired by `qemu-img convert` design.

Changes compared to `qemu-img convert`:

- State for the entire image is kept in ImgChecksumState

- State for single worker coroutine is kept in ImgChecksumworker.

- "Writes" are always in-order, ensured using a queue.

- Calling block status once per image extent, when the current extent is
   consumed by the workers.

- Using 1m buffer size - testings shows that this gives best read
   performance both with buffered and direct I/O.

- Number of coroutines is not configurable. Testing does not show
   improvement when using more than 8 coroutines.

- Progress include entire image, not only the allocated state.

Comparing to the simple read loop shows that this version is up to 4.67
times faster when computing a checksum for an image full of zeroes. For
real images it is 1.59 times faster with direct I/O, and with buffered
I/O there is no difference.

Test results on Dell PowerEdge R640 in a CentOS Stream 9 container:

| image| size | i/o   | before | after  | change |
|--|--|---||||
| zero [1] |   6g | buffered  | 1.600s ±0.014s | 0.342s ±0.016s |  x4.67 |
| zero |   6g | direct| 4.684s ±0.093s | 2.211s ±0.009s |  x2.12 |
| real [2] |   6g | buffered  | 1.841s ±0.075s | 1.806s ±0.036s |  x1.02 |
| real |   6g | direct| 3.094s ±0.079s | 1.947s ±0.017s |  x1.59 |
| nbd  [3] |   6g | buffered  | 2.455s ±0.183s | 1.808s ±0.016s |  x1.36 |
| nbd  |   6g | direct| 3.540s ±0.020s | 1.749s ±0.018s |  x2.02 |

[1] raw image full of zeroes
[2] raw fedora 35 image with additional random data, 50% full
[3] image [2] exported by qemu-nbd via unix socket

Signed-off-by: Nir Soffer 
---
  qemu-img.c | 350 ++---
  1 file changed, 277 insertions(+), 73 deletions(-)


Reviewed-by: Hanna Reitz 




Re: [PATCH v2 3/5] qemu-img: Add checksum command

2022-12-12 Thread Hanna Reitz

On 28.11.22 15:15, Nir Soffer wrote:

The checksum command compute a checksum for disk image content using the
blkhash library[1]. The blkhash library is not packaged yet, but it is
available via copr[2].

Example run:

 $ ./qemu-img checksum -p fedora-35.qcow2
 6e5c00c995056319d52395f8d91c7f84725ae3da69ffcba4de4c7d22cff713a5  
fedora-35.qcow2

The block checksum is constructed by splitting the image to fixed sized
blocks and computing a digest of every block. The image checksum is the
digest of the all block digests.

The checksum uses internally the "sha256" algorithm but it cannot be
compared with checksums created by other tools such as `sha256sum`.

The blkhash library supports sparse images, zero detection, and
optimizes zero block hashing (they are practically free). The library
uses multiple threads to speed up the computation.

Comparing to `sha256sum`, `qemu-img checksum` is 3.5-4800[3] times
faster, depending on the amount of data in the image:

 $ ./qemu-img info /scratch/50p.raw
 file format: raw
 virtual size: 6 GiB (6442450944 bytes)
 disk size: 2.91 GiB

 $ hyperfine -w2 -r5 -p "sleep 1" "./qemu-img checksum /scratch/50p.raw" \
  "sha256sum /scratch/50p.raw"
 Benchmark 1: ./qemu-img checksum /scratch/50p.raw
   Time (mean ± σ):  1.849 s ±  0.037 s[User: 7.764 s, System: 
0.962 s]
   Range (min … max):1.813 s …  1.908 s5 runs

 Benchmark 2: sha256sum /scratch/50p.raw
   Time (mean ± σ): 14.585 s ±  0.072 s[User: 13.537 s, System: 
1.003 s]
   Range (min … max):   14.501 s … 14.697 s5 runs

 Summary
   './qemu-img checksum /scratch/50p.raw' ran
 7.89 ± 0.16 times faster than 'sha256sum /scratch/50p.raw'

The new command is available only when `blkhash` is available during
build. To test the new command please install the `blkhash-devel`
package:

 $ dnf copr enable nsoffer/blkhash
 $ sudo dnf install blkhash-devel

[1] https://gitlab.com/nirs/blkhash
[2] https://copr.fedorainfracloud.org/coprs/nsoffer/blkhash/
[3] Computing checksum for 8T empty image: qemu-img checksum: 3.7s,
 sha256sum (estimate): 17,749s

Signed-off-by: Nir Soffer 
---
  docs/tools/qemu-img.rst |  24 ++
  meson.build |  10 ++-
  meson_options.txt   |   2 +
  qemu-img-cmds.hx|   8 ++
  qemu-img.c  | 183 
  5 files changed, 226 insertions(+), 1 deletion(-)


[...]


diff --git a/qemu-img.c b/qemu-img.c
index c03d6b4b31..4b4ca7add3 100644
--- a/qemu-img.c
+++ b/qemu-img.c


[...]


@@ -1613,20 +1617,199 @@ out:
  qemu_vfree(buf1);
  qemu_vfree(buf2);
  blk_unref(blk2);
  out2:
  blk_unref(blk1);
  out3:
  qemu_progress_end();
  return ret;
  }
  
+#ifdef CONFIG_BLKHASH

+/*
+ * Compute image checksum.
+ */
+static int img_checksum(int argc, char **argv)
+{
+const char *digest_name = "sha256";
+const size_t block_size = 64 * KiB;
+
+_Static_assert(QEMU_IS_ALIGNED(IO_BUF_SIZE, block_size),
+   "IO_BUF_SIZE should be alligned to block_size");


(s/alligned/aligned/)

A suggestion: We have a `QEMU_BUILD_BUG_MSG()` macro in 
include/qemu/compiler.h.  Nowadays it just unconditionally resolves to 
_Static_assert, I think before C11 was adopted it used a custom 
implementation.  Still, it is what seems to be used throughout the 
actual qemu code (disregarding roms/ and pc-bios/), so I think it would 
be more fitting to use.


But that’s just a suggestion.  It always resolves to _Static_assert 
anyway, so using _Static_assert seems by no means wrong.


So with the spelling fixed:

Reviewed-by: Hanna Reitz 




Re: [PATCH v2 4/5] iotests: Test qemu-img checksum

2022-12-12 Thread Hanna Reitz

On 28.11.22 15:15, Nir Soffer wrote:

Add simple tests computing a checksum for image with all kinds of
extents in raw and qcow2 formats.

The test can be extended later for other formats, format options (e..g
compressed qcow2), protocols (e.g. nbd), and image with a backing chain,
but I'm not sure this is really needed.

To help debugging in case of failures, the output includes a json map of
the test image.

Signed-off-by: Nir Soffer 
---
  tests/qemu-iotests/tests/qemu-img-checksum| 63 +++
  .../tests/qemu-img-checksum.out.qcow2 | 11 
  .../tests/qemu-img-checksum.out.raw   | 10 +++
  3 files changed, 84 insertions(+)
  create mode 100755 tests/qemu-iotests/tests/qemu-img-checksum
  create mode 100644 tests/qemu-iotests/tests/qemu-img-checksum.out.qcow2
  create mode 100644 tests/qemu-iotests/tests/qemu-img-checksum.out.raw


Thanks!

Reviewed-by: Hanna Reitz 




Re: [PATCH v2 2/5] Support format or cache specific out file

2022-12-12 Thread Hanna Reitz

On 28.11.22 15:15, Nir Soffer wrote:

Extend the test finder to find tests with format (*.out.qcow2) or cache
specific (*.out.nocache) out file. This worked before only for the
numbered tests.
---
  tests/qemu-iotests/findtests.py | 10 --
  1 file changed, 8 insertions(+), 2 deletions(-)


This patch lacks an S-o-b, too.


diff --git a/tests/qemu-iotests/findtests.py b/tests/qemu-iotests/findtests.py
index dd77b453b8..f4344ce78c 100644
--- a/tests/qemu-iotests/findtests.py
+++ b/tests/qemu-iotests/findtests.py
@@ -38,31 +38,37 @@ def chdir(path: Optional[str] = None) -> Iterator[None]:
  os.chdir(saved_dir)
  
  
  class TestFinder:

  def __init__(self, test_dir: Optional[str] = None) -> None:
  self.groups = defaultdict(set)
  
  with chdir(test_dir):

  self.all_tests = glob.glob('[0-9][0-9][0-9]')
  self.all_tests += [f for f in glob.iglob('tests/*')
-   if not f.endswith('.out') and
-   os.path.isfile(f + '.out')]
+   if self.is_test(f)]


So previously a file was only considered a test file if there was a 
corresponding reference output file (`f + '.out'`), so files without 
such a reference output aren’t considered test files...



  for t in self.all_tests:
  with open(t, encoding="utf-8") as f:
  for line in f:
  if line.startswith('# group: '):
  for g in line.split()[2:]:
  self.groups[g].add(t)
  break
  
+def is_test(self, fname: str) -> bool:

+"""
+The tests directory contains tests (no extension) and out files
+(*.out, *.out.{format}, *.out.{option}).
+"""
+return re.search(r'.+\.out(\.\w+)?$', fname) is None


...but this new function doesn’t check that.  I think we should check it 
(just whether there’s any variant of `/{fname}\.out(\.\w+)?/` to go with 
`fname`) so that behavior isn’t changed.


Hanna




Re: [PATCH v2 1/5] qemu-img.c: Move IO_BUF_SIZE to the top of the file

2022-12-12 Thread Hanna Reitz

On 28.11.22 15:15, Nir Soffer wrote:

This macro is used by various commands (compare, convert, rebase) but it
is defined somewhere in the middle of the file. I'm going to use it in
the new checksum command so lets clean up a bit before that.
---
  qemu-img.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)


Looks good, but is missing your S-o-b – with it added:

Reviewed-by: Hanna Reitz 




Re: [PATCH v2 00/12] qemu-img info: Show protocol-level information

2022-12-08 Thread Hanna Reitz

On 20.06.22 18:26, Hanna Reitz wrote:

Hi,

This series is a v2 to:

https://lists.nongnu.org/archive/html/qemu-block/2022-05/msg00042.html


Ping, it looks like this still applies (to the master branch and kevin’s 
block-next branch at least).


Hanna


So the final state is that despite the QAPI changes, hopefully only the
qemu-img info output changes, and it now prints every image node’s
subgraph.  This results in an output like the following (for a qcow2
image):

image: test.qcow2
file format: qcow2
virtual size: 64 MiB (67108864 bytes)
disk size: 196 KiB
cluster_size: 65536
Format specific information:
 compat: 1.1
 compression type: zlib
 lazy refcounts: false
 refcount bits: 16
 corrupt: false
 extended l2: false
Child node '/file':
 filename: test.qcow2
 protocol type: file
 file length: 192 KiB (197120 bytes)
 disk size: 196 KiB
 Format specific information:
 extent size hint: 1048576





Re: [PATCH v2 00/15] block: Simplify drain

2022-11-24 Thread Hanna Reitz

On 18.11.22 18:40, Kevin Wolf wrote:

I'm aware that exactly nobody has been looking forward to a series with
this title, but it has to be. The way drain works means that we need to
poll in bdrv_replace_child_noperm() and that makes things rather messy
with Emanuele's multiqueue work because you must not poll while you hold
the graph lock.

The other reason why it has to be is that drain is way too complex and
there are too many different cases. Some simplification like this will
hopefully make it considerably more maintainable. The diffstat probably
tells something, too.

There are roughly speaking three parts in this series:

1. Make BlockDriver.bdrv_drained_begin/end() non-coroutine_fn again,
which allows us to not poll on bdrv_drained_end() any more.

2. Remove subtree drains. They are a considerable complication in the
whole drain machinery (in particular, they require polling in the
BdrvChildClass.attach/detach() callbacks that are called during
bdrv_replace_child_noperm()) and none of their users actually has a
good reason to use them.

3. Finally get rid of polling in bdrv_replace_child_noperm() by
requiring that the child is already drained by the caller and calling
callbacks only once and not again for every nested drain section.

If necessary, a prefix of this series can be merged that covers only the
first or the first two parts and it would still make sense.

v2:
- Rebased on master
- Patch 3: Removed left over _co parts in function names
- Patch 4: Updated function comments to reflect that we're not polling
   any more
- Patch 6 (new): Fix inconsistent AioContext locking for reopen code
- Patch 9 (was 8): Added comment to clarify when polling is allowed
   and the graph may change again
- Patch 11 (was 10):
   * Reworded some comments and the commit message.
   * Dropped a now unnecessary assertion that was dropped only in a later
 patch in v1 of the series.
   * Changed 'int parent_quiesce_counter' into 'bool quiesced_parent'
- Patch 12 (was 11): Don't remove ignore_bds_parents from
   bdrv_drain_poll(), it is actually still a valid optimisation there
   that makes polling O(n) instead of O(n²)
- Patch 13 (new): Instead of only removing assert(!qemu_in_coroutine())
   like in v1 of the series, drop out of coroutine context in
   bdrv_do_drained_begin_quiesce() just to be sure that we'll never get
   coroutine surprises in drain code.
- Patch 14 (was 12): More and reworded comments to make things hopefully
   a bit clearer


Thanks!

Reviewed-by: Hanna Reitz 




Re: [PULL 00/11] Block layer patches

2022-11-15 Thread Hanna Reitz

On 15.11.22 11:14, Kevin Wolf wrote:

Am 15.11.2022 um 00:58 hat John Snow geschrieben:

On Mon, Nov 14, 2022 at 5:56 AM Kevin Wolf  wrote:

Am 11.11.2022 um 20:20 hat Stefan Hajnoczi geschrieben:

Hanna Reitz (9):
   block/mirror: Do not wait for active writes
   block/mirror: Drop mirror_wait_for_any_operation()
   block/mirror: Fix NULL s->job in active writes
   iotests/151: Test that active mirror progresses
   iotests/151: Test active requests on mirror start
   block: Make bdrv_child_get_parent_aio_context I/O
   block-backend: Update ctx immediately after root
   block: Start/end drain on correct AioContext
   tests/stream-under-throttle: New test

Hi Hanna,
This test is broken, probably due to the minimum Python version:
https://gitlab.com/qemu-project/qemu/-/jobs/3311521303

This is exactly the problem I saw with running linters in a gating CI,
but not during 'make check'. And of course, we're hitting it during the
-rc phase now. :-(

I mean. I'd love to have it run in make check too. The alternative was
never seeing this *anywhere* ...

What is the problem with running it in 'make check'? The additional
dependencies? If so, can we run it automatically if the dependencies
happen to be fulfilled and just skip it otherwise?

If I have to run 'make -C python check-pipenv' manually, I can guarantee
you that I'll forget it more often than I'll run it.


...but I'm sorry it's taken me so long to figure out how to get this
stuff to work in "make check" and also from manual iotests runs
without adding any kind of setup that you have to manage. It's just
fiddly, sorry :(


But yes, it seems that asyncio.TimeoutError should be used instead of
asyncio.exceptions.TimeoutError, and Python 3.6 has only the former.
I'll fix this up and send a v2 if it fixes check-python-pipenv.

Hopefully this goes away when we drop 3.6. I want to, but I recall
there was some question about some platforms that don't support 3.7+
"by default" and how annoying that was or wasn't. We're almost a year
out from 3.6 being EOL, so maybe after this release it's worth a crack
to see how painful it is to move on.

If I understand the documentation right, asyncio.TimeoutError is what
you should be using either way. That it happens to be a re-export from
the internal module asyncio.exceptions seems to be more of an
implementation detail, not the official interface.


Oh, so I understood 
https://docs.python.org/3/library/asyncio-exceptions.html wrong.  I took 
that to mean that as of 3.11, `asyncio.TimeoutError` is a deprecated 
alias for `asyncio.exceptions.TimeoutError`, but it’s actually become an 
alias for the now-built-in `TimeoutError` exception.  I think.


Hanna




Re: [PATCH 05/13] block: Inline bdrv_drain_invoke()

2022-11-14 Thread Hanna Reitz

On 08.11.22 13:37, Kevin Wolf wrote:

bdrv_drain_invoke() has now two entirely separate cases that share no
code any more and are selected depending on a bool parameter. Each case
has only one caller. Just inline the function.

Signed-off-by: Kevin Wolf 
---
  block/io.c | 23 ++-
  1 file changed, 6 insertions(+), 17 deletions(-)


Reviewed-by: Hanna Reitz 




Re: [PATCH 03/13] block: Revert .bdrv_drained_begin/end to non-coroutine_fn

2022-11-14 Thread Hanna Reitz

On 08.11.22 13:37, Kevin Wolf wrote:

Polling during bdrv_drained_end() can be problematic (and in the future,
we may get cases for bdrv_drained_begin() where polling is forbidden,
and we don't care about already in-flight requests, but just want to
prevent new requests from arriving).

The .bdrv_drained_begin/end callbacks running in a coroutine is the only
reason why we have to do this polling, so make them non-coroutine
callbacks again. None of the callers actually yield any more.

This means that bdrv_drained_end() effectively doesn't poll any more,
even if AIO_WAIT_WHILE() loops are still there (their condition is false
from the beginning). This is generally not a problem, but in
test-bdrv-drain, some additional explicit aio_poll() calls need to be
added because the test case wants to verify the final state after BHs
have executed.

Signed-off-by: Kevin Wolf 
---
  include/block/block_int-common.h | 10 ---
  block.c  |  4 +--
  block/io.c   | 49 +---
  block/qed.c  |  4 +--
  block/throttle.c |  6 ++--
  tests/unit/test-bdrv-drain.c | 18 ++--
  6 files changed, 30 insertions(+), 61 deletions(-)


As the others have already suggested, I’d too drop the _co_ in qed and 
throttle, and the coroutine_fn in throttle.  With that done:


Reviewed-by: Hanna Reitz 




Re: [PATCH 07/13] block: Don't use subtree drains in bdrv_drop_intermediate()

2022-11-14 Thread Hanna Reitz

On 08.11.22 13:37, Kevin Wolf wrote:

Instead of using a subtree drain from the top node (which also drains
child nodes of base that we're not even interested in), use a normal
drain for base, which automatically drains all of the parents, too.

Signed-off-by: Kevin Wolf 
---
  block.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)


Reviewed-by: Hanna Reitz 




Re: [PATCH 04/13] block: Remove drained_end_counter

2022-11-14 Thread Hanna Reitz

On 08.11.22 13:37, Kevin Wolf wrote:

drained_end_counter is unused now, nobody changes its value any more. It
can be removed.

In cases where we had two almost identical functions that only differed
in whether the caller passes drained_end_counter, or whether they would
poll for a local drained_end_counter to reach 0, these become a single
function.

Signed-off-by: Kevin Wolf 
---
  include/block/block-io.h | 15 -
  include/block/block_int-common.h |  6 +-
  block.c  |  5 +-
  block/block-backend.c|  4 +-
  block/io.c   | 97 
  blockjob.c   |  2 +-
  6 files changed, 30 insertions(+), 99 deletions(-)


The comments on bdrv_drained_end() and bdrv_parent_drained_end_single() 
in include/block/block-io.h still say that they poll some AioContext 
“which may result in a graph change”.  That’s no longer the case, 
though, so those paragraphs should be dropped, I think.


Apart from that, looks good.

Hanna




Re: [PATCH 12/13] block: Don't poll in bdrv_replace_child_noperm()

2022-11-14 Thread Hanna Reitz

On 08.11.22 13:37, Kevin Wolf wrote:

In order to make sure that bdrv_replace_child_noperm() doesn't have to
poll any more, get rid of the bdrv_parent_drained_begin_single() call.

This is possible now because we can require that the child is already
drained when the function is called (it better be, having in-flight
requests while modifying the graph isn't going to end well!) and we
don't call the parent drain callbacks more than once.

The additional drain calls needed in callers cause the test case to run
its code in the drain handler too early (bdrv_attach_child() drains
now), so modify it to only enable the code after the test setup has
completed.

Signed-off-by: Kevin Wolf 
---
  include/block/block-io.h |  8 
  block.c  | 72 +---
  block/io.c   |  2 +-
  tests/unit/test-bdrv-drain.c | 10 +
  4 files changed, 70 insertions(+), 22 deletions(-)


I find this change complicated.  I understand it’s the point of the 
series, but I find it difficult to grasp.  But I guess there can be no 
drain series without such a patch.


As usual, I was very skeptical of the code at first, and over time 
slowly realized that I’m mostly confused by the comments, and the code 
seems fine.  Ah, well.


[...]


diff --git a/block.c b/block.c
index 5f5f79cd16..12039e9b8a 100644
--- a/block.c
+++ b/block.c


[...]


@@ -2414,12 +2428,20 @@ static TransactionActionDrv bdrv_replace_child_drv = {
   *
   * Note: real unref of old_bs is done only on commit.
   *
+ * Both child and new_bs (if non-NULL) must be drained. new_bs must be kept
+ * drained until the transaction is completed (this automatically implies that
+ * child remains drained, too).


I find “child” extremely ambiguous.  The problem is that there generally 
is no way to drain a BdrvChild object, is there?  You can only drain the 
BDS in it, which then drains the parent through the BdrvChild object.  
Historically, I don’t think there was ever a place where we cared about 
the BdrvChild object between the two to be drained, was there?  I mean, 
now there apparently is, in bdrv_child_attach_common(), but that’s a 
different story.


So the problem is that “draining a BdrvChild object” generally appears 
in the context of bdrv_parent_drained_*() functions, i.e. actually 
functions draining the parent.  Which makes it a bit confusing to refer 
to a BdrvChild object just as “child”.


I know that “child” here refers to the variable (or does it not?), but 
that is why I really prefer marking variables that are just plain 
English words, e.g. as @child or `child`, so it’s clear they are a name 
and not a noun.


In any case, because the concept is generally to drain the `child->bs` 
instead of the BdrvChild object directly, I understand the comment to 
mean: “Both the old child (`child->bs`) and `new_bs` (if non-NULL) must 
be drained.  `new_bs` must be kept drained until the transaction is 
completed.  This implies that the parent too will be kept drained until 
the transaction is completed by the BdrvChild object `child`.”


Or am I misunderstanding something, and the distinction between `child` 
and `child->bs` and the parent node is important here? (Would be good to 
know. :))



+ *
   * The function doesn't update permissions, caller is responsible for this.
   */
  static void bdrv_replace_child_tran(BdrvChild *child, BlockDriverState 
*new_bs,
  Transaction *tran)
  {
  BdrvReplaceChildState *s = g_new(BdrvReplaceChildState, 1);
+
+assert(child->parent_quiesce_counter);
+assert(!new_bs || new_bs->quiesce_counter);
+
  *s = (BdrvReplaceChildState) {
  .child = child,
  .old_bs = child->bs,
@@ -2818,6 +2840,12 @@ static void bdrv_replace_child_noperm(BdrvChild *child,


This function now has its callers fulfill kind of a complicated 
contract.  I would prefer that to be written out in a doc comment, 
especially because it sounds like the assertions can’t cover everything 
(i.e. callers that remove a child are required to have stopped issuing 
requests to that child, but they are free to do that in any way they 
want, so no assertion will check for it here).



  int new_bs_quiesce_counter;
  
  assert(!child->frozen);

+/*
+ * When removing the child, it's the callers responsibility to make sure
+ * that no requests are in flight any more. Usually the parent is drained,
+ * but not through child->parent_quiesce_counter.
+ */


When I see a comment above an assertion, I immediately assume it is 
going to describe what the assertion checks.  Unless I’m 
misunderstanding something (again?), this comment actually describes 
what the assertion *does not* check.  I find that confusing, especially 
because the comment leads with “it’s the caller’s responsibility”, which 
to me implies “and that’s why we check it here in this assertion”, 
because assertions are there to verify that contracts are 

Re: [PATCH 01/13] qed: Don't yield in bdrv_qed_co_drain_begin()

2022-11-14 Thread Hanna Reitz

On 08.11.22 13:37, Kevin Wolf wrote:

We want to change .bdrv_co_drained_begin() back to be a non-coroutine
callback, so in preparation, avoid yielding in its implementation.

Because we increase bs->in_flight and bdrv_drained_begin() polls, the
behaviour is unchanged.

Signed-off-by: Kevin Wolf 
---
  block/qed.c | 20 +---
  1 file changed, 17 insertions(+), 3 deletions(-)


Reviewed-by: Hanna Reitz 




[PATCH v2] tests/stream-under-throttle: New test

2022-11-14 Thread Hanna Reitz
Test streaming a base image into the top image underneath two throttle
nodes.  This was reported to make qemu 7.1 hang
(https://gitlab.com/qemu-project/qemu/-/issues/1215), so this serves as
a regression test.

Signed-off-by: Hanna Reitz 
---
Based-on: <20221107151321.211175-1-hre...@redhat.com>

v1: https://lists.nongnu.org/archive/html/qemu-block/2022-11/msg00368.html

v2:
- Replace `asyncio.exceptions.TimeoutError` by `asyncio.TimeoutError`:
  Stefan reported that the CI does not recognize the former:
  https://lists.nongnu.org/archive/html/qemu-block/2022-11/msg00424.html

  As far as I understand, the latter was basically moved to become the
  former in Python 3.11, and an alias remains, so both are basically
  equivalent.  I only have 3.10, though, where the documentation says
  that both are different, even though using either seems to work fine
  (i.e. both catch the timeout there).  Not sure about previous
  versions, but the CI seems pretty certain about not knowing
  `asyncio.exceptions.TimeoutError`, so use `asyncio.TimeoutError`
  instead.  (Even though that is deprecated in 3.11, but this is not the
  first place in the tree to use it, so it should not be too bad.)
---
 .../qemu-iotests/tests/stream-under-throttle  | 121 ++
 .../tests/stream-under-throttle.out   |   5 +
 2 files changed, 126 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/stream-under-throttle
 create mode 100644 tests/qemu-iotests/tests/stream-under-throttle.out

diff --git a/tests/qemu-iotests/tests/stream-under-throttle 
b/tests/qemu-iotests/tests/stream-under-throttle
new file mode 100755
index 00..8d2d9e1684
--- /dev/null
+++ b/tests/qemu-iotests/tests/stream-under-throttle
@@ -0,0 +1,121 @@
+#!/usr/bin/env python3
+# group: rw
+#
+# Test streaming with throttle nodes on top
+#
+# Copyright (C) 2022 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+import asyncio
+import os
+from typing import List
+import iotests
+from iotests import qemu_img_create, qemu_io
+
+
+image_size = 256 * 1024 * 1024
+base_img = os.path.join(iotests.test_dir, 'base.img')
+top_img = os.path.join(iotests.test_dir, 'top.img')
+
+
+class TcgVM(iotests.VM):
+'''
+Variant of iotests.VM that uses -accel tcg.  Simply using
+iotests.VM.add_args('-accel', 'tcg') is not sufficient, because that will
+put -accel qtest before -accel tcg, and -accel arguments are prioritized in
+the order they appear.
+'''
+@property
+def _base_args(self) -> List[str]:
+# Put -accel tcg first so it takes precedence
+return ['-accel', 'tcg'] + super()._base_args
+
+
+class TestStreamWithThrottle(iotests.QMPTestCase):
+def setUp(self) -> None:
+'''
+Create a simple backing chain between two images, write something to
+the base image.  Attach them to the VM underneath two throttle nodes,
+one of which has actually no limits set, but the other does.  Then put
+a virtio-blk device on top.
+This test configuration has been taken from
+https://gitlab.com/qemu-project/qemu/-/issues/1215
+'''
+qemu_img_create('-f', iotests.imgfmt, base_img, str(image_size))
+qemu_img_create('-f', iotests.imgfmt, '-b', base_img, '-F',
+iotests.imgfmt, top_img, str(image_size))
+
+# Write something to stream
+qemu_io(base_img, '-c', f'write 0 {image_size}')
+
+blockdev = {
+'driver': 'throttle',
+'node-name': 'throttled-node',
+'throttle-group': 'thrgr-limited',
+'file': {
+'driver': 'throttle',
+'throttle-group': 'thrgr-unlimited',
+'file': {
+'driver': iotests.imgfmt,
+'node-name': 'unthrottled-node',
+'file': {
+'driver': 'file',
+'filename': top_img
+}
+}
+}
+}
+
+# Issue 1215 is not reproducible in qtest mode, which is why we need to
+# create an -accel tcg VM
+self.vm = TcgVM()
+self.vm.add_object('iothread,id=iothr0')
+self.vm.add_object('throttle-group,id=thrgr-unlimited')
+self.vm.add_object('throttle-group,id=thrgr-limited,'
+

Re: [PATCH 08/13] stream: Replace subtree drain with a single node drain

2022-11-14 Thread Hanna Reitz

On 08.11.22 13:37, Kevin Wolf wrote:

The subtree drain was introduced in commit b1e1af394d9 as a way to avoid
graph changes between finding the base node and changing the block graph
as necessary on completion of the image streaming job.

The block graph could change between these two points because
bdrv_set_backing_hd() first drains the parent node, which involved
polling and can do anything.

Subtree draining was an imperfect way to make this less likely (because
with it, fewer callbacks are called during this window). Everyone agreed
that it's not really the right solution, and it was only committed as a
stopgap solution.

This replaces the subtree drain with a solution that simply drains the
parent node before we try to find the base node, and then call a version
of bdrv_set_backing_hd() that doesn't drain, but just asserts that the
parent node is already drained.

This way, any graph changes caused by draining happen before we start
looking at the graph and things stay consistent between finding the base
node and changing the graph.

Signed-off-by: Kevin Wolf 
---
  include/block/block-global-state.h |  3 +++
  block.c| 17 ++---
  block/stream.c | 20 ++--
  3 files changed, 27 insertions(+), 13 deletions(-)


Reviewed-by: Hanna Reitz 




Re: [PULL 00/11] Block layer patches

2022-11-14 Thread Hanna Reitz

On 11.11.22 20:20, Stefan Hajnoczi wrote:

On Fri, 11 Nov 2022 at 10:29, Kevin Wolf  wrote:

The following changes since commit 2ccad61746ca7de5dd3e25146062264387e43bd4:

   Merge tag 'pull-tcg-20221109' of https://gitlab.com/rth7680/qemu into 
staging (2022-11-09 13:26:45 -0500)

are available in the Git repository at:

   https://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to b04af371af685c12970ea93027dc6d8bf86265aa:

   tests/stream-under-throttle: New test (2022-11-11 13:02:43 +0100)


Block layer patches

- Fix deadlock in graph modification with iothreads
- mirror: Fix non-converging cases for active mirror
- qapi: Fix BlockdevOptionsNvmeIoUring @path description
- blkio: Set BlockDriver::has_variable_length to false


Alberto Faria (2):
   qapi/block-core: Fix BlockdevOptionsNvmeIoUring @path description
   block/blkio: Set BlockDriver::has_variable_length to false

Hanna Reitz (9):
   block/mirror: Do not wait for active writes
   block/mirror: Drop mirror_wait_for_any_operation()
   block/mirror: Fix NULL s->job in active writes
   iotests/151: Test that active mirror progresses
   iotests/151: Test active requests on mirror start
   block: Make bdrv_child_get_parent_aio_context I/O
   block-backend: Update ctx immediately after root
   block: Start/end drain on correct AioContext
   tests/stream-under-throttle: New test

Hi Hanna,
This test is broken, probably due to the minimum Python version:
https://gitlab.com/qemu-project/qemu/-/jobs/3311521303


:(

I just took the exception name (asyncio.exceptions.TimeoutError) from 
the test output when a timeout occurred, seems indeed like that’s too 
new.  I’m not entirely sure when that was introduced, and what it’s 
relationship to asyncio.TimeoutError is – in 3.11, the latter is an 
alias for the former, but I have 3.10 myself, where the documentation 
says both are distinct.  Anyway, using either works fine here, and the 
existing scripts in python/qemu use asyncio.TimeoutError, so I’ve sent a 
v2 of the patch to do the same.


(For the record, this test isn’t vital for anything, so just dropping it 
from the pull request seems perfectly fine to me.)


Hanna




Re: [PATCH 09/13] block: Remove subtree drains

2022-11-14 Thread Hanna Reitz

On 08.11.22 13:37, Kevin Wolf wrote:

Subtree drains are not used any more. Remove them.

After this, BdrvChildClass.attach/detach() don't poll any more.

Signed-off-by: Kevin Wolf 
---
  include/block/block-io.h |  18 +--
  include/block/block_int-common.h |   1 -
  include/block/block_int-io.h |  12 --
  block.c  |  20 +--
  block/io.c   | 121 +++---
  tests/unit/test-bdrv-drain.c | 261 ++-
  6 files changed, 44 insertions(+), 389 deletions(-)


Reviewed-by: Hanna Reitz 




Re: [PATCH 10/13] block: Call drain callbacks only once

2022-11-14 Thread Hanna Reitz

On 08.11.22 13:37, Kevin Wolf wrote:

We only need to call both the BlockDriver's callback and the parent
callbacks when going from undrained to drained or vice versa. A second
drain section doesn't make a difference for the driver or the parent,
they weren't supposed to send new requests before and after the second
drain.

One thing that gets in the way is the 'ignore_bds_parents' parameter in
bdrv_do_drained_begin_quiesce() and bdrv_do_drained_end(): If it is true
for the first drain, bs->quiesce_counter will be non-zero, but the
parent callbacks still haven't been called, so a second drain where it
is false would still have to call them.

Instead of keeping track of this, let's just get rid of the parameter.
It was introduced in commit 6cd5c9d7b2d as an optimisation so that
during bdrv_drain_all(), we wouldn't recursively drain all parents up to
the root for each node, resulting in quadratic complexity. As it happens,
calling the callbacks only once solves the same problem, so as of this
patch, we'll still have O(n) complexity and ignore_bds_parents is not
needed any more.

This patch only ignores the 'ignore_bds_parents' parameter. It will be
removed in a separate patch.

Signed-off-by: Kevin Wolf 
---
  block.c  | 13 ++---
  block/io.c   | 24 +---
  tests/unit/test-bdrv-drain.c | 16 ++--
  3 files changed, 29 insertions(+), 24 deletions(-)


I too would like parent_quiesce_counter to become `bool 
parent_quiesced`, but:


Reviewed-by: Hanna Reitz 




Re: [PATCH 13/13] block: Remove poll parameter from bdrv_parent_drained_begin_single()

2022-11-14 Thread Hanna Reitz

On 08.11.22 13:37, Kevin Wolf wrote:

All callers of bdrv_parent_drained_begin_single() pass poll=false now,
so we don't need the parameter any more.

Signed-off-by: Kevin Wolf 
---
  include/block/block-io.h | 5 ++---
  block.c  | 4 ++--
  block/io.c   | 7 ++-
  3 files changed, 6 insertions(+), 10 deletions(-)


Well, “drained_begin” does not mean “drain”, so...

Reviewed-by: Hanna Reitz 




Re: [PATCH 11/13] block: Remove ignore_bds_parents parameter from drain functions

2022-11-14 Thread Hanna Reitz

On 08.11.22 13:37, Kevin Wolf wrote:

ignore_bds_parents is now ignored, so we can just remove it.

Signed-off-by: Kevin Wolf 
---
  include/block/block-io.h | 10 ++
  block.c  |  4 +--
  block/io.c   | 78 +++-
  3 files changed, 32 insertions(+), 60 deletions(-)


Reviewed-by: Hanna Reitz 




Re: [PATCH 02/13] test-bdrv-drain: Don't yield in .bdrv_co_drained_begin/end()

2022-11-14 Thread Hanna Reitz

On 08.11.22 13:37, Kevin Wolf wrote:

We want to change .bdrv_co_drained_begin/end() back to be non-coroutine
callbacks, so in preparation, avoid yielding in their implementation.

This does almost the same as the existing logic in bdrv_drain_invoke(),
by creating and entering coroutines internally. However, since the test
case is by far the heaviest user of coroutine code in drain callbacks,
it is preferable to have the complexity in the test case rather than the
drain core, which is already complicated enough without this.

The behaviour for bdrv_drain_begin() is unchanged because we increase
bs->in_flight and this is still polled. However, bdrv_drain_end()
doesn't wait for the spawned coroutine to complete any more. This is
fine, we don't rely on bdrv_drain_end() restarting all operations
immediately before the next aio_poll().

Signed-off-by: Kevin Wolf 
---
  tests/unit/test-bdrv-drain.c | 64 ++--
  1 file changed, 46 insertions(+), 18 deletions(-)


Reviewed-by: Hanna Reitz 




Re: [PATCH v2 0/3] block: Start/end drain on correct AioContext

2022-11-10 Thread Hanna Reitz

On 10.11.22 15:01, Kevin Wolf wrote:

Am 07.11.2022 um 16:13 hat Hanna Reitz geschrieben:

Hi,

v1 cover letter:
https://lists.nongnu.org/archive/html/qemu-block/2022-09/msg00389.html

bdrv_replace_child_noperm() drains the child via
bdrv_parent_drained_{begin,end}_single().  When it removes a child, the
bdrv_parent_drained_end_single() at its end will be called on an empty
child, making the BDRV_POLL_WHILE() in it poll the main AioContext
(because c->bs is NULL).

That’s wrong, though, because it’s supposed to operate on the parent.
bdrv_parent_drained_end_single_no_poll() will have scheduled any BHs in
the parents’ AioContext, which may be anything, not necessarily the main
context.  Therefore, we must poll the parent’s context.

Patch 3 does this for both bdrv_parent_drained_{begin,end}_single().
Patch 1 ensures that we can legally call
bdrv_child_get_parent_aio_context() from those I/O context functions,
and patch 2 fixes blk_do_set_aio_context() to not cause an assertion
failure if it beginning a drain can end up in blk_get_aio_context()
before blk->ctx has been updated.

Thanks, applied to the block branch.


Thanks!

I tested your drain series, and it does indeed fix the bug, too. (Sorry 
for the delay, I thought it’d take less time to write an iotest...)



I would still be interested in a test case as a follow-up.


Got it working now and sent as “tests/stream-under-throttle: New test”.

Hanna




[PATCH] tests/stream-under-throttle: New test

2022-11-10 Thread Hanna Reitz
Test streaming a base image into the top image underneath two throttle
nodes.  This was reported to make qemu 7.1 hang
(https://gitlab.com/qemu-project/qemu/-/issues/1215), so this serves as
a regression test.

Signed-off-by: Hanna Reitz 
---
This is a regression test for
https://lists.nongnu.org/archive/html/qemu-block/2022-11/msg00180.html
("block: Start/end drain on correct AioContext"), and so depends on that
series.

Based-on: <20221107151321.211175-1-hre...@redhat.com>

Note that Kevin's drain series also fixes this bug:
https://lists.nongnu.org/archive/html/qemu-block/2022-11/msg00204.html
---
 .../qemu-iotests/tests/stream-under-throttle  | 121 ++
 .../tests/stream-under-throttle.out   |   5 +
 2 files changed, 126 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/stream-under-throttle
 create mode 100644 tests/qemu-iotests/tests/stream-under-throttle.out

diff --git a/tests/qemu-iotests/tests/stream-under-throttle 
b/tests/qemu-iotests/tests/stream-under-throttle
new file mode 100755
index 00..37788b147d
--- /dev/null
+++ b/tests/qemu-iotests/tests/stream-under-throttle
@@ -0,0 +1,121 @@
+#!/usr/bin/env python3
+# group: rw
+#
+# Test streaming with throttle nodes on top
+#
+# Copyright (C) 2022 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+import asyncio
+import os
+from typing import List
+import iotests
+from iotests import qemu_img_create, qemu_io
+
+
+image_size = 256 * 1024 * 1024
+base_img = os.path.join(iotests.test_dir, 'base.img')
+top_img = os.path.join(iotests.test_dir, 'top.img')
+
+
+class TcgVM(iotests.VM):
+'''
+Variant of iotests.VM that uses -accel tcg.  Simply using
+iotests.VM.add_args('-accel', 'tcg') is not sufficient, because that will
+put -accel qtest before -accel tcg, and -accel arguments are prioritized in
+the order they appear.
+'''
+@property
+def _base_args(self) -> List[str]:
+# Put -accel tcg first so it takes precedence
+return ['-accel', 'tcg'] + super()._base_args
+
+
+class TestStreamWithThrottle(iotests.QMPTestCase):
+def setUp(self) -> None:
+'''
+Create a simple backing chain between two images, write something to
+the base image.  Attach them to the VM underneath two throttle nodes,
+one of which has actually no limits set, but the other does.  Then put
+a virtio-blk device on top.
+This test configuration has been taken from
+https://gitlab.com/qemu-project/qemu/-/issues/1215
+'''
+qemu_img_create('-f', iotests.imgfmt, base_img, str(image_size))
+qemu_img_create('-f', iotests.imgfmt, '-b', base_img, '-F',
+iotests.imgfmt, top_img, str(image_size))
+
+# Write something to stream
+qemu_io(base_img, '-c', f'write 0 {image_size}')
+
+blockdev = {
+'driver': 'throttle',
+'node-name': 'throttled-node',
+'throttle-group': 'thrgr-limited',
+'file': {
+'driver': 'throttle',
+'throttle-group': 'thrgr-unlimited',
+'file': {
+'driver': iotests.imgfmt,
+'node-name': 'unthrottled-node',
+'file': {
+'driver': 'file',
+'filename': top_img
+}
+}
+}
+}
+
+# Issue 1215 is not reproducible in qtest mode, which is why we need to
+# create an -accel tcg VM
+self.vm = TcgVM()
+self.vm.add_object('iothread,id=iothr0')
+self.vm.add_object('throttle-group,id=thrgr-unlimited')
+self.vm.add_object('throttle-group,id=thrgr-limited,'
+   'x-iops-total=1,x-bps-total=104857600')
+self.vm.add_blockdev(self.vm.qmp_to_opts(blockdev))
+self.vm.add_device('virtio-blk,iothread=iothr0,drive=throttled-node')
+self.vm.launch()
+
+def tearDown(self) -> None:
+self.vm.shutdown()
+os.remove(top_img)
+os.remove(base_img)
+
+def test_stream(self) -> None:
+'''
+Do a simple stream beneath the two throttle nodes.  Should complete
+with no problems.
+'''
+result = self.vm.qmp('block-stream',
+ 

Re: [PATCH for-7.2 3/5] block/mirror: Fix NULL s->job in active writes

2022-11-10 Thread Hanna Reitz

On 10.11.22 13:10, Kevin Wolf wrote:

Am 09.11.2022 um 17:54 hat Hanna Reitz geschrieben:

There is a small gap in mirror_start_job() before putting the mirror
filter node into the block graph (bdrv_append() call) and the actual job
being created.  Before the job is created, MirrorBDSOpaque.job is NULL.

It is possible that requests come in when bdrv_drained_end() is called,
and those requests would see MirrorBDSOpaque.job == NULL.  Have our
filter node handle that case gracefully.

Signed-off-by: Hanna Reitz 

This can only happen because bdrv_drained_end() polls, right? So after
changing that it won't be necessary any more, but this series is for 7.2
and the drain one isn't, so this is the right thing to do for now.


I was wondering the same, but I haven’t tested it yet.

Hanna




[PATCH for-7.2 4/5] iotests/151: Test that active mirror progresses

2022-11-09 Thread Hanna Reitz
Before this series, a mirror job in write-blocking mode would pause
issuing background requests while active requests are in flight.  Thus,
if the source is constantly in use by active requests, no actual
progress can be made.

This series should have fixed that, making the mirror job issue
background requests even while active requests are in flight.

Have a new test case in 151 verify this.

Signed-off-by: Hanna Reitz 
---
 tests/qemu-iotests/151 | 180 -
 tests/qemu-iotests/151.out |   4 +-
 2 files changed, 181 insertions(+), 3 deletions(-)

diff --git a/tests/qemu-iotests/151 b/tests/qemu-iotests/151
index 93d14193d0..0a052e5050 100755
--- a/tests/qemu-iotests/151
+++ b/tests/qemu-iotests/151
@@ -19,7 +19,10 @@
 # along with this program.  If not, see <http://www.gnu.org/licenses/>.
 #
 
+import math
 import os
+import subprocess
+from typing import List
 import iotests
 from iotests import qemu_img
 
@@ -50,7 +53,7 @@ class TestActiveMirror(iotests.QMPTestCase):
 self.vm = iotests.VM()
 self.vm.add_drive_raw(self.vm.qmp_to_opts(blk_source))
 self.vm.add_blockdev(self.vm.qmp_to_opts(blk_target))
-self.vm.add_device('virtio-blk,drive=source')
+self.vm.add_device('virtio-blk,id=vblk,drive=source')
 self.vm.launch()
 
 def tearDown(self):
@@ -192,6 +195,181 @@ class TestActiveMirror(iotests.QMPTestCase):
 self.potential_writes_in_flight = False
 
 
+class TestThrottledWithNbdExport(iotests.QMPTestCase):
+image_len = 128 * 1024 * 1024  # MB
+iops = 16
+background_processes: List['subprocess.Popen[str]'] = []
+
+def setUp(self):
+qemu_img('create', '-f', iotests.imgfmt, source_img, '128M')
+qemu_img('create', '-f', iotests.imgfmt, target_img, '128M')
+
+self.vm = iotests.VM()
+self.vm.launch()
+
+result = self.vm.qmp('object-add', **{
+'qom-type': 'throttle-group',
+'id': 'thrgr',
+'limits': {
+'iops-total': self.iops,
+'iops-total-max': self.iops
+}
+})
+self.assert_qmp(result, 'return', {})
+
+result = self.vm.qmp('blockdev-add', **{
+'node-name': 'source-node',
+'driver': 'throttle',
+'throttle-group': 'thrgr',
+'file': {
+'driver': iotests.imgfmt,
+'file': {
+'driver': 'file',
+'filename': source_img
+}
+}
+})
+self.assert_qmp(result, 'return', {})
+
+result = self.vm.qmp('blockdev-add', **{
+'node-name': 'target-node',
+'driver': iotests.imgfmt,
+'file': {
+'driver': 'file',
+'filename': target_img
+}
+})
+self.assert_qmp(result, 'return', {})
+
+self.nbd_sock = iotests.file_path('nbd.sock',
+  base_dir=iotests.sock_dir)
+self.nbd_url = f'nbd+unix:///source-node?socket={self.nbd_sock}'
+
+result = self.vm.qmp('nbd-server-start', addr={
+'type': 'unix',
+'data': {
+'path': self.nbd_sock
+}
+})
+self.assert_qmp(result, 'return', {})
+
+result = self.vm.qmp('block-export-add', id='exp0', type='nbd',
+ node_name='source-node', writable=True)
+self.assert_qmp(result, 'return', {})
+
+def tearDown(self):
+# Wait for background requests to settle
+try:
+while True:
+p = self.background_processes.pop()
+while True:
+try:
+p.wait(timeout=0.0)
+break
+except subprocess.TimeoutExpired:
+self.vm.qtest(f'clock_step {1 * 1000 * 1000 * 1000}')
+except IndexError:
+pass
+
+# Cancel ongoing block jobs
+for job in self.vm.qmp('query-jobs')['return']:
+self.vm.qmp('block-job-cancel', device=job['id'], force=True)
+
+while True:
+self.vm.qtest(f'clock_step {1 * 1000 * 1000 * 1000}')
+if len(self.vm.qmp('query-jobs')['return']) == 0:
+break
+
+self.vm.shutdown()
+os.remove(source_img)
+os.remove(target_img)
+
+def testUnderLoad(self):
+'''
+Throttle the source node, then issue a whole bunch of external requests
+while the mirror job (in write-blocking mode) is running.  We want to
+see background requests being issued even while the source is under
+full load by active writes, so that progress can be made towards READY.
+'''
+
+# Fill the first half of the source image; do not fill the second half,
+# that is where we will have active requests occur.  This e

[PATCH for-7.2 5/5] iotests/151: Test active requests on mirror start

2022-11-09 Thread Hanna Reitz
Have write requests happen to the source node right when we start a
mirror job.  The mirror filter node may encounter MirrorBDSOpaque.job
being NULL, but this should not cause a segfault.

Signed-off-by: Hanna Reitz 
---
 tests/qemu-iotests/151 | 53 +++---
 tests/qemu-iotests/151.out |  4 +--
 2 files changed, 52 insertions(+), 5 deletions(-)

diff --git a/tests/qemu-iotests/151 b/tests/qemu-iotests/151
index 0a052e5050..b4d1bc2553 100755
--- a/tests/qemu-iotests/151
+++ b/tests/qemu-iotests/151
@@ -22,7 +22,8 @@
 import math
 import os
 import subprocess
-from typing import List
+import time
+from typing import List, Optional
 import iotests
 from iotests import qemu_img
 
@@ -195,12 +196,15 @@ class TestActiveMirror(iotests.QMPTestCase):
 self.potential_writes_in_flight = False
 
 
-class TestThrottledWithNbdExport(iotests.QMPTestCase):
+class TestThrottledWithNbdExportBase(iotests.QMPTestCase):
 image_len = 128 * 1024 * 1024  # MB
-iops = 16
+iops: Optional[int] = None
 background_processes: List['subprocess.Popen[str]'] = []
 
 def setUp(self):
+# Must be set by subclasses
+self.assertIsNotNone(self.iops)
+
 qemu_img('create', '-f', iotests.imgfmt, source_img, '128M')
 qemu_img('create', '-f', iotests.imgfmt, target_img, '128M')
 
@@ -284,6 +288,10 @@ class TestThrottledWithNbdExport(iotests.QMPTestCase):
 os.remove(source_img)
 os.remove(target_img)
 
+
+class TestLowThrottledWithNbdExport(TestThrottledWithNbdExportBase):
+iops = 16
+
 def testUnderLoad(self):
 '''
 Throttle the source node, then issue a whole bunch of external requests
@@ -370,6 +378,45 @@ class TestThrottledWithNbdExport(iotests.QMPTestCase):
 self.assertGreater(start_remaining - end_remaining, 0)
 
 
+class TestHighThrottledWithNbdExport(TestThrottledWithNbdExportBase):
+iops = 1024
+
+def testActiveOnCreation(self):
+'''
+Issue requests on the mirror source node right as the mirror is
+instated.  It's possible that requests occur before the actual job is
+created, but after the node has been put into the graph.  Write
+requests across the node must in that case be forwarded to the source
+node without attempting to mirror them (there is no job object yet, so
+attempting to access it would cause a segfault).
+We do this with a lightly throttled node (i.e. quite high IOPS limit).
+Using throttling seems to increase reproductivity, but if the limit is
+too low, all requests allowed per second will be submitted before
+mirror_start_job() gets to the problematic point.
+'''
+
+# Let qemu-img bench create write requests (enough for two seconds on
+# the virtual clock)
+bench_args = ['bench', '-w', '-d', '1024', '-f', 'nbd',
+  '-c', str(self.iops * 2), self.nbd_url]
+p = iotests.qemu_tool_popen(iotests.qemu_img_args + bench_args)
+self.background_processes += [p]
+
+# Give qemu-img bench time to start up and issue requests
+time.sleep(1.0)
+# Flush the request queue, so new requests can come in right as we
+# start blockdev-mirror
+self.vm.qtest(f'clock_step {1 * 1000 * 1000 * 1000}')
+
+result = self.vm.qmp('blockdev-mirror',
+ job_id='mirror',
+ device='source-node',
+ target='target-node',
+ sync='full',
+ copy_mode='write-blocking')
+self.assert_qmp(result, 'return', {})
+
+
 if __name__ == '__main__':
 iotests.main(supported_fmts=['qcow2', 'raw'],
  supported_protocols=['file'])
diff --git a/tests/qemu-iotests/151.out b/tests/qemu-iotests/151.out
index 914e3737bd..3f8a935a08 100644
--- a/tests/qemu-iotests/151.out
+++ b/tests/qemu-iotests/151.out
@@ -1,5 +1,5 @@
-.
+..
 --
-Ran 5 tests
+Ran 6 tests
 
 OK
-- 
2.36.1




[PATCH for-7.2 3/5] block/mirror: Fix NULL s->job in active writes

2022-11-09 Thread Hanna Reitz
There is a small gap in mirror_start_job() before putting the mirror
filter node into the block graph (bdrv_append() call) and the actual job
being created.  Before the job is created, MirrorBDSOpaque.job is NULL.

It is possible that requests come in when bdrv_drained_end() is called,
and those requests would see MirrorBDSOpaque.job == NULL.  Have our
filter node handle that case gracefully.

Signed-off-by: Hanna Reitz 
---
 block/mirror.c | 20 
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 5b6f42392c..251adc5ae0 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1438,11 +1438,13 @@ static int coroutine_fn 
bdrv_mirror_top_do_write(BlockDriverState *bs,
 MirrorOp *op = NULL;
 MirrorBDSOpaque *s = bs->opaque;
 int ret = 0;
-bool copy_to_target;
+bool copy_to_target = false;
 
-copy_to_target = s->job->ret >= 0 &&
- !job_is_cancelled(>job->common.job) &&
- s->job->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING;
+if (s->job) {
+copy_to_target = s->job->ret >= 0 &&
+ !job_is_cancelled(>job->common.job) &&
+ s->job->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING;
+}
 
 if (copy_to_target) {
 op = active_write_prepare(s->job, offset, bytes);
@@ -1487,11 +1489,13 @@ static int coroutine_fn 
bdrv_mirror_top_pwritev(BlockDriverState *bs,
 QEMUIOVector bounce_qiov;
 void *bounce_buf;
 int ret = 0;
-bool copy_to_target;
+bool copy_to_target = false;
 
-copy_to_target = s->job->ret >= 0 &&
- !job_is_cancelled(>job->common.job) &&
- s->job->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING;
+if (s->job) {
+copy_to_target = s->job->ret >= 0 &&
+ !job_is_cancelled(>job->common.job) &&
+ s->job->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING;
+}
 
 if (copy_to_target) {
 /* The guest might concurrently modify the data to write; but
-- 
2.36.1




[PATCH for-7.2 1/5] block/mirror: Do not wait for active writes

2022-11-09 Thread Hanna Reitz
Waiting for all active writes to settle before daring to create a
background copying operation means that we will never do background
operations while the guest does anything (in write-blocking mode), and
therefore cannot converge.  Yes, we also will not diverge, but actually
converging would be even nicer.

It is unclear why we did decide to wait for all active writes to settle
before creating a background operation, but it just does not seem
necessary.  Active writes will put themselves into the in_flight bitmap
and thus properly block actually conflicting background requests.

It is important for active requests to wait on overlapping background
requests, which we do in active_write_prepare().  However, so far it was
not documented why it is important.  Add such documentation now, and
also to the other call of mirror_wait_on_conflicts(), so that it becomes
more clear why and when requests need to actively wait for other
requests to settle.

Another thing to note is that of course we need to ensure that there are
no active requests when the job completes, but that is done by virtue of
the BDS being drained anyway, so there cannot be any active requests at
that point.

With this change, we will need to explicitly keep track of how many
bytes are in flight in active requests so that
job_progress_set_remaining() in mirror_run() can set the correct number
of remaining bytes.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2123297
Signed-off-by: Hanna Reitz 
---
 block/mirror.c | 37 ++---
 1 file changed, 30 insertions(+), 7 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 1a75a47cc3..e5467b0053 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -82,6 +82,7 @@ typedef struct MirrorBlockJob {
 int max_iov;
 bool initial_zeroing_ongoing;
 int in_active_write_counter;
+int64_t active_write_bytes_in_flight;
 bool prepared;
 bool in_drain;
 } MirrorBlockJob;
@@ -494,6 +495,13 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
 }
 bdrv_dirty_bitmap_unlock(s->dirty_bitmap);
 
+/*
+ * Wait for concurrent requests to @offset.  The next loop will limit the
+ * copied area based on in_flight_bitmap so we only copy an area that does
+ * not overlap with concurrent in-flight requests.  Still, we would like to
+ * copy something, so wait until there are at least no more requests to the
+ * very beginning of the area.
+ */
 mirror_wait_on_conflicts(NULL, s, offset, 1);
 
 job_pause_point(>common.job);
@@ -988,12 +996,6 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 int64_t cnt, delta;
 bool should_complete;
 
-/* Do not start passive operations while there are active
- * writes in progress */
-while (s->in_active_write_counter) {
-mirror_wait_for_any_operation(s, true);
-}
-
 if (s->ret < 0) {
 ret = s->ret;
 goto immediate_exit;
@@ -1010,7 +1012,9 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 /* cnt is the number of dirty bytes remaining and s->bytes_in_flight is
  * the number of bytes currently being processed; together those are
  * the current remaining operation length */
-job_progress_set_remaining(>common.job, s->bytes_in_flight + cnt);
+job_progress_set_remaining(>common.job,
+   s->bytes_in_flight + cnt +
+   s->active_write_bytes_in_flight);
 
 /* Note that even when no rate limit is applied we need to yield
  * periodically with no pending I/O so that bdrv_drain_all() returns.
@@ -1071,6 +1075,10 @@ static int coroutine_fn mirror_run(Job *job, Error 
**errp)
 
 s->in_drain = true;
 bdrv_drained_begin(bs);
+
+/* Must be zero because we are drained */
+assert(s->in_active_write_counter == 0);
+
 cnt = bdrv_get_dirty_count(s->dirty_bitmap);
 if (cnt > 0 || mirror_flush(s) < 0) {
 bdrv_drained_end(bs);
@@ -1306,6 +1314,7 @@ do_sync_target_write(MirrorBlockJob *job, MirrorMethod 
method,
 }
 
 job_progress_increase_remaining(>common.job, bytes);
+job->active_write_bytes_in_flight += bytes;
 
 switch (method) {
 case MIRROR_METHOD_COPY:
@@ -1327,6 +1336,7 @@ do_sync_target_write(MirrorBlockJob *job, MirrorMethod 
method,
 abort();
 }
 
+job->active_write_bytes_in_flight -= bytes;
 if (ret >= 0) {
 job_progress_update(>common.job, bytes);
 } else {
@@ -1375,6 +1385,19 @@ static MirrorOp *coroutine_fn 
active_write_prepare(MirrorBlockJob *s,
 
 s->in_active_write_counter++;
 
+/*
+ * Wait for concurrent requests affecting the area.  If there are already
+ * running requests that are copying off now-to-be stale data

[PATCH for-7.2 2/5] block/mirror: Drop mirror_wait_for_any_operation()

2022-11-09 Thread Hanna Reitz
mirror_wait_for_free_in_flight_slot() is the only remaining user of
mirror_wait_for_any_operation(), so inline the latter into the former.

Signed-off-by: Hanna Reitz 
---
 block/mirror.c | 21 -
 1 file changed, 8 insertions(+), 13 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index e5467b0053..5b6f42392c 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -305,19 +305,21 @@ static int mirror_cow_align(MirrorBlockJob *s, int64_t 
*offset,
 }
 
 static inline void coroutine_fn
-mirror_wait_for_any_operation(MirrorBlockJob *s, bool active)
+mirror_wait_for_free_in_flight_slot(MirrorBlockJob *s)
 {
 MirrorOp *op;
 
 QTAILQ_FOREACH(op, >ops_in_flight, next) {
-/* Do not wait on pseudo ops, because it may in turn wait on
+/*
+ * Do not wait on pseudo ops, because it may in turn wait on
  * some other operation to start, which may in fact be the
  * caller of this function.  Since there is only one pseudo op
  * at any given time, we will always find some real operation
- * to wait on. */
-if (!op->is_pseudo_op && op->is_in_flight &&
-op->is_active_write == active)
-{
+ * to wait on.
+ * Also, do not wait on active operations, because they do not
+ * use up in-flight slots.
+ */
+if (!op->is_pseudo_op && op->is_in_flight && !op->is_active_write) {
 qemu_co_queue_wait(>waiting_requests, NULL);
 return;
 }
@@ -325,13 +327,6 @@ mirror_wait_for_any_operation(MirrorBlockJob *s, bool 
active)
 abort();
 }
 
-static inline void coroutine_fn
-mirror_wait_for_free_in_flight_slot(MirrorBlockJob *s)
-{
-/* Only non-active operations use up in-flight slots */
-mirror_wait_for_any_operation(s, false);
-}
-
 /* Perform a mirror copy operation.
  *
  * *op->bytes_handled is set to the number of bytes copied after and
-- 
2.36.1




[PATCH for-7.2 0/5] block/mirror: Do not wait for active writes

2022-11-09 Thread Hanna Reitz
Hi,

For some reason(TM), the mirror job, when in write-blocking mode, has
decided not to issue background requests while active writes are in
flight.  (Or rather, it was probably me who decided that.)

The problem is that only background requests can really reliably help
you make progress.  When all the mirror job does is to mirror guest
writes to the target, but not copy anything else that remains to be
copied in the disk, it will not make converge.

It is unclear why it is that way, so patch 1 simply drops that
dependency, and attempts to better explain the remaining
wait-dependencies we have between requests (i.e. why active requests
must wait on all overlapping requests for the whole range, but
background requests only wait on any conflicts that concern the
beginning of the range they want to copy).

Patch 2 is clean-up, patch 3 fixes another bug I found while trying to
come up with a working test case.

Patch 4 is that test case (I hope it works on your end, too), and patch
5 is a test case for the fix in patch 3.


Hanna Reitz (5):
  block/mirror: Do not wait for active writes
  block/mirror: Drop mirror_wait_for_any_operation()
  block/mirror: Fix NULL s->job in active writes
  iotests/151: Test that active mirror progresses
  iotests/151: Test active requests on mirror start

 block/mirror.c |  78 -
 tests/qemu-iotests/151 | 227 -
 tests/qemu-iotests/151.out |   4 +-
 3 files changed, 278 insertions(+), 31 deletions(-)

-- 
2.36.1




[PATCH v2 3/3] block: Start/end drain on correct AioContext

2022-11-07 Thread Hanna Reitz
bdrv_parent_drained_{begin,end}_single() are supposed to operate on the
parent, not on the child, so they should not attempt to get the context
to poll from the child but the parent instead.  BDRV_POLL_WHILE(c->bs)
does get the context from the child, so we should replace it with
AIO_WAIT_WHILE() on the parent's context instead.

This problem becomes apparent when bdrv_replace_child_noperm() invokes
bdrv_parent_drained_end_single() after removing a child from a subgraph
that is in an I/O thread.  By the time bdrv_parent_drained_end_single()
is called, child->bs is NULL, and so BDRV_POLL_WHILE(c->bs, ...) will
poll the main loop instead of the I/O thread; but anything that
bdrv_parent_drained_end_single_no_poll() may have scheduled is going to
want to run in the I/O thread, but because we poll the main loop, the
I/O thread is never unpaused, and nothing is run, resulting in a
deadlock.

Closes: https://gitlab.com/qemu-project/qemu/-/issues/1215
Reviewed-by: Kevin Wolf 
Signed-off-by: Hanna Reitz 
---
 block/io.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/block/io.c b/block/io.c
index 34b30e304e..b9424024f9 100644
--- a/block/io.c
+++ b/block/io.c
@@ -71,9 +71,10 @@ static void bdrv_parent_drained_end_single_no_poll(BdrvChild 
*c,
 void bdrv_parent_drained_end_single(BdrvChild *c)
 {
 int drained_end_counter = 0;
+AioContext *ctx = bdrv_child_get_parent_aio_context(c);
 IO_OR_GS_CODE();
 bdrv_parent_drained_end_single_no_poll(c, _end_counter);
-BDRV_POLL_WHILE(c->bs, qatomic_read(_end_counter) > 0);
+AIO_WAIT_WHILE(ctx, qatomic_read(_end_counter) > 0);
 }
 
 static void bdrv_parent_drained_end(BlockDriverState *bs, BdrvChild *ignore,
@@ -116,13 +117,14 @@ static bool bdrv_parent_drained_poll(BlockDriverState 
*bs, BdrvChild *ignore,
 
 void bdrv_parent_drained_begin_single(BdrvChild *c, bool poll)
 {
+AioContext *ctx = bdrv_child_get_parent_aio_context(c);
 IO_OR_GS_CODE();
 c->parent_quiesce_counter++;
 if (c->klass->drained_begin) {
 c->klass->drained_begin(c);
 }
 if (poll) {
-BDRV_POLL_WHILE(c->bs, bdrv_parent_drained_poll_single(c));
+AIO_WAIT_WHILE(ctx, bdrv_parent_drained_poll_single(c));
 }
 }
 
-- 
2.36.1




[PATCH v2 0/3] block: Start/end drain on correct AioContext

2022-11-07 Thread Hanna Reitz
Hi,

v1 cover letter:
https://lists.nongnu.org/archive/html/qemu-block/2022-09/msg00389.html

bdrv_replace_child_noperm() drains the child via
bdrv_parent_drained_{begin,end}_single().  When it removes a child, the
bdrv_parent_drained_end_single() at its end will be called on an empty
child, making the BDRV_POLL_WHILE() in it poll the main AioContext
(because c->bs is NULL).

That’s wrong, though, because it’s supposed to operate on the parent.
bdrv_parent_drained_end_single_no_poll() will have scheduled any BHs in
the parents’ AioContext, which may be anything, not necessarily the main
context.  Therefore, we must poll the parent’s context.

Patch 3 does this for both bdrv_parent_drained_{begin,end}_single().
Patch 1 ensures that we can legally call
bdrv_child_get_parent_aio_context() from those I/O context functions,
and patch 2 fixes blk_do_set_aio_context() to not cause an assertion
failure if it beginning a drain can end up in blk_get_aio_context()
before blk->ctx has been updated.


v2:
- Patch 1:
  - Move bdrv_child_get_parent_aio_context() from block-global-state.h
to block-io.h
  - Move BdrvChildClass.get_parent_aio_context into the I/O code section
  - Mark blk_root_get_parent_aio_context() as I/O code
  - Mark child_job_get_parent_aio_context(), and lock the job mutex,
which we now need to do (as of 3ed4f708fe1)
- Patch 2:
  - Added comment similar to Kevin’s suggestion
(Still decided to take Kevin’s R-b, even though I didn’t use the
literal suggestion, but made it a bit more verbose)


Hanna Reitz (3):
  block: Make bdrv_child_get_parent_aio_context I/O
  block-backend: Update ctx immediately after root
  block: Start/end drain on correct AioContext

 include/block/block-global-state.h | 1 -
 include/block/block-io.h   | 2 ++
 include/block/block_int-common.h   | 4 ++--
 block.c| 2 +-
 block/block-backend.c  | 9 -
 block/io.c | 6 --
 blockjob.c | 3 ++-
 7 files changed, 19 insertions(+), 8 deletions(-)

-- 
2.36.1




[PATCH v2 2/3] block-backend: Update ctx immediately after root

2022-11-07 Thread Hanna Reitz
blk_get_aio_context() asserts that blk->ctx is always equal to the root
BDS's context (if there is a root BDS).  Therefore,
blk_do_set_aio_context() must update blk->ctx immediately after the root
BDS's context has changed.

Without this patch, the next patch would break iotest 238, because
bdrv_drained_begin() (called by blk_do_set_aio_context()) may then
invoke bdrv_child_get_parent_aio_context() on the root child, i.e.
blk_get_aio_context().  However, by this point, blk->ctx would not have
been updated and thus differ from the root node's context.  This patch
fixes that.

Reviewed-by: Kevin Wolf 
Signed-off-by: Hanna Reitz 
---
 block/block-backend.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index ed2f4b67a2..b48c91f4e1 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2158,6 +2158,11 @@ static int blk_do_set_aio_context(BlockBackend *blk, 
AioContext *new_context,
 return ret;
 }
 }
+/*
+ * Make blk->ctx consistent with the root node before we invoke any
+ * other operations like drain that might inquire blk->ctx
+ */
+blk->ctx = new_context;
 if (tgm->throttle_state) {
 bdrv_drained_begin(bs);
 throttle_group_detach_aio_context(tgm);
@@ -2166,9 +2171,10 @@ static int blk_do_set_aio_context(BlockBackend *blk, 
AioContext *new_context,
 }
 
 bdrv_unref(bs);
+} else {
+blk->ctx = new_context;
 }
 
-blk->ctx = new_context;
 return 0;
 }
 
-- 
2.36.1




[PATCH v2 1/3] block: Make bdrv_child_get_parent_aio_context I/O

2022-11-07 Thread Hanna Reitz
We want to use bdrv_child_get_parent_aio_context() from
bdrv_parent_drained_{begin,end}_single(), both of which are "I/O or GS"
functions.

Prior to 3ed4f708fe1, all the implementations were I/O code anyway.
3ed4f708fe1 has put block jobs' AioContext field under the job mutex, so
to make child_job_get_parent_aio_context() work in an I/O context, we
need to take that lock there.

Furthermore, blk_root_get_parent_aio_context() is not marked as
anything, but is safe to run in an I/O context, so mark it that way now.
(blk_get_aio_context() is an I/O code function.)

With that done, all implementations explicitly are I/O code, so we can
mark bdrv_child_get_parent_aio_context() as I/O code, too, so callers
know it is safe to run from both GS and I/O contexts.

Signed-off-by: Hanna Reitz 
---
 include/block/block-global-state.h | 1 -
 include/block/block-io.h   | 2 ++
 include/block/block_int-common.h   | 4 ++--
 block.c| 2 +-
 block/block-backend.c  | 1 +
 blockjob.c | 3 ++-
 6 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/include/block/block-global-state.h 
b/include/block/block-global-state.h
index bb42ed9559..c7bd4a2088 100644
--- a/include/block/block-global-state.h
+++ b/include/block/block-global-state.h
@@ -220,7 +220,6 @@ void coroutine_fn bdrv_co_lock(BlockDriverState *bs);
  */
 void coroutine_fn bdrv_co_unlock(BlockDriverState *bs);
 
-AioContext *bdrv_child_get_parent_aio_context(BdrvChild *c);
 bool bdrv_child_change_aio_context(BdrvChild *c, AioContext *ctx,
GHashTable *visited, Transaction *tran,
Error **errp);
diff --git a/include/block/block-io.h b/include/block/block-io.h
index 770ddeb7c8..b099d7db45 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -171,6 +171,8 @@ void bdrv_debug_event(BlockDriverState *bs, BlkdebugEvent 
event);
  */
 AioContext *bdrv_get_aio_context(BlockDriverState *bs);
 
+AioContext *bdrv_child_get_parent_aio_context(BdrvChild *c);
+
 /**
  * Move the current coroutine to the AioContext of @bs and return the old
  * AioContext of the coroutine. Increase bs->in_flight so that draining @bs
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 5a2cc077a0..31ae91e56e 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -911,8 +911,6 @@ struct BdrvChildClass {
GHashTable *visited, Transaction *tran,
Error **errp);
 
-AioContext *(*get_parent_aio_context)(BdrvChild *child);
-
 /*
  * I/O API functions. These functions are thread-safe.
  *
@@ -929,6 +927,8 @@ struct BdrvChildClass {
  */
 const char *(*get_name)(BdrvChild *child);
 
+AioContext *(*get_parent_aio_context)(BdrvChild *child);
+
 /*
  * If this pair of functions is implemented, the parent doesn't issue new
  * requests after returning from .drained_begin() until .drained_end() is
diff --git a/block.c b/block.c
index 3bd594eb2a..2d6128d80a 100644
--- a/block.c
+++ b/block.c
@@ -1533,7 +1533,7 @@ const BdrvChildClass child_of_bds = {
 
 AioContext *bdrv_child_get_parent_aio_context(BdrvChild *c)
 {
-GLOBAL_STATE_CODE();
+IO_CODE();
 return c->klass->get_parent_aio_context(c);
 }
 
diff --git a/block/block-backend.c b/block/block-backend.c
index c0c7d56c8d..ed2f4b67a2 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -311,6 +311,7 @@ static void blk_root_detach(BdrvChild *child)
 static AioContext *blk_root_get_parent_aio_context(BdrvChild *c)
 {
 BlockBackend *blk = c->opaque;
+IO_CODE();
 
 return blk_get_aio_context(blk);
 }
diff --git a/blockjob.c b/blockjob.c
index 2d86014fa5..f51d4e18f3 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -173,7 +173,8 @@ static bool child_job_change_aio_ctx(BdrvChild *c, 
AioContext *ctx,
 static AioContext *child_job_get_parent_aio_context(BdrvChild *c)
 {
 BlockJob *job = c->opaque;
-GLOBAL_STATE_CODE();
+IO_CODE();
+JOB_LOCK_GUARD();
 
 return job->job.aio_context;
 }
-- 
2.36.1




Re: [PATCH v2] vl: defuse PID file path resolve error

2022-11-07 Thread Hanna Reitz

On 31.10.22 10:47, Fiona Ebner wrote:

Commit 85c4bf8aa6 ("vl: Unlink absolute PID file path") introduced a
critical error when the PID file path cannot be resolved. Before this
commit, it was possible to invoke QEMU when the PID file was a file
created with mkstemp that was already unlinked at the time of the
invocation. There might be other similar scenarios.

It should not be a critical error when the PID file unlink notifier
can't be registered, because the path can't be resolved. If the file
is already gone from QEMU's perspective, silently ignore the error.
Otherwise, only print a warning.

Fixes: 85c4bf8aa6 ("vl: Unlink absolute PID file path")
Reported-by: Dominik Csapak 
Suggested-by: Thomas Lamprecht 
Signed-off-by: Fiona Ebner 
---

v1 -> v2:
 * Ignore error if errno == ENOENT.

  softmmu/vl.c | 9 +
  1 file changed, 5 insertions(+), 4 deletions(-)


Reviewed-by: Hanna Reitz 




Re: [PATCH 2/3] iotests: Test qemu-img checksum

2022-11-07 Thread Hanna Reitz

On 30.10.22 18:38, Nir Soffer wrote:

On Wed, Oct 26, 2022 at 4:31 PM Hanna Reitz  wrote:

On 01.09.22 16:32, Nir Soffer wrote:
> Add simple tests creating an image with all kinds of extents,
different
> formats, different backing chain, different protocol, and different
> image options. Since all images have the same guest visible
content they
> must have the same checksum.
>
> To help debugging in case of failures, the output includes a
json map of
> every test image.
>
> Signed-off-by: Nir Soffer 
> ---
>   tests/qemu-iotests/tests/qemu-img-checksum    | 149
++
>   .../qemu-iotests/tests/qemu-img-checksum.out  |  74 +
>   2 files changed, 223 insertions(+)
>   create mode 100755 tests/qemu-iotests/tests/qemu-img-checksum
>   create mode 100644 tests/qemu-iotests/tests/qemu-img-checksum.out
>
> diff --git a/tests/qemu-iotests/tests/qemu-img-checksum
b/tests/qemu-iotests/tests/qemu-img-checksum
> new file mode 100755
> index 00..3a85ba33f2
> --- /dev/null
> +++ b/tests/qemu-iotests/tests/qemu-img-checksum
> @@ -0,0 +1,149 @@
> +#!/usr/bin/env python3
> +# group: rw auto quick
> +#
> +# Test cases for qemu-img checksum.
> +#
> +# Copyright (C) 2022 Red Hat, Inc.
> +#
> +# This program is free software; you can redistribute it and/or
modify
> +# it under the terms of the GNU General Public License as
published by
> +# the Free Software Foundation; either version 2 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program.  If not, see
<http://www.gnu.org/licenses/>.
> +
> +import re
> +
> +import iotests
> +
> +from iotests import (
> +    filter_testfiles,
> +    qemu_img,
> +    qemu_img_log,
> +    qemu_io,
> +    qemu_nbd_popen,
> +)
> +
> +
> +def checksum_available():
> +    out = qemu_img("--help").stdout
> +    return re.search(r"\bchecksum .+ filename\b", out) is not None
> +
> +
> +if not checksum_available():
> +    iotests.notrun("checksum command not available")
> +
> +iotests.script_initialize(
> +    supported_fmts=["raw", "qcow2"],
> +    supported_cache_modes=["none", "writeback"],

It doesn’t work with writeback, though, because it uses -T none below.


Good point


Which by the way is a heavy cost, because I usually run tests in
tmpfs,
where this won’t work.  Is there any way of not doing the -T none
below?


Testing using tempfs is problematic since you cannot test -T none.In 
oVirt
we alway use /var/tmp which usually uses something that supports 
direct I/O.


Do we have a way to specify cache mode in the tests, so we can use -T none
only when the option is set?


`./check` has a `-c` option (e.g. `./check -c none`), which lands in 
`iotests.cachemode`.  That isn’t automatically passed to qemu-img calls, 
but you can do it manually (i.e. `qemu_img_log("checksum", "-T", 
iotests.cachemode, disk_top)` instead of `"-T", "none"`).




> +    supported_protocols=["file", "nbd"],
> +    required_fmts=["raw", "qcow2"],
> +)
> +
> +print()
> +print("=== Test images ===")
> +print()
> +
> +disk_raw = iotests.file_path('raw')
> +qemu_img("create", "-f", "raw", disk_raw, "10m")
> +qemu_io("-f", "raw",
> +        "-c", "write -P 0x1 0 2m",      # data
> +        "-c", "write -P 0x0 2m 2m",     # data with zeroes
> +        "-c", "write -z 4m 2m",         # zero allocated
> +        "-c", "write -z -u 6m 2m",      # zero hole
> +                                        # unallocated
> +        disk_raw)
> +print(filter_testfiles(disk_raw))
> +qemu_img_log("map", "--output", "json", disk_raw)
> +
> +disk_qcow2 = iotests.file_path('qcow2')
> +qemu_img("create", "-f", "qcow2", disk_qcow2, &

Re: [PATCH 3/3] qemu-img: Speed up checksum

2022-11-07 Thread Hanna Reitz

On 30.10.22 18:38, Nir Soffer wrote:

On Wed, Oct 26, 2022 at 4:54 PM Hanna Reitz  wrote:

On 01.09.22 16:32, Nir Soffer wrote:
> Add coroutine based loop inspired by `qemu-img convert` design.
>
> Changes compared to `qemu-img convert`:
>
> - State for the entire image is kept in ImgChecksumState
>
> - State for single worker coroutine is kept in ImgChecksumworker.
>
> - "Writes" are always in-order, ensured using a queue.
>
> - Calling block status once per image extent, when the current
extent is
>    consumed by the workers.
>
> - Using 1m buffer size - testings shows that this gives best read
>    performance both with buffered and direct I/O.

Why does patch 1 then choose to use 2 MB?


The first patch uses sync I/O, and in this case 2 MB is a little faster.


Interesting.


> - Number of coroutines is not configurable. Testing does not show
>    improvement when using more than 8 coroutines.
>
> - Progress include entire image, not only the allocated state.
>
> Comparing to the simple read loop shows that this version is up
to 4.67
> times faster when computing a checksum for an image full of
zeroes. For
> real images it is 1.59 times faster with direct I/O, and with
buffered
> I/O there is no difference.
>
> Test results on Dell PowerEdge R640 in a CentOS Stream 9 container:
>
> | image    | size | i/o       | before         | after         |
change |
>
|--|--|---||||
> | zero [1] |   6g | buffered  | 1.600s ±0.014s | 0.342s ±0.016s
|  x4.67 |
> | zero     |   6g | direct    | 4.684s ±0.093s | 2.211s ±0.009s
|  x2.12 |
> | real [2] |   6g | buffered  | 1.841s ±0.075s | 1.806s ±0.036s
|  x1.02 |
> | real     |   6g | direct    | 3.094s ±0.079s | 1.947s ±0.017s
|  x1.59 |
> | nbd  [3] |   6g | buffered  | 2.455s ±0.183s | 1.808s ±0.016s
|  x1.36 |
> | nbd      |   6g | direct    | 3.540s ±0.020s | 1.749s ±0.018s
|  x2.02 |
>
> [1] raw image full of zeroes
> [2] raw fedora 35 image with additional random data, 50% full
> [3] image [2] exported by qemu-nbd via unix socket
>
> Signed-off-by: Nir Soffer 
> ---
>   qemu-img.c | 343
+
>   1 file changed, 270 insertions(+), 73 deletions(-)

Looks good!

Just a couple of style comments below.

> diff --git a/qemu-img.c b/qemu-img.c
> index 7edcfe4bc8..bfa8e2862f 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -1613,48 +1613,288 @@ out:
>       qemu_vfree(buf2);
>       blk_unref(blk2);
>   out2:
>       blk_unref(blk1);
>   out3:
>       qemu_progress_end();
>       return ret;
>   }
>
>   #ifdef CONFIG_BLKHASH
> +
> +#define CHECKSUM_COROUTINES 8
> +#define CHECKSUM_BUF_SIZE (1 * MiB)
> +#define CHECKSUM_ZERO_SIZE MIN(16 * GiB, SIZE_MAX)
> +
> +typedef struct ImgChecksumState ImgChecksumState;
> +
> +typedef struct ImgChecksumWorker {
> +    QTAILQ_ENTRY(ImgChecksumWorker) entry;
> +    ImgChecksumState *state;
> +    Coroutine *co;
> +    uint8_t *buf;
> +
> +    /* The current chunk. */
> +    int64_t offset;
> +    int64_t length;
> +    bool zero;
> +
> +    /* Always true for zero extent, false for data extent. Set
to true
> +     * when reading the chunk completes. */

Qemu codestyle requires /* and */ to be on separate lines for
multi-line
comments (see checkpatch.pl <http://checkpatch.pl>).


I'll change that. Do we have a good way to run checkpatch.pl 
<http://checkpatch.pl/> when using

git-publish?

Maybe a way to run checkpatch.pl <http://checkpatch.pl> on all patches 
generated by git publish

automatically?


I don’t know myself, I don’t use git-publish.  Stefan should know. ;)



> +    bool ready;
> +} ImgChecksumWorker;
> +
> +struct ImgChecksumState {
> +    const char *filename;
> +    BlockBackend *blk;
> +    BlockDriverState *bs;
> +    int64_t total_size;
> +
> +    /* Current extent, modified in checksum_co_next. */
> +    int64_t offset;
> +    int64_t length;
> +    bool zero;
> +
> +    int running_coroutines;
> +    CoMutex lock;
> +    ImgChecksumWorker workers[CHECKSUM_COROUTINES];
> +
> +    /* Ensure in-order updates. Update are scheduled at the
tail of the
> +     * queue and processed from the head of

Re: [PATCH 1/3] qemu-img: Add checksum command

2022-11-07 Thread Hanna Reitz

On 30.10.22 18:37, Nir Soffer wrote:

On Wed, Oct 26, 2022 at 4:00 PM Hanna Reitz  wrote:

On 01.09.22 16:32, Nir Soffer wrote:
> The checksum command compute a checksum for disk image content
using the
> blkhash library[1]. The blkhash library is not packaged yet, but
it is
> available via copr[2].
>
> Example run:
>
>      $ ./qemu-img checksum -p fedora-35.qcow2
> 6e5c00c995056319d52395f8d91c7f84725ae3da69ffcba4de4c7d22cff713a5
fedora-35.qcow2
>
> The block checksum is constructed by splitting the image to
fixed sized
> blocks and computing a digest of every block. The image checksum
is the
> digest of the all block digests.
>
> The checksum uses internally the "sha256" algorithm but it cannot be
> compared with checksums created by other tools such as `sha256sum`.
>
> The blkhash library supports sparse images, zero detection, and
> optimizes zero block hashing (they are practically free). The
library
> uses multiple threads to speed up the computation.
>
> Comparing to `sha256sum`, `qemu-img checksum` is 3.5-4800[3] times
> faster, depending on the amount of data in the image:
>
>      $ ./qemu-img info /scratch/50p.raw
>      file format: raw
>      virtual size: 6 GiB (6442450944 bytes)
>      disk size: 2.91 GiB
>
>      $ hyperfine -w2 -r5 -p "sleep 1" "./qemu-img checksum
/scratch/50p.raw" \
>                                       "sha256sum /scratch/50p.raw"
>      Benchmark 1: ./qemu-img checksum /scratch/50p.raw
>        Time (mean ± σ):      1.849 s ±  0.037 s [User: 7.764 s,
System: 0.962 s]
>        Range (min … max):    1.813 s …  1.908 s    5 runs
>
>      Benchmark 2: sha256sum /scratch/50p.raw
>        Time (mean ± σ):     14.585 s ±  0.072 s [User: 13.537 s,
System: 1.003 s]
>        Range (min … max):   14.501 s … 14.697 s    5 runs
>
>      Summary
>        './qemu-img checksum /scratch/50p.raw' ran
>          7.89 ± 0.16 times faster than 'sha256sum /scratch/50p.raw'
>
> The new command is available only when `blkhash` is available during
> build. To test the new command please install the `blkhash-devel`
> package:
>
>      $ dnf copr enable nsoffer/blkhash
>      $ sudo dnf install blkhash-devel
>
> [1] https://gitlab.com/nirs/blkhash
> [2] https://copr.fedorainfracloud.org/coprs/nsoffer/blkhash/
> [3] Computing checksum for 8T empty image: qemu-img checksum: 3.7s,
>      sha256sum (estimate): 17,749s
>
> Signed-off-by: Nir Soffer 
> ---
>   docs/tools/qemu-img.rst |  22 +
>   meson.build             |  10 ++-
>   meson_options.txt       |   2 +
>   qemu-img-cmds.hx        |   8 ++
>   qemu-img.c              | 191

>   5 files changed, 232 insertions(+), 1 deletion(-)
>
> diff --git a/docs/tools/qemu-img.rst b/docs/tools/qemu-img.rst
> index 85a6e05b35..8be9c45cbf 100644
> --- a/docs/tools/qemu-img.rst
> +++ b/docs/tools/qemu-img.rst
> @@ -347,20 +347,42 @@ Command description:
>       Check completed, image is corrupted
>     3
>       Check completed, image has leaked clusters, but is not
corrupted
>     63
>       Checks are not supported by the image format
>
>     If ``-r`` is specified, exit codes representing the image
state refer to the
>     state after (the attempt at) repairing it. That is, a
successful ``-r all``
>     will yield the exit code 0, independently of the image state
before.
>
> +.. option:: checksum [--object OBJECTDEF] [--image-opts] [-f
FMT] [-T SRC_CACHE] [-p] FILENAME
> +
> +  Print a checksum for image *FILENAME* guest visible content.

Why not say which kind of checksum it is?


Do you mean the algorithm used? This may be confusing, for example we 
write


   Print a sha256 checksum ...

User will expect to get the same result from "sha256sum disk.img". How 
about


   Print a blkhash checksum ...

And add a link to the blkhash project?


I did mean sha256, but if it isn’t pure sha256, then a link to any 
description how it is computed would be good, I think.




>           Images with
> +  different format or settings wil have the same checksum.

s/wil/will/


Fixing


> +
> +  The format is probed unless you specify it by ``-f``.
> +
> +  The checksum is computed for guest visible content. Allocated
areas full of
> +  zeroes, zero clusters, and unallocated areas are 

Re: [PATCH] vl: change PID file path resolve error to warning

2022-10-27 Thread Hanna Reitz

On 27.10.22 12:14, Fiona Ebner wrote:

Commit 85c4bf8aa6 ("vl: Unlink absolute PID file path") made it a
critical error when the PID file path cannot be resolved. Before this
commit, it was possible to invoke QEMU when the PID file was a file
created with mkstemp that was already unlinked at the time of the
invocation. There might be other similar scenarios.

It should not be a critical error when the PID file unlink notifier
can't be registered, because the path can't be resolved. Turn it into
a warning instead.

Fixes: 85c4bf8aa6 ("vl: Unlink absolute PID file path")
Reported-by: Dominik Csapak 
Suggested-by: Thomas Lamprecht 
Signed-off-by: Fiona Ebner 
---


Reviewed-by: Hanna Reitz 




Re: [PATCH] block/block-backend: blk_set_enable_write_cache is IO_CODE

2022-10-27 Thread Hanna Reitz

On 27.10.22 09:27, Emanuele Giuseppe Esposito wrote:

blk_set_enable_write_cache() is defined as GLOBAL_STATE_CODE
but can be invoked from iothreads when handling scsi requests.
This triggers an assertion failure:

  0x7fd6c3515ce1 in raise () from /lib/x86_64-linux-gnu/libc.so.6
  0x7fd6c34ff537 in abort () from /lib/x86_64-linux-gnu/libc.so.6
  0x7fd6c34ff40f in ?? () from /lib/x86_64-linux-gnu/libc.so.6
  0x7fd6c350e662 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
  0x56149e2cea03 in blk_set_enable_write_cache (wce=true, 
blk=0x5614a01c27f0)
at ../src/block/block-backend.c:1949
  0x56149e2d0a67 in blk_set_enable_write_cache (blk=0x5614a01c27f0,
wce=) at ../src/block/block-backend.c:1951
  0x56149dfe9c59 in scsi_disk_apply_mode_select (p=0x7fd6b400c00e "\004",
page=, s=) at ../src/hw/scsi/scsi-disk.c:1520
  mode_select_pages (change=true, len=18, p=0x7fd6b400c00e "\004", 
r=0x7fd6b4001ff0)
at ../src/hw/scsi/scsi-disk.c:1570
  scsi_disk_emulate_mode_select (inbuf=, r=0x7fd6b4001ff0) at
../src/hw/scsi/scsi-disk.c:1640
  scsi_disk_emulate_write_data (req=0x7fd6b4001ff0) at 
../src/hw/scsi/scsi-disk.c:1934
  0x56149e18ff16 in virtio_scsi_handle_cmd_req_submit (req=,
req=, s=0x5614a12f16b0) at ../src/hw/scsi/virtio-scsi.c:719
  virtio_scsi_handle_cmd_vq (vq=0x7fd6bab92140, s=0x5614a12f16b0) at
../src/hw/scsi/virtio-scsi.c:761
  virtio_scsi_handle_cmd (vq=, vdev=) at
../src/hw/scsi/virtio-scsi.c:775
  virtio_scsi_handle_cmd (vdev=0x5614a12f16b0, vq=0x7fd6bab92140) at
../src/hw/scsi/virtio-scsi.c:765
  0x56149e1a8aa6 in virtio_queue_notify_vq (vq=0x7fd6bab92140) at
../src/hw/virtio/virtio.c:2365
  0x56149e3ccea5 in aio_dispatch_handler (ctx=ctx@entry=0x5614a01babe0,
node=) at ../src/util/aio-posix.c:369
  0x56149e3cd868 in aio_dispatch_ready_handlers (ready_list=0x7fd6c09b2680,
ctx=0x5614a01babe0) at ../src/util/aio-posix.c:399
  aio_poll (ctx=0x5614a01babe0, blocking=blocking@entry=true) at
../src/util/aio-posix.c:713
  0x56149e2a7796 in iothread_run (opaque=opaque@entry=0x56149ffde500) at
../src/iothread.c:67
  0x56149e3d0859 in qemu_thread_start (args=0x7fd6c09b26f0) at
../src/util/qemu-thread-posix.c:504
  0x7fd6c36b9ea7 in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
  0x7fd6c35d9aef in clone () from /lib/x86_64-linux-gnu/libc.so.6

Changing GLOBAL_STATE_CODE in IO_CODE is allowed, since GSC callers are
allowed to call IO_CODE.

Resolves: #1272

Signed-off-by: Emanuele Giuseppe Esposito 
---
  block/block-backend.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)


Reviewed-by: Hanna Reitz 




Re: [PATCH 3/3] qemu-img: Speed up checksum

2022-10-26 Thread Hanna Reitz

On 01.09.22 16:32, Nir Soffer wrote:

Add coroutine based loop inspired by `qemu-img convert` design.

Changes compared to `qemu-img convert`:

- State for the entire image is kept in ImgChecksumState

- State for single worker coroutine is kept in ImgChecksumworker.

- "Writes" are always in-order, ensured using a queue.

- Calling block status once per image extent, when the current extent is
   consumed by the workers.

- Using 1m buffer size - testings shows that this gives best read
   performance both with buffered and direct I/O.


Why does patch 1 then choose to use 2 MB?


- Number of coroutines is not configurable. Testing does not show
   improvement when using more than 8 coroutines.

- Progress include entire image, not only the allocated state.

Comparing to the simple read loop shows that this version is up to 4.67
times faster when computing a checksum for an image full of zeroes. For
real images it is 1.59 times faster with direct I/O, and with buffered
I/O there is no difference.

Test results on Dell PowerEdge R640 in a CentOS Stream 9 container:

| image| size | i/o   | before | after  | change |
|--|--|---||||
| zero [1] |   6g | buffered  | 1.600s ±0.014s | 0.342s ±0.016s |  x4.67 |
| zero |   6g | direct| 4.684s ±0.093s | 2.211s ±0.009s |  x2.12 |
| real [2] |   6g | buffered  | 1.841s ±0.075s | 1.806s ±0.036s |  x1.02 |
| real |   6g | direct| 3.094s ±0.079s | 1.947s ±0.017s |  x1.59 |
| nbd  [3] |   6g | buffered  | 2.455s ±0.183s | 1.808s ±0.016s |  x1.36 |
| nbd  |   6g | direct| 3.540s ±0.020s | 1.749s ±0.018s |  x2.02 |

[1] raw image full of zeroes
[2] raw fedora 35 image with additional random data, 50% full
[3] image [2] exported by qemu-nbd via unix socket

Signed-off-by: Nir Soffer 
---
  qemu-img.c | 343 +
  1 file changed, 270 insertions(+), 73 deletions(-)


Looks good!

Just a couple of style comments below.


diff --git a/qemu-img.c b/qemu-img.c
index 7edcfe4bc8..bfa8e2862f 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1613,48 +1613,288 @@ out:
  qemu_vfree(buf2);
  blk_unref(blk2);
  out2:
  blk_unref(blk1);
  out3:
  qemu_progress_end();
  return ret;
  }
  
  #ifdef CONFIG_BLKHASH

+
+#define CHECKSUM_COROUTINES 8
+#define CHECKSUM_BUF_SIZE (1 * MiB)
+#define CHECKSUM_ZERO_SIZE MIN(16 * GiB, SIZE_MAX)
+
+typedef struct ImgChecksumState ImgChecksumState;
+
+typedef struct ImgChecksumWorker {
+QTAILQ_ENTRY(ImgChecksumWorker) entry;
+ImgChecksumState *state;
+Coroutine *co;
+uint8_t *buf;
+
+/* The current chunk. */
+int64_t offset;
+int64_t length;
+bool zero;
+
+/* Always true for zero extent, false for data extent. Set to true
+ * when reading the chunk completes. */


Qemu codestyle requires /* and */ to be on separate lines for multi-line 
comments (see checkpatch.pl).



+bool ready;
+} ImgChecksumWorker;
+
+struct ImgChecksumState {
+const char *filename;
+BlockBackend *blk;
+BlockDriverState *bs;
+int64_t total_size;
+
+/* Current extent, modified in checksum_co_next. */
+int64_t offset;
+int64_t length;
+bool zero;
+
+int running_coroutines;
+CoMutex lock;
+ImgChecksumWorker workers[CHECKSUM_COROUTINES];
+
+/* Ensure in-order updates. Update are scheduled at the tail of the
+ * queue and processed from the head of the queue when a worker is
+ * ready. */


Qemu codestyle requires /* and */ to be on separate lines for multi-line 
comments.



+QTAILQ_HEAD(, ImgChecksumWorker) update_queue;
+
+struct blkhash *hash;
+int ret;
+};


[...]


+static coroutine_fn bool checksum_co_next(ImgChecksumWorker *w)
+{
+ImgChecksumState *s = w->state;
+
+qemu_co_mutex_lock(>lock);
+
+if (s->offset == s->total_size || s->ret != -EINPROGRESS) {
+qemu_co_mutex_unlock(>lock);
+return false;
+}
+
+if (s->length == 0 && checksum_block_status(s)) {


I’d prefer `checksum_block_status(s) < 0` so that it is clear that 
negative values indicate errors.  Otherwise, based on this code alone, 
it looks to me more like `checksum_block_status()` returned a boolean, 
where `false` would generally indicate an error, which is confusing.


Same in other places below.


+qemu_co_mutex_unlock(>lock);
+return false;
+}


[...]


+/* Enter the next worker coroutine if the worker is ready. */
+static void coroutine_fn checksum_co_enter_next(ImgChecksumWorker *w)
+{
+ImgChecksumState *s = w->state;
+ImgChecksumWorker *next;
+
+if (!QTAILQ_EMPTY(>update_queue)) {
+next = QTAILQ_FIRST(>update_queue);
+if (next->ready)
+qemu_coroutine_enter(next->co);


Qemu codestyle requires braces here.

Hanna




Re: [PATCH 2/3] iotests: Test qemu-img checksum

2022-10-26 Thread Hanna Reitz

On 01.09.22 16:32, Nir Soffer wrote:

Add simple tests creating an image with all kinds of extents, different
formats, different backing chain, different protocol, and different
image options. Since all images have the same guest visible content they
must have the same checksum.

To help debugging in case of failures, the output includes a json map of
every test image.

Signed-off-by: Nir Soffer 
---
  tests/qemu-iotests/tests/qemu-img-checksum| 149 ++
  .../qemu-iotests/tests/qemu-img-checksum.out  |  74 +
  2 files changed, 223 insertions(+)
  create mode 100755 tests/qemu-iotests/tests/qemu-img-checksum
  create mode 100644 tests/qemu-iotests/tests/qemu-img-checksum.out

diff --git a/tests/qemu-iotests/tests/qemu-img-checksum 
b/tests/qemu-iotests/tests/qemu-img-checksum
new file mode 100755
index 00..3a85ba33f2
--- /dev/null
+++ b/tests/qemu-iotests/tests/qemu-img-checksum
@@ -0,0 +1,149 @@
+#!/usr/bin/env python3
+# group: rw auto quick
+#
+# Test cases for qemu-img checksum.
+#
+# Copyright (C) 2022 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+
+import re
+
+import iotests
+
+from iotests import (
+filter_testfiles,
+qemu_img,
+qemu_img_log,
+qemu_io,
+qemu_nbd_popen,
+)
+
+
+def checksum_available():
+out = qemu_img("--help").stdout
+return re.search(r"\bchecksum .+ filename\b", out) is not None
+
+
+if not checksum_available():
+iotests.notrun("checksum command not available")
+
+iotests.script_initialize(
+supported_fmts=["raw", "qcow2"],
+supported_cache_modes=["none", "writeback"],


It doesn’t work with writeback, though, because it uses -T none below.

Which by the way is a heavy cost, because I usually run tests in tmpfs, 
where this won’t work.  Is there any way of not doing the -T none below?



+supported_protocols=["file", "nbd"],
+required_fmts=["raw", "qcow2"],
+)
+
+print()
+print("=== Test images ===")
+print()
+
+disk_raw = iotests.file_path('raw')
+qemu_img("create", "-f", "raw", disk_raw, "10m")
+qemu_io("-f", "raw",
+"-c", "write -P 0x1 0 2m",  # data
+"-c", "write -P 0x0 2m 2m", # data with zeroes
+"-c", "write -z 4m 2m", # zero allocated
+"-c", "write -z -u 6m 2m",  # zero hole
+# unallocated
+disk_raw)
+print(filter_testfiles(disk_raw))
+qemu_img_log("map", "--output", "json", disk_raw)
+
+disk_qcow2 = iotests.file_path('qcow2')
+qemu_img("create", "-f", "qcow2", disk_qcow2, "10m")
+qemu_io("-f", "qcow2",
+"-c", "write -P 0x1 0 2m",  # data
+"-c", "write -P 0x0 2m 2m", # data with zeroes
+"-c", "write -z 4m 2m", # zero allocated
+"-c", "write -z -u 6m 2m",  # zero hole
+# unallocated
+disk_qcow2)
+print(filter_testfiles(disk_qcow2))
+qemu_img_log("map", "--output", "json", disk_qcow2)


This isn’t how iotests work, generally.  When run with -qcow2 -file, it 
should only test qcow2 on file, not raw on file, not raw on nbd. Perhaps 
this way this test could even support other formats than qcow2 and raw.


Hanna




Re: [PATCH 1/3] qemu-img: Add checksum command

2022-10-26 Thread Hanna Reitz

On 01.09.22 16:32, Nir Soffer wrote:

The checksum command compute a checksum for disk image content using the
blkhash library[1]. The blkhash library is not packaged yet, but it is
available via copr[2].

Example run:

 $ ./qemu-img checksum -p fedora-35.qcow2
 6e5c00c995056319d52395f8d91c7f84725ae3da69ffcba4de4c7d22cff713a5  
fedora-35.qcow2

The block checksum is constructed by splitting the image to fixed sized
blocks and computing a digest of every block. The image checksum is the
digest of the all block digests.

The checksum uses internally the "sha256" algorithm but it cannot be
compared with checksums created by other tools such as `sha256sum`.

The blkhash library supports sparse images, zero detection, and
optimizes zero block hashing (they are practically free). The library
uses multiple threads to speed up the computation.

Comparing to `sha256sum`, `qemu-img checksum` is 3.5-4800[3] times
faster, depending on the amount of data in the image:

 $ ./qemu-img info /scratch/50p.raw
 file format: raw
 virtual size: 6 GiB (6442450944 bytes)
 disk size: 2.91 GiB

 $ hyperfine -w2 -r5 -p "sleep 1" "./qemu-img checksum /scratch/50p.raw" \
  "sha256sum /scratch/50p.raw"
 Benchmark 1: ./qemu-img checksum /scratch/50p.raw
   Time (mean ± σ):  1.849 s ±  0.037 s[User: 7.764 s, System: 
0.962 s]
   Range (min … max):1.813 s …  1.908 s5 runs

 Benchmark 2: sha256sum /scratch/50p.raw
   Time (mean ± σ): 14.585 s ±  0.072 s[User: 13.537 s, System: 
1.003 s]
   Range (min … max):   14.501 s … 14.697 s5 runs

 Summary
   './qemu-img checksum /scratch/50p.raw' ran
 7.89 ± 0.16 times faster than 'sha256sum /scratch/50p.raw'

The new command is available only when `blkhash` is available during
build. To test the new command please install the `blkhash-devel`
package:

 $ dnf copr enable nsoffer/blkhash
 $ sudo dnf install blkhash-devel

[1] https://gitlab.com/nirs/blkhash
[2] https://copr.fedorainfracloud.org/coprs/nsoffer/blkhash/
[3] Computing checksum for 8T empty image: qemu-img checksum: 3.7s,
 sha256sum (estimate): 17,749s

Signed-off-by: Nir Soffer 
---
  docs/tools/qemu-img.rst |  22 +
  meson.build |  10 ++-
  meson_options.txt   |   2 +
  qemu-img-cmds.hx|   8 ++
  qemu-img.c  | 191 
  5 files changed, 232 insertions(+), 1 deletion(-)

diff --git a/docs/tools/qemu-img.rst b/docs/tools/qemu-img.rst
index 85a6e05b35..8be9c45cbf 100644
--- a/docs/tools/qemu-img.rst
+++ b/docs/tools/qemu-img.rst
@@ -347,20 +347,42 @@ Command description:
  Check completed, image is corrupted
3
  Check completed, image has leaked clusters, but is not corrupted
63
  Checks are not supported by the image format
  
If ``-r`` is specified, exit codes representing the image state refer to the

state after (the attempt at) repairing it. That is, a successful ``-r all``
will yield the exit code 0, independently of the image state before.
  
+.. option:: checksum [--object OBJECTDEF] [--image-opts] [-f FMT] [-T SRC_CACHE] [-p] FILENAME

+
+  Print a checksum for image *FILENAME* guest visible content.


Why not say which kind of checksum it is?


 Images with
+  different format or settings wil have the same checksum.


s/wil/will/


+
+  The format is probed unless you specify it by ``-f``.
+
+  The checksum is computed for guest visible content. Allocated areas full of
+  zeroes, zero clusters, and unallocated areas are read as zeros so they will
+  have the same checksum. Images with single or multiple files or backing files
+  will have the same checksums if the guest will see the same content when
+  reading the image.
+
+  Image metadata that is not visible to the guest such as dirty bitmaps does
+  not affect the checksum.
+
+  Computing a checksum requires a read-only image. You cannot compute a
+  checksum of an active image used by a guest,


Makes me ask: Why not?  Other subcommands have the -U flag for this.


 but you can compute a checksum
+  of a guest during pull mode incremental backup using NBD URL.
+
+  The checksum is not compatible with other tools such as *sha256sum*.


Why not?  I can see it differs even for raw images, but why?  I would 
have very much assumed that this gives me exactly what sha256sum in the 
guest on the guest device would yield.



+
  .. option:: commit [--object OBJECTDEF] [--image-opts] [-q] [-f FMT] [-t 
CACHE] [-b BASE] [-r RATE_LIMIT] [-d] [-p] FILENAME
  
Commit the changes recorded in *FILENAME* in its base image or backing file.

If the backing file is smaller than the snapshot, then the backing file 
will be
resized to be the same size as the snapshot.  If the snapshot is smaller 
than
the backing file, 

[PATCH 1/3] block: bdrv_child_get_parent_aio_context is not GS

2022-09-23 Thread Hanna Reitz
All implementations of bdrv_child_get_parent_aio_context() are IO_CODE
(or do not mark anything in the case of block jobs), so this too can be
IO_CODE.  By the definition of "I/O API functions" in block-io.h, this
is a strict relaxation, as I/O code can be run from both GS and I/O code
arbitrarily.

Signed-off-by: Hanna Reitz 
---
 block.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block.c b/block.c
index bc85f46eed..7f2a9d4df0 100644
--- a/block.c
+++ b/block.c
@@ -1499,7 +1499,7 @@ const BdrvChildClass child_of_bds = {
 
 AioContext *bdrv_child_get_parent_aio_context(BdrvChild *c)
 {
-GLOBAL_STATE_CODE();
+IO_CODE();
 return c->klass->get_parent_aio_context(c);
 }
 
-- 
2.36.1




[PATCH 0/3] blcok: Start/end drain on correct AioContext

2022-09-23 Thread Hanna Reitz
Hi,

bdrv_replace_child_noperm() drains the child via
bdrv_parent_drained_{begin,end}_single().  When it removes a child, the
bdrv_parent_drained_end_single() at its end will be called on an empty
child, making the BDRV_POLL_WHILE() in it poll the main AioContext
(because c->bs is NULL).

That’s wrong, though, because it’s supposed to operate on the parent.
bdrv_parent_drained_end_single_no_poll() will have scheduled any BHs in
the parents’ AioContext, which may be anything, not necessarily the main
context.  Therefore, we must poll the parent’s context.

Patch 3 does this for both bdrv_parent_drained_{begin,end}_single().
Patch 1 ensures that we can legally call
bdrv_child_get_parent_aio_context() from those functions (currently
marked as GLOBAL_STATE_CODE(), which I don’t think it is), and patch 2
fixes blk_do_set_aio_context() to not cause an assertion failure if it
beginning a drain can end up in blk_get_aio_context() before blk->ctx
has been updated.


Hanna Reitz (3):
  block: bdrv_child_get_parent_aio_context is not GS
  block-backend: Update ctx immediately after root
  block: Start/end drain on correct AioContext

 block.c   | 2 +-
 block/block-backend.c | 4 +++-
 block/io.c| 6 --
 3 files changed, 8 insertions(+), 4 deletions(-)

-- 
2.36.1




[PATCH 2/3] block-backend: Update ctx immediately after root

2022-09-23 Thread Hanna Reitz
blk_get_aio_context() asserts that blk->ctx is always equal to the root
BDS's context (if there is a root BDS).  Therefore,
blk_do_set_aio_context() must update blk->ctx immediately after the root
BDS's context has changed.

Without this patch, the next patch would break iotest 238, because
bdrv_drained_begin() (called by blk_do_set_aio_context()) may then
invoke bdrv_child_get_parent_aio_context() on the root child, i.e.
blk_get_aio_context().  However, by this point, blk->ctx would not have
been updated and thus differ from the root node's context.  This patch
fixes that.

Signed-off-by: Hanna Reitz 
---
 block/block-backend.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index d4a5df2ac2..abdb5ff5af 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2156,6 +2156,7 @@ static int blk_do_set_aio_context(BlockBackend *blk, 
AioContext *new_context,
 return ret;
 }
 }
+blk->ctx = new_context;
 if (tgm->throttle_state) {
 bdrv_drained_begin(bs);
 throttle_group_detach_aio_context(tgm);
@@ -2164,9 +2165,10 @@ static int blk_do_set_aio_context(BlockBackend *blk, 
AioContext *new_context,
 }
 
 bdrv_unref(bs);
+} else {
+blk->ctx = new_context;
 }
 
-blk->ctx = new_context;
 return 0;
 }
 
-- 
2.36.1




[PATCH 3/3] block: Start/end drain on correct AioContext

2022-09-23 Thread Hanna Reitz
bdrv_parent_drained_{begin,end}_single() are supposed to operate on the
parent, not on the child, so they should not attempt to get the context
to poll from the child but the parent instead.  BDRV_POLL_WHILE(c->bs)
does get the context from the child, so we should replace it with
AIO_WAIT_WHILE() on the parent's context instead.

This problem becomes apparent when bdrv_replace_child_noperm() invokes
bdrv_parent_drained_end_single() after removing a child from a subgraph
that is in an I/O thread.  By the time bdrv_parent_drained_end_single()
is called, child->bs is NULL, and so BDRV_POLL_WHILE(c->bs, ...) will
poll the main loop instead of the I/O thread; but anything that
bdrv_parent_drained_end_single_no_poll() may have scheduled is going to
want to run in the I/O thread, but because we poll the main loop, the
I/O thread is never unpaused, and nothing is run, resulting in a
deadlock.

Closes: https://gitlab.com/qemu-project/qemu/-/issues/1215
Signed-off-by: Hanna Reitz 
---
 block/io.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/block/io.c b/block/io.c
index 0a8cbefe86..7df1129b3b 100644
--- a/block/io.c
+++ b/block/io.c
@@ -71,9 +71,10 @@ static void bdrv_parent_drained_end_single_no_poll(BdrvChild 
*c,
 void bdrv_parent_drained_end_single(BdrvChild *c)
 {
 int drained_end_counter = 0;
+AioContext *ctx = bdrv_child_get_parent_aio_context(c);
 IO_OR_GS_CODE();
 bdrv_parent_drained_end_single_no_poll(c, _end_counter);
-BDRV_POLL_WHILE(c->bs, qatomic_read(_end_counter) > 0);
+AIO_WAIT_WHILE(ctx, qatomic_read(_end_counter) > 0);
 }
 
 static void bdrv_parent_drained_end(BlockDriverState *bs, BdrvChild *ignore,
@@ -116,13 +117,14 @@ static bool bdrv_parent_drained_poll(BlockDriverState 
*bs, BdrvChild *ignore,
 
 void bdrv_parent_drained_begin_single(BdrvChild *c, bool poll)
 {
+AioContext *ctx = bdrv_child_get_parent_aio_context(c);
 IO_OR_GS_CODE();
 c->parent_quiesce_counter++;
 if (c->klass->drained_begin) {
 c->klass->drained_begin(c);
 }
 if (poll) {
-BDRV_POLL_WHILE(c->bs, bdrv_parent_drained_poll_single(c));
+AIO_WAIT_WHILE(ctx, bdrv_parent_drained_poll_single(c));
 }
 }
 
-- 
2.36.1




[PATCH 2/3] block/qed: Keep auto_backing_file if possible

2022-08-03 Thread Hanna Reitz
Just like qcow2, qed invokes its open function in its
.bdrv_co_invalidate_cache() implementation.  Therefore, just like done
for qcow2 in HEAD^, update auto_backing_file only if the backing file
string in the image header differs from the one we have read before.

Signed-off-by: Hanna Reitz 
---
 block/qed.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index 40943e679b..324ca0e95a 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -445,6 +445,8 @@ static int coroutine_fn bdrv_qed_do_open(BlockDriverState 
*bs, QDict *options,
 }
 
 if ((s->header.features & QED_F_BACKING_FILE)) {
+g_autofree char *backing_file_str = NULL;
+
 if ((uint64_t)s->header.backing_filename_offset +
 s->header.backing_filename_size >
 s->header.cluster_size * s->header.header_size) {
@@ -452,16 +454,21 @@ static int coroutine_fn bdrv_qed_do_open(BlockDriverState 
*bs, QDict *options,
 return -EINVAL;
 }
 
+backing_file_str = g_malloc(sizeof(bs->backing_file));
 ret = qed_read_string(bs->file, s->header.backing_filename_offset,
   s->header.backing_filename_size,
-  bs->auto_backing_file,
-  sizeof(bs->auto_backing_file));
+  backing_file_str, sizeof(bs->backing_file));
 if (ret < 0) {
 error_setg(errp, "Failed to read backing filename");
 return ret;
 }
-pstrcpy(bs->backing_file, sizeof(bs->backing_file),
-bs->auto_backing_file);
+
+if (!g_str_equal(backing_file_str, bs->backing_file)) {
+pstrcpy(bs->backing_file, sizeof(bs->backing_file),
+backing_file_str);
+pstrcpy(bs->auto_backing_file, sizeof(bs->auto_backing_file),
+backing_file_str);
+}
 
 if (s->header.features & QED_F_BACKING_FORMAT_NO_PROBE) {
 pstrcpy(bs->backing_format, sizeof(bs->backing_format), "raw");
-- 
2.36.1




[PATCH 3/3] iotests/backing-file-invalidation: Add new test

2022-08-03 Thread Hanna Reitz
Add a new test to see what happens when you migrate a VM with a backing
chain that has json:{} backing file strings, which, when opened, will be
resolved to plain filenames.

Signed-off-by: Hanna Reitz 
---
 .../tests/backing-file-invalidation   | 152 ++
 .../tests/backing-file-invalidation.out   |   5 +
 2 files changed, 157 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/backing-file-invalidation
 create mode 100644 tests/qemu-iotests/tests/backing-file-invalidation.out

diff --git a/tests/qemu-iotests/tests/backing-file-invalidation 
b/tests/qemu-iotests/tests/backing-file-invalidation
new file mode 100755
index 00..4eccc80153
--- /dev/null
+++ b/tests/qemu-iotests/tests/backing-file-invalidation
@@ -0,0 +1,152 @@
+#!/usr/bin/env python3
+# group: rw migration
+#
+# Migrate a VM with a BDS with backing nodes, which runs
+# bdrv_invalidate_cache(), which for qcow2 and qed triggers reading the
+# backing file string from the image header.  Check whether this
+# interferes with bdrv_backing_overridden().
+#
+# Copyright (C) 2022 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+import json
+import os
+from typing import Optional
+
+import iotests
+from iotests import qemu_img_create, qemu_img_info
+
+
+image_size = 1 * 1024 * 1024
+imgs = [os.path.join(iotests.test_dir, f'{i}.img') for i in range(0, 4)]
+
+mig_sock = os.path.join(iotests.sock_dir, 'mig.sock')
+
+
+class TestPostMigrateFilename(iotests.QMPTestCase):
+vm_s: Optional[iotests.VM] = None
+vm_d: Optional[iotests.VM] = None
+
+def setUp(self) -> None:
+# Create backing chain of three images, where the backing file strings
+# are json:{} filenames
+qemu_img_create('-f', iotests.imgfmt, imgs[0], str(image_size))
+for i in range(1, 3):
+backing = {
+'driver': iotests.imgfmt,
+'file': {
+'driver': 'file',
+'filename': imgs[i - 1]
+}
+}
+qemu_img_create('-f', iotests.imgfmt, '-F', iotests.imgfmt,
+'-b', 'json:' + json.dumps(backing),
+imgs[i], str(image_size))
+
+def tearDown(self) -> None:
+if self.vm_s is not None:
+self.vm_s.shutdown()
+if self.vm_d is not None:
+self.vm_d.shutdown()
+
+for img in imgs:
+try:
+os.remove(img)
+except OSError:
+pass
+try:
+os.remove(mig_sock)
+except OSError:
+pass
+
+def test_migration(self) -> None:
+"""
+Migrate a VM with the backing chain created in setUp() attached.  At
+the end of the migration process, the destination will run
+bdrv_invalidate_cache(), which for some image formats (qcow2 and qed)
+means the backing file string is re-read from the image header.  If
+this overwrites bs->auto_backing_file, doing so may cause
+bdrv_backing_overridden() to become true: The image header reports a
+json:{} filename, but when opening it, bdrv_refresh_filename() will
+simplify it to a plain simple filename; and when bs->auto_backing_file
+and bs->backing->bs->filename differ, bdrv_backing_overridden() becomes
+true.
+If bdrv_backing_overridden() is true, the BDS will be forced to get a
+json:{} filename, which in general is not the end of the world, but not
+great.  Check whether that happens, i.e. whether migration changes the
+node's filename.
+"""
+
+blockdev = {
+'node-name': 'node0',
+'driver': iotests.imgfmt,
+'file': {
+'driver': 'file',
+'filename': imgs[2]
+}
+}
+
+self.vm_s = iotests.VM(path_suffix='a') \
+   .add_blockdev(json.dumps(blockdev))
+self.vm_d = iotests.VM(path_suffix='b') \
+   .add_blockdev(json.dumps(blockdev)) \
+   .add_incoming(f'unix:{mig_sock}')
+
+assert self.vm_s is not None
+assert self.vm_d is not None
+
+self.vm_s.launch()
+self.vm_d.launch()
+
+ 

[PATCH 1/3] block/qcow2: Keep auto_backing_file if possible

2022-08-03 Thread Hanna Reitz
qcow2_do_open() is used by qcow2_co_invalidate_cache(), i.e. may be run
on an image that has been opened before.  When reading the backing file
string from the image header, compare it against the existing
bs->backing_file, and update bs->auto_backing_file only if they differ.

auto_backing_file should ideally contain the filename the backing BDS
will actually have after opening, i.e. a post-bdrv_refresh_filename()
version of what is in the image header.  So for example, if the image
header reports the following backing file string:

json:{"driver": "qcow2", "file": {
"driver": "file", "filename": "/tmp/backing.qcow2"
}}

Then auto_backing_file should contain simply "/tmp/backing.qcow2".

Because bdrv_refresh_filename() only works on existing BDSs, though, the
way how we get this auto_backing_file value is to have the format driver
set it to whatever is in the image header, and when the backing BDS is
opened based on that, we update it with the filename the backing BDS
actually got.

However, qcow2's qcow2_co_invalidate_cache() implementation breaks this
because it just resets auto_backing_file to whatever is in the image
file without opening a BDS based on it, so we never get
auto_backing_file back to the "refreshed" version, and in the example
above, it would stay "json:{...}".

Then, bs->backing->bs->filename will differ from bs->auto_backing_file,
making bdrv_backing_overridden(bs) return true, which will lead
bdrv_refresh_filename(bs) to generate a json:{} filename for bs, even
though that may not have been necessary.  This is reported in the issue
linked below.

Therefore, skip updating auto_backing_file if nothing has changed in the
image header since we last read it.

Fixes: https://gitlab.com/qemu-project/qemu/-/issues/1117
Signed-off-by: Hanna Reitz 
---
 block/qcow2.c | 21 -
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index c6c6692fb7..a43fee2067 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1696,16 +1696,27 @@ static int coroutine_fn qcow2_do_open(BlockDriverState 
*bs, QDict *options,
 ret = -EINVAL;
 goto fail;
 }
+
+s->image_backing_file = g_malloc(len + 1);
 ret = bdrv_pread(bs->file, header.backing_file_offset, len,
- bs->auto_backing_file, 0);
+ s->image_backing_file, 0);
 if (ret < 0) {
 error_setg_errno(errp, -ret, "Could not read backing file name");
 goto fail;
 }
-bs->auto_backing_file[len] = '\0';
-pstrcpy(bs->backing_file, sizeof(bs->backing_file),
-bs->auto_backing_file);
-s->image_backing_file = g_strdup(bs->auto_backing_file);
+s->image_backing_file[len] = '\0';
+
+/*
+ * Update only when something has changed.  This function is called by
+ * qcow2_co_invalidate_cache(), and we do not want to reset
+ * auto_backing_file unless necessary.
+ */
+if (!g_str_equal(s->image_backing_file, bs->backing_file)) {
+pstrcpy(bs->backing_file, sizeof(bs->backing_file),
+s->image_backing_file);
+pstrcpy(bs->auto_backing_file, sizeof(bs->auto_backing_file),
+s->image_backing_file);
+}
 }
 
 /*
-- 
2.36.1




[PATCH 0/3] block: Keep auto_backing_file post-migration

2022-08-03 Thread Hanna Reitz
Hi,

https://gitlab.com/qemu-project/qemu/-/issues/1117 reports the following
issue:

Say you have a VM with a backing chain of images where the image
metadata contains json:{} backing file strings, which however will be
resolved to simple plain filenames when opened[1].

So when these images are opened, bs->auto_backing_file is first read
directly from the image header, and will thus contain a json:{}
filename.  The backing image is opened based off of this filename, and
bdrv_refresh_filename() simplfies the filename as shown[1].  We then
update bs->auto_backing_file from bs->backing->bs->filename, so both are
equal.

It is quite important that both are equal, because
bdrv_backing_overridden() checks whether the backing file has been
changed from the default by comparing bs->auto_backing_file to
bs->backing->bs->filename.

Because we did set bs->auto_backing_file from bs->backing->bs->filename,
both are equal, the backing file is not considered overridden, and
bdrv_refresh_filename(bs) will not consider it necessary to generate a
json:{} filename for the overlay.

Then the VM is migrated.

The destination side invokes bdrv_invalidate_cache(), which by qcow2 and
qed is implemented by closing the image and opening it.  This re-reads
the backing file string from disk, resetting bs->auto_backing_file.
Now, it will contains the json:{} filename again and thus differ from
bs->backing->bs->filename.

Consequentially, a subsequent bdrv_refresh_filename(bs) will find that
the overlay’s backing file has been overridden and generate a json:{}
filename, which isn’t great.

This series fixes that by having qcow2’s and qed’s image-open operations
not overwrite bs->auto_backing_file unless something has changed since
the last time we read the backing filename from the metadata.


Now, generating a json:{} filename can be a nuisance but shouldn’t be a
real problem.  The actual problem reported in 1117 comes later, namely
when creating a snapshot overlay post-migration.  This overlay image
will have a json:{} backing filename in its image metadata, which
contains a 'backing' key[2].

'qemu-img info' uses the BDRV_O_NO_BACKING flag to open images, which
conflicts with those backing options: With that flag, nobody processes
those options, and that’s an error.  Therefore, you can’t run 'qemu-img
info --backing-chain' on that overlay image.

That part of the issue is not fixed in this series, however.  I’ll send
a separate RFC series for it, because I’m honstly not quite certain how
it should be fixed.


[1] Example:
json:{"driver": "qcow2",
  "file": {"driver": "file", "filename": "img.qcow2"}}
Will generally be “resolved” by bdrv_refresh_filename() to
"img.qcow2"

[2] That it contains a 'backing' key is only natural, because the reason
why bdrv_refresh_filename() decided to generate a json:{} filename
for the image is because it considered the backing file overridden.
Hence it must put the actual backing file options into a 'backing'
object in the json:{} filename.


Hanna Reitz (3):
  block/qcow2: Keep auto_backing_file if possible
  block/qed: Keep auto_backing_file if possible
  iotests/backing-file-invalidation: Add new test

 block/qcow2.c |  21 ++-
 block/qed.c   |  15 +-
 .../tests/backing-file-invalidation   | 152 ++
 .../tests/backing-file-invalidation.out   |   5 +
 4 files changed, 184 insertions(+), 9 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/backing-file-invalidation
 create mode 100644 tests/qemu-iotests/tests/backing-file-invalidation.out

-- 
2.36.1




Re: [RFC PATCH 8/8] block: remove all unused ->can_set_aio_ctx and ->set_aio_ctx callbacks

2022-07-15 Thread Hanna Reitz

On 12.07.22 23:19, Emanuele Giuseppe Esposito wrote:

Together with all _can_set_ and _set_ APIs, as they are not needed
anymore.

Signed-off-by: Emanuele Giuseppe Esposito 
---
  block.c| 196 -
  block/block-backend.c  |  33 -
  blockjob.c |  35 --
  include/block/block-global-state.h |   9 --
  include/block/block_int-common.h   |   4 -
  5 files changed, 277 deletions(-)


Looks good!  I’d just like a follow-up commit that also drops 
bdrv_try_set_aio_context(), so it’s all gone (I think that’s the final 
remnant?).


Hanna




Re: [RFC PATCH 7/8] block: use the new _change_ API instead of _can_set_ and _set_

2022-07-15 Thread Hanna Reitz

On 12.07.22 23:19, Emanuele Giuseppe Esposito wrote:

Replace all direct usage of ->can_set_aio_ctx and ->set_aio_ctx,
and call bdrv_child_try_change_aio_context() in
bdrv_try_set_aio_context(), the main function called through
the whole block layer.

 From this point onwards, ->can_set_aio_ctx and ->set_aio_ctx
won't be used anymore.

Signed-off-by: Emanuele Giuseppe Esposito 
---
  block.c   | 30 --
  block/block-backend.c |  8 ++--
  2 files changed, 22 insertions(+), 16 deletions(-)

diff --git a/block.c b/block.c
index a7ba590dfa..101188a2d4 100644
--- a/block.c
+++ b/block.c
@@ -2966,17 +2966,18 @@ static void bdrv_attach_child_common_abort(void *opaque)
  }
  
  if (bdrv_child_get_parent_aio_context(child) != s->old_parent_ctx) {

+Transaction *tran;
  GSList *ignore;
+bool ret;
  
-/* No need to ignore `child`, because it has been detached already */

-ignore = NULL;
-child->klass->can_set_aio_ctx(child, s->old_parent_ctx, ,
-  _abort);
-g_slist_free(ignore);
+tran = tran_new();
  
+/* No need to ignore `child`, because it has been detached already */

  ignore = NULL;
-child->klass->set_aio_ctx(child, s->old_parent_ctx, );
+ret = child->klass->change_aio_ctx(child, s->old_parent_ctx, ,
+   tran, _abort);
  g_slist_free(ignore);
+tran_finalize(tran, ret ? ret : -1);


As far as I understand, the transaction is supposed to always succeed; 
that’s why we pass `_abort`, I thought.


If so, `ret` should always be true.  More importantly, though, I think 
the `ret ? ret : -1` is wrong because it’ll always evaluate to either 1 
or -1, but never to 0, which would indicate success.  I think it should 
be `ret == true ? 0 : -1`, or even better `assert(ret == true); 
tran_finalize(tran, 0);`.



  }
  
  bdrv_unref(bs);

@@ -3037,17 +3038,18 @@ static int bdrv_attach_child_common(BlockDriverState 
*child_bs,
  Error *local_err = NULL;
  int ret = bdrv_try_set_aio_context(child_bs, parent_ctx, _err);
  
-if (ret < 0 && child_class->can_set_aio_ctx) {

+if (ret < 0 && child_class->change_aio_ctx) {
+Transaction *tran = tran_new();
  GSList *ignore = g_slist_prepend(NULL, new_child);
-if (child_class->can_set_aio_ctx(new_child, child_ctx, ,
- NULL))
-{
+bool ret_child;
+
+ret_child = child_class->change_aio_ctx(new_child, child_ctx,
+, tran, NULL);
+if (ret_child) {


To be honest, due to the mix of return value styles we have, perhaps a 
`ret_child == true` would help to signal that this is a success path.



  error_free(local_err);
  ret = 0;
-g_slist_free(ignore);
-ignore = g_slist_prepend(NULL, new_child);
-child_class->set_aio_ctx(new_child, child_ctx, );
  }
+tran_finalize(tran, ret_child ? ret_child : -1);


This too should probably be `ret_child == true ? 0 : -1`.


  g_slist_free(ignore);
  }
  
@@ -7708,7 +7710,7 @@ int bdrv_try_set_aio_context(BlockDriverState *bs, AioContext *ctx,

   Error **errp)
  {
  GLOBAL_STATE_CODE();
-return bdrv_child_try_set_aio_context(bs, ctx, NULL, errp);
+return bdrv_child_try_change_aio_context(bs, ctx, NULL, errp);


Why not remove this function and adjust all callers?

Hanna


  }
  
  void bdrv_add_aio_context_notifier(BlockDriverState *bs,





Re: [RFC PATCH 6/8] block-backend: implement .change_aio_ctx in child_root

2022-07-15 Thread Hanna Reitz

On 12.07.22 23:19, Emanuele Giuseppe Esposito wrote:

blk_root_change_aio_ctx() is very similar to blk_root_can_set_aio_ctx(),
but implements a new transaction so that if all check pass, the new
transaction's .commit will take care of changing the BlockBackend
AioContext. blk_root_set_aio_ctx_commit() is the same as
blk_root_set_aio_ctx().

Note: bdrv_child_try_change_aio_context() is not called by
anyone at this point.

Signed-off-by: Emanuele Giuseppe Esposito 
---
  block/block-backend.c | 54 +++
  1 file changed, 54 insertions(+)

diff --git a/block/block-backend.c b/block/block-backend.c
index f425b00793..674eaaa2bf 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c


[...]


@@ -2208,6 +2212,56 @@ int blk_set_aio_context(BlockBackend *blk, AioContext 
*new_context,


[...]


+static bool blk_root_change_aio_ctx(BdrvChild *child, AioContext *ctx,
+GSList **visited, Transaction *tran,
+Error **errp)
+{
+BlockBackend *blk = child->opaque;
+BdrvStateBlkRootContext *s;
+
+if (blk->allow_aio_context_change) {
+goto finish;
+}
+
+/*
+ * Only manually created BlockBackends that are not attached to anything
+ * can change their AioContext without updating their user.
+ */
+if (!blk->name || blk->dev) {
+/* TODO Add BB name/QOM path */
+error_setg(errp, "Cannot change iothread of active block backend");
+return false;
+}


Is the goto really necessary?  Or, rather, do you prefer this to 
something like


if (!blk->allow_aio_context_change) {
    /*
 * Manually created BlockBackends (those with a name) that are not
 * attached to anything can change their AioContext without updating
 * their user; return an error for others.
 */
    if (!blk->name || blk->dev) {
    ...
    }
}

If you prefer the goto, I’d at least rename the label to 
“change_context” or “allowed” or something.


Hanna


+
+finish:
+s = g_new(BdrvStateBlkRootContext, 1);
+*s = (BdrvStateBlkRootContext) {
+.new_ctx = ctx,
+.blk = blk,
+};
+
+tran_add_tail(tran, _blk_root_context, s);


(Again, not a huge fan of this.)


+return true;
+}
+
  static bool blk_root_can_set_aio_ctx(BdrvChild *child, AioContext *ctx,
   GSList **ignore, Error **errp)
  {





Re: [RFC PATCH 5/8] block: implement .change_aio_ctx in child_of_bds

2022-07-15 Thread Hanna Reitz

On 12.07.22 23:19, Emanuele Giuseppe Esposito wrote:

bdrv_child_cb_change_aio_ctx() is identical to
bdrv_child_cb_can_set_aio_ctx(), as we only need
to recursively go on the parent bs.

Note: bdrv_child_try_change_aio_context() is not called by
anyone at this point.

Signed-off-by: Emanuele Giuseppe Esposito 
---
  block.c | 9 +
  1 file changed, 9 insertions(+)


Reviewed-by: Hanna Reitz 




Re: [RFC PATCH 4/8] blockjob: implement .change_aio_ctx in child_job

2022-07-15 Thread Hanna Reitz

On 12.07.22 23:19, Emanuele Giuseppe Esposito wrote:

child_job_change_aio_ctx() is very similar to
child_job_can_set_aio_ctx(), but it implements a new transaction
so that if all check pass, the new transaction's .commit()
will take care of changin the BlockJob AioContext.
child_job_set_aio_ctx_commit() is similar to child_job_set_aio_ctx(),
but it doesn't need to invoke the recursion, as this is already
taken care by child_job_change_aio_ctx().

Note: bdrv_child_try_change_aio_context() is not called by
anyone at this point.

Signed-off-by: Emanuele Giuseppe Esposito 
---
  blockjob.c | 45 +
  1 file changed, 45 insertions(+)


Looks good, disregarding the fact that I’d like it very much if we could 
find some other primitive than tran_add_trail() to get these handlers to 
run on a drained graph.


But that’s taste (and something to talk about in patch 3), so I’ll just 
give a


Reviewed-by: Hanna Reitz 




Re: [RFC PATCH 3/8] RFC: block: use transactions as a replacement of ->{can_}set_aio_context()

2022-07-14 Thread Hanna Reitz

On 12.07.22 23:19, Emanuele Giuseppe Esposito wrote:

-
RFC because I am not sure about the AioContext locks.
- Do we need to take the new AioContext lock? what does it protect?
- Taking the old AioContext lock is required now, because of
   bdrv_drained_begin calling AIO_WAIT_WHILE that unlocks the
   aiocontext. If we replace it with AIO_WAIT_WHILE_UNLOCKED,
   could we get rid of taking every time the old AioContext?
   drain would be enough to protect the graph modification.


It’s been a while, but as far as I remember (which may be wrong), the 
reason for how the locks are supposed to be taken was mostly that we 
need some defined state so that we know how to invoke 
bdrv_drained_begin() and bdrv_drained_end() (i.e. call the first one 
as-is, and switch the locks around for the latter one).


The idea of using _UNLOCKED sounds interesting, almost too obvious. I 
can’t see why that wouldn’t work, actually.



--

Simplify the way the aiocontext can be changed in a BDS graph.
There are currently two problems in bdrv_try_set_aio_context:
- There is a confusion of AioContext locks taken and released, because
   we assume that old aiocontext is always taken and new one is
   taken inside.


Yep, and that assumption is just broken in some cases, which is the main 
pain point I’m feeling with it right now.


For example, look at bdrv_attach_child_common(): Here, we attach a child 
to a parent, so we need to get them into a common AioContext. So first 
we try to put the child into the parent’s context, and if that fails, 
we’ll try the other way, putting the parent into the child’s context.


The problem is clear: The bdrv_try_set_aio_context() call requires us to 
hold the child’s current context but not the parent’s, and the 
child_class->set_aio_ctx() call requires the exact opposite.  But we 
never switch the context we have acquired, so this can’t both work.  
Better yet, nowhere is it defined what context a caller to 
bdrv_attach_child_common() will hold.


In practice, what happens here most of the time is that something will 
be moved from the main context to some other context, and since we’re in 
the main context already, that’ll just work.  But you can construct 
cases where something is attempted to be moved from an I/O thread into a 
different thread and then you’ll get a crash.


I’d be happy if we could do away with the requirement of having to hold 
any lock for changing a node’s AioContext.



- It doesn't look very safe to call bdrv_drained_begin while some
   nodes have already switched to the new aiocontext and others haven't.
   This could be especially dangerous because bdrv_drained_begin polls, so
   something else could be executed while graph is in an inconsistent
   state.

Additional minor nitpick: can_set and set_ callbacks both traverse the
graph, both using the ignored list of visited nodes in a different way.

Therefore, get rid of all of this and introduce a new callback,
change_aio_context, that uses transactions to efficiently, cleanly
and most importantly safely change the aiocontext of a graph.

This new callback is a "merge" of the two previous ones:
- Just like can_set_aio_context, recursively traverses the graph.
   Marks all nodes that are visited using a GList, and checks if
   they *could* change the aio_context.
- For each node that passes the above check, add a new transaction
   that implements a callback that effectively changes the aiocontext.
- If a node is a BDS, add two transactions: one taking care of draining
   the node at the beginning of the list (so that will be executed first)
   and one at the end taking care of changing the AioContext.
- Once done, the recursive function returns if *all* nodes can change
   the AioContext. If so, commit the above transactions. Otherwise don't
   do anything.
- The transaction list contains first all "drain" transactions, so
   we are sure we are draining all nodes in the same context, and then
   all the other switch the AioContext. In this way we make sure that
   bdrv_drained_begin() is always called under the old AioContext, and
   bdrv_drained_end() under the new one.
- Because of the above, we don't need to release and re-acquire the
   old AioContext every time, as everything is done once (and not
   per-node drain and aiocontext change).

Note that the "change" API is not yet invoked anywhere.


So the idea is that we introduce a completely new transaction-based API 
to change BDSs’ AioContext, and then drop the old one, right?



Signed-off-by: Emanuele Giuseppe Esposito 
---
  block.c| 197 +
  include/block/block-global-state.h |   9 ++
  include/block/block_int-common.h   |   3 +
  3 files changed, 209 insertions(+)

diff --git a/block.c b/block.c
index 267a39c0de..bda4e1bcef 100644
--- a/block.c
+++ b/block.c
@@ -7437,6 +7437,51 @@ static bool bdrv_parent_can_set_aio_context(BdrvChild 
*c, AioContext *ctx,
  return true;
  }
  

Re: [RFC PATCH 2/8] transactions: add tran_add_back

2022-07-14 Thread Hanna Reitz

On 12.07.22 23:19, Emanuele Giuseppe Esposito wrote:

First change the transactions from a QLIST to QSIMPLEQ, then
use it to implement tran_add_tail, which allows adding elements
to the end of list transactions.


The subject still calls it `tran_add_back()` (perhaps from a preliminary 
version?), I think that needs adjustment.



This is useful if we have some "preparation" transiction callbacks


*transaction


that we want to run before the others but still only when invoking
finalize/commit/abort.


I don’t understand this yet (but perhaps it’ll become clearer with the 
following patches); doesn’t the new function do the opposite? I.e., 
basically add some clean-up that’s only used after everything else?



For example (A and B are lists transaction callbacks):

for (i=0; i < 3; i++) {
tran_add(A[i]);
tran_add_tail(B[i]);
}

tran_commit();

Will process transactions in this order: A2 - A1 - A0 - B0 - B1 - B2

Signed-off-by: Emanuele Giuseppe Esposito 
---
  include/qemu/transactions.h |  9 +
  util/transactions.c | 29 +
  2 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/include/qemu/transactions.h b/include/qemu/transactions.h
index 2f2060acd9..42783720b9 100644
--- a/include/qemu/transactions.h
+++ b/include/qemu/transactions.h
@@ -50,7 +50,16 @@ typedef struct TransactionActionDrv {
  typedef struct Transaction Transaction;
  
  Transaction *tran_new(void);

+/*
+ * Add transaction at the beginning of the transaction list.
+ * @tran will be the first transaction to be processed in 
finalize/commit/abort.


Of course, if you call tran_add() afterwards, this transaction will no 
longer be the first one.  I mean, that’s kind of obvious, but perhaps we 
can still express that here.


Like, perhaps, “finalize/commit/abort process this list in order, so 
after this call, @tran will be the first transaction to be processed”?



+ */
  void tran_add(Transaction *tran, TransactionActionDrv *drv, void *opaque);
+/*
+ * Add transaction at the end of the transaction list.
+ * @tran will be the last transaction to be processed in finalize/commit/abort.


(And then “finalize/commit/abort process this list in order, so after 
this call, @tran will be the last transaction to be processed”)



+ */
+void tran_add_tail(Transaction *tran, TransactionActionDrv *drv, void *opaque);
  void tran_abort(Transaction *tran);
  void tran_commit(Transaction *tran);
  
diff --git a/util/transactions.c b/util/transactions.c

index 2dbdedce95..89e541c4a4 100644
--- a/util/transactions.c
+++ b/util/transactions.c


[...]


@@ -54,20 +54,33 @@ void tran_add(Transaction *tran, TransactionActionDrv *drv, 
void *opaque)
  .opaque = opaque
  };
  
-QSLIST_INSERT_HEAD(>actions, act, entry);

+QSIMPLEQ_INSERT_HEAD(>actions, act, entry);
+}
+
+void tran_add_tail(Transaction *tran, TransactionActionDrv *drv, void *opaque)
+{
+TransactionAction *act;
+
+act = g_new(TransactionAction, 1);
+*act = (TransactionAction) {
+.drv = drv,
+.opaque = opaque
+};
+
+QSIMPLEQ_INSERT_TAIL(>actions, act, entry);
  }


Perhaps this could benefit from a function encompassing the common 
functionality, i.e. a tran_do_add(..., bool tail) with


if (tail) {
    QSIMPLEQ_INSERT_TAIL(...);
} else {
    QSIMPLEQ_INSERT_HEAD(...);
}

(Just a light suggestion.)

Hanna




Re: [RFC PATCH 1/8] block.c: assert bs->aio_context is written under BQL and drains

2022-07-14 Thread Hanna Reitz

On 12.07.22 23:19, Emanuele Giuseppe Esposito wrote:

Also here ->aio_context is read by I/O threads and written
under BQL.

Signed-off-by: Emanuele Giuseppe Esposito 
---
  block.c | 2 ++
  1 file changed, 2 insertions(+)


Reviewed-by: Hanna Reitz 




[PATCH 1/2] block/parallels: Fix buffer-based write call

2022-07-14 Thread Hanna Reitz
Commit a4072543ccdddbd241d5962d9237b8b41fd006bf has changed the I/O here
from working on a local one-element I/O vector to just using the buffer
directly (using the bdrv_co_pread()/bdrv_co_pwrite() helper functions
introduced shortly before).

However, it only changed the bdrv_co_preadv() call to bdrv_co_pread() -
the subsequent bdrv_co_pwritev() call stayed this way, and so still
expects a QEMUIOVector pointer instead of a plain buffer.  We must
change that to be a bdrv_co_pwrite() call.

Fixes: a4072543ccdddbd241d5962d ("block/parallels: use buffer-based io")
Signed-off-by: Hanna Reitz 
---
 block/parallels.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index 8b235b9505..a229c06f25 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -241,8 +241,8 @@ static int64_t allocate_clusters(BlockDriverState *bs, 
int64_t sector_num,
 return ret;
 }
 
-ret = bdrv_co_pwritev(bs->file, s->data_end * BDRV_SECTOR_SIZE,
-  nb_cow_bytes, buf, 0);
+ret = bdrv_co_pwrite(bs->file, s->data_end * BDRV_SECTOR_SIZE,
+ nb_cow_bytes, buf, 0);
 qemu_vfree(buf);
 if (ret < 0) {
 return ret;
-- 
2.35.3




[PATCH 2/2] iotests/131: Add parallels regression test

2022-07-14 Thread Hanna Reitz
Test an allocating write to a parallels image that has a backing node.
Before HEAD^, doing so used to give me a failed assertion (when the
backing node contains only `42` bytes; the results varies with the value
chosen, for `0` bytes, for example, all I get is EIO).

Signed-off-by: Hanna Reitz 
---
 tests/qemu-iotests/131 | 35 ++-
 tests/qemu-iotests/131.out | 13 +
 2 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/131 b/tests/qemu-iotests/131
index d7456cae5b..a847692b4c 100755
--- a/tests/qemu-iotests/131
+++ b/tests/qemu-iotests/131
@@ -43,7 +43,7 @@ _supported_os Linux
 
 inuse_offset=$((0x2c))
 
-size=64M
+size=$((64 * 1024 * 1024))
 CLUSTER_SIZE=64k
 IMGFMT=parallels
 _make_test_img $size
@@ -70,6 +70,39 @@ _check_test_img
 _check_test_img -r all
 { $QEMU_IO -c "read -P 0x11 64k 64k" "$TEST_IMG"; } 2>&1 | _filter_qemu_io | 
_filter_testdir
 
+echo "== allocate with backing =="
+# Verify that allocating clusters works fine even when there is a backing 
image.
+# Regression test for a bug where we would pass a buffer read from the backing
+# node as a QEMUIOVector object, which could cause anything from I/O errors 
over
+# assertion failures to invalid reads from memory.
+
+# Clear image
+_make_test_img $size
+# Create base image
+TEST_IMG="$TEST_IMG.base" _make_test_img $size
+
+# Write some data to the base image (which would trigger an assertion failure 
if
+# interpreted as a QEMUIOVector)
+$QEMU_IO -c 'write -P 42 0 64k' "$TEST_IMG.base" | _filter_qemu_io
+
+# Parallels does not seem to support storing a backing filename in the image
+# itself, so we need to build our backing chain on the command line
+imgopts="driver=$IMGFMT,file.driver=$IMGPROTO,file.filename=$TEST_IMG"
+imgopts+=",backing.driver=$IMGFMT"
+imgopts+=",backing.file.driver=$IMGPROTO,backing.file.filename=$TEST_IMG.base"
+
+# Cause allocation in the top image
+QEMU_IO_OPTIONS=$QEMU_IO_OPTIONS_NO_FMT \
+$QEMU_IO --image-opts "$imgopts" -c 'write -P 1 0 64' | _filter_qemu_io
+
+# Verify
+QEMU_IO_OPTIONS=$QEMU_IO_OPTIONS_NO_FMT \
+$QEMU_IO --image-opts "$imgopts" \
+-c 'read -P 1 0 64' \
+-c "read -P 42 64 $((64 * 1024 - 64))" \
+-c "read -P 0 64k $((size - 64 * 1024))" \
+| _filter_qemu_io
+
 # success, all done
 echo "*** done"
 rm -f $seq.full
diff --git a/tests/qemu-iotests/131.out b/tests/qemu-iotests/131.out
index 70da03dee5..de5ef7a8f5 100644
--- a/tests/qemu-iotests/131.out
+++ b/tests/qemu-iotests/131.out
@@ -37,4 +37,17 @@ Double checking the fixed image now...
 No errors were found on the image.
 read 65536/65536 bytes at offset 65536
 64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+== allocate with backing ==
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=67108864
+wrote 65536/65536 bytes at offset 0
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 64/64 bytes at offset 0
+64 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 64/64 bytes at offset 0
+64 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65472/65472 bytes at offset 64
+63.938 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 67043328/67043328 bytes at offset 65536
+63.938 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 *** done
-- 
2.35.3




[PATCH 0/2] block/parallels: Fix buffer-based write call

2022-07-14 Thread Hanna Reitz
Hi,

While reviewing Stefan’s libblkio driver series, I’ve noticed that
block/parallels.c contains a call to bdrv_co_pwritev() that doesn’t pass
a QEMUIOVector object but a plain buffer instead.  That seems wrong and
also pretty dangerous, so change it to a bdrv_co_pwrite() call (as I
assume it should be), and add a regression test demonstrating the
problem.


Hanna Reitz (2):
  block/parallels: Fix buffer-based write call
  iotests/131: Add parallels regression test

 block/parallels.c  |  4 ++--
 tests/qemu-iotests/131 | 35 ++-
 tests/qemu-iotests/131.out | 13 +
 3 files changed, 49 insertions(+), 3 deletions(-)

-- 
2.35.3




Re: [RFC v3 8/8] virtio-blk: use BDRV_REQ_REGISTERED_BUF optimization hint

2022-07-14 Thread Hanna Reitz

On 08.07.22 06:17, Stefan Hajnoczi wrote:

Register guest RAM using BlockRAMRegistrar and set the
BDRV_REQ_REGISTERED_BUF flag so block drivers can optimize memory
accesses in I/O requests.

This is for vdpa-blk, vhost-user-blk, and other I/O interfaces that rely
on DMA mapping/unmapping.

Signed-off-by: Stefan Hajnoczi 
---
  include/hw/virtio/virtio-blk.h |  2 ++
  hw/block/virtio-blk.c  | 13 +
  2 files changed, 11 insertions(+), 4 deletions(-)


Seems fair, but as said on patch 5, I’m quite wary of “register guest 
RAM”.  How can we guarantee that it won’t be too fragmented to be 
registerable with either nvme.c or blkio.c?


Hanna




Re: [RFC v3 7/8] blkio: implement BDRV_REQ_REGISTERED_BUF optimization

2022-07-14 Thread Hanna Reitz

On 08.07.22 06:17, Stefan Hajnoczi wrote:

Avoid bounce buffers when QEMUIOVector elements are within previously
registered bdrv_register_buf() buffers.

The idea is that emulated storage controllers will register guest RAM
using bdrv_register_buf() and set the BDRV_REQ_REGISTERED_BUF on I/O
requests. Therefore no blkio_map_mem_region() calls are necessary in the
performance-critical I/O code path.

This optimization doesn't apply if the I/O buffer is internally
allocated by QEMU (e.g. qcow2 metadata). There we still take the slow
path because BDRV_REQ_REGISTERED_BUF is not set.


Which keeps the question relevant of how slow the slow path is, i.e. 
whether it wouldn’t make sense to keep some of the mem regions allocated 
there in a cache instead of allocating/freeing them on every I/O request.



Signed-off-by: Stefan Hajnoczi 
---
  block/blkio.c | 104 --
  1 file changed, 101 insertions(+), 3 deletions(-)

diff --git a/block/blkio.c b/block/blkio.c
index 7fbdbd7fae..37d593a20c 100644
--- a/block/blkio.c
+++ b/block/blkio.c


[...]


@@ -198,6 +203,8 @@ static BlockAIOCB *blkio_aio_preadv(BlockDriverState *bs, 
int64_t offset,
  BlockCompletionFunc *cb, void *opaque)
  {
  BDRVBlkioState *s = bs->opaque;
+bool needs_mem_regions =
+s->needs_mem_regions && !(flags & BDRV_REQ_REGISTERED_BUF);


Is that condition sufficient?  bdrv_register_buf() has no way of 
returning an error, so it’s possible that buffers are silently not 
registered.  (And there are conditions in blkio_register_buf() where the 
buffer will not be registered, e.g. because it isn’t aligned.)


The caller knows nothing of this and will still pass 
BDRV_REQ_REGISTERED_BUF, and then we’ll assume the region is mapped but 
it won’t be.



  struct iovec *iov = qiov->iov;
  int iovcnt = qiov->niov;
  BlkioAIOCB *acb;


[...]


@@ -324,6 +333,80 @@ static void blkio_io_unplug(BlockDriverState *bs)
  }
  }
  
+static void blkio_register_buf(BlockDriverState *bs, void *host, size_t size)

+{
+BDRVBlkioState *s = bs->opaque;
+int ret;
+struct blkio_mem_region region = (struct blkio_mem_region){
+.addr = host,
+.len = size,
+.fd = -1,
+};
+
+if (((uintptr_t)host | size) % s->mem_region_alignment) {
+error_report_once("%s: skipping unaligned buf %p with size %zu",
+  __func__, host, size);
+return; /* skip unaligned */
+}


How big is mem-region-alignment generally?  Is it like 4k or is it going 
to be a real issue?


(Also, we could probably register a truncated region.  I know, that’ll 
break the BDRV_REQ_REGISTERED_BUF idea because the caller won’t know 
we’ve truncated it, but that’s no different than just not registering 
the buffer at all.)



+
+/* Attempt to find the fd for a MemoryRegion */
+if (s->needs_mem_region_fd) {
+int fd = -1;
+ram_addr_t offset;
+MemoryRegion *mr;
+
+/*
+ * bdrv_register_buf() is called with the BQL held so mr lives at least
+ * until this function returns.
+ */
+mr = memory_region_from_host(host, );
+if (mr) {
+fd = memory_region_get_fd(mr);
+}


I don’t think it’s specified that buffers registered with 
bdrv_register_buf() must be within a single memory region, is it? So can 
we somehow verify that the memory region covers the whole buffer?



+if (fd == -1) {
+error_report_once("%s: skipping fd-less buf %p with size %zu",
+  __func__, host, size);
+return; /* skip if there is no fd */
+}
+
+region.fd = fd;
+region.fd_offset = offset;
+}
+
+WITH_QEMU_LOCK_GUARD(>lock) {
+ret = blkio_map_mem_region(s->blkio, );
+}
+
+if (ret < 0) {
+error_report_once("Failed to add blkio mem region %p with size %zu: 
%s",
+  host, size, blkio_get_error_msg());
+}
+}
+
+static void blkio_unregister_buf(BlockDriverState *bs, void *host, size_t size)
+{
+BDRVBlkioState *s = bs->opaque;
+int ret;
+struct blkio_mem_region region = (struct blkio_mem_region){
+.addr = host,
+.len = size,
+.fd = -1,
+};
+
+if (((uintptr_t)host | size) % s->mem_region_alignment) {
+return; /* skip unaligned */
+}
+
+WITH_QEMU_LOCK_GUARD(>lock) {
+ret = blkio_unmap_mem_region(s->blkio, );
+}


The documentation of libblkio says that “memory regions must be 
unmapped/freed with exactly the same `region` field values that they 
were mapped/allocated with.”  We don’t set .fd here, though.


It’s also unclear whether it’s allowed to unmap a region that wasn’t 
mapped, but I’ll trust libblkio to detect that.



+
+if (ret < 0) {
+error_report_once("Failed to delete blkio mem region %p with size %zu: 
%s",
+  host, size, blkio_get_error_msg());

Re: [RFC v3 6/8] stubs: add memory_region_from_host() and memory_region_get_fd()

2022-07-14 Thread Hanna Reitz

On 08.07.22 06:17, Stefan Hajnoczi wrote:

The blkio block driver will need to look up the file descriptor for a
given pointer. This is possible in softmmu builds where the memory API
is available for querying guest RAM.

Add stubs so tools like qemu-img that link the block layer still build
successfully. In this case there is no guest RAM but that is fine.
Bounce buffers and their file descriptors will be allocated with
libblkio's blkio_alloc_mem_region() so we won't rely on QEMU's
memory_region_get_fd() in that case.

Signed-off-by: Stefan Hajnoczi 
---
  stubs/memory.c| 13 +
  stubs/meson.build |  1 +
  2 files changed, 14 insertions(+)
  create mode 100644 stubs/memory.c


Reviewed-by: Hanna Reitz 




Re: [RFC v3 5/8] block: add BlockRAMRegistrar

2022-07-14 Thread Hanna Reitz

On 08.07.22 06:17, Stefan Hajnoczi wrote:

Emulated devices and other BlockBackend users wishing to take advantage
of blk_register_buf() all have the same repetitive job: register
RAMBlocks with the BlockBackend using RAMBlockNotifier.

Add a BlockRAMRegistrar API to do this. A later commit will use this
from hw/block/virtio-blk.c.

Signed-off-by: Stefan Hajnoczi 
---
  MAINTAINERS  |  1 +
  include/sysemu/block-ram-registrar.h | 30 +
  block/block-ram-registrar.c  | 39 
  block/meson.build|  1 +
  4 files changed, 71 insertions(+)
  create mode 100644 include/sysemu/block-ram-registrar.h
  create mode 100644 block/block-ram-registrar.c


What memory is handled in ram_list?  Is it everything?  If so, won’t 
devices have trouble registering all those buffer, especially if they 
happen to be fragmented in physical memory? (nvme_register_buf() seems 
to say it can run out of slots quite easily.)



diff --git a/MAINTAINERS b/MAINTAINERS
index 50f340d9ee..d16189449f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2490,6 +2490,7 @@ F: block*
  F: block/
  F: hw/block/
  F: include/block/
+F: include/sysemu/block-*.h
  F: qemu-img*
  F: docs/tools/qemu-img.rst
  F: qemu-io*


Sneaky. ;)


diff --git a/include/sysemu/block-ram-registrar.h 
b/include/sysemu/block-ram-registrar.h
new file mode 100644
index 00..09d63f64b2
--- /dev/null
+++ b/include/sysemu/block-ram-registrar.h
@@ -0,0 +1,30 @@
+/*
+ * BlockBackend RAM Registrar
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#ifndef BLOCK_RAM_REGISTRAR_H
+#define BLOCK_RAM_REGISTRAR_H
+
+#include "exec/ramlist.h"
+
+/**
+ * struct BlockRAMRegistrar:
+ *
+ * Keeps RAMBlock memory registered with a BlockBackend using
+ * blk_register_buf() including hotplugged memory.
+ *
+ * Emulated devices or other BlockBackend users initialize a BlockRAMRegistrar
+ * with blk_ram_registrar_init() before submitting I/O requests with the
+ * BLK_REQ_REGISTERED_BUF flag set.


s/BLK/BDRV/, right?


+ */
+typedef struct {
+BlockBackend *blk;
+RAMBlockNotifier notifier;
+} BlockRAMRegistrar;
+
+void blk_ram_registrar_init(BlockRAMRegistrar *r, BlockBackend *blk);
+void blk_ram_registrar_destroy(BlockRAMRegistrar *r);
+
+#endif /* BLOCK_RAM_REGISTRAR_H */





Re: [RFC v3 4/8] block: add BDRV_REQ_REGISTERED_BUF request flag

2022-07-14 Thread Hanna Reitz

On 08.07.22 06:17, Stefan Hajnoczi wrote:

Block drivers may optimize I/O requests accessing buffers previously
registered with bdrv_register_buf(). Checking whether all elements of a
request's QEMUIOVector are within previously registered buffers is
expensive, so we need a hint from the user to avoid costly checks.

Add a BDRV_REQ_REGISTERED_BUF request flag to indicate that all
QEMUIOVector elements in an I/O request are known to be within
previously registered buffers.

bdrv_aligned_preadv() is strict in validating supported read flags and
its assertions fail when it sees BDRV_REQ_REGISTERED_BUF. There is no
harm in passing BDRV_REQ_REGISTERED_BUF to block drivers that do not
support it, so update the assertions to ignore BDRV_REQ_REGISTERED_BUF.

Care must be taken to clear the flag when the block layer or filter
drivers replace QEMUIOVector elements with bounce buffers since these
have not been registered with bdrv_register_buf(). A lot of the changes
in this commit deal with clearing the flag in those cases.

Ensuring that the flag is cleared properly is somewhat invasive to
implement across the block layer and it's hard to spot when future code
changes accidentally break it. Another option might be to add a flag to
QEMUIOVector itself and clear it in qemu_iovec_*() functions that modify
elements. That is more robust but somewhat of a layering violation, so I
haven't attempted that.


Yeah...  I will say that most read code already looks quite reasonable 
in that it’ll pass @flags to lower layers basically only if it’s an 
unmodified request, so it seems like in the past most people have 
already adhered to “don’t pass on any flags if you’re reading to a local 
bounce buffer”.



Signed-off-by: Stefan Hajnoczi
---
  include/block/block-common.h |  9 +
  block/blkverify.c|  4 ++--
  block/crypto.c   |  2 ++
  block/io.c   | 30 +++---
  block/mirror.c   |  2 ++
  block/raw-format.c   |  2 ++
  6 files changed, 40 insertions(+), 9 deletions(-)


Some things not covered here that look a bit wrong:

While bdrv_driver_preadv() asserts that the flags don’t contain anything 
the driver couldn’t handle (and this new flag is made exempt from that 
assertion here in this patch), bdrv_driver_pwritev() just hides those 
flags from drivers silently. I think just like we exempt the new flag 
from the assertion in bdrv_driver_preadv(), we should have 
bdrv_driver_pwritev() always pass it to drivers.


The following driver read/write functions assert that @flags is 0, which 
is probably no longer ideal:

- bdrv_qed_co_writev()
- block_crypto_co_preadv()
- nbd_client_co_preadv()
- parallels_co_writev()
- qcow_co_preadv()
- qcow_co_pwritev()
- qemu_gluster_co_writev()
- raw_co_pwritev() (block/file-posix.c)
- replication_co_writev()
- ssh_co_writev()
- vhdx_co_writev()

snapshot_access_co_preadv_part() returns an error when any flags are 
set, but should probably ignore BDRV_REQ_REGISTERED_BUF for this check.



While looking around, I spotted a couple of places that look like they 
could pass the flag on but currently don’t (just FYI, not asking for 
anything here):


bdrv_co_do_copy_on_readv() never passes the flags through to its calls, 
but I think it could pass this flag on in the one bdrv_driver_preadv() 
call where it doesn’t use a bounce buffer (“Read directly into the 
destination”).


qcow2’s qcow2_co_preadv_task() and qcow2_co_pwritev_task() (besides the 
encryption part) also look like they should pass this flag on, but, 
well, the functions themselves currently don’t get the flag, so they can’t.


qcow1’s qcow_co_preadv() and qcow_co_pwritev() are so-so, sometimes 
using a bounce buffer, and sometimes not, but those function could use 
optimization in general if anyone cared.


vpc_co_preadv()’s and vpc_co_pwritev()’s first 
bdrv_co_preadv()/bdrv_co_pwritev() invocations look straightforward, but 
as with qcow1, not sure if anyone cares.


I’m too lazy to thoroughly check what’s going on with 
qed_aio_write_main().  Passing 0 is safe, and it doesn’t get the 
original request flags, so I guess doing anything about this would be 
difficult.


quorum’s read_fifo_child() probably could pass acb->flags. Probably.  
Perhaps not.  Difficult to say it is.


block/replication.c also looks like a candidate for passing flags, but 
personally, I’d like to refrain from touching it.  (Well, besides the 
fact that replication_co_writev() asserts that @flags is 0.)



(And finally, I found that block/parallels.c invokes bdrv_co_pwritev() 
with a buffer instead of an I/O vector, which looks really wrong, but 
has nothing to do with this patch.)


[...]


diff --git a/block/io.c b/block/io.c
index e7f4117fe7..83b8259227 100644
--- a/block/io.c
+++ b/block/io.c


[...]


@@ -1902,6 +1910,11 @@ static int coroutine_fn 
bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
  return -ENOTSUP;
  }
  
+/* By definition there is 

Re: [RFC v3 3/8] block: pass size to bdrv_unregister_buf()

2022-07-13 Thread Hanna Reitz

On 08.07.22 06:17, Stefan Hajnoczi wrote:

The only implementor of bdrv_register_buf() is block/nvme.c, where the
size is not needed when unregistering a buffer. This is because
util/vfio-helpers.c can look up mappings by address.

Future block drivers that implement bdrv_register_buf() may not be able
to do their job given only the buffer address. Add a size argument to
bdrv_unregister_buf().

Also document the assumptions about
bdrv_register_buf()/bdrv_unregister_buf() calls. The same 
values that were given to bdrv_register_buf() must be given to
bdrv_unregister_buf().

gcc 11.2.1 emits a spurious warning that img_bench()'s buf_size local
variable might be uninitialized, so it's necessary to silence the
compiler.

Signed-off-by: Stefan Hajnoczi 
---
  include/block/block-global-state.h  | 5 -
  include/block/block_int-common.h| 2 +-
  include/sysemu/block-backend-global-state.h | 2 +-
  block/block-backend.c   | 4 ++--
  block/io.c  | 6 +++---
  block/nvme.c| 2 +-
  qemu-img.c  | 4 ++--
  7 files changed, 14 insertions(+), 11 deletions(-)


Reviewed-by: Hanna Reitz 




Re: [RFC v3 1/8] blkio: add io_uring block driver using libblkio

2022-07-13 Thread Hanna Reitz

On 08.07.22 06:17, Stefan Hajnoczi wrote:

libblkio (https://gitlab.com/libblkio/libblkio/) is a library for
high-performance disk I/O. It currently supports io_uring and
virtio-blk-vhost-vdpa with additional drivers under development.

One of the reasons for developing libblkio is that other applications
besides QEMU can use it. This will be particularly useful for
vhost-user-blk which applications may wish to use for connecting to
qemu-storage-daemon.

libblkio also gives us an opportunity to develop in Rust behind a C API
that is easy to consume from QEMU.

This commit adds io_uring and virtio-blk-vhost-vdpa BlockDrivers to QEMU
using libblkio. It will be easy to add other libblkio drivers since they
will share the majority of code.

For now I/O buffers are copied through bounce buffers if the libblkio
driver requires it. Later commits add an optimization for
pre-registering guest RAM to avoid bounce buffers.

The syntax is:

   --blockdev 
io_uring,node-name=drive0,filename=test.img,readonly=on|off,cache.direct=on|off

and:

   --blockdev 
virtio-blk-vhost-vdpa,node-name=drive0,path=/dev/vdpa...,readonly=on|off

Signed-off-by: Stefan Hajnoczi 
---
  MAINTAINERS   |   6 +
  meson_options.txt |   2 +
  qapi/block-core.json  |  37 +-
  meson.build   |   9 +
  block/blkio.c | 659 ++
  tests/qtest/modules-test.c|   3 +
  block/meson.build |   1 +
  scripts/meson-buildoptions.sh |   3 +
  8 files changed, 718 insertions(+), 2 deletions(-)
  create mode 100644 block/blkio.c


[...]


diff --git a/block/blkio.c b/block/blkio.c
new file mode 100644
index 00..7fbdbd7fae
--- /dev/null
+++ b/block/blkio.c
@@ -0,0 +1,659 @@


Not sure whether it’s necessary, but I would have expected a copyright 
header here.



+#include "qemu/osdep.h"
+#include 
+#include "block/block_int.h"
+#include "qapi/error.h"
+#include "qapi/qmp/qdict.h"
+#include "qemu/module.h"
+
+typedef struct BlkAIOCB {
+BlockAIOCB common;
+struct blkio_mem_region mem_region;
+QEMUIOVector qiov;
+struct iovec bounce_iov;
+} BlkioAIOCB;
+
+typedef struct {
+/* Protects ->blkio and request submission on ->blkioq */
+QemuMutex lock;
+
+struct blkio *blkio;
+struct blkioq *blkioq; /* this could be multi-queue in the future */
+int completion_fd;
+
+/* Polling fetches the next completion into this field */
+struct blkio_completion poll_completion;
+
+/* The value of the "mem-region-alignment" property */
+size_t mem_region_alignment;
+
+/* Can we skip adding/deleting blkio_mem_regions? */
+bool needs_mem_regions;
+} BDRVBlkioState;
+
+static void blkio_aiocb_complete(BlkioAIOCB *acb, int ret)
+{
+/* Copy bounce buffer back to qiov */
+if (acb->qiov.niov > 0) {
+qemu_iovec_from_buf(>qiov, 0,
+acb->bounce_iov.iov_base,
+acb->bounce_iov.iov_len);
+qemu_iovec_destroy(>qiov);
+}
+
+acb->common.cb(acb->common.opaque, ret);
+
+if (acb->mem_region.len > 0) {
+BDRVBlkioState *s = acb->common.bs->opaque;
+
+WITH_QEMU_LOCK_GUARD(>lock) {
+blkio_free_mem_region(s->blkio, >mem_region);
+}
+}
+
+qemu_aio_unref(>common);
+}
+
+/*
+ * Only the thread that calls aio_poll() invokes fd and poll handlers.
+ * Therefore locks are not necessary except when accessing s->blkio.
+ *
+ * No locking is performed around blkioq_get_completions() although other
+ * threads may submit I/O requests on s->blkioq. We're assuming there is no
+ * inteference between blkioq_get_completions() and other s->blkioq APIs.
+ */
+
+static void blkio_completion_fd_read(void *opaque)
+{
+BlockDriverState *bs = opaque;
+BDRVBlkioState *s = bs->opaque;
+struct blkio_completion completion;
+uint64_t val;
+ssize_t ret __attribute__((unused));


I’d prefer a `(void)ret;` over this attribute, not least because that 
line would give a nice opportunity to explain in a short comment why we 
ignore this return value that the compiler tells us not to ignore, but 
if you don’t, then this’ll be fine.



+
+/* Polling may have already fetched a completion */
+if (s->poll_completion.user_data != NULL) {
+completion = s->poll_completion;
+
+/* Clear it in case blkio_aiocb_complete() has a nested event loop */
+s->poll_completion.user_data = NULL;
+
+blkio_aiocb_complete(completion.user_data, completion.ret);
+}
+
+/* Reset completion fd status */
+ret = read(s->completion_fd, , sizeof(val));
+
+/*
+ * Reading one completion at a time makes nested event loop re-entrancy
+ * simple. Change this loop to get multiple completions in one go if it
+ * becomes a performance bottleneck.
+ */
+while (blkioq_do_io(s->blkioq, , 0, 1, NULL) == 1) {
+blkio_aiocb_complete(completion.user_data, completion.ret);
+}
+}
+
+static bool 

[PULL 35/35] vl: Unlink absolute PID file path

2022-07-12 Thread Hanna Reitz
After writing the PID file, we register an exit notifier to unlink it
when the process terminates.  However, if the process has changed its
working directory in the meantime (e.g. in os_setup_post() when
daemonizing), this will not work when the PID file path was relative.
Therefore, pass the absolute path (created with realpath()) to the
unlink() call in the exit notifier.

(realpath() needs a path pointing to an existing file, so we cannot use
it before qemu_write_pidfile().)

Reproducer:
$ cd /tmp
$ qemu-system-x86_64 --daemonize --pidfile qemu.pid
$ file qemu.pid
qemu.pid: ASCII text
$ kill $(cat qemu.pid)
$ file qemu.pid
qemu.pid: ASCII text

(qemu.pid should be gone after the process has terminated.)

Signed-off-by: Hanna Reitz 
Message-Id: <20220609122701.17172-4-hre...@redhat.com>
Reviewed-by: Daniel P. Berrangé 
---
 softmmu/vl.c | 30 ++
 1 file changed, 26 insertions(+), 4 deletions(-)

diff --git a/softmmu/vl.c b/softmmu/vl.c
index 36f46fcdad..aabd82e09a 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -1522,11 +1522,18 @@ machine_parse_property_opt(QemuOptsList *opts_list, 
const char *propname,
 }
 
 static const char *pid_file;
-static Notifier qemu_unlink_pidfile_notifier;
+struct UnlinkPidfileNotifier {
+Notifier notifier;
+char *pid_file_realpath;
+};
+static struct UnlinkPidfileNotifier qemu_unlink_pidfile_notifier;
 
 static void qemu_unlink_pidfile(Notifier *n, void *data)
 {
-unlink(pid_file);
+struct UnlinkPidfileNotifier *upn;
+
+upn = DO_UPCAST(struct UnlinkPidfileNotifier, notifier, n);
+unlink(upn->pid_file_realpath);
 }
 
 static const QEMUOption *lookup_opt(int argc, char **argv,
@@ -2430,13 +2437,28 @@ static void qemu_maybe_daemonize(const char *pid_file)
 rcu_disable_atfork();
 
 if (pid_file) {
+char *pid_file_realpath = NULL;
+
 if (!qemu_write_pidfile(pid_file, )) {
 error_reportf_err(err, "cannot create PID file: ");
 exit(1);
 }
 
-qemu_unlink_pidfile_notifier.notify = qemu_unlink_pidfile;
-qemu_add_exit_notifier(_unlink_pidfile_notifier);
+pid_file_realpath = g_malloc0(PATH_MAX);
+if (!realpath(pid_file, pid_file_realpath)) {
+error_report("cannot resolve PID file path: %s: %s",
+ pid_file, strerror(errno));
+unlink(pid_file);
+exit(1);
+}
+
+qemu_unlink_pidfile_notifier = (struct UnlinkPidfileNotifier) {
+.notifier = {
+.notify = qemu_unlink_pidfile,
+},
+.pid_file_realpath = pid_file_realpath,
+};
+qemu_add_exit_notifier(_unlink_pidfile_notifier.notifier);
 }
 }
 
-- 
2.35.3




[PULL 31/35] qsd: Do not use error_report() before monitor_init

2022-07-12 Thread Hanna Reitz
error_report() only works once monitor_init_globals_core() has been
called, which is not the case when parsing the --daemonize option.  Use
fprintf(stderr, ...) instead.

Fixes: 2525edd85fec53e23fda98974a15e3b3c8957596 ("qsd: Add --daemonize")
Signed-off-by: Hanna Reitz 
Message-Id: <20220609122852.21140-1-hre...@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé 
---
 storage-daemon/qemu-storage-daemon.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/storage-daemon/qemu-storage-daemon.c 
b/storage-daemon/qemu-storage-daemon.c
index b8e910f220..448c318e23 100644
--- a/storage-daemon/qemu-storage-daemon.c
+++ b/storage-daemon/qemu-storage-daemon.c
@@ -296,7 +296,11 @@ static void process_options(int argc, char *argv[], bool 
pre_init_pass)
 }
 case OPTION_DAEMONIZE:
 if (os_set_daemonize(true) < 0) {
-error_report("--daemonize not supported in this build");
+/*
+ * --daemonize is parsed before monitor_init_globals_core(), so
+ * error_report() does not work yet
+ */
+fprintf(stderr, "--daemonize not supported in this build\n");
 exit(EXIT_FAILURE);
 }
 break;
-- 
2.35.3




[PULL 25/35] block: Implement blk_pdiscard() using generated_co_wrapper

2022-07-12 Thread Hanna Reitz
From: Alberto Faria 

Signed-off-by: Alberto Faria 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Hanna Reitz 
Message-Id: <20220705161527.1054072-14-afa...@redhat.com>
Signed-off-by: Hanna Reitz 
---
 block/coroutines.h|  3 ---
 include/sysemu/block-backend-io.h |  3 ++-
 block/block-backend.c | 12 
 3 files changed, 2 insertions(+), 16 deletions(-)

diff --git a/block/coroutines.h b/block/coroutines.h
index 94fd283f62..2693ecabfb 100644
--- a/block/coroutines.h
+++ b/block/coroutines.h
@@ -110,9 +110,6 @@ nbd_do_establish_connection(BlockDriverState *bs, bool 
blocking, Error **errp);
 int generated_co_wrapper
 blk_do_ioctl(BlockBackend *blk, unsigned long int req, void *buf);
 
-int generated_co_wrapper
-blk_do_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes);
-
 int generated_co_wrapper blk_do_flush(BlockBackend *blk);
 
 #endif /* BLOCK_COROUTINES_H */
diff --git a/include/sysemu/block-backend-io.h 
b/include/sysemu/block-backend-io.h
index 346ca47fed..5a12a1f74f 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -159,6 +159,8 @@ static inline int coroutine_fn blk_co_pwrite(BlockBackend 
*blk, int64_t offset,
 return blk_co_pwritev(blk, offset, bytes, , flags);
 }
 
+int generated_co_wrapper blk_pdiscard(BlockBackend *blk, int64_t offset,
+  int64_t bytes);
 int coroutine_fn blk_co_pdiscard(BlockBackend *blk, int64_t offset,
  int64_t bytes);
 
@@ -172,7 +174,6 @@ int generated_co_wrapper blk_pwrite_compressed(BlockBackend 
*blk,
const void *buf);
 int coroutine_fn blk_co_pwrite_compressed(BlockBackend *blk, int64_t offset,
   int64_t bytes, const void *buf);
-int blk_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes);
 int generated_co_wrapper blk_pwrite_zeroes(BlockBackend *blk, int64_t offset,
int64_t bytes,
BdrvRequestFlags flags);
diff --git a/block/block-backend.c b/block/block-backend.c
index 85661d1823..71585c0c0a 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1711,18 +1711,6 @@ int coroutine_fn blk_co_pdiscard(BlockBackend *blk, 
int64_t offset,
 return ret;
 }
 
-int blk_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes)
-{
-int ret;
-IO_OR_GS_CODE();
-
-blk_inc_in_flight(blk);
-ret = blk_do_pdiscard(blk, offset, bytes);
-blk_dec_in_flight(blk);
-
-return ret;
-}
-
 /* To be called between exactly one pair of blk_inc/dec_in_flight() */
 int coroutine_fn blk_co_do_flush(BlockBackend *blk)
 {
-- 
2.35.3




[PULL 18/35] block: Implement blk_{pread, pwrite}() using generated_co_wrapper

2022-07-12 Thread Hanna Reitz
From: Alberto Faria 

We need to add include/sysemu/block-backend-io.h to the inputs of the
block-gen.c target defined in block/meson.build.

Signed-off-by: Alberto Faria 
Reviewed-by: Hanna Reitz 
Message-Id: <20220705161527.1054072-7-afa...@redhat.com>
Signed-off-by: Hanna Reitz 
---
 block/coroutines.h|  4 
 include/sysemu/block-backend-io.h | 10 ++
 block/block-backend.c | 23 ---
 block/meson.build |  1 +
 4 files changed, 7 insertions(+), 31 deletions(-)

diff --git a/block/coroutines.h b/block/coroutines.h
index 3f41238b33..443ef2f2e6 100644
--- a/block/coroutines.h
+++ b/block/coroutines.h
@@ -112,10 +112,6 @@ bdrv_common_block_status_above(BlockDriverState *bs,
 int generated_co_wrapper
 nbd_do_establish_connection(BlockDriverState *bs, bool blocking, Error **errp);
 
-int generated_co_wrapper
-blk_do_preadv(BlockBackend *blk, int64_t offset, int64_t bytes,
-  QEMUIOVector *qiov, BdrvRequestFlags flags);
-
 int generated_co_wrapper
 blk_do_pwritev_part(BlockBackend *blk, int64_t offset, int64_t bytes,
 QEMUIOVector *qiov, size_t qiov_offset,
diff --git a/include/sysemu/block-backend-io.h 
b/include/sysemu/block-backend-io.h
index b048476a47..750089fc9b 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -101,10 +101,12 @@ int coroutine_fn blk_co_copy_range(BlockBackend *blk_in, 
int64_t off_in,
  * the "I/O or GS" API.
  */
 
-int blk_pread(BlockBackend *blk, int64_t offset, int64_t bytes, void *buf,
-  BdrvRequestFlags flags);
-int blk_pwrite(BlockBackend *blk, int64_t offset, int64_t bytes,
-   const void *buf, BdrvRequestFlags flags);
+int generated_co_wrapper blk_pread(BlockBackend *blk, int64_t offset,
+   int64_t bytes, void *buf,
+   BdrvRequestFlags flags);
+int generated_co_wrapper blk_pwrite(BlockBackend *blk, int64_t offset,
+int64_t bytes, const void *buf,
+BdrvRequestFlags flags);
 int coroutine_fn blk_co_preadv(BlockBackend *blk, int64_t offset,
int64_t bytes, QEMUIOVector *qiov,
BdrvRequestFlags flags);
diff --git a/block/block-backend.c b/block/block-backend.c
index 3203b25695..4d6c18bc39 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1563,29 +1563,6 @@ BlockAIOCB *blk_aio_pwrite_zeroes(BlockBackend *blk, 
int64_t offset,
 flags | BDRV_REQ_ZERO_WRITE, cb, opaque);
 }
 
-int blk_pread(BlockBackend *blk, int64_t offset, int64_t bytes, void *buf,
-  BdrvRequestFlags flags)
-{
-int ret;
-QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
-IO_OR_GS_CODE();
-
-blk_inc_in_flight(blk);
-ret = blk_do_preadv(blk, offset, bytes, , flags);
-blk_dec_in_flight(blk);
-
-return ret;
-}
-
-int blk_pwrite(BlockBackend *blk, int64_t offset, int64_t bytes,
-   const void *buf, BdrvRequestFlags flags)
-{
-QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
-IO_OR_GS_CODE();
-
-return blk_pwritev_part(blk, offset, bytes, , 0, flags);
-}
-
 int64_t blk_getlength(BlockBackend *blk)
 {
 IO_CODE();
diff --git a/block/meson.build b/block/meson.build
index 0b2a60c99b..60bc305597 100644
--- a/block/meson.build
+++ b/block/meson.build
@@ -136,6 +136,7 @@ block_gen_c = custom_target('block-gen.c',
 input: files(
   '../include/block/block-io.h',
   '../include/block/block-global-state.h',
+  '../include/sysemu/block-backend-io.h',
   'coroutines.h'
   ),
 command: [wrapper_py, '@OUTPUT@', '@INPUT@'])
-- 
2.35.3




[PULL 32/35] iotests/297: Have mypy ignore unused ignores

2022-07-12 Thread Hanna Reitz
e7874a50ff3f5b20fb46f36958ad ("python: update for mypy 0.950") has added
`warn_unused_ignores = False` to python/setup.cfg, to be able to keep
compatibility with both pre- and post-0.950 mypy versions.

The iotests' mypy.ini needs the same, or 297 will fail (on both pre- and
post-0.950 mypy, as far as I can tell; just for different `ignore`
lines).

Signed-off-by: Hanna Reitz 
Message-Id: <20220621092536.19837-1-hre...@redhat.com>
Acked-by: John Snow 
---
 tests/qemu-iotests/mypy.ini | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/mypy.ini b/tests/qemu-iotests/mypy.ini
index 4c0339f558..d66ffc2e3c 100644
--- a/tests/qemu-iotests/mypy.ini
+++ b/tests/qemu-iotests/mypy.ini
@@ -9,4 +9,4 @@ no_implicit_optional = True
 scripts_are_modules = True
 warn_redundant_casts = True
 warn_unused_configs = True
-warn_unused_ignores = True
+warn_unused_ignores = False
-- 
2.35.3




[PULL 22/35] block: Change blk_pwrite_compressed() param order

2022-07-12 Thread Hanna Reitz
From: Alberto Faria 

Swap 'buf' and 'bytes' around for consistency with other I/O functions.

Signed-off-by: Alberto Faria 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Hanna Reitz 
Message-Id: <20220705161527.1054072-11-afa...@redhat.com>
Signed-off-by: Hanna Reitz 
---
 include/sysemu/block-backend-io.h | 4 ++--
 block/block-backend.c | 4 ++--
 qemu-img.c| 2 +-
 qemu-io-cmds.c| 2 +-
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/sysemu/block-backend-io.h 
b/include/sysemu/block-backend-io.h
index 93a15071c4..695b793a72 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -167,8 +167,8 @@ int blk_flush(BlockBackend *blk);
 
 int blk_ioctl(BlockBackend *blk, unsigned long int req, void *buf);
 
-int blk_pwrite_compressed(BlockBackend *blk, int64_t offset, const void *buf,
-  int64_t bytes);
+int blk_pwrite_compressed(BlockBackend *blk, int64_t offset, int64_t bytes,
+  const void *buf);
 int blk_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes);
 int blk_pwrite_zeroes(BlockBackend *blk, int64_t offset,
   int64_t bytes, BdrvRequestFlags flags);
diff --git a/block/block-backend.c b/block/block-backend.c
index c75942b773..60fb7eb3f2 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2314,8 +2314,8 @@ int coroutine_fn blk_co_pwrite_zeroes(BlockBackend *blk, 
int64_t offset,
   flags | BDRV_REQ_ZERO_WRITE);
 }
 
-int blk_pwrite_compressed(BlockBackend *blk, int64_t offset, const void *buf,
-  int64_t bytes)
+int blk_pwrite_compressed(BlockBackend *blk, int64_t offset, int64_t bytes,
+  const void *buf)
 {
 QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
 IO_OR_GS_CODE();
diff --git a/qemu-img.c b/qemu-img.c
index df607a2a2e..7d4b33b3da 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -2114,7 +2114,7 @@ static int convert_do_copy(ImgConvertState *s)
 
 if (s->compressed && !s->ret) {
 /* signal EOF to align */
-ret = blk_pwrite_compressed(s->target, 0, NULL, 0);
+ret = blk_pwrite_compressed(s->target, 0, 0, NULL);
 if (ret < 0) {
 return ret;
 }
diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index c8cbaed0cd..952dc940f1 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -631,7 +631,7 @@ static int do_write_compressed(BlockBackend *blk, char 
*buf, int64_t offset,
 return -ERANGE;
 }
 
-ret = blk_pwrite_compressed(blk, offset, buf, bytes);
+ret = blk_pwrite_compressed(blk, offset, bytes, buf);
 if (ret < 0) {
 return ret;
 }
-- 
2.35.3




[PULL 30/35] block: Remove remaining unused symbols in coroutines.h

2022-07-12 Thread Hanna Reitz
From: Alberto Faria 

Some can be made static, others are unused generated_co_wrappers.

Signed-off-by: Alberto Faria 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Hanna Reitz 
Message-Id: <20220705161527.1054072-19-afa...@redhat.com>
Signed-off-by: Hanna Reitz 
---
 block/coroutines.h| 19 ---
 block/block-backend.c |  6 +++---
 2 files changed, 3 insertions(+), 22 deletions(-)

diff --git a/block/coroutines.h b/block/coroutines.h
index d66551a277..3a2bad564f 100644
--- a/block/coroutines.h
+++ b/block/coroutines.h
@@ -63,17 +63,6 @@ nbd_co_do_establish_connection(BlockDriverState *bs, bool 
blocking,
Error **errp);
 
 
-int coroutine_fn
-blk_co_do_pwritev_part(BlockBackend *blk, int64_t offset, int64_t bytes,
-   QEMUIOVector *qiov, size_t qiov_offset,
-   BdrvRequestFlags flags);
-
-int coroutine_fn
-blk_co_do_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes);
-
-int coroutine_fn blk_co_do_flush(BlockBackend *blk);
-
-
 /*
  * "I/O or GS" API functions. These functions can run without
  * the BQL, but only in one specific iothread/main loop.
@@ -82,14 +71,6 @@ int coroutine_fn blk_co_do_flush(BlockBackend *blk);
  * the "I/O or GS" API.
  */
 
-int generated_co_wrapper
-bdrv_preadv(BdrvChild *child, int64_t offset, int64_t bytes,
-QEMUIOVector *qiov, BdrvRequestFlags flags);
-
-int generated_co_wrapper
-bdrv_pwritev(BdrvChild *child, int64_t offset, int64_t bytes,
- QEMUIOVector *qiov, BdrvRequestFlags flags);
-
 int generated_co_wrapper
 bdrv_common_block_status_above(BlockDriverState *bs,
BlockDriverState *base,
diff --git a/block/block-backend.c b/block/block-backend.c
index f3d6525923..d4a5df2ac2 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1354,7 +1354,7 @@ int coroutine_fn blk_co_preadv_part(BlockBackend *blk, 
int64_t offset,
 }
 
 /* To be called between exactly one pair of blk_inc/dec_in_flight() */
-int coroutine_fn
+static int coroutine_fn
 blk_co_do_pwritev_part(BlockBackend *blk, int64_t offset, int64_t bytes,
QEMUIOVector *qiov, size_t qiov_offset,
BdrvRequestFlags flags)
@@ -1687,7 +1687,7 @@ BlockAIOCB *blk_aio_ioctl(BlockBackend *blk, unsigned 
long int req, void *buf,
 }
 
 /* To be called between exactly one pair of blk_inc/dec_in_flight() */
-int coroutine_fn
+static int coroutine_fn
 blk_co_do_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes)
 {
 int ret;
@@ -1735,7 +1735,7 @@ int coroutine_fn blk_co_pdiscard(BlockBackend *blk, 
int64_t offset,
 }
 
 /* To be called between exactly one pair of blk_inc/dec_in_flight() */
-int coroutine_fn blk_co_do_flush(BlockBackend *blk)
+static int coroutine_fn blk_co_do_flush(BlockBackend *blk)
 {
 blk_wait_while_drained(blk);
 IO_CODE();
-- 
2.35.3




[PULL 34/35] vl: Conditionally register PID file unlink notifier

2022-07-12 Thread Hanna Reitz
Currently, the exit notifier for unlinking the PID file is registered
unconditionally.  Limit it to only when we actually do create a PID
file.

Signed-off-by: Hanna Reitz 
Message-Id: <20220609122701.17172-3-hre...@redhat.com>
Reviewed-by: Daniel P. Berrangé 
---
 softmmu/vl.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/softmmu/vl.c b/softmmu/vl.c
index 3f264d4b09..36f46fcdad 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -1526,9 +1526,7 @@ static Notifier qemu_unlink_pidfile_notifier;
 
 static void qemu_unlink_pidfile(Notifier *n, void *data)
 {
-if (pid_file) {
-unlink(pid_file);
-}
+unlink(pid_file);
 }
 
 static const QEMUOption *lookup_opt(int argc, char **argv,
@@ -2431,13 +2429,15 @@ static void qemu_maybe_daemonize(const char *pid_file)
 os_daemonize();
 rcu_disable_atfork();
 
-if (pid_file && !qemu_write_pidfile(pid_file, )) {
-error_reportf_err(err, "cannot create PID file: ");
-exit(1);
-}
+if (pid_file) {
+if (!qemu_write_pidfile(pid_file, )) {
+error_reportf_err(err, "cannot create PID file: ");
+exit(1);
+}
 
-qemu_unlink_pidfile_notifier.notify = qemu_unlink_pidfile;
-qemu_add_exit_notifier(_unlink_pidfile_notifier);
+qemu_unlink_pidfile_notifier.notify = qemu_unlink_pidfile;
+qemu_add_exit_notifier(_unlink_pidfile_notifier);
+}
 }
 
 static void qemu_init_displays(void)
-- 
2.35.3




[PULL 33/35] qsd: Unlink absolute PID file path

2022-07-12 Thread Hanna Reitz
After writing the PID file, we register an atexit() handler to unlink it
when the process terminates.  However, if the process has changed its
working directory in the meantime (e.g. in os_setup_post() when
daemonizing), this will not work when the PID file path was relative.
Therefore, pass the absolute path (created with realpath()) to the
unlink() call in the atexit() handler.

(realpath() needs a path pointing to an existing file, so we cannot use
it before qemu_write_pidfile().)

Reproducer:
$ cd /tmp
$ qemu-storage-daemon --daemonize --pidfile qsd.pid
$ file qsd.pid
qsd.pid: ASCII text
$ kill $(cat qsd.pid)
$ file qsd.pid
qsd.pid: ASCII text

(qsd.pid should be gone after the process has terminated.)

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2092322
Signed-off-by: Hanna Reitz 
Message-Id: <20220609122701.17172-2-hre...@redhat.com>
Reviewed-by: Daniel P. Berrangé 
---
 storage-daemon/qemu-storage-daemon.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/storage-daemon/qemu-storage-daemon.c 
b/storage-daemon/qemu-storage-daemon.c
index 448c318e23..7718f6dcda 100644
--- a/storage-daemon/qemu-storage-daemon.c
+++ b/storage-daemon/qemu-storage-daemon.c
@@ -61,6 +61,7 @@
 #include "trace/control.h"
 
 static const char *pid_file;
+static char *pid_file_realpath;
 static volatile bool exit_requested = false;
 
 void qemu_system_killed(int signal, pid_t pid)
@@ -363,7 +364,7 @@ static void process_options(int argc, char *argv[], bool 
pre_init_pass)
 
 static void pid_file_cleanup(void)
 {
-unlink(pid_file);
+unlink(pid_file_realpath);
 }
 
 static void pid_file_init(void)
@@ -379,6 +380,14 @@ static void pid_file_init(void)
 exit(EXIT_FAILURE);
 }
 
+pid_file_realpath = g_malloc(PATH_MAX);
+if (!realpath(pid_file, pid_file_realpath)) {
+error_report("cannot resolve PID file path: %s: %s",
+ pid_file, strerror(errno));
+unlink(pid_file);
+exit(EXIT_FAILURE);
+}
+
 atexit(pid_file_cleanup);
 }
 
-- 
2.35.3




[PULL 19/35] block: Add blk_{preadv,pwritev}()

2022-07-12 Thread Hanna Reitz
From: Alberto Faria 

Implement them using generated_co_wrapper.

Signed-off-by: Alberto Faria 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Hanna Reitz 
Message-Id: <20220705161527.1054072-8-afa...@redhat.com>
Signed-off-by: Hanna Reitz 
---
 include/sysemu/block-backend-io.h |  6 +
 tests/unit/test-block-iothread.c  | 42 ++-
 2 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/include/sysemu/block-backend-io.h 
b/include/sysemu/block-backend-io.h
index 750089fc9b..cbf4422ea1 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -107,6 +107,9 @@ int generated_co_wrapper blk_pread(BlockBackend *blk, 
int64_t offset,
 int generated_co_wrapper blk_pwrite(BlockBackend *blk, int64_t offset,
 int64_t bytes, const void *buf,
 BdrvRequestFlags flags);
+int generated_co_wrapper blk_preadv(BlockBackend *blk, int64_t offset,
+int64_t bytes, QEMUIOVector *qiov,
+BdrvRequestFlags flags);
 int coroutine_fn blk_co_preadv(BlockBackend *blk, int64_t offset,
int64_t bytes, QEMUIOVector *qiov,
BdrvRequestFlags flags);
@@ -114,6 +117,9 @@ int coroutine_fn blk_co_pwritev_part(BlockBackend *blk, 
int64_t offset,
  int64_t bytes,
  QEMUIOVector *qiov, size_t qiov_offset,
  BdrvRequestFlags flags);
+int generated_co_wrapper blk_pwritev(BlockBackend *blk, int64_t offset,
+ int64_t bytes, QEMUIOVector *qiov,
+ BdrvRequestFlags flags);
 int coroutine_fn blk_co_pwritev(BlockBackend *blk, int64_t offset,
 int64_t bytes, QEMUIOVector *qiov,
 BdrvRequestFlags flags);
diff --git a/tests/unit/test-block-iothread.c b/tests/unit/test-block-iothread.c
index 0ced5333ff..b9c5da3a87 100644
--- a/tests/unit/test-block-iothread.c
+++ b/tests/unit/test-block-iothread.c
@@ -138,6 +138,36 @@ static void test_sync_op_blk_pwrite(BlockBackend *blk)
 g_assert_cmpint(ret, ==, -EIO);
 }
 
+static void test_sync_op_blk_preadv(BlockBackend *blk)
+{
+uint8_t buf[512];
+QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, sizeof(buf));
+int ret;
+
+/* Success */
+ret = blk_preadv(blk, 0, sizeof(buf), , 0);
+g_assert_cmpint(ret, ==, 0);
+
+/* Early error: Negative offset */
+ret = blk_preadv(blk, -2, sizeof(buf), , 0);
+g_assert_cmpint(ret, ==, -EIO);
+}
+
+static void test_sync_op_blk_pwritev(BlockBackend *blk)
+{
+uint8_t buf[512] = { 0 };
+QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, sizeof(buf));
+int ret;
+
+/* Success */
+ret = blk_pwritev(blk, 0, sizeof(buf), , 0);
+g_assert_cmpint(ret, ==, 0);
+
+/* Early error: Negative offset */
+ret = blk_pwritev(blk, -2, sizeof(buf), , 0);
+g_assert_cmpint(ret, ==, -EIO);
+}
+
 static void test_sync_op_load_vmstate(BdrvChild *c)
 {
 uint8_t buf[512];
@@ -301,6 +331,14 @@ const SyncOpTest sync_op_tests[] = {
 .name   = "/sync-op/pwrite",
 .fn = test_sync_op_pwrite,
 .blkfn  = test_sync_op_blk_pwrite,
+}, {
+.name   = "/sync-op/preadv",
+.fn = NULL,
+.blkfn  = test_sync_op_blk_preadv,
+}, {
+.name   = "/sync-op/pwritev",
+.fn = NULL,
+.blkfn  = test_sync_op_blk_pwritev,
 }, {
 .name   = "/sync-op/load_vmstate",
 .fn = test_sync_op_load_vmstate,
@@ -349,7 +387,9 @@ static void test_sync_op(const void *opaque)
 
 blk_set_aio_context(blk, ctx, _abort);
 aio_context_acquire(ctx);
-t->fn(c);
+if (t->fn) {
+t->fn(c);
+}
 if (t->blkfn) {
 t->blkfn(blk);
 }
-- 
2.35.3




[PULL 28/35] block: Add blk_co_truncate()

2022-07-12 Thread Hanna Reitz
From: Alberto Faria 

Also convert blk_truncate() into a generated_co_wrapper.

Signed-off-by: Alberto Faria 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Hanna Reitz 
Message-Id: <20220705161527.1054072-17-afa...@redhat.com>
Signed-off-by: Hanna Reitz 
---
 include/sysemu/block-backend-io.h |  8 ++--
 block/block-backend.c |  7 ---
 tests/unit/test-block-iothread.c  | 14 ++
 3 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/include/sysemu/block-backend-io.h 
b/include/sysemu/block-backend-io.h
index 21f3a1b4f2..a79b7d9d9c 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -182,7 +182,11 @@ int generated_co_wrapper blk_pwrite_zeroes(BlockBackend 
*blk, int64_t offset,
BdrvRequestFlags flags);
 int coroutine_fn blk_co_pwrite_zeroes(BlockBackend *blk, int64_t offset,
   int64_t bytes, BdrvRequestFlags flags);
-int blk_truncate(BlockBackend *blk, int64_t offset, bool exact,
- PreallocMode prealloc, BdrvRequestFlags flags, Error **errp);
+int generated_co_wrapper blk_truncate(BlockBackend *blk, int64_t offset,
+  bool exact, PreallocMode prealloc,
+  BdrvRequestFlags flags, Error **errp);
+int coroutine_fn blk_co_truncate(BlockBackend *blk, int64_t offset, bool exact,
+ PreallocMode prealloc, BdrvRequestFlags flags,
+ Error **errp);
 
 #endif /* BLOCK_BACKEND_IO_H */
diff --git a/block/block-backend.c b/block/block-backend.c
index a31b9f1205..d0450ffd22 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2293,8 +2293,9 @@ int coroutine_fn blk_co_pwrite_compressed(BlockBackend 
*blk, int64_t offset,
BDRV_REQ_WRITE_COMPRESSED);
 }
 
-int blk_truncate(BlockBackend *blk, int64_t offset, bool exact,
- PreallocMode prealloc, BdrvRequestFlags flags, Error **errp)
+int coroutine_fn blk_co_truncate(BlockBackend *blk, int64_t offset, bool exact,
+ PreallocMode prealloc, BdrvRequestFlags flags,
+ Error **errp)
 {
 IO_OR_GS_CODE();
 if (!blk_is_available(blk)) {
@@ -2302,7 +2303,7 @@ int blk_truncate(BlockBackend *blk, int64_t offset, bool 
exact,
 return -ENOMEDIUM;
 }
 
-return bdrv_truncate(blk->root, offset, exact, prealloc, flags, errp);
+return bdrv_co_truncate(blk->root, offset, exact, prealloc, flags, errp);
 }
 
 int blk_save_vmstate(BlockBackend *blk, const uint8_t *buf,
diff --git a/tests/unit/test-block-iothread.c b/tests/unit/test-block-iothread.c
index bb9230a4b4..8b55eccc89 100644
--- a/tests/unit/test-block-iothread.c
+++ b/tests/unit/test-block-iothread.c
@@ -298,6 +298,19 @@ static void test_sync_op_truncate(BdrvChild *c)
 c->bs->open_flags |= BDRV_O_RDWR;
 }
 
+static void test_sync_op_blk_truncate(BlockBackend *blk)
+{
+int ret;
+
+/* Normal success path */
+ret = blk_truncate(blk, 65536, false, PREALLOC_MODE_OFF, 0, NULL);
+g_assert_cmpint(ret, ==, 0);
+
+/* Early error: Negative offset */
+ret = blk_truncate(blk, -2, false, PREALLOC_MODE_OFF, 0, NULL);
+g_assert_cmpint(ret, ==, -EINVAL);
+}
+
 static void test_sync_op_block_status(BdrvChild *c)
 {
 int ret;
@@ -425,6 +438,7 @@ const SyncOpTest sync_op_tests[] = {
 }, {
 .name   = "/sync-op/truncate",
 .fn = test_sync_op_truncate,
+.blkfn  = test_sync_op_blk_truncate,
 }, {
 .name   = "/sync-op/block_status",
 .fn = test_sync_op_block_status,
-- 
2.35.3




[PULL 24/35] block: Implement blk_pwrite_zeroes() using generated_co_wrapper

2022-07-12 Thread Hanna Reitz
From: Alberto Faria 

Signed-off-by: Alberto Faria 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Hanna Reitz 
Message-Id: <20220705161527.1054072-13-afa...@redhat.com>
Signed-off-by: Hanna Reitz 
---
 include/sysemu/block-backend-io.h |  5 +++--
 block/block-backend.c |  8 
 tests/unit/test-block-iothread.c  | 17 +
 3 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/include/sysemu/block-backend-io.h 
b/include/sysemu/block-backend-io.h
index 8500a63102..346ca47fed 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -173,8 +173,9 @@ int generated_co_wrapper blk_pwrite_compressed(BlockBackend 
*blk,
 int coroutine_fn blk_co_pwrite_compressed(BlockBackend *blk, int64_t offset,
   int64_t bytes, const void *buf);
 int blk_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes);
-int blk_pwrite_zeroes(BlockBackend *blk, int64_t offset,
-  int64_t bytes, BdrvRequestFlags flags);
+int generated_co_wrapper blk_pwrite_zeroes(BlockBackend *blk, int64_t offset,
+   int64_t bytes,
+   BdrvRequestFlags flags);
 int coroutine_fn blk_co_pwrite_zeroes(BlockBackend *blk, int64_t offset,
   int64_t bytes, BdrvRequestFlags flags);
 int blk_truncate(BlockBackend *blk, int64_t offset, bool exact,
diff --git a/block/block-backend.c b/block/block-backend.c
index f07a76aa5a..85661d1823 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1411,14 +1411,6 @@ typedef struct BlkRwCo {
 BdrvRequestFlags flags;
 } BlkRwCo;
 
-int blk_pwrite_zeroes(BlockBackend *blk, int64_t offset,
-  int64_t bytes, BdrvRequestFlags flags)
-{
-IO_OR_GS_CODE();
-return blk_pwritev_part(blk, offset, bytes, NULL, 0,
-flags | BDRV_REQ_ZERO_WRITE);
-}
-
 int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags)
 {
 GLOBAL_STATE_CODE();
diff --git a/tests/unit/test-block-iothread.c b/tests/unit/test-block-iothread.c
index 3a46886784..bb9230a4b4 100644
--- a/tests/unit/test-block-iothread.c
+++ b/tests/unit/test-block-iothread.c
@@ -212,6 +212,19 @@ static void 
test_sync_op_blk_pwrite_compressed(BlockBackend *blk)
 g_assert_cmpint(ret, ==, -EIO);
 }
 
+static void test_sync_op_blk_pwrite_zeroes(BlockBackend *blk)
+{
+int ret;
+
+/* Success */
+ret = blk_pwrite_zeroes(blk, 0, 512, 0);
+g_assert_cmpint(ret, ==, 0);
+
+/* Early error: Negative offset */
+ret = blk_pwrite_zeroes(blk, -2, 512, 0);
+g_assert_cmpint(ret, ==, -EIO);
+}
+
 static void test_sync_op_load_vmstate(BdrvChild *c)
 {
 uint8_t buf[512];
@@ -395,6 +408,10 @@ const SyncOpTest sync_op_tests[] = {
 .name   = "/sync-op/pwrite_compressed",
 .fn = NULL,
 .blkfn  = test_sync_op_blk_pwrite_compressed,
+}, {
+.name   = "/sync-op/pwrite_zeroes",
+.fn = NULL,
+.blkfn  = test_sync_op_blk_pwrite_zeroes,
 }, {
 .name   = "/sync-op/load_vmstate",
 .fn = test_sync_op_load_vmstate,
-- 
2.35.3




[PULL 15/35] block: Change blk_{pread,pwrite}() param order

2022-07-12 Thread Hanna Reitz
From: Alberto Faria 

Swap 'buf' and 'bytes' around for consistency with
blk_co_{pread,pwrite}(), and in preparation to implement these functions
using generated_co_wrapper.

Callers were updated using this Coccinelle script:

@@ expression blk, offset, buf, bytes, flags; @@
- blk_pread(blk, offset, buf, bytes, flags)
+ blk_pread(blk, offset, bytes, buf, flags)

@@ expression blk, offset, buf, bytes, flags; @@
- blk_pwrite(blk, offset, buf, bytes, flags)
+ blk_pwrite(blk, offset, bytes, buf, flags)

It had no effect on hw/block/nand.c, presumably due to the #if, so that
file was updated manually.

Overly-long lines were then fixed by hand.

Signed-off-by: Alberto Faria 
Reviewed-by: Eric Blake 
Reviewed-by: Hanna Reitz 
Message-Id: <20220705161527.1054072-4-afa...@redhat.com>
Signed-off-by: Hanna Reitz 
---
 include/sysemu/block-backend-io.h |  4 +--
 block.c   |  2 +-
 block/block-backend.c |  4 +--
 block/commit.c|  4 +--
 block/crypto.c|  2 +-
 block/export/fuse.c   |  4 +--
 block/parallels.c |  2 +-
 block/qcow.c  |  8 +++---
 block/qcow2.c |  4 +--
 block/qed.c   |  8 +++---
 block/vdi.c   |  4 +--
 block/vhdx.c  | 20 ++---
 block/vmdk.c  | 10 +++
 block/vpc.c   | 12 
 hw/arm/allwinner-h3.c |  2 +-
 hw/arm/aspeed.c   |  2 +-
 hw/block/block.c  |  2 +-
 hw/block/fdc.c| 20 ++---
 hw/block/hd-geometry.c|  2 +-
 hw/block/m25p80.c |  2 +-
 hw/block/nand.c   | 47 ---
 hw/block/onenand.c| 32 ++---
 hw/block/pflash_cfi01.c   |  4 +--
 hw/block/pflash_cfi02.c   |  4 +--
 hw/ide/atapi.c|  4 +--
 hw/misc/mac_via.c |  4 +--
 hw/misc/sifive_u_otp.c| 14 -
 hw/nvram/eeprom_at24c.c   |  4 +--
 hw/nvram/spapr_nvram.c|  6 ++--
 hw/nvram/xlnx-bbram.c |  4 +--
 hw/nvram/xlnx-efuse.c |  4 +--
 hw/ppc/pnv_pnor.c |  6 ++--
 hw/sd/sd.c|  4 +--
 migration/block.c |  6 ++--
 nbd/server.c  |  8 +++---
 qemu-img.c| 18 ++--
 qemu-io-cmds.c|  4 +--
 tests/unit/test-block-iothread.c  |  8 +++---
 38 files changed, 150 insertions(+), 149 deletions(-)

diff --git a/include/sysemu/block-backend-io.h 
b/include/sysemu/block-backend-io.h
index 5c5c108e0d..e2e7ab9e6c 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -101,9 +101,9 @@ int coroutine_fn blk_co_copy_range(BlockBackend *blk_in, 
int64_t off_in,
  * the "I/O or GS" API.
  */
 
-int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int bytes,
+int blk_pread(BlockBackend *blk, int64_t offset, int bytes, void *buf,
   BdrvRequestFlags flags);
-int blk_pwrite(BlockBackend *blk, int64_t offset, const void *buf, int bytes,
+int blk_pwrite(BlockBackend *blk, int64_t offset, int bytes, const void *buf,
BdrvRequestFlags flags);
 int coroutine_fn blk_co_preadv(BlockBackend *blk, int64_t offset,
int64_t bytes, QEMUIOVector *qiov,
diff --git a/block.c b/block.c
index ed701b4889..bc85f46eed 100644
--- a/block.c
+++ b/block.c
@@ -1037,7 +1037,7 @@ static int find_image_format(BlockBackend *file, const 
char *filename,
 return ret;
 }
 
-ret = blk_pread(file, 0, buf, sizeof(buf), 0);
+ret = blk_pread(file, 0, sizeof(buf), buf, 0);
 if (ret < 0) {
 error_setg_errno(errp, -ret, "Could not read image for determining its 
"
  "format");
diff --git a/block/block-backend.c b/block/block-backend.c
index bdde90d235..5fb3890bab 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1563,7 +1563,7 @@ BlockAIOCB *blk_aio_pwrite_zeroes(BlockBackend *blk, 
int64_t offset,
 flags | BDRV_REQ_ZERO_WRITE, cb, opaque);
 }
 
-int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int bytes,
+int blk_pread(BlockBackend *blk, int64_t offset, int bytes, void *buf,
   BdrvRequestFlags flags)
 {
 int ret;
@@ -1577,7 +1577,7 @@ int blk_pread(BlockBackend *blk, int64_t offset, void 
*buf, int bytes,
 return ret;
 }
 
-int blk_pwrite(BlockBackend *blk, int64_t offset, const void *buf, int bytes,
+int blk_pwrite(BlockBackend *blk, int64_t offset, int bytes, const void *buf,
BdrvRequestFlags flags)
 {
 QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
diff --git a/block/commit.c b/block/commit.c
index e5b3ad08da..38571510cb 100644
--

[PULL 11/35] tests/qemu-iotests: hotfix for 307, 223 output

2022-07-12 Thread Hanna Reitz
From: John Snow 

Fixes: 58a6fdcc
Signed-off-by: John Snow 
Tested-by: Daniel P. Berrangé 
Reviewed-by: Daniel P. Berrangé 
Message-Id: <20220616142659.3184115-2-js...@redhat.com>
Reviewed-by: Hanna Reitz 
Signed-off-by: Hanna Reitz 
---
 tests/qemu-iotests/223.out | 4 ++--
 tests/qemu-iotests/307.out | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/tests/qemu-iotests/223.out b/tests/qemu-iotests/223.out
index 0647941531..26fb347c5d 100644
--- a/tests/qemu-iotests/223.out
+++ b/tests/qemu-iotests/223.out
@@ -93,7 +93,7 @@ exports available: 3
  export: 'n2'
   description: some text
   size:  4194304
-  flags: 0xced ( flush fua trim zeroes df cache fast-zero )
+  flags: 0xded ( flush fua trim zeroes df multi cache fast-zero )
   min block: 1
   opt block: 4096
   max block: 33554432
@@ -212,7 +212,7 @@ exports available: 3
  export: 'n2'
   description: some text
   size:  4194304
-  flags: 0xced ( flush fua trim zeroes df cache fast-zero )
+  flags: 0xded ( flush fua trim zeroes df multi cache fast-zero )
   min block: 1
   opt block: 4096
   max block: 33554432
diff --git a/tests/qemu-iotests/307.out b/tests/qemu-iotests/307.out
index ec8d2be0e0..390f05d1b7 100644
--- a/tests/qemu-iotests/307.out
+++ b/tests/qemu-iotests/307.out
@@ -83,7 +83,7 @@ exports available: 2
  export: 'export1'
   description: This is the writable second export
   size:  67108864
-  flags: 0xced ( flush fua trim zeroes df cache fast-zero )
+  flags: 0xded ( flush fua trim zeroes df multi cache fast-zero )
   min block: XXX
   opt block: XXX
   max block: XXX
@@ -109,7 +109,7 @@ exports available: 1
  export: 'export1'
   description: This is the writable second export
   size:  67108864
-  flags: 0xced ( flush fua trim zeroes df cache fast-zero )
+  flags: 0xded ( flush fua trim zeroes df multi cache fast-zero )
   min block: XXX
   opt block: XXX
   max block: XXX
-- 
2.35.3




[PULL 29/35] block: Reorganize some declarations in block-backend-io.h

2022-07-12 Thread Hanna Reitz
From: Alberto Faria 

Keep generated_co_wrapper and coroutine_fn pairs together. This should
make it clear that each I/O function has these two versions.

Also move blk_co_{pread,pwrite}()'s implementations out of the header
file for consistency.

Signed-off-by: Alberto Faria 
Reviewed-by: Paolo Bonzini 
Message-Id: <20220705161527.1054072-18-afa...@redhat.com>
Reviewed-by: Hanna Reitz 
Signed-off-by: Hanna Reitz 
---
 include/sysemu/block-backend-io.h | 77 +--
 block/block-backend.c | 22 +
 2 files changed, 54 insertions(+), 45 deletions(-)

diff --git a/include/sysemu/block-backend-io.h 
b/include/sysemu/block-backend-io.h
index a79b7d9d9c..50f5aa2e07 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -104,9 +104,16 @@ int coroutine_fn blk_co_copy_range(BlockBackend *blk_in, 
int64_t off_in,
 int generated_co_wrapper blk_pread(BlockBackend *blk, int64_t offset,
int64_t bytes, void *buf,
BdrvRequestFlags flags);
-int generated_co_wrapper blk_pwrite(BlockBackend *blk, int64_t offset,
-int64_t bytes, const void *buf,
+int coroutine_fn blk_co_pread(BlockBackend *blk, int64_t offset, int64_t bytes,
+  void *buf, BdrvRequestFlags flags);
+
+int generated_co_wrapper blk_preadv(BlockBackend *blk, int64_t offset,
+int64_t bytes, QEMUIOVector *qiov,
 BdrvRequestFlags flags);
+int coroutine_fn blk_co_preadv(BlockBackend *blk, int64_t offset,
+   int64_t bytes, QEMUIOVector *qiov,
+   BdrvRequestFlags flags);
+
 int generated_co_wrapper blk_preadv_part(BlockBackend *blk, int64_t offset,
  int64_t bytes, QEMUIOVector *qiov,
  size_t qiov_offset,
@@ -114,12 +121,20 @@ int generated_co_wrapper blk_preadv_part(BlockBackend 
*blk, int64_t offset,
 int coroutine_fn blk_co_preadv_part(BlockBackend *blk, int64_t offset,
 int64_t bytes, QEMUIOVector *qiov,
 size_t qiov_offset, BdrvRequestFlags 
flags);
-int generated_co_wrapper blk_preadv(BlockBackend *blk, int64_t offset,
-int64_t bytes, QEMUIOVector *qiov,
+
+int generated_co_wrapper blk_pwrite(BlockBackend *blk, int64_t offset,
+int64_t bytes, const void *buf,
 BdrvRequestFlags flags);
-int coroutine_fn blk_co_preadv(BlockBackend *blk, int64_t offset,
-   int64_t bytes, QEMUIOVector *qiov,
-   BdrvRequestFlags flags);
+int coroutine_fn blk_co_pwrite(BlockBackend *blk, int64_t offset, int64_t 
bytes,
+   const void *buf, BdrvRequestFlags flags);
+
+int generated_co_wrapper blk_pwritev(BlockBackend *blk, int64_t offset,
+ int64_t bytes, QEMUIOVector *qiov,
+ BdrvRequestFlags flags);
+int coroutine_fn blk_co_pwritev(BlockBackend *blk, int64_t offset,
+int64_t bytes, QEMUIOVector *qiov,
+BdrvRequestFlags flags);
+
 int generated_co_wrapper blk_pwritev_part(BlockBackend *blk, int64_t offset,
   int64_t bytes, QEMUIOVector *qiov,
   size_t qiov_offset,
@@ -128,36 +143,18 @@ int coroutine_fn blk_co_pwritev_part(BlockBackend *blk, 
int64_t offset,
  int64_t bytes,
  QEMUIOVector *qiov, size_t qiov_offset,
  BdrvRequestFlags flags);
-int generated_co_wrapper blk_pwritev(BlockBackend *blk, int64_t offset,
- int64_t bytes, QEMUIOVector *qiov,
- BdrvRequestFlags flags);
-int coroutine_fn blk_co_pwritev(BlockBackend *blk, int64_t offset,
-int64_t bytes, QEMUIOVector *qiov,
-BdrvRequestFlags flags);
-
-static inline int coroutine_fn blk_co_pread(BlockBackend *blk, int64_t offset,
-int64_t bytes, void *buf,
-BdrvRequestFlags flags)
-{
-QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
-IO_OR_GS_CODE();
-
-assert(bytes <= SIZE_MAX);
-
-return blk_co_preadv(blk, offset, bytes, , flags);
-}
-
-static inline int coroutine_fn blk_co_pwrite(BlockBackend *blk, int64_t offset,
- int64_t bytes, const void *buf,
- BdrvRequestFlags flags)
-{
-QEMUIOVe

[PULL 21/35] block: Export blk_pwritev_part() in block-backend-io.h

2022-07-12 Thread Hanna Reitz
From: Alberto Faria 

Also convert it into a generated_co_wrapper.

Signed-off-by: Alberto Faria 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Hanna Reitz 
Message-Id: <20220705161527.1054072-10-afa...@redhat.com>
Signed-off-by: Hanna Reitz 
---
 block/coroutines.h|  5 -
 include/sysemu/block-backend-io.h |  4 
 block/block-backend.c | 14 --
 tests/unit/test-block-iothread.c  | 19 +++
 4 files changed, 23 insertions(+), 19 deletions(-)

diff --git a/block/coroutines.h b/block/coroutines.h
index 85423f8db6..94fd283f62 100644
--- a/block/coroutines.h
+++ b/block/coroutines.h
@@ -107,11 +107,6 @@ bdrv_common_block_status_above(BlockDriverState *bs,
 int generated_co_wrapper
 nbd_do_establish_connection(BlockDriverState *bs, bool blocking, Error **errp);
 
-int generated_co_wrapper
-blk_do_pwritev_part(BlockBackend *blk, int64_t offset, int64_t bytes,
-QEMUIOVector *qiov, size_t qiov_offset,
-BdrvRequestFlags flags);
-
 int generated_co_wrapper
 blk_do_ioctl(BlockBackend *blk, unsigned long int req, void *buf);
 
diff --git a/include/sysemu/block-backend-io.h 
b/include/sysemu/block-backend-io.h
index 877274a999..93a15071c4 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -120,6 +120,10 @@ int generated_co_wrapper blk_preadv(BlockBackend *blk, 
int64_t offset,
 int coroutine_fn blk_co_preadv(BlockBackend *blk, int64_t offset,
int64_t bytes, QEMUIOVector *qiov,
BdrvRequestFlags flags);
+int generated_co_wrapper blk_pwritev_part(BlockBackend *blk, int64_t offset,
+  int64_t bytes, QEMUIOVector *qiov,
+  size_t qiov_offset,
+  BdrvRequestFlags flags);
 int coroutine_fn blk_co_pwritev_part(BlockBackend *blk, int64_t offset,
  int64_t bytes,
  QEMUIOVector *qiov, size_t qiov_offset,
diff --git a/block/block-backend.c b/block/block-backend.c
index 49db3f63fb..c75942b773 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1403,20 +1403,6 @@ int coroutine_fn blk_co_pwritev(BlockBackend *blk, 
int64_t offset,
 return blk_co_pwritev_part(blk, offset, bytes, qiov, 0, flags);
 }
 
-static int coroutine_fn blk_pwritev_part(BlockBackend *blk, int64_t offset,
- int64_t bytes,
- QEMUIOVector *qiov, size_t 
qiov_offset,
- BdrvRequestFlags flags)
-{
-int ret;
-
-blk_inc_in_flight(blk);
-ret = blk_do_pwritev_part(blk, offset, bytes, qiov, qiov_offset, flags);
-blk_dec_in_flight(blk);
-
-return ret;
-}
-
 typedef struct BlkRwCo {
 BlockBackend *blk;
 int64_t offset;
diff --git a/tests/unit/test-block-iothread.c b/tests/unit/test-block-iothread.c
index 2fa1248445..274e9e3653 100644
--- a/tests/unit/test-block-iothread.c
+++ b/tests/unit/test-block-iothread.c
@@ -183,6 +183,21 @@ static void test_sync_op_blk_preadv_part(BlockBackend *blk)
 g_assert_cmpint(ret, ==, -EIO);
 }
 
+static void test_sync_op_blk_pwritev_part(BlockBackend *blk)
+{
+uint8_t buf[512] = { 0 };
+QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, sizeof(buf));
+int ret;
+
+/* Success */
+ret = blk_pwritev_part(blk, 0, sizeof(buf), , 0, 0);
+g_assert_cmpint(ret, ==, 0);
+
+/* Early error: Negative offset */
+ret = blk_pwritev_part(blk, -2, sizeof(buf), , 0, 0);
+g_assert_cmpint(ret, ==, -EIO);
+}
+
 static void test_sync_op_load_vmstate(BdrvChild *c)
 {
 uint8_t buf[512];
@@ -358,6 +373,10 @@ const SyncOpTest sync_op_tests[] = {
 .name   = "/sync-op/preadv_part",
 .fn = NULL,
 .blkfn  = test_sync_op_blk_preadv_part,
+}, {
+.name   = "/sync-op/pwritev_part",
+.fn = NULL,
+.blkfn  = test_sync_op_blk_pwritev_part,
 }, {
 .name   = "/sync-op/load_vmstate",
 .fn = test_sync_op_load_vmstate,
-- 
2.35.3




[PULL 27/35] block: Add blk_co_ioctl()

2022-07-12 Thread Hanna Reitz
From: Alberto Faria 

Also convert blk_ioctl() into a generated_co_wrapper.

Signed-off-by: Alberto Faria 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Hanna Reitz 
Message-Id: <20220705161527.1054072-16-afa...@redhat.com>
Signed-off-by: Hanna Reitz 
---
 block/coroutines.h| 6 --
 include/sysemu/block-backend-io.h | 5 -
 block/block-backend.c | 7 ---
 3 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/block/coroutines.h b/block/coroutines.h
index 7e94b9fa83..d66551a277 100644
--- a/block/coroutines.h
+++ b/block/coroutines.h
@@ -68,9 +68,6 @@ blk_co_do_pwritev_part(BlockBackend *blk, int64_t offset, 
int64_t bytes,
QEMUIOVector *qiov, size_t qiov_offset,
BdrvRequestFlags flags);
 
-int coroutine_fn
-blk_co_do_ioctl(BlockBackend *blk, unsigned long int req, void *buf);
-
 int coroutine_fn
 blk_co_do_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes);
 
@@ -107,7 +104,4 @@ bdrv_common_block_status_above(BlockDriverState *bs,
 int generated_co_wrapper
 nbd_do_establish_connection(BlockDriverState *bs, bool blocking, Error **errp);
 
-int generated_co_wrapper
-blk_do_ioctl(BlockBackend *blk, unsigned long int req, void *buf);
-
 #endif /* BLOCK_COROUTINES_H */
diff --git a/include/sysemu/block-backend-io.h 
b/include/sysemu/block-backend-io.h
index 6954c3bb8c..21f3a1b4f2 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -167,7 +167,10 @@ int coroutine_fn blk_co_pdiscard(BlockBackend *blk, 
int64_t offset,
 int generated_co_wrapper blk_flush(BlockBackend *blk);
 int coroutine_fn blk_co_flush(BlockBackend *blk);
 
-int blk_ioctl(BlockBackend *blk, unsigned long int req, void *buf);
+int generated_co_wrapper blk_ioctl(BlockBackend *blk, unsigned long int req,
+   void *buf);
+int coroutine_fn blk_co_ioctl(BlockBackend *blk, unsigned long int req,
+  void *buf);
 
 int generated_co_wrapper blk_pwrite_compressed(BlockBackend *blk,
int64_t offset, int64_t bytes,
diff --git a/block/block-backend.c b/block/block-backend.c
index 783a77cf6d..a31b9f1205 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1620,7 +1620,7 @@ void blk_aio_cancel_async(BlockAIOCB *acb)
 }
 
 /* To be called between exactly one pair of blk_inc/dec_in_flight() */
-int coroutine_fn
+static int coroutine_fn
 blk_co_do_ioctl(BlockBackend *blk, unsigned long int req, void *buf)
 {
 IO_CODE();
@@ -1634,13 +1634,14 @@ blk_co_do_ioctl(BlockBackend *blk, unsigned long int 
req, void *buf)
 return bdrv_co_ioctl(blk_bs(blk), req, buf);
 }
 
-int blk_ioctl(BlockBackend *blk, unsigned long int req, void *buf)
+int coroutine_fn blk_co_ioctl(BlockBackend *blk, unsigned long int req,
+  void *buf)
 {
 int ret;
 IO_OR_GS_CODE();
 
 blk_inc_in_flight(blk);
-ret = blk_do_ioctl(blk, req, buf);
+ret = blk_co_do_ioctl(blk, req, buf);
 blk_dec_in_flight(blk);
 
 return ret;
-- 
2.35.3




[PULL 07/35] block: Implement bdrv_{pread, pwrite, pwrite_zeroes}() using generated_co_wrapper

2022-07-12 Thread Hanna Reitz
From: Alberto Faria 

bdrv_{pread,pwrite}() now return -EIO instead of -EINVAL when 'bytes' is
negative, making them consistent with bdrv_{preadv,pwritev}() and
bdrv_co_{pread,pwrite,preadv,pwritev}().

bdrv_pwrite_zeroes() now also calls trace_bdrv_co_pwrite_zeroes() and
clears the BDRV_REQ_MAY_UNMAP flag when appropriate, which it didn't
previously.

Signed-off-by: Alberto Faria 
Message-Id: <20220609152744.3891847-8-afa...@redhat.com>
Reviewed-by: Eric Blake 
Reviewed-by: Hanna Reitz 
Signed-off-by: Hanna Reitz 
---
 include/block/block-io.h | 15 +--
 block/io.c   | 41 
 2 files changed, 9 insertions(+), 47 deletions(-)

diff --git a/include/block/block-io.h b/include/block/block-io.h
index ecce56c096..260f669d89 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -39,13 +39,16 @@
  * to catch when they are accidentally called by the wrong API.
  */
 
-int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
-   int64_t bytes, BdrvRequestFlags flags);
+int generated_co_wrapper bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
+int64_t bytes,
+BdrvRequestFlags flags);
 int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags);
-int bdrv_pread(BdrvChild *child, int64_t offset, int64_t bytes, void *buf,
-   BdrvRequestFlags flags);
-int bdrv_pwrite(BdrvChild *child, int64_t offset, int64_t bytes,
-const void *buf, BdrvRequestFlags flags);
+int generated_co_wrapper bdrv_pread(BdrvChild *child, int64_t offset,
+int64_t bytes, void *buf,
+BdrvRequestFlags flags);
+int generated_co_wrapper bdrv_pwrite(BdrvChild *child, int64_t offset,
+ int64_t bytes, const void *buf,
+ BdrvRequestFlags flags);
 int bdrv_pwrite_sync(BdrvChild *child, int64_t offset, int64_t bytes,
  const void *buf, BdrvRequestFlags flags);
 /*
diff --git a/block/io.c b/block/io.c
index 86b5e8b471..9ff440e4e4 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1046,14 +1046,6 @@ static int bdrv_check_request32(int64_t offset, int64_t 
bytes,
 return 0;
 }
 
-int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
-   int64_t bytes, BdrvRequestFlags flags)
-{
-IO_CODE();
-return bdrv_pwritev(child, offset, bytes, NULL,
-BDRV_REQ_ZERO_WRITE | flags);
-}
-
 /*
  * Completely zero out a block device with the help of bdrv_pwrite_zeroes.
  * The operation is sped up by checking the block status and only writing
@@ -1096,39 +1088,6 @@ int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags 
flags)
 }
 }
 
-/* See bdrv_pwrite() for the return codes */
-int bdrv_pread(BdrvChild *child, int64_t offset, int64_t bytes, void *buf,
-   BdrvRequestFlags flags)
-{
-QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
-IO_CODE();
-
-if (bytes < 0) {
-return -EINVAL;
-}
-
-return bdrv_preadv(child, offset, bytes, , flags);
-}
-
-/* Return no. of bytes on success or < 0 on error. Important errors are:
-  -EIO generic I/O error (may happen for all errors)
-  -ENOMEDIUM   No media inserted.
-  -EINVAL  Invalid offset or number of bytes
-  -EACCES  Trying to write a read-only device
-*/
-int bdrv_pwrite(BdrvChild *child, int64_t offset, int64_t bytes,
-const void *buf, BdrvRequestFlags flags)
-{
-QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
-IO_CODE();
-
-if (bytes < 0) {
-return -EINVAL;
-}
-
-return bdrv_pwritev(child, offset, bytes, , flags);
-}
-
 /*
  * Writes to the file and ensures that no writes are reordered across this
  * request (acts as a barrier)
-- 
2.35.3




[PULL 23/35] block: Add blk_co_pwrite_compressed()

2022-07-12 Thread Hanna Reitz
From: Alberto Faria 

Also convert blk_pwrite_compressed() into a generated_co_wrapper.

Signed-off-by: Alberto Faria 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Hanna Reitz 
Message-Id: <20220705161527.1054072-12-afa...@redhat.com>
Signed-off-by: Hanna Reitz 
---
 include/sysemu/block-backend-io.h |  7 +--
 block/block-backend.c |  8 
 tests/unit/test-block-iothread.c  | 18 ++
 3 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/include/sysemu/block-backend-io.h 
b/include/sysemu/block-backend-io.h
index 695b793a72..8500a63102 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -167,8 +167,11 @@ int blk_flush(BlockBackend *blk);
 
 int blk_ioctl(BlockBackend *blk, unsigned long int req, void *buf);
 
-int blk_pwrite_compressed(BlockBackend *blk, int64_t offset, int64_t bytes,
-  const void *buf);
+int generated_co_wrapper blk_pwrite_compressed(BlockBackend *blk,
+   int64_t offset, int64_t bytes,
+   const void *buf);
+int coroutine_fn blk_co_pwrite_compressed(BlockBackend *blk, int64_t offset,
+  int64_t bytes, const void *buf);
 int blk_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes);
 int blk_pwrite_zeroes(BlockBackend *blk, int64_t offset,
   int64_t bytes, BdrvRequestFlags flags);
diff --git a/block/block-backend.c b/block/block-backend.c
index 60fb7eb3f2..f07a76aa5a 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2314,13 +2314,13 @@ int coroutine_fn blk_co_pwrite_zeroes(BlockBackend 
*blk, int64_t offset,
   flags | BDRV_REQ_ZERO_WRITE);
 }
 
-int blk_pwrite_compressed(BlockBackend *blk, int64_t offset, int64_t bytes,
-  const void *buf)
+int coroutine_fn blk_co_pwrite_compressed(BlockBackend *blk, int64_t offset,
+  int64_t bytes, const void *buf)
 {
 QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
 IO_OR_GS_CODE();
-return blk_pwritev_part(blk, offset, bytes, , 0,
-BDRV_REQ_WRITE_COMPRESSED);
+return blk_co_pwritev_part(blk, offset, bytes, , 0,
+   BDRV_REQ_WRITE_COMPRESSED);
 }
 
 int blk_truncate(BlockBackend *blk, int64_t offset, bool exact,
diff --git a/tests/unit/test-block-iothread.c b/tests/unit/test-block-iothread.c
index 274e9e3653..3a46886784 100644
--- a/tests/unit/test-block-iothread.c
+++ b/tests/unit/test-block-iothread.c
@@ -198,6 +198,20 @@ static void test_sync_op_blk_pwritev_part(BlockBackend 
*blk)
 g_assert_cmpint(ret, ==, -EIO);
 }
 
+static void test_sync_op_blk_pwrite_compressed(BlockBackend *blk)
+{
+uint8_t buf[512] = { 0 };
+int ret;
+
+/* Late error: Not supported */
+ret = blk_pwrite_compressed(blk, 0, sizeof(buf), buf);
+g_assert_cmpint(ret, ==, -ENOTSUP);
+
+/* Early error: Negative offset */
+ret = blk_pwrite_compressed(blk, -2, sizeof(buf), buf);
+g_assert_cmpint(ret, ==, -EIO);
+}
+
 static void test_sync_op_load_vmstate(BdrvChild *c)
 {
 uint8_t buf[512];
@@ -377,6 +391,10 @@ const SyncOpTest sync_op_tests[] = {
 .name   = "/sync-op/pwritev_part",
 .fn = NULL,
 .blkfn  = test_sync_op_blk_pwritev_part,
+}, {
+.name   = "/sync-op/pwrite_compressed",
+.fn = NULL,
+.blkfn  = test_sync_op_blk_pwrite_compressed,
 }, {
 .name   = "/sync-op/load_vmstate",
 .fn = test_sync_op_load_vmstate,
-- 
2.35.3




[PULL 17/35] block: Make blk_co_pwrite() take a const buffer

2022-07-12 Thread Hanna Reitz
From: Alberto Faria 

It does not mutate the buffer.

Signed-off-by: Alberto Faria 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Hanna Reitz 
Message-Id: <20220705161527.1054072-6-afa...@redhat.com>
Signed-off-by: Hanna Reitz 
---
 include/sysemu/block-backend-io.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/sysemu/block-backend-io.h 
b/include/sysemu/block-backend-io.h
index 8bf1296105..b048476a47 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -129,7 +129,7 @@ static inline int coroutine_fn blk_co_pread(BlockBackend 
*blk, int64_t offset,
 }
 
 static inline int coroutine_fn blk_co_pwrite(BlockBackend *blk, int64_t offset,
- int64_t bytes, void *buf,
+ int64_t bytes, const void *buf,
  BdrvRequestFlags flags)
 {
 QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
-- 
2.35.3




[PULL 20/35] block: Add blk_[co_]preadv_part()

2022-07-12 Thread Hanna Reitz
From: Alberto Faria 

Implement blk_preadv_part() using generated_co_wrapper.

Signed-off-by: Alberto Faria 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Hanna Reitz 
Message-Id: <20220705161527.1054072-9-afa...@redhat.com>
Signed-off-by: Hanna Reitz 
---
 block/coroutines.h|  5 -
 include/sysemu/block-backend-io.h |  7 +++
 block/block-backend.c | 30 +++---
 tests/unit/test-block-iothread.c  | 19 +++
 4 files changed, 49 insertions(+), 12 deletions(-)

diff --git a/block/coroutines.h b/block/coroutines.h
index 443ef2f2e6..85423f8db6 100644
--- a/block/coroutines.h
+++ b/block/coroutines.h
@@ -63,11 +63,6 @@ nbd_co_do_establish_connection(BlockDriverState *bs, bool 
blocking,
Error **errp);
 
 
-int coroutine_fn
-blk_co_do_preadv(BlockBackend *blk, int64_t offset, int64_t bytes,
- QEMUIOVector *qiov, BdrvRequestFlags flags);
-
-
 int coroutine_fn
 blk_co_do_pwritev_part(BlockBackend *blk, int64_t offset, int64_t bytes,
QEMUIOVector *qiov, size_t qiov_offset,
diff --git a/include/sysemu/block-backend-io.h 
b/include/sysemu/block-backend-io.h
index cbf4422ea1..877274a999 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -107,6 +107,13 @@ int generated_co_wrapper blk_pread(BlockBackend *blk, 
int64_t offset,
 int generated_co_wrapper blk_pwrite(BlockBackend *blk, int64_t offset,
 int64_t bytes, const void *buf,
 BdrvRequestFlags flags);
+int generated_co_wrapper blk_preadv_part(BlockBackend *blk, int64_t offset,
+ int64_t bytes, QEMUIOVector *qiov,
+ size_t qiov_offset,
+ BdrvRequestFlags flags);
+int coroutine_fn blk_co_preadv_part(BlockBackend *blk, int64_t offset,
+int64_t bytes, QEMUIOVector *qiov,
+size_t qiov_offset, BdrvRequestFlags 
flags);
 int generated_co_wrapper blk_preadv(BlockBackend *blk, int64_t offset,
 int64_t bytes, QEMUIOVector *qiov,
 BdrvRequestFlags flags);
diff --git a/block/block-backend.c b/block/block-backend.c
index 4d6c18bc39..49db3f63fb 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1280,9 +1280,10 @@ static void coroutine_fn 
blk_wait_while_drained(BlockBackend *blk)
 }
 
 /* To be called between exactly one pair of blk_inc/dec_in_flight() */
-int coroutine_fn
-blk_co_do_preadv(BlockBackend *blk, int64_t offset, int64_t bytes,
- QEMUIOVector *qiov, BdrvRequestFlags flags)
+static int coroutine_fn
+blk_co_do_preadv_part(BlockBackend *blk, int64_t offset, int64_t bytes,
+  QEMUIOVector *qiov, size_t qiov_offset,
+  BdrvRequestFlags flags)
 {
 int ret;
 BlockDriverState *bs;
@@ -1307,7 +1308,8 @@ blk_co_do_preadv(BlockBackend *blk, int64_t offset, 
int64_t bytes,
 bytes, false);
 }
 
-ret = bdrv_co_preadv(blk->root, offset, bytes, qiov, flags);
+ret = bdrv_co_preadv_part(blk->root, offset, bytes, qiov, qiov_offset,
+  flags);
 bdrv_dec_in_flight(bs);
 return ret;
 }
@@ -1320,7 +1322,21 @@ int coroutine_fn blk_co_preadv(BlockBackend *blk, 
int64_t offset,
 IO_OR_GS_CODE();
 
 blk_inc_in_flight(blk);
-ret = blk_co_do_preadv(blk, offset, bytes, qiov, flags);
+ret = blk_co_do_preadv_part(blk, offset, bytes, qiov, 0, flags);
+blk_dec_in_flight(blk);
+
+return ret;
+}
+
+int coroutine_fn blk_co_preadv_part(BlockBackend *blk, int64_t offset,
+int64_t bytes, QEMUIOVector *qiov,
+size_t qiov_offset, BdrvRequestFlags flags)
+{
+int ret;
+IO_OR_GS_CODE();
+
+blk_inc_in_flight(blk);
+ret = blk_co_do_preadv_part(blk, offset, bytes, qiov, qiov_offset, flags);
 blk_dec_in_flight(blk);
 
 return ret;
@@ -1537,8 +1553,8 @@ static void blk_aio_read_entry(void *opaque)
 QEMUIOVector *qiov = rwco->iobuf;
 
 assert(qiov->size == acb->bytes);
-rwco->ret = blk_co_do_preadv(rwco->blk, rwco->offset, acb->bytes,
- qiov, rwco->flags);
+rwco->ret = blk_co_do_preadv_part(rwco->blk, rwco->offset, acb->bytes, 
qiov,
+  0, rwco->flags);
 blk_aio_complete(acb);
 }
 
diff --git a/tests/unit/test-block-iothread.c b/tests/unit/test-block-iothread.c
index b9c5da3a87..2fa1248445 100644
--- a/tests/unit/test-block-iothread.c
+++ b/tests/unit/test-block-iothread.c
@@ -168,6 +168,21 @@ static void test_sync_op_blk_pwritev(BlockBackend *blk)
 g_assert_cmpint(ret, ==, -EIO);
 }
 
+static void test_sync_op_blk

[PULL 26/35] block: Implement blk_flush() using generated_co_wrapper

2022-07-12 Thread Hanna Reitz
From: Alberto Faria 

Signed-off-by: Alberto Faria 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Hanna Reitz 
Message-Id: <20220705161527.1054072-15-afa...@redhat.com>
Signed-off-by: Hanna Reitz 
---
 block/coroutines.h|  2 --
 include/sysemu/block-backend-io.h |  2 +-
 block/block-backend.c | 11 ---
 3 files changed, 1 insertion(+), 14 deletions(-)

diff --git a/block/coroutines.h b/block/coroutines.h
index 2693ecabfb..7e94b9fa83 100644
--- a/block/coroutines.h
+++ b/block/coroutines.h
@@ -110,6 +110,4 @@ nbd_do_establish_connection(BlockDriverState *bs, bool 
blocking, Error **errp);
 int generated_co_wrapper
 blk_do_ioctl(BlockBackend *blk, unsigned long int req, void *buf);
 
-int generated_co_wrapper blk_do_flush(BlockBackend *blk);
-
 #endif /* BLOCK_COROUTINES_H */
diff --git a/include/sysemu/block-backend-io.h 
b/include/sysemu/block-backend-io.h
index 5a12a1f74f..6954c3bb8c 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -164,8 +164,8 @@ int generated_co_wrapper blk_pdiscard(BlockBackend *blk, 
int64_t offset,
 int coroutine_fn blk_co_pdiscard(BlockBackend *blk, int64_t offset,
  int64_t bytes);
 
+int generated_co_wrapper blk_flush(BlockBackend *blk);
 int coroutine_fn blk_co_flush(BlockBackend *blk);
-int blk_flush(BlockBackend *blk);
 
 int blk_ioctl(BlockBackend *blk, unsigned long int req, void *buf);
 
diff --git a/block/block-backend.c b/block/block-backend.c
index 71585c0c0a..783a77cf6d 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1752,17 +1752,6 @@ int coroutine_fn blk_co_flush(BlockBackend *blk)
 return ret;
 }
 
-int blk_flush(BlockBackend *blk)
-{
-int ret;
-
-blk_inc_in_flight(blk);
-ret = blk_do_flush(blk);
-blk_dec_in_flight(blk);
-
-return ret;
-}
-
 void blk_drain(BlockBackend *blk)
 {
 BlockDriverState *bs = blk_bs(blk);
-- 
2.35.3




[PULL 03/35] block: Make bdrv_{pread,pwrite}() return 0 on success

2022-07-12 Thread Hanna Reitz
From: Alberto Faria 

They currently return the value of their 'bytes' parameter on success.

Make them return 0 instead, for consistency with other I/O functions and
in preparation to implement them using generated_co_wrapper. This also
makes it clear that short reads/writes are not possible.

The few callers that rely on the previous behavior are adjusted
accordingly by hand.

Signed-off-by: Alberto Faria 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Stefan Hajnoczi 
Message-Id: <20220609152744.3891847-4-afa...@redhat.com>
Reviewed-by: Hanna Reitz 
Signed-off-by: Hanna Reitz 
---
 block/cloop.c|  2 +-
 block/crypto.c   |  4 ++--
 block/dmg.c  | 10 +-
 block/io.c   | 10 ++
 block/qcow.c |  2 +-
 block/qcow2.c|  4 ++--
 block/qed.c  |  7 +--
 block/vdi.c  |  2 +-
 block/vmdk.c |  5 ++---
 tests/unit/test-block-iothread.c |  4 ++--
 10 files changed, 19 insertions(+), 31 deletions(-)

diff --git a/block/cloop.c b/block/cloop.c
index 9a2334495e..40b146e714 100644
--- a/block/cloop.c
+++ b/block/cloop.c
@@ -222,7 +222,7 @@ static inline int cloop_read_block(BlockDriverState *bs, 
int block_num)
 
 ret = bdrv_pread(bs->file, s->offsets[block_num], bytes,
  s->compressed_block, 0);
-if (ret != bytes) {
+if (ret < 0) {
 return -1;
 }
 
diff --git a/block/crypto.c b/block/crypto.c
index deec7fae2f..e7f5c4e31a 100644
--- a/block/crypto.c
+++ b/block/crypto.c
@@ -70,7 +70,7 @@ static ssize_t block_crypto_read_func(QCryptoBlock *block,
 error_setg_errno(errp, -ret, "Could not read encryption header");
 return ret;
 }
-return ret;
+return buflen;
 }
 
 static ssize_t block_crypto_write_func(QCryptoBlock *block,
@@ -88,7 +88,7 @@ static ssize_t block_crypto_write_func(QCryptoBlock *block,
 error_setg_errno(errp, -ret, "Could not write encryption header");
 return ret;
 }
-return ret;
+return buflen;
 }
 
 
diff --git a/block/dmg.c b/block/dmg.c
index 5a460c3eb1..98db18d82a 100644
--- a/block/dmg.c
+++ b/block/dmg.c
@@ -390,7 +390,7 @@ static int dmg_read_plist_xml(BlockDriverState *bs, 
DmgHeaderState *ds,
 buffer = g_malloc(info_length + 1);
 buffer[info_length] = '\0';
 ret = bdrv_pread(bs->file, info_begin, info_length, buffer, 0);
-if (ret != info_length) {
+if (ret < 0) {
 ret = -EINVAL;
 goto fail;
 }
@@ -611,7 +611,7 @@ static inline int dmg_read_chunk(BlockDriverState *bs, 
uint64_t sector_num)
  * inflated. */
 ret = bdrv_pread(bs->file, s->offsets[chunk], s->lengths[chunk],
  s->compressed_chunk, 0);
-if (ret != s->lengths[chunk]) {
+if (ret < 0) {
 return -1;
 }
 
@@ -637,7 +637,7 @@ static inline int dmg_read_chunk(BlockDriverState *bs, 
uint64_t sector_num)
  * inflated. */
 ret = bdrv_pread(bs->file, s->offsets[chunk], s->lengths[chunk],
  s->compressed_chunk, 0);
-if (ret != s->lengths[chunk]) {
+if (ret < 0) {
 return -1;
 }
 
@@ -658,7 +658,7 @@ static inline int dmg_read_chunk(BlockDriverState *bs, 
uint64_t sector_num)
  * inflated. */
 ret = bdrv_pread(bs->file, s->offsets[chunk], s->lengths[chunk],
  s->compressed_chunk, 0);
-if (ret != s->lengths[chunk]) {
+if (ret < 0) {
 return -1;
 }
 
@@ -674,7 +674,7 @@ static inline int dmg_read_chunk(BlockDriverState *bs, 
uint64_t sector_num)
 case UDRW: /* copy */
 ret = bdrv_pread(bs->file, s->offsets[chunk], s->lengths[chunk],
  s->uncompressed_chunk, 0);
-if (ret != s->lengths[chunk]) {
+if (ret < 0) {
 return -1;
 }
 break;
diff --git a/block/io.c b/block/io.c
index d4a39b5569..86b5e8b471 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1100,7 +1100,6 @@ int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags 
flags)
 int bdrv_pread(BdrvChild *child, int64_t offset, int64_t bytes, void *buf,
BdrvRequestFlags flags)
 {
-int ret;
 QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
 IO_CODE();
 
@@ -1108,9 +1107,7 @@ int bdrv_pread(BdrvChild *child, int64_t offset, int64_t 
bytes, void *buf,
 return -EINVAL;
 }
 
-ret = bdrv_preadv(child, offset, bytes, , flags);
-
-return ret < 0 ? ret : bytes;
+return bdrv_preadv(child, offset, bytes, , flags);
 }
 
 /* Return no. of bytes on success or < 0 on error. Im

[PULL 16/35] block: Make 'bytes' param of blk_{pread, pwrite}() an int64_t

2022-07-12 Thread Hanna Reitz
From: Alberto Faria 

For consistency with other I/O functions, and in preparation to
implement them using generated_co_wrapper.

Signed-off-by: Alberto Faria 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Hanna Reitz 
Message-Id: <20220705161527.1054072-5-afa...@redhat.com>
Signed-off-by: Hanna Reitz 
---
 include/sysemu/block-backend-io.h | 6 +++---
 block/block-backend.c | 6 +++---
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/sysemu/block-backend-io.h 
b/include/sysemu/block-backend-io.h
index e2e7ab9e6c..8bf1296105 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -101,10 +101,10 @@ int coroutine_fn blk_co_copy_range(BlockBackend *blk_in, 
int64_t off_in,
  * the "I/O or GS" API.
  */
 
-int blk_pread(BlockBackend *blk, int64_t offset, int bytes, void *buf,
+int blk_pread(BlockBackend *blk, int64_t offset, int64_t bytes, void *buf,
   BdrvRequestFlags flags);
-int blk_pwrite(BlockBackend *blk, int64_t offset, int bytes, const void *buf,
-   BdrvRequestFlags flags);
+int blk_pwrite(BlockBackend *blk, int64_t offset, int64_t bytes,
+   const void *buf, BdrvRequestFlags flags);
 int coroutine_fn blk_co_preadv(BlockBackend *blk, int64_t offset,
int64_t bytes, QEMUIOVector *qiov,
BdrvRequestFlags flags);
diff --git a/block/block-backend.c b/block/block-backend.c
index 5fb3890bab..3203b25695 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1563,7 +1563,7 @@ BlockAIOCB *blk_aio_pwrite_zeroes(BlockBackend *blk, 
int64_t offset,
 flags | BDRV_REQ_ZERO_WRITE, cb, opaque);
 }
 
-int blk_pread(BlockBackend *blk, int64_t offset, int bytes, void *buf,
+int blk_pread(BlockBackend *blk, int64_t offset, int64_t bytes, void *buf,
   BdrvRequestFlags flags)
 {
 int ret;
@@ -1577,8 +1577,8 @@ int blk_pread(BlockBackend *blk, int64_t offset, int 
bytes, void *buf,
 return ret;
 }
 
-int blk_pwrite(BlockBackend *blk, int64_t offset, int bytes, const void *buf,
-   BdrvRequestFlags flags)
+int blk_pwrite(BlockBackend *blk, int64_t offset, int64_t bytes,
+   const void *buf, BdrvRequestFlags flags)
 {
 QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
 IO_OR_GS_CODE();
-- 
2.35.3




[PULL 14/35] block: Add a 'flags' param to blk_pread()

2022-07-12 Thread Hanna Reitz
From: Alberto Faria 

For consistency with other I/O functions, and in preparation to
implement it using generated_co_wrapper.

Callers were updated using this Coccinelle script:

@@ expression blk, offset, buf, bytes; @@
- blk_pread(blk, offset, buf, bytes)
+ blk_pread(blk, offset, buf, bytes, 0)

It had no effect on hw/block/nand.c, presumably due to the #if, so that
file was updated manually.

Overly-long lines were then fixed by hand.

Signed-off-by: Alberto Faria 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Greg Kurz 
Reviewed-by: Hanna Reitz 
Message-Id: <20220705161527.1054072-3-afa...@redhat.com>
Signed-off-by: Hanna Reitz 
---
 include/sysemu/block-backend-io.h |  3 ++-
 block.c   |  2 +-
 block/block-backend.c |  5 +++--
 block/commit.c|  2 +-
 block/export/fuse.c   |  2 +-
 hw/arm/allwinner-h3.c |  2 +-
 hw/arm/aspeed.c   |  2 +-
 hw/block/block.c  |  2 +-
 hw/block/fdc.c|  6 +++---
 hw/block/hd-geometry.c|  2 +-
 hw/block/m25p80.c |  2 +-
 hw/block/nand.c   | 12 ++--
 hw/block/onenand.c| 12 ++--
 hw/ide/atapi.c|  4 ++--
 hw/misc/mac_via.c |  2 +-
 hw/misc/sifive_u_otp.c|  4 ++--
 hw/nvram/eeprom_at24c.c   |  2 +-
 hw/nvram/spapr_nvram.c|  2 +-
 hw/nvram/xlnx-bbram.c |  2 +-
 hw/nvram/xlnx-efuse.c |  2 +-
 hw/ppc/pnv_pnor.c |  2 +-
 hw/sd/sd.c|  2 +-
 migration/block.c |  4 ++--
 nbd/server.c  |  4 ++--
 qemu-img.c| 12 ++--
 qemu-io-cmds.c|  2 +-
 tests/unit/test-block-iothread.c  |  4 ++--
 27 files changed, 52 insertions(+), 50 deletions(-)

diff --git a/include/sysemu/block-backend-io.h 
b/include/sysemu/block-backend-io.h
index ccef514023..5c5c108e0d 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -101,7 +101,8 @@ int coroutine_fn blk_co_copy_range(BlockBackend *blk_in, 
int64_t off_in,
  * the "I/O or GS" API.
  */
 
-int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int bytes);
+int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int bytes,
+  BdrvRequestFlags flags);
 int blk_pwrite(BlockBackend *blk, int64_t offset, const void *buf, int bytes,
BdrvRequestFlags flags);
 int coroutine_fn blk_co_preadv(BlockBackend *blk, int64_t offset,
diff --git a/block.c b/block.c
index 0fd830e2e2..ed701b4889 100644
--- a/block.c
+++ b/block.c
@@ -1037,7 +1037,7 @@ static int find_image_format(BlockBackend *file, const 
char *filename,
 return ret;
 }
 
-ret = blk_pread(file, 0, buf, sizeof(buf));
+ret = blk_pread(file, 0, buf, sizeof(buf), 0);
 if (ret < 0) {
 error_setg_errno(errp, -ret, "Could not read image for determining its 
"
  "format");
diff --git a/block/block-backend.c b/block/block-backend.c
index 4c5e130615..bdde90d235 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1563,14 +1563,15 @@ BlockAIOCB *blk_aio_pwrite_zeroes(BlockBackend *blk, 
int64_t offset,
 flags | BDRV_REQ_ZERO_WRITE, cb, opaque);
 }
 
-int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int bytes)
+int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int bytes,
+  BdrvRequestFlags flags)
 {
 int ret;
 QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
 IO_OR_GS_CODE();
 
 blk_inc_in_flight(blk);
-ret = blk_do_preadv(blk, offset, bytes, , 0);
+ret = blk_do_preadv(blk, offset, bytes, , flags);
 blk_dec_in_flight(blk);
 
 return ret;
diff --git a/block/commit.c b/block/commit.c
index 851d1c557a..e5b3ad08da 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -527,7 +527,7 @@ int bdrv_commit(BlockDriverState *bs)
 goto ro_cleanup;
 }
 if (ret) {
-ret = blk_pread(src, offset, buf, n);
+ret = blk_pread(src, offset, buf, n, 0);
 if (ret < 0) {
 goto ro_cleanup;
 }
diff --git a/block/export/fuse.c b/block/export/fuse.c
index e80b24a867..dcf8f225f3 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -554,7 +554,7 @@ static void fuse_read(fuse_req_t req, fuse_ino_t inode,
 return;
 }
 
-ret = blk_pread(exp->common.blk, offset, buf, size);
+ret = blk_pread(exp->common.blk, offset, buf, size, 0);
 if (ret >= 0) {
 fuse_reply_buf(req, buf, size);
 } else {
diff --git a/hw/arm/allwinner-h3.c b/hw/arm/allwinner-h3.c
index 318ed4348c..788083b6fa 100644
--- a/hw/arm/allwinner-h3.c
+++ b/hw/arm/allwinner-h3.c
@@ -174,7 +174,7 @@ void allwinner_h3_bootrom_setup(AwH3State *s, B

[PULL 12/35] tests/qemu-iotests: skip 108 when FUSE is not loaded

2022-07-12 Thread Hanna Reitz
From: John Snow 

In certain container environments we may not have FUSE at all, so skip
the test in this circumstance too.

Signed-off-by: John Snow 
Message-Id: <20220616142659.3184115-3-js...@redhat.com>
Reviewed-by: Thomas Huth 
Reviewed-by: Hanna Reitz 
Signed-off-by: Hanna Reitz 
---
 tests/qemu-iotests/108 | 5 +
 1 file changed, 5 insertions(+)

diff --git a/tests/qemu-iotests/108 b/tests/qemu-iotests/108
index 9e923d6a59..54e935acf2 100755
--- a/tests/qemu-iotests/108
+++ b/tests/qemu-iotests/108
@@ -60,6 +60,11 @@ if sudo -n losetup &>/dev/null; then
 else
 loopdev=false
 
+# Check for usable FUSE in the host environment:
+if test ! -c "/dev/fuse"; then
+_notrun 'No passwordless sudo nor usable /dev/fuse'
+fi
+
 # QSD --export fuse will either yield "Parameter 'id' is missing"
 # or "Invalid parameter 'fuse'", depending on whether there is
 # FUSE support or not.
-- 
2.35.3




[PULL 02/35] block: Change bdrv_{pread, pwrite, pwrite_sync}() param order

2022-07-12 Thread Hanna Reitz
From: Alberto Faria 

Swap 'buf' and 'bytes' around for consistency with
bdrv_co_{pread,pwrite}(), and in preparation to implement these
functions using generated_co_wrapper.

Callers were updated using this Coccinelle script:

@@ expression child, offset, buf, bytes, flags; @@
- bdrv_pread(child, offset, buf, bytes, flags)
+ bdrv_pread(child, offset, bytes, buf, flags)

@@ expression child, offset, buf, bytes, flags; @@
- bdrv_pwrite(child, offset, buf, bytes, flags)
+ bdrv_pwrite(child, offset, bytes, buf, flags)

@@ expression child, offset, buf, bytes, flags; @@
- bdrv_pwrite_sync(child, offset, buf, bytes, flags)
+ bdrv_pwrite_sync(child, offset, bytes, buf, flags)

Resulting overly-long lines were then fixed by hand.

Signed-off-by: Alberto Faria 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Message-Id: <20220609152744.3891847-3-afa...@redhat.com>
Reviewed-by: Hanna Reitz 
Signed-off-by: Hanna Reitz 
---
 include/block/block-io.h | 10 +++---
 block/blklogwrites.c |  6 ++--
 block/bochs.c| 10 +++---
 block/cloop.c| 10 +++---
 block/crypto.c   |  4 +--
 block/dmg.c  | 26 +++
 block/io.c   | 12 +++
 block/parallels-ext.c|  6 ++--
 block/parallels.c| 10 +++---
 block/qcow.c | 34 +--
 block/qcow2-bitmap.c | 14 
 block/qcow2-cache.c  |  8 ++---
 block/qcow2-cluster.c| 22 ++---
 block/qcow2-refcount.c   | 56 +---
 block/qcow2-snapshot.c   | 48 +--
 block/qcow2.c| 47 ++-
 block/qed.c  |  8 ++---
 block/vdi.c  | 14 
 block/vhdx-log.c | 18 +-
 block/vhdx.c | 28 
 block/vmdk.c | 50 ++--
 block/vpc.c  | 22 ++---
 block/vvfat.c| 10 +++---
 tests/unit/test-block-iothread.c |  8 ++---
 24 files changed, 242 insertions(+), 239 deletions(-)

diff --git a/include/block/block-io.h b/include/block/block-io.h
index 1a02d235f8..ecce56c096 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -42,12 +42,12 @@
 int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
int64_t bytes, BdrvRequestFlags flags);
 int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags);
-int bdrv_pread(BdrvChild *child, int64_t offset, void *buf, int64_t bytes,
+int bdrv_pread(BdrvChild *child, int64_t offset, int64_t bytes, void *buf,
BdrvRequestFlags flags);
-int bdrv_pwrite(BdrvChild *child, int64_t offset, const void *buf,
-int64_t bytes, BdrvRequestFlags flags);
-int bdrv_pwrite_sync(BdrvChild *child, int64_t offset,
- const void *buf, int64_t bytes, BdrvRequestFlags flags);
+int bdrv_pwrite(BdrvChild *child, int64_t offset, int64_t bytes,
+const void *buf, BdrvRequestFlags flags);
+int bdrv_pwrite_sync(BdrvChild *child, int64_t offset, int64_t bytes,
+ const void *buf, BdrvRequestFlags flags);
 /*
  * Efficiently zero a region of the disk image.  Note that this is a regular
  * I/O request like read or write and should have a reasonable size.  This
diff --git a/block/blklogwrites.c b/block/blklogwrites.c
index c5c021e6f8..e3c6c4039c 100644
--- a/block/blklogwrites.c
+++ b/block/blklogwrites.c
@@ -107,8 +107,8 @@ static uint64_t 
blk_log_writes_find_cur_log_sector(BdrvChild *log,
 struct log_write_entry cur_entry;
 
 while (cur_idx < nr_entries) {
-int read_ret = bdrv_pread(log, cur_sector << sector_bits, _entry,
-  sizeof(cur_entry), 0);
+int read_ret = bdrv_pread(log, cur_sector << sector_bits,
+  sizeof(cur_entry), _entry, 0);
 if (read_ret < 0) {
 error_setg_errno(errp, -read_ret,
  "Failed to read log entry %"PRIu64, cur_idx);
@@ -190,7 +190,7 @@ static int blk_log_writes_open(BlockDriverState *bs, QDict 
*options, int flags,
 log_sb.nr_entries = cpu_to_le64(0);
 log_sb.sectorsize = cpu_to_le32(BDRV_SECTOR_SIZE);
 } else {
-ret = bdrv_pread(s->log_file, 0, _sb, sizeof(log_sb), 0);
+ret = bdrv_pread(s->log_file, 0, sizeof(log_sb), _sb, 0);
 if (ret < 0) {
 error_setg_errno(errp, -ret, "Could not read log superblock");
 goto fail_log;
diff --git a/block/bochs.c b/block/bochs.c
index 46d0f6a693..b76f34fe03 100644
--- a/block/bochs.c
+++ b/block/bochs.c
@@ -116,7 +116,7 @@ static int 

[PULL 10/35] block/qcow2: Use bdrv_pwrite_sync() in qcow2_mark_dirty()

2022-07-12 Thread Hanna Reitz
From: Alberto Faria 

Use bdrv_pwrite_sync() instead of calling bdrv_pwrite() and bdrv_flush()
separately.

Signed-off-by: Alberto Faria 
Reviewed-by: Eric Blake 
Reviewed-by: Stefan Hajnoczi 
Message-Id: <20220609152744.3891847-11-afa...@redhat.com>
Reviewed-by: Hanna Reitz 
Signed-off-by: Hanna Reitz 
---
 block/qcow2.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index f2fb54c51f..90a2dd406b 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -516,12 +516,9 @@ int qcow2_mark_dirty(BlockDriverState *bs)
 }
 
 val = cpu_to_be64(s->incompatible_features | QCOW2_INCOMPAT_DIRTY);
-ret = bdrv_pwrite(bs->file, offsetof(QCowHeader, incompatible_features),
-  sizeof(val), , 0);
-if (ret < 0) {
-return ret;
-}
-ret = bdrv_flush(bs->file->bs);
+ret = bdrv_pwrite_sync(bs->file,
+   offsetof(QCowHeader, incompatible_features),
+   sizeof(val), , 0);
 if (ret < 0) {
 return ret;
 }
-- 
2.35.3




[PULL 09/35] block: Use bdrv_co_pwrite_sync() when caller is coroutine_fn

2022-07-12 Thread Hanna Reitz
From: Alberto Faria 

Convert uses of bdrv_pwrite_sync() into bdrv_co_pwrite_sync() when the
callers are already coroutine_fn.

Signed-off-by: Alberto Faria 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Stefan Hajnoczi 
Message-Id: <20220609152744.3891847-10-afa...@redhat.com>
Reviewed-by: Hanna Reitz 
Signed-off-by: Hanna Reitz 
---
 block/parallels.c  | 2 +-
 block/qcow2-snapshot.c | 6 +++---
 block/qcow2.c  | 4 ++--
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index f22444efff..8b23b9580d 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -481,7 +481,7 @@ static int coroutine_fn parallels_co_check(BlockDriverState 
*bs,
 
 ret = 0;
 if (flush_bat) {
-ret = bdrv_pwrite_sync(bs->file, 0, s->header_size, s->header, 0);
+ret = bdrv_co_pwrite_sync(bs->file, 0, s->header_size, s->header, 0);
 if (ret < 0) {
 res->check_errors++;
 goto out;
diff --git a/block/qcow2-snapshot.c b/block/qcow2-snapshot.c
index 60e0461632..d1d46facbf 100644
--- a/block/qcow2-snapshot.c
+++ b/block/qcow2-snapshot.c
@@ -512,9 +512,9 @@ int coroutine_fn 
qcow2_check_read_snapshot_table(BlockDriverState *bs,
 assert(fix & BDRV_FIX_ERRORS);
 
 snapshot_table_pointer.nb_snapshots = cpu_to_be32(s->nb_snapshots);
-ret = bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, nb_snapshots),
-   sizeof(snapshot_table_pointer.nb_snapshots),
-   _table_pointer.nb_snapshots, 0);
+ret = bdrv_co_pwrite_sync(bs->file, offsetof(QCowHeader, nb_snapshots),
+  sizeof(snapshot_table_pointer.nb_snapshots),
+  _table_pointer.nb_snapshots, 0);
 if (ret < 0) {
 result->check_errors++;
 fprintf(stderr, "ERROR failed to update the snapshot count in the "
diff --git a/block/qcow2.c b/block/qcow2.c
index c43238a006..f2fb54c51f 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -4551,8 +4551,8 @@ static int coroutine_fn 
qcow2_co_truncate(BlockDriverState *bs, int64_t offset,
 
 /* write updated header.size */
 offset = cpu_to_be64(offset);
-ret = bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, size),
-   sizeof(offset), , 0);
+ret = bdrv_co_pwrite_sync(bs->file, offsetof(QCowHeader, size),
+  sizeof(offset), , 0);
 if (ret < 0) {
 error_setg_errno(errp, -ret, "Failed to update the image size");
 goto fail;
-- 
2.35.3




  1   2   3   4   5   6   7   8   9   10   >