Re: Suspicious QOM types without instance/class size

2020-08-20 Thread David Gibson
On Thu, Aug 20, 2020 at 05:55:29PM -0400, Eduardo Habkost wrote:
> While trying to convert TypeInfo declarations to the new
> OBJECT_DECLARE* macros, I've stumbled on a few suspicious cases
> where instance_size or class_size is not set, despite having type
> checker macros that use a specific type.
> 
> The ones with "WARNING" are abstract types (maybe not serious if
> subclasses set the appropriate sizes).  The ones with "ERROR"
> don't seem to be abstract types.


Comment on the ones within my area:
> 
> WARNING: hw/input/adb.c:310:1: class_size should be set to 
> sizeof(ADBDeviceClass)?

Yeah, that looks like a bug (though we'll get away with it because
it's abstract).

> WARNING: hw/ppc/pnv_lpc.c:771:1: instance_size should be set to 
> sizeof(PnvLpcController)?

Ditto.

Should I make fixes for these, or will you?

> ERROR: hw/ppc/spapr_drc.c:771:1: instance_size should be set to 
> sizeof(SpaprDrc)?

I'm confused by this one.  I'm not exactly sure which definition is
tripping the error, and AFAICT they should all be correctly inheriting
instance_size from either TYPE_SPAPR_DR_CONNECTOR or
TYPE_SPAPR_DRC_PHSYICAL.  If anything, it looks like
TYPE_SPAPR_DRC_PHB could drop it's explicit override of instance_size.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-20 Thread Dave Chinner
On Thu, Aug 20, 2020 at 10:03:10PM +0200, Alberto Garcia wrote:
> Cc: linux-xfs
> 
> On Wed 19 Aug 2020 07:53:00 PM CEST, Brian Foster wrote:
> > In any event, if you're seeing unclear or unexpected performance
> > deltas between certain XFS configurations or other fs', I think the
> > best thing to do is post a more complete description of the workload,
> > filesystem/storage setup, and test results to the linux-xfs mailing
> > list (feel free to cc me as well). As it is, aside from the questions
> > above, it's not really clear to me what the storage stack looks like
> > for this test, if/how qcow2 is involved, what the various
> > 'preallocation=' modes actually mean, etc.
> 
> (see [1] for a bit of context)
> 
> I repeated the tests with a larger (125GB) filesystem. Things are a bit
> faster but not radically different, here are the new numbers:
> 
> |--+---+---|
> | preallocation mode   |   xfs |  ext4 |
> |--+---+---|
> | off  |  8139 | 11688 |
> | off (w/o ZERO_RANGE) |  2965 |  2780 |
> | metadata |  7768 |  9132 |
> | falloc   |  7742 | 13108 |
> | full | 41389 | 16351 |
> |--+---+---|
> 
> The numbers are I/O operations per second as reported by fio, running
> inside a VM.
> 
> The VM is running Debian 9.7 with Linux 4.9.130 and the fio version is
> 2.16-1. I'm using QEMU 5.1.0.
> 
> fio is sending random 4KB write requests to a 25GB virtual drive, this
> is the full command line:
> 
> fio --filename=/dev/vdb --direct=1 --randrepeat=1 --eta=always
> --ioengine=libaio --iodepth=32 --numjobs=1 --name=test --size=25G
> --io_limit=25G --ramp_time=5 --rw=randwrite --bs=4k --runtime=60
>   
> The virtual drive (/dev/vdb) is a freshly created qcow2 file stored on
> the host (on an xfs or ext4 filesystem as the table above shows), and
> it is attached to QEMU using a virtio-blk-pci device:
> 
>-drive if=virtio,file=image.qcow2,cache=none,l2-cache-size=200M

You're not using AIO on this image file, so it can't do
concurrent IO? what happens when you add "aio=native" to this?

> cache=none means that the image is opened with O_DIRECT and
> l2-cache-size is large enough so QEMU is able to cache all the
> relevant qcow2 metadata in memory.

What happens when you just use a sparse file (i.e. a raw image) with
aio=native instead of using qcow2? XFS, ext4, btrfs, etc all support
sparse files so using qcow2 to provide sparse image file support is
largely an unnecessary layer of indirection and overhead...

And with XFS, you don't need qcow2 for snapshots either because you
can use reflink copies to take an atomic copy-on-write snapshot of
the raw image file... (assuming you made the xfs filesystem with
reflink support (which is the TOT default now)).

I've been using raw sprase files on XFS for all my VMs for over a
decade now, and using reflink to create COW copies of golden
image files iwhen deploying new VMs for a couple of years now...

> The host is running Linux 4.19.132 and has an SSD drive.
> 
> About the preallocation modes: a qcow2 file is divided into clusters
> of the same size (64KB in this case). That is the minimum unit of
> allocation, so when writing 4KB to an unallocated cluster QEMU needs
> to fill the other 60KB with zeroes. So here's what happens with the
> different modes:

Which is something that sparse files on filesystems do not need to
do. If, on XFS, you really want 64kB allocation clusters, use an
extent size hint of 64kB. Though for image files, I highly recommend
using 1MB or larger extent size hints.


> 1) off: for every write request QEMU initializes the cluster (64KB)
> with fallocate(ZERO_RANGE) and then writes the 4KB of data.
> 
> 2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest
> of the cluster with zeroes.
> 
> 3) metadata: all clusters were allocated when the image was created
> but they are sparse, QEMU only writes the 4KB of data.
> 
> 4) falloc: all clusters were allocated with fallocate() when the image
> was created, QEMU only writes 4KB of data.
> 
> 5) full: all clusters were allocated by writing zeroes to all of them
> when the image was created, QEMU only writes 4KB of data.
> 
> As I said in a previous message I'm not familiar with xfs, but the
> parts that I don't understand are
> 
>- Why is (4) slower than (1)?

Because fallocate() is a full IO serialisation barrier at the
filesystem level. If you do:

fallocate(whole file)



.

The IO can run concurrent and does not serialise against anything in
the filesysetm except unwritten extent conversions at IO completion
(see answer to next question!)

However, if you just use (4) you get:

falloc(64k)
  
  
<4k io>
  
falloc(64k)
  
  
  <4k IO completes, converts 4k to written>
  
<4k io>
falloc(64k)
  
  
  <4k IO completes, converts 4k to written>
  
<4k io>
  

until all the 

Re: [PATCH v2 0/3] hw/sd: Add Cadence SDHCI emulation

2020-08-20 Thread Bin Meng
Hi Philippe,

On Fri, Aug 21, 2020 at 2:04 AM Philippe Mathieu-Daudé  wrote:
>
> Hi Sai Pavan, you said you were interested to test the first 2
> patches. FYI I plan to queue them and send the pull request tomorrow
> or Saturday the latest.

Have you got a chance to review the v2 of 3rd patch?

"hw/sd: Add Cadence SDHCI emulation"

Regards,
Bin



Re: [PATCH v4 4/4] iotests: Test node/bitmap aliases during migration

2020-08-20 Thread Eric Blake

On 8/20/20 10:49 AM, Vladimir Sementsov-Ogievskiy wrote:

# MYPYPATH=../../python/ mypy 300
300:154: error: Item "None" of "Optional[Match[Any]]" has no attribute 
"group"

Found 1 error in 1 file (checked 1 source file)

- the only complain. Suggest a fix:

diff --git a/tests/qemu-iotests/300 b/tests/qemu-iotests/300
index c6d86b1dbc..0241903743 100755
--- a/tests/qemu-iotests/300
+++ b/tests/qemu-iotests/300
@@ -148,11 +148,11 @@ class TestDirtyBitmapMigration(iotests.QMPTestCase):
  result = vm.qmp('human-monitor-command',
  command_line='info migrate_parameters')

-    hmp_mapping = re.search(r'^block-bitmap-mapping:\r?(\n  
.*)*\n',

-    result['return'], flags=re.MULTILINE)
+    m = re.search(r'^block-bitmap-mapping:\r?(\n  .*)*\n',
+  result['return'], flags=re.MULTILINE)
+    hmp_mapping = m.group(0).replace('\r', '') if m else None

-    self.assertEqual(hmp_mapping.group(0).replace('\r', ''),
- self.to_hmp_mapping(mapping))
+    self.assertEqual(hmp_mapping, self.to_hmp_mapping(mapping))
  else:
  self.assert_qmp(result, 'error/desc', error)


Thanks; I'll fold that in to v5.



===

# flake8 300
300:24:1: F401 'time' imported but unused
300:34:1: E302 expected 2 blank lines, found 1
300:42:67: E502 the backslash is redundant between brackets
300:47:67: E502 the backslash is redundant between brackets
300:67:53: E502 the backslash is redundant between brackets
300:122:9: E125 continuation line with same indent as next logical line
300:134:9: E125 continuation line with same indent as next logical line
300:185:52: E502 the backslash is redundant between brackets
300:186:72: E502 the backslash is redundant between brackets
300:285:77: E502 the backslash is redundant between brackets
300:305:77: E502 the backslash is redundant between brackets
300:306:78: E502 the backslash is redundant between brackets
300:330:77: E502 the backslash is redundant between brackets
300:350:77: E502 the backslash is redundant between brackets
300:385:57: E502 the backslash is redundant between brackets
300:386:59: E502 the backslash is redundant between brackets
300:387:67: E502 the backslash is redundant between brackets
300:412:78: E502 the backslash is redundant between brackets
300:425:78: E502 the backslash is redundant between brackets
300:435:78: E502 the backslash is redundant between brackets
300:436:76: E502 the backslash is redundant between brackets
300:451:66: E502 the backslash is redundant between brackets
300:473:78: E502 the backslash is redundant between brackets
300:474:79: E502 the backslash is redundant between brackets
300:488:78: E502 the backslash is redundant between brackets
300:489:77: E502 the backslash is redundant between brackets

- I post it just because ALE plugin in vim highlights all these things 
for me. Up to you, I don't ask you to fix it.


Seems like an easy enough touchup to make.



+    def test_alias_on_both_migration(self) -> None:
+    src_map = self.mapping(self.src_node_name, 'node-alias',
+   self.src_bmap_name, 'bmap-alias')
+
+    dst_map = self.mapping(self.dst_node_name, 'node-alias',
+   self.dst_bmap_name, 'bmap-alias')
+
+    self.set_mapping(self.vm_a, src_map)
+    self.set_mapping(self.vm_b, dst_map)
+    self.migrate()
+    self.verify_dest_error(None)


Hmm, probably verify_dest_error() should be called from migrate(), as 
you call it (almost) always after migrate()


This one I did not touch; it can be a followup patch if desired.




+
+


[..]


+    def test_unused_mapping_on_dst(self) -> None:
+    # Let the source not send any bitmaps
+    self.set_mapping(self.vm_a, [])
+
+    # Establish some mapping on the destination
+    self.set_mapping(self.vm_b, [])


Seems, you wanted to specify non-empty mapping for vm_b, yes?
With any non-empty mapping here, just to better correspond to the comments:
Reviewed-by: Vladimir Sementsov-Ogievskiy 

(or keep this case with both mappings empty, and add one similar with 
empty mapping only on src)


Adding another case will now have to be a separate patch.  At any rate, 
I've taken the liberty of including your R-b in my staging tree for the 
pending pull request, let me know if I should drop it.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v5 3/3] iotests: Test node/bitmap aliases during migration

2020-08-20 Thread Eric Blake

On 8/20/20 10:07 AM, Max Reitz wrote:

Signed-off-by: Max Reitz 
---
  tests/qemu-iotests/300 | 595 +
  tests/qemu-iotests/300.out |   5 +
  tests/qemu-iotests/group   |   1 +
  3 files changed, 601 insertions(+)
  create mode 100755 tests/qemu-iotests/300
  create mode 100644 tests/qemu-iotests/300.out



The changes in patch 1 helped; the test completes for me in 11 s. 
Minimal differences to v4, and the switch to discard is nice.


Tested-by: Eric Blake 
Reviewed-by: Eric Blake 

I've gone ahead and queued this on my bitmaps tree with testing on top 
of Paolo's meson conversion; I'm prepared to send a pull request once 
that lands.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v5 1/3] migration: Add block-bitmap-mapping parameter

2020-08-20 Thread Eric Blake

On 8/20/20 10:07 AM, Max Reitz wrote:

This migration parameter allows mapping block node names and bitmap
names to aliases for the purpose of block dirty bitmap migration.

This way, management tools can use different node and bitmap names on
the source and destination and pass the mapping of how bitmaps are to be
transferred to qemu (on the source, the destination, or even both with
arbitrary aliases in the migration stream).

While touching this code, fix a bug where bitmap names longer than 255
bytes would fail an assertion in qemu_put_counted_string().

Suggested-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Max Reitz 
---


Changes from v4 look sane.

Reviwed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




[PATCH v2 4/5] qemu-iotests: Merge FilePaths and FilePath

2020-08-20 Thread Nir Soffer
FilePath creates now one temporary file:

with FilePath("a") as a:

Or more:

with FilePath("a", "b", "c") as (a, b, c):

This is also the behavior of the file_path() helper, used by some of the
tests. Now we have only 2 helpers for creating temporary files instead
of 3.

Signed-off-by: Nir Soffer 
---
 tests/qemu-iotests/194|  2 +-
 tests/qemu-iotests/208|  2 +-
 tests/qemu-iotests/222|  2 +-
 tests/qemu-iotests/257|  4 ++--
 tests/qemu-iotests/iotests.py | 23 +++
 5 files changed, 16 insertions(+), 17 deletions(-)

diff --git a/tests/qemu-iotests/194 b/tests/qemu-iotests/194
index 08389f474e..7a4863ab18 100755
--- a/tests/qemu-iotests/194
+++ b/tests/qemu-iotests/194
@@ -26,7 +26,7 @@ iotests.script_initialize(supported_fmts=['qcow2', 'qed', 
'raw'],
 
 with iotests.FilePath('source.img') as source_img_path, \
  iotests.FilePath('dest.img') as dest_img_path, \
- iotests.FilePaths('migration.sock', 'nbd.sock', 
base_dir=iotests.sock_dir) \
+ iotests.FilePath('migration.sock', 'nbd.sock', base_dir=iotests.sock_dir) 
\
 as (migration_sock_path, nbd_sock_path), \
  iotests.VM('source') as source_vm, \
  iotests.VM('dest') as dest_vm:
diff --git a/tests/qemu-iotests/208 b/tests/qemu-iotests/208
index 6cb642f821..54aa4be273 100755
--- a/tests/qemu-iotests/208
+++ b/tests/qemu-iotests/208
@@ -26,7 +26,7 @@ iotests.script_initialize(supported_fmts=['generic'])
 
 with iotests.FilePath('disk.img') as disk_img_path, \
  iotests.FilePath('disk-snapshot.img') as disk_snapshot_img_path, \
- iotests.FilePath('nbd.sock', iotests.sock_dir) as nbd_sock_path, \
+ iotests.FilePath('nbd.sock', base_dir=iotests.sock_dir) as nbd_sock_path, 
\
  iotests.VM() as vm:
 
 img_size = '10M'
diff --git a/tests/qemu-iotests/222 b/tests/qemu-iotests/222
index 6602f6b4ba..14d67c875b 100755
--- a/tests/qemu-iotests/222
+++ b/tests/qemu-iotests/222
@@ -49,7 +49,7 @@ remainder = [("0xd5", "0x108000",  "32k"), # Right-end of 
partial-left [1]
 
 with iotests.FilePath('base.img') as base_img_path, \
  iotests.FilePath('fleece.img') as fleece_img_path, \
- iotests.FilePath('nbd.sock', iotests.sock_dir) as nbd_sock_path, \
+ iotests.FilePath('nbd.sock', base_dir=iotests.sock_dir) as nbd_sock_path, 
\
  iotests.VM() as vm:
 
 log('--- Setting up images ---')
diff --git a/tests/qemu-iotests/257 b/tests/qemu-iotests/257
index a9aa65bbe3..c80e06ae28 100755
--- a/tests/qemu-iotests/257
+++ b/tests/qemu-iotests/257
@@ -275,7 +275,7 @@ def test_bitmap_sync(bsync_mode, msync_mode='bitmap', 
failure=None):
 an incomplete backup. Testing limitations prevent
 testing competing writes.
 """
-with iotests.FilePaths(
+with iotests.FilePath(
 'img', 'bsync1', 'bsync2', 'fbackup0', 'fbackup1', 'fbackup2') as \
 (img_path, bsync1, bsync2, fbackup0, fbackup1, fbackup2), \
  iotests.VM() as vm:
@@ -440,7 +440,7 @@ def test_backup_api():
 """
 Test malformed and prohibited invocations of the backup API.
 """
-with iotests.FilePaths('img', 'bsync1') as (img_path, backup_path), \
+with iotests.FilePath('img', 'bsync1') as (img_path, backup_path), \
  iotests.VM() as vm:
 
 log("\n=== API failure tests ===\n")
diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index cdcdc2ad8b..1b5cdd493e 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -448,18 +448,23 @@ class Timeout:
 def file_pattern(name):
 return "{0}-{1}".format(os.getpid(), name)
 
-class FilePaths:
+class FilePath:
 """
 Context manager generating multiple file names. The generated files are
 removed when exiting the context.
 
 Example usage:
 
-with FilePaths('test.img', 'test.sock') as (img_path, sock_path):
+with FilePath('test.img', 'test.sock') as (img_path, sock_path):
 # Use img_path and sock_path here...
 
 # img_path and sock_path are automatically removed here.
 
+For convenience, calling with one argument yields a single file instead of
+a tuple with one item:
+
+with FilePath("a") as a:
+
 """
 def __init__(self, *names, base_dir=test_dir):
 self.paths = []
@@ -467,7 +472,10 @@ class FilePaths:
 self.paths.append(os.path.join(base_dir, file_pattern(name)))
 
 def __enter__(self):
-return self.paths
+if len(self.paths) == 1:
+return self.paths[0]
+else:
+return self.paths
 
 def __exit__(self, exc_type, exc_val, exc_tb):
 for path in self.paths:
@@ -477,15 +485,6 @@ class FilePaths:
 pass
 return False
 
-class FilePath(FilePaths):
-"""
-FilePath is a specialization of FilePaths that takes a single filename.
-"""
-def __init__(self, name, base_dir=test_dir):
-super(FilePath, 

[PATCH v2 5/5] qemu-iotests: Simplify FilePath __init__

2020-08-20 Thread Nir Soffer
Use list comprehension instead of append loop.

Signed-off-by: Nir Soffer 
---
 tests/qemu-iotests/iotests.py | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 1b5cdd493e..7ebd0bcc92 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -467,9 +467,8 @@ class FilePath:
 
 """
 def __init__(self, *names, base_dir=test_dir):
-self.paths = []
-for name in names:
-self.paths.append(os.path.join(base_dir, file_pattern(name)))
+self.paths = [os.path.join(base_dir, file_pattern(name))
+  for name in names]
 
 def __enter__(self):
 if len(self.paths) == 1:
-- 
2.26.2




[PATCH v2 2/5] qemu-iotests: Fix FilePaths docstring

2020-08-20 Thread Nir Soffer
When this class was extracted from FilePath, the docstring was not
updated for generating multiple files, and the example usage was
referencing unrelated file.

Fixes: de263986b5dc
Signed-off-by: Nir Soffer 
---
 tests/qemu-iotests/iotests.py | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 16a04df8a3..f34a1d7ef1 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -450,14 +450,16 @@ def file_pattern(name):
 
 class FilePaths:
 """
-FilePaths is an auto-generated filename that cleans itself up.
+Context manager generating multiple file names. The generated files are
+removed when exiting the context.
 
-Use this context manager to generate filenames and ensure that the file
-gets deleted::
+Example usage:
+
+with FilePaths(['test.img', 'test.sock']) as (img_path, sock_path):
+# Use img_path and sock_path here...
+
+# img_path and sock_path are automatically removed here.
 
-with FilePaths(['test.img']) as img_path:
-qemu_img('create', img_path, '1G')
-# migration_sock_path is automatically deleted
 """
 def __init__(self, names, base_dir=test_dir):
 self.paths = []
-- 
2.26.2




[PATCH v2 0/5] iotest.FilePath fixes and cleanups

2020-08-20 Thread Nir Soffer
Fix some issues introduced when iotests.FilePaths was added and merge it back
into FilePath keeping the option to create multiple file names.

Changes since v1:
- Remove unwanted submodule change [Eric]
- Add Fixes: tag

v1 was here:
https://lists.nongnu.org/archive/html/qemu-block/2020-08/msg00773.html

Nir Soffer (5):
  qemu-iotests: Fix FilePaths cleanup
  qemu-iotests: Fix FilePaths docstring
  qemu-iotests: Support varargs syntax in FilePaths
  qemu-iotests: Merge FilePaths and FilePath
  qemu-iotests: Simplify FilePath __init__

 tests/qemu-iotests/194|  4 +--
 tests/qemu-iotests/208|  2 +-
 tests/qemu-iotests/222|  2 +-
 tests/qemu-iotests/257| 10 +++
 tests/qemu-iotests/iotests.py | 50 +--
 5 files changed, 33 insertions(+), 35 deletions(-)

-- 
2.26.2




[PATCH v2 3/5] qemu-iotests: Support varargs syntax in FilePaths

2020-08-20 Thread Nir Soffer
Accept variable number of names instead of a sequence:

with FilePaths("a", "b", "c") as (a, b, c):

The disadvantage is that base_dir must be used as kwarg:

with FilePaths("a", "b", base_dir=soc_dir) as (sock1, sock2):

But this is more clear and calling optional argument as positional
arguments is bad idea anyway.

Signed-off-by: Nir Soffer 
---
 tests/qemu-iotests/194|  4 ++--
 tests/qemu-iotests/257| 10 --
 tests/qemu-iotests/iotests.py |  6 +++---
 3 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/tests/qemu-iotests/194 b/tests/qemu-iotests/194
index da7c4265ec..08389f474e 100755
--- a/tests/qemu-iotests/194
+++ b/tests/qemu-iotests/194
@@ -26,8 +26,8 @@ iotests.script_initialize(supported_fmts=['qcow2', 'qed', 
'raw'],
 
 with iotests.FilePath('source.img') as source_img_path, \
  iotests.FilePath('dest.img') as dest_img_path, \
- iotests.FilePaths(['migration.sock', 'nbd.sock'], iotests.sock_dir) as \
- [migration_sock_path, nbd_sock_path], \
+ iotests.FilePaths('migration.sock', 'nbd.sock', 
base_dir=iotests.sock_dir) \
+as (migration_sock_path, nbd_sock_path), \
  iotests.VM('source') as source_vm, \
  iotests.VM('dest') as dest_vm:
 
diff --git a/tests/qemu-iotests/257 b/tests/qemu-iotests/257
index e1e6077219..a9aa65bbe3 100755
--- a/tests/qemu-iotests/257
+++ b/tests/qemu-iotests/257
@@ -275,10 +275,9 @@ def test_bitmap_sync(bsync_mode, msync_mode='bitmap', 
failure=None):
 an incomplete backup. Testing limitations prevent
 testing competing writes.
 """
-with iotests.FilePaths(['img', 'bsync1', 'bsync2',
-'fbackup0', 'fbackup1', 'fbackup2']) as \
-(img_path, bsync1, bsync2,
- fbackup0, fbackup1, fbackup2), \
+with iotests.FilePaths(
+'img', 'bsync1', 'bsync2', 'fbackup0', 'fbackup1', 'fbackup2') as \
+(img_path, bsync1, bsync2, fbackup0, fbackup1, fbackup2), \
  iotests.VM() as vm:
 
 mode = "Mode {:s}; Bitmap Sync {:s}".format(msync_mode, bsync_mode)
@@ -441,8 +440,7 @@ def test_backup_api():
 """
 Test malformed and prohibited invocations of the backup API.
 """
-with iotests.FilePaths(['img', 'bsync1']) as \
- (img_path, backup_path), \
+with iotests.FilePaths('img', 'bsync1') as (img_path, backup_path), \
  iotests.VM() as vm:
 
 log("\n=== API failure tests ===\n")
diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index f34a1d7ef1..cdcdc2ad8b 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -455,13 +455,13 @@ class FilePaths:
 
 Example usage:
 
-with FilePaths(['test.img', 'test.sock']) as (img_path, sock_path):
+with FilePaths('test.img', 'test.sock') as (img_path, sock_path):
 # Use img_path and sock_path here...
 
 # img_path and sock_path are automatically removed here.
 
 """
-def __init__(self, names, base_dir=test_dir):
+def __init__(self, *names, base_dir=test_dir):
 self.paths = []
 for name in names:
 self.paths.append(os.path.join(base_dir, file_pattern(name)))
@@ -482,7 +482,7 @@ class FilePath(FilePaths):
 FilePath is a specialization of FilePaths that takes a single filename.
 """
 def __init__(self, name, base_dir=test_dir):
-super(FilePath, self).__init__([name], base_dir)
+super(FilePath, self).__init__(name, base_dir=base_dir)
 
 def __enter__(self):
 return self.paths[0]
-- 
2.26.2




[PATCH v2 1/5] qemu-iotests: Fix FilePaths cleanup

2020-08-20 Thread Nir Soffer
If os.remove() fails to remove one of the paths, for example if the file
was removed by the test, the cleanup loop would exit silently, without
removing the rest of the files.

Fixes: de263986b5dc
Signed-off-by: Nir Soffer 
---
 tests/qemu-iotests/iotests.py | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 717b5b652c..16a04df8a3 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -468,11 +468,11 @@ class FilePaths:
 return self.paths
 
 def __exit__(self, exc_type, exc_val, exc_tb):
-try:
-for path in self.paths:
+for path in self.paths:
+try:
 os.remove(path)
-except OSError:
-pass
+except OSError:
+pass
 return False
 
 class FilePath(FilePaths):
-- 
2.26.2




Suspicious QOM types without instance/class size

2020-08-20 Thread Eduardo Habkost
While trying to convert TypeInfo declarations to the new
OBJECT_DECLARE* macros, I've stumbled on a few suspicious cases
where instance_size or class_size is not set, despite having type
checker macros that use a specific type.

The ones with "WARNING" are abstract types (maybe not serious if
subclasses set the appropriate sizes).  The ones with "ERROR"
don't seem to be abstract types.

WARNING: hw/arm/armsse.c:1159:1: class_size should be set to 
sizeof(ARMSSEClass)?
WARNING: hw/audio/hda-codec.c:900:1: instance_size should be set to 
sizeof(HDAAudioState)?
ERROR: hw/core/register.c:328:1: instance_size should be set to 
sizeof(RegisterInfo)?
WARNING: hw/input/adb.c:310:1: class_size should be set to 
sizeof(ADBDeviceClass)?
WARNING: hw/isa/isa-superio.c:181:1: instance_size should be set to 
sizeof(ISASuperIODevice)?
WARNING: hw/ppc/pnv_lpc.c:771:1: instance_size should be set to 
sizeof(PnvLpcController)?
ERROR: hw/ppc/spapr_drc.c:771:1: instance_size should be set to 
sizeof(SpaprDrc)?
WARNING: hw/rtc/m48t59-isa.c:156:1: class_size should be set to 
sizeof(M48txxISADeviceClass)?
WARNING: hw/rtc/m48t59.c:691:1: class_size should be set to 
sizeof(M48txxSysBusDeviceClass)?
ERROR: hw/s390x/virtio-ccw.c:1237:1: class_size should be set to 
sizeof(VirtioCcwBusClass)?
WARNING: hw/ssi/ssi.c:88:1: instance_size should be set to sizeof(SSISlave)?
ERROR: hw/virtio/virtio-pci.c:2101:1: class_size should be set to 
sizeof(VirtioPCIBusClass)?
WARNING: scsi/pr-manager.c:76:1: instance_size should be set to 
sizeof(PRManager)?
ERROR: target/i386/hvf/hvf.c:908:1: instance_size should be set to 
sizeof(HVFState)?

-- 
Eduardo




Re: [PATCH] qemu-iotests: Fix FilePaths cleanup

2020-08-20 Thread Nir Soffer
On Fri, Aug 21, 2020 at 12:40 AM Nir Soffer  wrote:
>
> On Fri, Aug 21, 2020 at 12:33 AM Eric Blake  wrote:
> >
> > On 8/20/20 4:29 PM, Eric Blake wrote:
> > > On 8/20/20 4:19 PM, Nir Soffer wrote:
> > >> If os.remove() fails to remove one of the paths, for example if the file
> > >> was removed by the test, the cleanup loop would exit silently, without
> > >> removing the rest of the files.
> > >>
> > >> Signed-off-by: Nir Soffer 
> > >> ---
> > >>   dtc   | 2 +-
> > >>   tests/qemu-iotests/iotests.py | 8 
> > >>   2 files changed, 5 insertions(+), 5 deletions(-)
> >
> > >
> > > Reviewed-by: Eric Blake 
> >
> > By the way, what test did you hit this on? If possible, I'd like to add
> > a Fixes: tag pointing to a commit that includes the problem.

I'll send a v2 with a Fixes tag, and few other related fixes.

>
> I did not hit this issue, found it while reviewing another patch,
> while trying to
> understand what FilePath is doing.
>
> The error was introduced in:
>
> commit de263986b5dc7571d12a95305ffc7ddd2f349431
> Author: John Snow 
> Date:   Mon Jul 29 16:35:54 2019 -0400
>
> iotests: teach FilePath to produce multiple paths
>
> Use "FilePaths" instead of "FilePath" to request multiple files be
> cleaned up after we leave that object's scope.
>
> This is not crucial; but it saves a little typing.
>
> Signed-off-by: John Snow 
> Reviewed-by: Max Reitz 
> Message-id: 20190709232550.10724-16-js...@redhat.com
> Signed-off-by: John Snow 




Re: [PATCH] qemu-iotests: Fix FilePaths cleanup

2020-08-20 Thread Nir Soffer
On Fri, Aug 21, 2020 at 12:33 AM Eric Blake  wrote:
>
> On 8/20/20 4:29 PM, Eric Blake wrote:
> > On 8/20/20 4:19 PM, Nir Soffer wrote:
> >> If os.remove() fails to remove one of the paths, for example if the file
> >> was removed by the test, the cleanup loop would exit silently, without
> >> removing the rest of the files.
> >>
> >> Signed-off-by: Nir Soffer 
> >> ---
> >>   dtc   | 2 +-
> >>   tests/qemu-iotests/iotests.py | 8 
> >>   2 files changed, 5 insertions(+), 5 deletions(-)
>
> >
> > Reviewed-by: Eric Blake 
>
> By the way, what test did you hit this on? If possible, I'd like to add
> a Fixes: tag pointing to a commit that includes the problem.

I did not hit this issue, found it while reviewing another patch,
while trying to
understand what FilePath is doing.

The error was introduced in:

commit de263986b5dc7571d12a95305ffc7ddd2f349431
Author: John Snow 
Date:   Mon Jul 29 16:35:54 2019 -0400

iotests: teach FilePath to produce multiple paths

Use "FilePaths" instead of "FilePath" to request multiple files be
cleaned up after we leave that object's scope.

This is not crucial; but it saves a little typing.

Signed-off-by: John Snow 
Reviewed-by: Max Reitz 
Message-id: 20190709232550.10724-16-js...@redhat.com
Signed-off-by: John Snow 




Re: [PATCH] qemu-iotests: Fix FilePaths cleanup

2020-08-20 Thread Eric Blake

On 8/20/20 4:29 PM, Eric Blake wrote:

On 8/20/20 4:19 PM, Nir Soffer wrote:

If os.remove() fails to remove one of the paths, for example if the file
was removed by the test, the cleanup loop would exit silently, without
removing the rest of the files.

Signed-off-by: Nir Soffer 
---
  dtc   | 2 +-
  tests/qemu-iotests/iotests.py | 8 
  2 files changed, 5 insertions(+), 5 deletions(-)




Reviewed-by: Eric Blake 


By the way, what test did you hit this on? If possible, I'd like to add 
a Fixes: tag pointing to a commit that includes the problem.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH] qemu-iotests: Fix FilePaths cleanup

2020-08-20 Thread Eric Blake

On 8/20/20 4:19 PM, Nir Soffer wrote:

If os.remove() fails to remove one of the paths, for example if the file
was removed by the test, the cleanup loop would exit silently, without
removing the rest of the files.

Signed-off-by: Nir Soffer 
---
  dtc   | 2 +-
  tests/qemu-iotests/iotests.py | 8 
  2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/dtc b/dtc
index 85e5d83984..88f18909db 16
--- a/dtc
+++ b/dtc
@@ -1 +1 @@
-Subproject commit 85e5d839847af54efab170f2b1331b2a6421e647
+Subproject commit 88f18909db731a627456f26d779445f84e449536


Modulo the unintended submodule bump,


diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 717b5b652c..16a04df8a3 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -468,11 +468,11 @@ class FilePaths:
  return self.paths
  
  def __exit__(self, exc_type, exc_val, exc_tb):

-try:
-for path in self.paths:
+for path in self.paths:
+try:
  os.remove(path)
-except OSError:
-pass
+except OSError:
+pass
  return False


Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH] qemu-iotests: Fix FilePaths cleanup

2020-08-20 Thread Nir Soffer
On Fri, Aug 21, 2020 at 12:19 AM Nir Soffer  wrote:
>
> If os.remove() fails to remove one of the paths, for example if the file
> was removed by the test, the cleanup loop would exit silently, without
> removing the rest of the files.
>
> Signed-off-by: Nir Soffer 
> ---
>  dtc   | 2 +-
>  tests/qemu-iotests/iotests.py | 8 
>  2 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/dtc b/dtc
> index 85e5d83984..88f18909db 16
> --- a/dtc
> +++ b/dtc
> @@ -1 +1 @@
> -Subproject commit 85e5d839847af54efab170f2b1331b2a6421e647
> +Subproject commit 88f18909db731a627456f26d779445f84e449536

This sneaked into the patch somehow, I did not change this.

> diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
> index 717b5b652c..16a04df8a3 100644
> --- a/tests/qemu-iotests/iotests.py
> +++ b/tests/qemu-iotests/iotests.py
> @@ -468,11 +468,11 @@ class FilePaths:
>  return self.paths
>
>  def __exit__(self, exc_type, exc_val, exc_tb):
> -try:
> -for path in self.paths:
> +for path in self.paths:
> +try:
>  os.remove(path)
> -except OSError:
> -pass
> +except OSError:
> +pass
>  return False
>
>  class FilePath(FilePaths):
> --
> 2.26.2
>




[PATCH] qemu-iotests: Fix FilePaths cleanup

2020-08-20 Thread Nir Soffer
If os.remove() fails to remove one of the paths, for example if the file
was removed by the test, the cleanup loop would exit silently, without
removing the rest of the files.

Signed-off-by: Nir Soffer 
---
 dtc   | 2 +-
 tests/qemu-iotests/iotests.py | 8 
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/dtc b/dtc
index 85e5d83984..88f18909db 16
--- a/dtc
+++ b/dtc
@@ -1 +1 @@
-Subproject commit 85e5d839847af54efab170f2b1331b2a6421e647
+Subproject commit 88f18909db731a627456f26d779445f84e449536
diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 717b5b652c..16a04df8a3 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -468,11 +468,11 @@ class FilePaths:
 return self.paths
 
 def __exit__(self, exc_type, exc_val, exc_tb):
-try:
-for path in self.paths:
+for path in self.paths:
+try:
 os.remove(path)
-except OSError:
-pass
+except OSError:
+pass
 return False
 
 class FilePath(FilePaths):
-- 
2.26.2




Re: [PATCH v4 08/10] iotests.py: add verify_o_direct helper

2020-08-20 Thread Vladimir Sementsov-Ogievskiy

21.08.2020 00:00, Nir Soffer wrote:

On Thu, Aug 20, 2020 at 9:49 PM Vladimir Sementsov-Ogievskiy
 wrote:


Add python notrun-helper similar to _check_o_direct for bash tests.
To be used in the following commit.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Stefan Hajnoczi 
---
  tests/qemu-iotests/iotests.py | 6 ++
  1 file changed, 6 insertions(+)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 717b5b652c..369e9918b4 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -1083,6 +1083,12 @@ def _verify_aio_mode(supported_aio_modes: Sequence[str] = 
()) -> None:
  if supported_aio_modes and (aiomode not in supported_aio_modes):
  notrun('not suitable for this aio mode: %s' % aiomode)

+def verify_o_direct() -> None:
+with FilePath('test_o_direct') as f:
+qemu_img_create('-f', 'raw', f, '1M')
+if 'O_DIRECT' in qemu_io('-f', 'raw', '-t', 'none', '-c', 'quit', f):
+notrun(f'file system at {test_dir} does not support O_DIRECT')


Why not:

 with FilePath('test_o_direct') as f:
 try:
 fd = os.open(f, os.O_DIRECT | os.O_CREAT | os.O_RDWR)
 except OSError as e:
 if e.errno != errno.EINVAL:
 raise
 notrun(...)
 else:
os.close(fd)

More verbose, but the intent is more clear, and we do not depend on the output
of qemu-io which is not a public API. For example if someone improves qemu-io
to fail with:

 Direct I/O is not supported

It would break the tests using this helper.



Agree, that's better. Will use it, thanks!




+
  def supports_quorum():
  return 'quorum' in qemu_img_pipe('--help')

--
2.21.3







--
Best regards,
Vladimir



Re: [PATCH v4 08/10] iotests.py: add verify_o_direct helper

2020-08-20 Thread Nir Soffer
On Thu, Aug 20, 2020 at 9:49 PM Vladimir Sementsov-Ogievskiy
 wrote:
>
> Add python notrun-helper similar to _check_o_direct for bash tests.
> To be used in the following commit.
>
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> Reviewed-by: Stefan Hajnoczi 
> ---
>  tests/qemu-iotests/iotests.py | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
> index 717b5b652c..369e9918b4 100644
> --- a/tests/qemu-iotests/iotests.py
> +++ b/tests/qemu-iotests/iotests.py
> @@ -1083,6 +1083,12 @@ def _verify_aio_mode(supported_aio_modes: 
> Sequence[str] = ()) -> None:
>  if supported_aio_modes and (aiomode not in supported_aio_modes):
>  notrun('not suitable for this aio mode: %s' % aiomode)
>
> +def verify_o_direct() -> None:
> +with FilePath('test_o_direct') as f:
> +qemu_img_create('-f', 'raw', f, '1M')
> +if 'O_DIRECT' in qemu_io('-f', 'raw', '-t', 'none', '-c', 'quit', f):
> +notrun(f'file system at {test_dir} does not support O_DIRECT')

Why not:

with FilePath('test_o_direct') as f:
try:
fd = os.open(f, os.O_DIRECT | os.O_CREAT | os.O_RDWR)
except OSError as e:
if e.errno != errno.EINVAL:
raise
notrun(...)
else:
   os.close(fd)

More verbose, but the intent is more clear, and we do not depend on the output
of qemu-io which is not a public API. For example if someone improves qemu-io
to fail with:

Direct I/O is not supported

It would break the tests using this helper.

Nir

> +
>  def supports_quorum():
>  return 'quorum' in qemu_img_pipe('--help')
>
> --
> 2.21.3
>
>




Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-20 Thread Alberto Garcia
Cc: linux-xfs

On Wed 19 Aug 2020 07:53:00 PM CEST, Brian Foster wrote:
> In any event, if you're seeing unclear or unexpected performance
> deltas between certain XFS configurations or other fs', I think the
> best thing to do is post a more complete description of the workload,
> filesystem/storage setup, and test results to the linux-xfs mailing
> list (feel free to cc me as well). As it is, aside from the questions
> above, it's not really clear to me what the storage stack looks like
> for this test, if/how qcow2 is involved, what the various
> 'preallocation=' modes actually mean, etc.

(see [1] for a bit of context)

I repeated the tests with a larger (125GB) filesystem. Things are a bit
faster but not radically different, here are the new numbers:

|--+---+---|
| preallocation mode   |   xfs |  ext4 |
|--+---+---|
| off  |  8139 | 11688 |
| off (w/o ZERO_RANGE) |  2965 |  2780 |
| metadata |  7768 |  9132 |
| falloc   |  7742 | 13108 |
| full | 41389 | 16351 |
|--+---+---|

The numbers are I/O operations per second as reported by fio, running
inside a VM.

The VM is running Debian 9.7 with Linux 4.9.130 and the fio version is
2.16-1. I'm using QEMU 5.1.0.

fio is sending random 4KB write requests to a 25GB virtual drive, this
is the full command line:

fio --filename=/dev/vdb --direct=1 --randrepeat=1 --eta=always
--ioengine=libaio --iodepth=32 --numjobs=1 --name=test --size=25G
--io_limit=25G --ramp_time=5 --rw=randwrite --bs=4k --runtime=60
  
The virtual drive (/dev/vdb) is a freshly created qcow2 file stored on
the host (on an xfs or ext4 filesystem as the table above shows), and
it is attached to QEMU using a virtio-blk-pci device:

   -drive if=virtio,file=image.qcow2,cache=none,l2-cache-size=200M

cache=none means that the image is opened with O_DIRECT and
l2-cache-size is large enough so QEMU is able to cache all the
relevant qcow2 metadata in memory.

The host is running Linux 4.19.132 and has an SSD drive.

About the preallocation modes: a qcow2 file is divided into clusters
of the same size (64KB in this case). That is the minimum unit of
allocation, so when writing 4KB to an unallocated cluster QEMU needs
to fill the other 60KB with zeroes. So here's what happens with the
different modes:

1) off: for every write request QEMU initializes the cluster (64KB)
with fallocate(ZERO_RANGE) and then writes the 4KB of data.

2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest
of the cluster with zeroes.

3) metadata: all clusters were allocated when the image was created
but they are sparse, QEMU only writes the 4KB of data.

4) falloc: all clusters were allocated with fallocate() when the image
was created, QEMU only writes 4KB of data.

5) full: all clusters were allocated by writing zeroes to all of them
when the image was created, QEMU only writes 4KB of data.

As I said in a previous message I'm not familiar with xfs, but the
parts that I don't understand are

   - Why is (4) slower than (1)?
   - Why is (5) so much faster than everything else?

I hope I didn't forget anything, tell me if you have questions.

Berto

[1] https://lists.gnu.org/archive/html/qemu-block/2020-08/msg00481.html



Re: [PATCH v4 00/10] preallocate filter

2020-08-20 Thread Vladimir Sementsov-Ogievskiy

20.08.2020 22:16, no-re...@patchew.org wrote:

Patchew URL: 
https://patchew.org/QEMU/20200820183950.13109-1-vsement...@virtuozzo.com/



Hi,

This series failed the docker-quick@centos7 build test. Please find the testing 
commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
make docker-image-centos7 V=1 NETWORK=1
time make docker-test-quick@centos7 SHOW_ENV=1 J=14 NETWORK=1
=== TEST SCRIPT END ===

   TESTcheck-unit: tests/test-char
Unexpected error in object_property_try_add() at 
/tmp/qemu-test/src/qom/object.c:1181:
attempt to add duplicate property 'serial-id' to object (type 'container')
ERROR test-char - too few tests run (expected 38, got 9)
make: *** [check-unit] Error 1
make: *** Waiting for unfinished jobs
   TESTiotest-qcow2: 018
   TESTiotest-qcow2: 019
---
 raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', 
'--label', 'com.qemu.instance.uuid=a80f0516394e4d83b19a19e329899c17', '-u', 
'1003', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', 
'-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 
'SHOW_ENV=1', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', 
'/home/patchew2/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', 
'/var/tmp/patchew-tester-tmp-n_a2v5bj/src/docker-src.2020-08-20-15.00.31.12987:/var/tmp/qemu:z,ro',
 'qemu/centos7', '/var/tmp/qemu/run', 'test-quick']' returned non-zero exit 
status 2.
filter=--filter=label=com.qemu.instance.uuid=a80f0516394e4d83b19a19e329899c17
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-n_a2v5bj/src'
make: *** [docker-run-test-quick@centos7] Error 2

real16m13.105s
user0m8.886s


The full log is available at
http://patchew.org/logs/20200820183950.13109-1-vsement...@virtuozzo.com/testing.docker-quick@centos7/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-de...@redhat.com




Seems something unrelated

--
Best regards,
Vladimir



Re: [PATCH v4 00/10] preallocate filter

2020-08-20 Thread no-reply
Patchew URL: 
https://patchew.org/QEMU/20200820183950.13109-1-vsement...@virtuozzo.com/



Hi,

This series failed the docker-quick@centos7 build test. Please find the testing 
commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
make docker-image-centos7 V=1 NETWORK=1
time make docker-test-quick@centos7 SHOW_ENV=1 J=14 NETWORK=1
=== TEST SCRIPT END ===

  TESTcheck-unit: tests/test-char
Unexpected error in object_property_try_add() at 
/tmp/qemu-test/src/qom/object.c:1181:
attempt to add duplicate property 'serial-id' to object (type 'container')
ERROR test-char - too few tests run (expected 38, got 9)
make: *** [check-unit] Error 1
make: *** Waiting for unfinished jobs
  TESTiotest-qcow2: 018
  TESTiotest-qcow2: 019
---
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', 
'--label', 'com.qemu.instance.uuid=a80f0516394e4d83b19a19e329899c17', '-u', 
'1003', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', 
'-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 
'SHOW_ENV=1', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', 
'/home/patchew2/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', 
'/var/tmp/patchew-tester-tmp-n_a2v5bj/src/docker-src.2020-08-20-15.00.31.12987:/var/tmp/qemu:z,ro',
 'qemu/centos7', '/var/tmp/qemu/run', 'test-quick']' returned non-zero exit 
status 2.
filter=--filter=label=com.qemu.instance.uuid=a80f0516394e4d83b19a19e329899c17
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-n_a2v5bj/src'
make: *** [docker-run-test-quick@centos7] Error 2

real16m13.105s
user0m8.886s


The full log is available at
http://patchew.org/logs/20200820183950.13109-1-vsement...@virtuozzo.com/testing.docker-quick@centos7/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-de...@redhat.com

[PATCH v4 09/10] iotests.py: add filter_img_check

2020-08-20 Thread Vladimir Sementsov-Ogievskiy
Add analog of bash _filter_qemu_img_check to python framework.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Stefan Hajnoczi 
---
 tests/qemu-iotests/iotests.py | 4 
 1 file changed, 4 insertions(+)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 369e9918b4..ef3da4ee61 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -402,6 +402,10 @@ def filter_img_info(output, filename):
 lines.append(line)
 return '\n'.join(lines)
 
+def filter_img_check(msg):
+msg = re.sub(r'.*allocated.*fragmented.*compressed clusters', '', msg)
+return re.sub(r'Image end offset: [0-9]+', '', msg).strip()
+
 def filter_imgfmt(msg):
 return msg.replace(imgfmt, 'IMGFMT')
 
-- 
2.21.3




[PATCH v4 08/10] iotests.py: add verify_o_direct helper

2020-08-20 Thread Vladimir Sementsov-Ogievskiy
Add python notrun-helper similar to _check_o_direct for bash tests.
To be used in the following commit.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Stefan Hajnoczi 
---
 tests/qemu-iotests/iotests.py | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 717b5b652c..369e9918b4 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -1083,6 +1083,12 @@ def _verify_aio_mode(supported_aio_modes: Sequence[str] 
= ()) -> None:
 if supported_aio_modes and (aiomode not in supported_aio_modes):
 notrun('not suitable for this aio mode: %s' % aiomode)
 
+def verify_o_direct() -> None:
+with FilePath('test_o_direct') as f:
+qemu_img_create('-f', 'raw', f, '1M')
+if 'O_DIRECT' in qemu_io('-f', 'raw', '-t', 'none', '-c', 'quit', f):
+notrun(f'file system at {test_dir} does not support O_DIRECT')
+
 def supports_quorum():
 return 'quorum' in qemu_img_pipe('--help')
 
-- 
2.21.3




[PATCH v4 07/10] block: introduce preallocate filter

2020-08-20 Thread Vladimir Sementsov-Ogievskiy
It's intended to be inserted between format and protocol nodes to
preallocate additional space (expanding protocol file) on writes
crossing EOF. It improves performance for file-systems with slow
allocation.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 docs/system/qemu-block-drivers.rst.inc |  26 +++
 qapi/block-core.json   |  20 +-
 block/preallocate.c| 291 +
 block/Makefile.objs|   1 +
 4 files changed, 337 insertions(+), 1 deletion(-)
 create mode 100644 block/preallocate.c

diff --git a/docs/system/qemu-block-drivers.rst.inc 
b/docs/system/qemu-block-drivers.rst.inc
index b052a6d14e..5e8a35c571 100644
--- a/docs/system/qemu-block-drivers.rst.inc
+++ b/docs/system/qemu-block-drivers.rst.inc
@@ -952,3 +952,29 @@ on host and see if there are locks held by the QEMU 
process on the image file.
 More than one byte could be locked by the QEMU instance, each byte of which
 reflects a particular permission that is acquired or protected by the running
 block driver.
+
+Filter drivers
+~~
+
+QEMU supports several filter drivers, which don't store any data, but do some
+additional tasks, hooking io requests.
+
+.. program:: filter-drivers
+.. option:: preallocate
+
+  Preallocate filter driver is intended to be inserted between format
+  and protocol nodes and does preallocation of some additional space
+  (expanding the protocol file) on write. It may be used for
+  file-systems with slow allocation.
+
+  Supported options:
+
+  .. program:: preallocate
+  .. option:: prealloc-align
+
+On preallocation, align file length to this number, default 1M.
+
+  .. program:: preallocate
+  .. option:: prealloc-size
+
+How much to preallocate, default 128M.
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 197bdc1c36..b40448063b 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2805,7 +2805,7 @@
 'cloop', 'compress', 'copy-on-read', 'dmg', 'file', 'ftp', 'ftps',
 'gluster', 'host_cdrom', 'host_device', 'http', 'https', 'iscsi',
 'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
-'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
+'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
 { 'name': 'replication', 'if': 'defined(CONFIG_REPLICATION)' },
 'sheepdog',
 'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
@@ -3074,6 +3074,23 @@
   'data': { 'aes': 'QCryptoBlockOptionsQCow',
 'luks': 'QCryptoBlockOptionsLUKS'} }
 
+##
+# @BlockdevOptionsPreallocate:
+#
+# Filter driver intended to be inserted between format and protocol node
+# and do preallocation in protocol node on write.
+#
+# @prealloc-align: on preallocation, align file length to this number,
+# default 1048576 (1M)
+#
+# @prealloc-size: how much to preallocate, default 134217728 (128M)
+#
+# Since: 5.2
+##
+{ 'struct': 'BlockdevOptionsPreallocate',
+  'base': 'BlockdevOptionsGenericFormat',
+  'data': { '*prealloc-align': 'int', '*prealloc-size': 'int' } }
+
 ##
 # @BlockdevOptionsQcow2:
 #
@@ -3979,6 +3996,7 @@
   'null-co':'BlockdevOptionsNull',
   'nvme':   'BlockdevOptionsNVMe',
   'parallels':  'BlockdevOptionsGenericFormat',
+  'preallocate':'BlockdevOptionsPreallocate',
   'qcow2':  'BlockdevOptionsQcow2',
   'qcow':   'BlockdevOptionsQcow',
   'qed':'BlockdevOptionsGenericCOWFormat',
diff --git a/block/preallocate.c b/block/preallocate.c
new file mode 100644
index 00..bdf54dbd2f
--- /dev/null
+++ b/block/preallocate.c
@@ -0,0 +1,291 @@
+/*
+ * preallocate filter driver
+ *
+ * The driver performs preallocate operation: it is injected above
+ * some node, and before each write over EOF it does additional preallocating
+ * write-zeroes request.
+ *
+ * Copyright (c) 2020 Virtuozzo International GmbH.
+ *
+ * Author:
+ *  Sementsov-Ogievskiy Vladimir 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see .
+ */
+
+#include "qemu/osdep.h"
+
+#include "qapi/error.h"
+#include "qemu/module.h"
+#include "qemu/option.h"
+#include "qemu/units.h"
+#include "block/block_int.h"
+
+
+typedef struct BDRVPreallocateState {
+int64_t prealloc_size;
+int64_t prealloc_align;
+
+/*
+ * Filter is started as not-active, so 

[PATCH v4 06/10] block: introduce BDRV_REQ_NO_WAIT flag

2020-08-20 Thread Vladimir Sementsov-Ogievskiy
Add flag to make serialising request no wait: if there are conflicting
requests, just return error immediately. It's will be used in upcoming
preallocate filter.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/block.h |  9 -
 block/io.c| 11 ++-
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index b8f4e86e8d..877fda06a4 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -67,8 +67,15 @@ typedef enum {
  * written to qiov parameter which may be NULL.
  */
 BDRV_REQ_PREFETCH  = 0x200,
+
+/*
+ * If we need to wait for other requests, just fail immediately. Used
+ * only together with BDRV_REQ_SERIALISING.
+ */
+BDRV_REQ_NO_WAIT = 0x400,
+
 /* Mask of valid flags */
-BDRV_REQ_MASK   = 0x3ff,
+BDRV_REQ_MASK   = 0x7ff,
 } BdrvRequestFlags;
 
 typedef struct BlockSizes {
diff --git a/block/io.c b/block/io.c
index dd28befb08..c93b1e98a3 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1912,9 +1912,18 @@ bdrv_co_write_req_prepare(BdrvChild *child, int64_t 
offset, uint64_t bytes,
 assert(!(bs->open_flags & BDRV_O_INACTIVE));
 assert((bs->open_flags & BDRV_O_NO_IO) == 0);
 assert(!(flags & ~BDRV_REQ_MASK));
+assert(!((flags & BDRV_REQ_NO_WAIT) && !(flags & BDRV_REQ_SERIALISING)));
 
 if (flags & BDRV_REQ_SERIALISING) {
-bdrv_make_request_serialising(req, bdrv_get_cluster_size(bs));
+QEMU_LOCK_GUARD(>reqs_lock);
+
+tracked_request_set_serialising(req, bdrv_get_cluster_size(bs));
+
+if ((flags & BDRV_REQ_NO_WAIT) && bdrv_find_conflicting_request(req)) {
+return -EBUSY;
+}
+
+bdrv_wait_serialising_requests_locked(req);
 } else {
 bdrv_wait_serialising_requests(req);
 }
-- 
2.21.3




[PATCH v4 10/10] iotests: add 298 to test new preallocate filter driver

2020-08-20 Thread Vladimir Sementsov-Ogievskiy
Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Stefan Hajnoczi 
---
 tests/qemu-iotests/298 | 50 ++
 tests/qemu-iotests/298.out |  6 +
 tests/qemu-iotests/group   |  1 +
 3 files changed, 57 insertions(+)
 create mode 100644 tests/qemu-iotests/298
 create mode 100644 tests/qemu-iotests/298.out

diff --git a/tests/qemu-iotests/298 b/tests/qemu-iotests/298
new file mode 100644
index 00..4f2087352a
--- /dev/null
+++ b/tests/qemu-iotests/298
@@ -0,0 +1,50 @@
+#!/usr/bin/env python3
+#
+# Test for preallocate filter
+#
+# Copyright (c) 2020 Virtuozzo International GmbH.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+import os
+import iotests
+from iotests import log
+
+iotests.script_initialize(supported_fmts=['qcow2'])
+iotests.verify_o_direct()
+
+size = 10 * 1024 * 1024
+disk = iotests.file_path('disk')
+
+iotests.qemu_img_create('-f', iotests.imgfmt, disk, str(size))
+
+opts = f'driver={iotests.imgfmt},' \
+f'file.driver=preallocate,file.file.filename={disk}'
+p = iotests.QemuIoInteractive('--image-opts', '-t', 'none', opts)
+
+log(p.cmd('write 0 1M'), filters=[iotests.filter_qemu_io])
+p.cmd('flush')
+
+if os.path.getsize(disk) > 100 * 1024 * 1024:
+log('file in progress is big, preallocation works')
+
+p.close()
+
+if os.path.getsize(disk) < 10 * 1024 * 1024:
+log('file is small after close')
+
+# Check that there are no leaks.
+log(iotests.qemu_img_pipe('check', '-f', 'qcow2', disk),
+filters=[iotests.filter_img_check])
diff --git a/tests/qemu-iotests/298.out b/tests/qemu-iotests/298.out
new file mode 100644
index 00..baf8f8425c
--- /dev/null
+++ b/tests/qemu-iotests/298.out
@@ -0,0 +1,6 @@
+wrote 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+file in progress is big, preallocation works
+file is small after close
+No errors were found on the image.
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index 7f76066640..cdcde2fe48 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -306,6 +306,7 @@
 295 rw
 296 rw
 297 meta
+298 auto quick
 299 auto quick
 301 backing quick
 302 quick
-- 
2.21.3




[PATCH v4 04/10] block/io: bdrv_wait_serialising_requests_locked: drop extra bs arg

2020-08-20 Thread Vladimir Sementsov-Ogievskiy
bs is linked in req, so no needs to pass it separately. Most of
tracked-requests API doesn't have bs argument. Actually, after this
patch only tracked_request_begin has it, but it's for purpose.

While being here, also add a comment about what "_locked" is.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Stefan Hajnoczi 
---
 block/io.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/block/io.c b/block/io.c
index 5b96715058..36bbe4b9b1 100644
--- a/block/io.c
+++ b/block/io.c
@@ -761,16 +761,16 @@ bdrv_find_conflicting_request(BdrvTrackedRequest *self)
 return NULL;
 }
 
+/* Called with self->bs->reqs_lock held */
 static bool coroutine_fn
-bdrv_wait_serialising_requests_locked(BlockDriverState *bs,
-  BdrvTrackedRequest *self)
+bdrv_wait_serialising_requests_locked(BdrvTrackedRequest *self)
 {
 BdrvTrackedRequest *req;
 bool waited = false;
 
 while ((req = bdrv_find_conflicting_request(self))) {
 self->waiting_for = req;
-qemu_co_queue_wait(>wait_queue, >reqs_lock);
+qemu_co_queue_wait(>wait_queue, >bs->reqs_lock);
 self->waiting_for = NULL;
 waited = true;
 }
@@ -794,7 +794,7 @@ bool bdrv_mark_request_serialising(BdrvTrackedRequest *req, 
uint64_t align)
 
 req->overlap_offset = MIN(req->overlap_offset, overlap_offset);
 req->overlap_bytes = MAX(req->overlap_bytes, overlap_bytes);
-waited = bdrv_wait_serialising_requests_locked(bs, req);
+waited = bdrv_wait_serialising_requests_locked(req);
 qemu_co_mutex_unlock(>reqs_lock);
 return waited;
 }
@@ -876,7 +876,7 @@ static bool coroutine_fn 
bdrv_wait_serialising_requests(BdrvTrackedRequest *self
 }
 
 qemu_co_mutex_lock(>reqs_lock);
-waited = bdrv_wait_serialising_requests_locked(bs, self);
+waited = bdrv_wait_serialising_requests_locked(self);
 qemu_co_mutex_unlock(>reqs_lock);
 
 return waited;
-- 
2.21.3




[PATCH v4 01/10] block: simplify comment to BDRV_REQ_SERIALISING

2020-08-20 Thread Vladimir Sementsov-Ogievskiy
1. BDRV_REQ_NO_SERIALISING doesn't exist already, don't mention it.

2. We are going to add one more user of BDRV_REQ_SERIALISING, so
   comment about backup becomes a bit confusing here. The use case in
   backup is documented in block/backup.c, so let's just drop
   duplication here.

3. The fact that BDRV_REQ_SERIALISING is only for write requests is
   omitted. Add a note.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Stefan Hajnoczi 
---
 include/block/block.h | 11 +--
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index 6e36154061..b8f4e86e8d 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -53,16 +53,7 @@ typedef enum {
  * content. */
 BDRV_REQ_WRITE_UNCHANGED= 0x40,
 
-/*
- * BDRV_REQ_SERIALISING forces request serialisation for writes.
- * It is used to ensure that writes to the backing file of a backup process
- * target cannot race with a read of the backup target that defers to the
- * backing file.
- *
- * Note, that BDRV_REQ_SERIALISING is _not_ opposite in meaning to
- * BDRV_REQ_NO_SERIALISING. A more descriptive name for the latter might be
- * _DO_NOT_WAIT_FOR_SERIALISING, except that is too long.
- */
+/* Forces request serialisation. Use only with write requests. */
 BDRV_REQ_SERIALISING= 0x80,
 
 /* Execute the request only if the operation can be offloaded or otherwise
-- 
2.21.3




[PATCH v4 02/10] block/io.c: drop assertion on double waiting for request serialisation

2020-08-20 Thread Vladimir Sementsov-Ogievskiy
The comments states, that on misaligned request we should have already
been waiting. But for bdrv_padding_rmw_read, we called
bdrv_mark_request_serialising with align = request_alignment, and now
we serialise with align = cluster_size. So we may have to wait again
with larger alignment.

Note, that the only user of BDRV_REQ_SERIALISING is backup which issues
cluster-aligned requests, so seems the assertion should not fire for
now. But it's wrong anyway.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Paolo Bonzini 
---
 block/io.c | 11 +--
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/block/io.c b/block/io.c
index ad3a51ed53..b18680a842 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1881,7 +1881,6 @@ bdrv_co_write_req_prepare(BdrvChild *child, int64_t 
offset, uint64_t bytes,
   BdrvTrackedRequest *req, int flags)
 {
 BlockDriverState *bs = child->bs;
-bool waited;
 int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE);
 
 if (bs->read_only) {
@@ -1893,15 +1892,7 @@ bdrv_co_write_req_prepare(BdrvChild *child, int64_t 
offset, uint64_t bytes,
 assert(!(flags & ~BDRV_REQ_MASK));
 
 if (flags & BDRV_REQ_SERIALISING) {
-waited = bdrv_mark_request_serialising(req, bdrv_get_cluster_size(bs));
-/*
- * For a misaligned request we should have already waited earlier,
- * because we come after bdrv_padding_rmw_read which must be called
- * with the request already marked as serialising.
- */
-assert(!waited ||
-   (req->offset == req->overlap_offset &&
-req->bytes == req->overlap_bytes));
+bdrv_mark_request_serialising(req, bdrv_get_cluster_size(bs));
 } else {
 bdrv_wait_serialising_requests(req);
 }
-- 
2.21.3




[PATCH v4 05/10] block: bdrv_mark_request_serialising: split non-waiting function

2020-08-20 Thread Vladimir Sementsov-Ogievskiy
We'll need a separate function, which will only "mark" request
serialising with specified align but not wait for conflicting
requests. So, it will be like old bdrv_mark_request_serialising(),
before merging bdrv_wait_serialising_requests_locked() into it.

To reduce the possible mess, let's do the following:

Public function that does both marking and waiting will be called
bdrv_make_request_serialising, and private function which will only
"mark" will be called tracked_request_set_serialising().

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/block_int.h |  3 ++-
 block/file-posix.c|  2 +-
 block/io.c| 35 +++
 3 files changed, 26 insertions(+), 14 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 38dec0275b..4d56a1b141 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -1039,7 +1039,8 @@ extern unsigned int bdrv_drain_all_count;
 void bdrv_apply_subtree_drain(BdrvChild *child, BlockDriverState *new_parent);
 void bdrv_unapply_subtree_drain(BdrvChild *child, BlockDriverState 
*old_parent);
 
-bool coroutine_fn bdrv_mark_request_serialising(BdrvTrackedRequest *req, 
uint64_t align);
+bool coroutine_fn bdrv_make_request_serialising(BdrvTrackedRequest *req,
+uint64_t align);
 BdrvTrackedRequest *coroutine_fn bdrv_co_get_self_request(BlockDriverState 
*bs);
 
 int get_tmp_filename(char *filename, int size);
diff --git a/block/file-posix.c b/block/file-posix.c
index 9a00d4190a..560d1c0a94 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2957,7 +2957,7 @@ raw_do_pwrite_zeroes(BlockDriverState *bs, int64_t 
offset, int bytes,
 req->bytes = end - req->offset;
 req->overlap_bytes = req->bytes;
 
-bdrv_mark_request_serialising(req, bs->bl.request_alignment);
+bdrv_make_request_serialising(req, bs->bl.request_alignment);
 }
 #endif
 
diff --git a/block/io.c b/block/io.c
index 36bbe4b9b1..dd28befb08 100644
--- a/block/io.c
+++ b/block/io.c
@@ -778,15 +778,14 @@ bdrv_wait_serialising_requests_locked(BdrvTrackedRequest 
*self)
 return waited;
 }
 
-bool bdrv_mark_request_serialising(BdrvTrackedRequest *req, uint64_t align)
+/* Called with req->bs->reqs_lock held */
+static void tracked_request_set_serialising(BdrvTrackedRequest *req,
+uint64_t align)
 {
-BlockDriverState *bs = req->bs;
 int64_t overlap_offset = req->offset & ~(align - 1);
 uint64_t overlap_bytes = ROUND_UP(req->offset + req->bytes, align)
- overlap_offset;
-bool waited;
 
-qemu_co_mutex_lock(>reqs_lock);
 if (!req->serialising) {
 atomic_inc(>bs->serialising_in_flight);
 req->serialising = true;
@@ -794,9 +793,6 @@ bool bdrv_mark_request_serialising(BdrvTrackedRequest *req, 
uint64_t align)
 
 req->overlap_offset = MIN(req->overlap_offset, overlap_offset);
 req->overlap_bytes = MAX(req->overlap_bytes, overlap_bytes);
-waited = bdrv_wait_serialising_requests_locked(req);
-qemu_co_mutex_unlock(>reqs_lock);
-return waited;
 }
 
 /**
@@ -882,6 +878,21 @@ static bool coroutine_fn 
bdrv_wait_serialising_requests(BdrvTrackedRequest *self
 return waited;
 }
 
+bool coroutine_fn bdrv_make_request_serialising(BdrvTrackedRequest *req,
+uint64_t align)
+{
+bool waited;
+
+qemu_co_mutex_lock(>bs->reqs_lock);
+
+tracked_request_set_serialising(req, align);
+waited = bdrv_wait_serialising_requests_locked(req);
+
+qemu_co_mutex_unlock(>bs->reqs_lock);
+
+return waited;
+}
+
 static int bdrv_check_byte_request(BlockDriverState *bs, int64_t offset,
size_t size)
 {
@@ -1492,7 +1503,7 @@ static int coroutine_fn bdrv_aligned_preadv(BdrvChild 
*child,
  * with each other for the same cluster.  For example, in copy-on-read
  * it ensures that the CoR read and write operations are atomic and
  * guest writes cannot interleave between them. */
-bdrv_mark_request_serialising(req, bdrv_get_cluster_size(bs));
+bdrv_make_request_serialising(req, bdrv_get_cluster_size(bs));
 } else {
 bdrv_wait_serialising_requests(req);
 }
@@ -1903,7 +1914,7 @@ bdrv_co_write_req_prepare(BdrvChild *child, int64_t 
offset, uint64_t bytes,
 assert(!(flags & ~BDRV_REQ_MASK));
 
 if (flags & BDRV_REQ_SERIALISING) {
-bdrv_mark_request_serialising(req, bdrv_get_cluster_size(bs));
+bdrv_make_request_serialising(req, bdrv_get_cluster_size(bs));
 } else {
 bdrv_wait_serialising_requests(req);
 }
@@ -2069,7 +2080,7 @@ static int coroutine_fn bdrv_co_do_zero_pwritev(BdrvChild 
*child,
 
 padding = bdrv_init_padding(bs, offset, bytes, );
 if (padding) {
-bdrv_mark_request_serialising(req, align);
+

[PATCH v4 03/10] block/io: split out bdrv_find_conflicting_request

2020-08-20 Thread Vladimir Sementsov-Ogievskiy
To be reused in separate.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Stefan Hajnoczi 
---
 block/io.c | 71 +++---
 1 file changed, 41 insertions(+), 30 deletions(-)

diff --git a/block/io.c b/block/io.c
index b18680a842..5b96715058 100644
--- a/block/io.c
+++ b/block/io.c
@@ -727,43 +727,54 @@ static bool tracked_request_overlaps(BdrvTrackedRequest 
*req,
 return true;
 }
 
+/* Called with self->bs->reqs_lock held */
+static BdrvTrackedRequest *
+bdrv_find_conflicting_request(BdrvTrackedRequest *self)
+{
+BdrvTrackedRequest *req;
+
+QLIST_FOREACH(req, >bs->tracked_requests, list) {
+if (req == self || (!req->serialising && !self->serialising)) {
+continue;
+}
+if (tracked_request_overlaps(req, self->overlap_offset,
+ self->overlap_bytes))
+{
+/*
+ * Hitting this means there was a reentrant request, for
+ * example, a block driver issuing nested requests.  This must
+ * never happen since it means deadlock.
+ */
+assert(qemu_coroutine_self() != req->co);
+
+/*
+ * If the request is already (indirectly) waiting for us, or
+ * will wait for us as soon as it wakes up, then just go on
+ * (instead of producing a deadlock in the former case).
+ */
+if (!req->waiting_for) {
+return req;
+}
+}
+}
+
+return NULL;
+}
+
 static bool coroutine_fn
 bdrv_wait_serialising_requests_locked(BlockDriverState *bs,
   BdrvTrackedRequest *self)
 {
 BdrvTrackedRequest *req;
-bool retry;
 bool waited = false;
 
-do {
-retry = false;
-QLIST_FOREACH(req, >tracked_requests, list) {
-if (req == self || (!req->serialising && !self->serialising)) {
-continue;
-}
-if (tracked_request_overlaps(req, self->overlap_offset,
- self->overlap_bytes))
-{
-/* Hitting this means there was a reentrant request, for
- * example, a block driver issuing nested requests.  This must
- * never happen since it means deadlock.
- */
-assert(qemu_coroutine_self() != req->co);
-
-/* If the request is already (indirectly) waiting for us, or
- * will wait for us as soon as it wakes up, then just go on
- * (instead of producing a deadlock in the former case). */
-if (!req->waiting_for) {
-self->waiting_for = req;
-qemu_co_queue_wait(>wait_queue, >reqs_lock);
-self->waiting_for = NULL;
-retry = true;
-waited = true;
-break;
-}
-}
-}
-} while (retry);
+while ((req = bdrv_find_conflicting_request(self))) {
+self->waiting_for = req;
+qemu_co_queue_wait(>wait_queue, >reqs_lock);
+self->waiting_for = NULL;
+waited = true;
+}
+
 return waited;
 }
 
-- 
2.21.3




[PATCH v4 00/10] preallocate filter

2020-08-20 Thread Vladimir Sementsov-Ogievskiy
Hi all!

Here is a filter, which does preallocation on write.

In Virtuozzo we have to deal with some custom distributed storage
solution, where allocation is relatively expensive operation. We have to
workaround it in Qemu, so here is a new filter.

v4:
01-04: add r-bs
05: add coroutine_fn tag
06: use QEMU_LOCK_GUARD and fix reqs_lock leak
07: grammar
08-10: add r-b

I decided to not resend patches 11-12, for which I don't know
how they can be applied universally. I think I'll keep them
downstream only for now, and probably drop in future if we
can insert filter in libvirt.

Vladimir Sementsov-Ogievskiy (10):
  block: simplify comment to BDRV_REQ_SERIALISING
  block/io.c: drop assertion on double waiting for request serialisation
  block/io: split out bdrv_find_conflicting_request
  block/io: bdrv_wait_serialising_requests_locked: drop extra bs arg
  block: bdrv_mark_request_serialising: split non-waiting function
  block: introduce BDRV_REQ_NO_WAIT flag
  block: introduce preallocate filter
  iotests.py: add verify_o_direct helper
  iotests.py: add filter_img_check
  iotests: add 298 to test new preallocate filter driver

 docs/system/qemu-block-drivers.rst.inc |  26 +++
 qapi/block-core.json   |  20 +-
 include/block/block.h  |  20 +-
 include/block/block_int.h  |   3 +-
 block/file-posix.c |   2 +-
 block/io.c | 130 ++-
 block/preallocate.c| 291 +
 block/Makefile.objs|   1 +
 tests/qemu-iotests/298 |  50 +
 tests/qemu-iotests/298.out |   6 +
 tests/qemu-iotests/group   |   1 +
 tests/qemu-iotests/iotests.py  |  10 +
 12 files changed, 492 insertions(+), 68 deletions(-)
 create mode 100644 block/preallocate.c
 create mode 100644 tests/qemu-iotests/298
 create mode 100644 tests/qemu-iotests/298.out

-- 
2.21.3




Re: [PATCH v2 0/3] hw/sd: Add Cadence SDHCI emulation

2020-08-20 Thread Philippe Mathieu-Daudé
Hi Sai Pavan, you said you were interested to test the first 2
patches. FYI I plan to queue them and send the pull request tomorrow
or Saturday the latest.

On 8/17/20 12:03 PM, Bin Meng wrote:
> This series is spun off from the following series as it is hw/sd
> centric, so that it can be picked up separately by Philippe.
> 
> http://patchwork.ozlabs.org/project/qemu-devel/list/?series=195648
> 
> This series fixed 2 SD card issues, and added a new model for
> Cadence SDHCI controller.
> 
> Patch "[09/18] hw/sd: sdhci: Make sdhci_poweron_reset() internal visible"
> in this series per the review comments.
> 
> Changes in v2:
> - remove the pointless zero initialization
> - fix SDSC size check in sd_set_csd() too
> - use 's' for the model state
> - call device_cold_reset() in cadence_sdhci_reset()
> - add .impl in cadence_sdhci_ops
> - move Cadence specific register defines to cadence_sdhci.c
> - use 'sdhci' instead of 'slot' to represent SDHCIState
> - use sysbus_mmio_get_region() to access SDHCI model's memory region
> - initialize TYPE_SYSBUS_SDHCI in the instance_init() so that users
>   of Cadence SDHCI do not have to do that themselves
> - propergate irq and 'sd-bus' from generic-sdhci
> 
> Bin Meng (3):
>   hw/sd: sd: Fix incorrect populated function switch status data
> structure
>   hw/sd: sd: Correct the maximum size of a Standard Capacity SD Memory
> Card
>   hw/sd: Add Cadence SDHCI emulation
> 
>  hw/sd/Kconfig |   4 +
>  hw/sd/Makefile.objs   |   1 +
>  hw/sd/cadence_sdhci.c | 200 
> ++
>  hw/sd/sd.c|   9 +-
>  include/hw/sd/cadence_sdhci.h |  46 ++
>  5 files changed, 257 insertions(+), 3 deletions(-)
>  create mode 100644 hw/sd/cadence_sdhci.c
>  create mode 100644 include/hw/sd/cadence_sdhci.h
> 



[PATCH v5 15/15] block/nvme: Use an array of EventNotifier

2020-08-20 Thread Philippe Mathieu-Daudé
In preparation of using multiple IRQ (thus multiple eventfds)
make BDRVNVMeState::irq_notifier an array (for now of a single
element, the admin queue notifier).

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Philippe Mathieu-Daudé 
---
 block/nvme.c | 31 +--
 1 file changed, 21 insertions(+), 10 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index a61e86a83eb..fe8a40b7ede 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -106,6 +106,12 @@ QEMU_BUILD_BUG_ON(offsetof(NVMeRegs, doorbells) != 0x1000);
 #define INDEX_ADMIN 0
 #define INDEX_IO(n) (1 + n)
 
+/* This driver shares a single MSIX IRQ for the admin and I/O queues */
+enum {
+MSIX_SHARED_IRQ_IDX = 0,
+MSIX_IRQ_COUNT = 1
+};
+
 struct BDRVNVMeState {
 AioContext *aio_context;
 QEMUVFIOState *vfio;
@@ -120,7 +126,7 @@ struct BDRVNVMeState {
 /* How many uint32_t elements does each doorbell entry take. */
 size_t doorbell_scale;
 bool write_cache_supported;
-EventNotifier irq_notifier;
+EventNotifier irq_notifier[MSIX_IRQ_COUNT];
 
 uint64_t nsze; /* Namespace size reported by identify command */
 int nsid;  /* The namespace id to read/write data. */
@@ -631,7 +637,8 @@ static bool nvme_poll_queues(BDRVNVMeState *s)
 
 static void nvme_handle_event(EventNotifier *n)
 {
-BDRVNVMeState *s = container_of(n, BDRVNVMeState, irq_notifier);
+BDRVNVMeState *s = container_of(n, BDRVNVMeState,
+irq_notifier[MSIX_SHARED_IRQ_IDX]);
 
 trace_nvme_handle_event(s);
 event_notifier_test_and_clear(n);
@@ -683,7 +690,8 @@ out_error:
 static bool nvme_poll_cb(void *opaque)
 {
 EventNotifier *e = opaque;
-BDRVNVMeState *s = container_of(e, BDRVNVMeState, irq_notifier);
+BDRVNVMeState *s = container_of(e, BDRVNVMeState,
+irq_notifier[MSIX_SHARED_IRQ_IDX]);
 
 trace_nvme_poll_cb(s);
 return nvme_poll_queues(s);
@@ -705,7 +713,7 @@ static int nvme_init(BlockDriverState *bs, const char 
*device, int namespace,
 s->device = g_strdup(device);
 s->nsid = namespace;
 s->aio_context = bdrv_get_aio_context(bs);
-ret = event_notifier_init(>irq_notifier, 0);
+ret = event_notifier_init(>irq_notifier[MSIX_SHARED_IRQ_IDX], 0);
 if (ret) {
 error_setg(errp, "Failed to init event notifier");
 return ret;
@@ -784,12 +792,13 @@ static int nvme_init(BlockDriverState *bs, const char 
*device, int namespace,
 }
 }
 
-ret = qemu_vfio_pci_init_irq(s->vfio, >irq_notifier,
+ret = qemu_vfio_pci_init_irq(s->vfio, s->irq_notifier,
  VFIO_PCI_MSIX_IRQ_INDEX, errp);
 if (ret) {
 goto out;
 }
-aio_set_event_notifier(bdrv_get_aio_context(bs), >irq_notifier,
+aio_set_event_notifier(bdrv_get_aio_context(bs),
+   >irq_notifier[MSIX_SHARED_IRQ_IDX],
false, nvme_handle_event, nvme_poll_cb);
 
 nvme_identify(bs, namespace, _err);
@@ -872,9 +881,10 @@ static void nvme_close(BlockDriverState *bs)
 nvme_free_queue_pair(s->queues[i]);
 }
 g_free(s->queues);
-aio_set_event_notifier(bdrv_get_aio_context(bs), >irq_notifier,
+aio_set_event_notifier(bdrv_get_aio_context(bs),
+   >irq_notifier[MSIX_SHARED_IRQ_IDX],
false, NULL, NULL);
-event_notifier_cleanup(>irq_notifier);
+event_notifier_cleanup(>irq_notifier[MSIX_SHARED_IRQ_IDX]);
 qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->regs, 0, NVME_BAR_SIZE);
 qemu_vfio_close(s->vfio);
 
@@ -1381,7 +1391,8 @@ static void nvme_detach_aio_context(BlockDriverState *bs)
 q->completion_bh = NULL;
 }
 
-aio_set_event_notifier(bdrv_get_aio_context(bs), >irq_notifier,
+aio_set_event_notifier(bdrv_get_aio_context(bs),
+   >irq_notifier[MSIX_SHARED_IRQ_IDX],
false, NULL, NULL);
 }
 
@@ -1391,7 +1402,7 @@ static void nvme_attach_aio_context(BlockDriverState *bs,
 BDRVNVMeState *s = bs->opaque;
 
 s->aio_context = new_context;
-aio_set_event_notifier(new_context, >irq_notifier,
+aio_set_event_notifier(new_context, >irq_notifier[MSIX_SHARED_IRQ_IDX],
false, nvme_handle_event, nvme_poll_cb);
 
 for (int i = 0; i < s->nr_queues; i++) {
-- 
2.26.2




[PATCH v5 11/15] block/nvme: Simplify nvme_init_queue() arguments

2020-08-20 Thread Philippe Mathieu-Daudé
nvme_init_queue() doesn't require BlockDriverState anymore.
Replace it by BDRVNVMeState to simplify.

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Philippe Mathieu-Daudé 
---
 block/nvme.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index f180078e781..5b69fc75a60 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -165,10 +165,9 @@ static QemuOptsList runtime_opts = {
 },
 };
 
-static void nvme_init_queue(BlockDriverState *bs, NVMeQueue *q,
+static void nvme_init_queue(BDRVNVMeState *s, NVMeQueue *q,
 int nentries, int entry_bytes, Error **errp)
 {
-BDRVNVMeState *s = bs->opaque;
 size_t bytes;
 int r;
 
@@ -251,14 +250,14 @@ static NVMeQueuePair 
*nvme_create_queue_pair(BlockDriverState *bs,
 req->prp_list_iova = prp_list_iova + i * s->page_size;
 }
 
-nvme_init_queue(bs, >sq, size, NVME_SQ_ENTRY_BYTES, _err);
+nvme_init_queue(s, >sq, size, NVME_SQ_ENTRY_BYTES, _err);
 if (local_err) {
 error_propagate(errp, local_err);
 goto fail;
 }
 q->sq.doorbell = >regs->doorbells[idx * 2 * s->doorbell_scale];
 
-nvme_init_queue(bs, >cq, size, NVME_CQ_ENTRY_BYTES, _err);
+nvme_init_queue(s, >cq, size, NVME_CQ_ENTRY_BYTES, _err);
 if (local_err) {
 error_propagate(errp, local_err);
 goto fail;
-- 
2.26.2




[PATCH v5 13/15] block/nvme: Simplify nvme_create_queue_pair() arguments

2020-08-20 Thread Philippe Mathieu-Daudé
nvme_create_queue_pair() doesn't require BlockDriverState anymore.
Replace it by BDRVNVMeState and AioContext to simplify.

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Philippe Mathieu-Daudé 
---
 block/nvme.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 456fe61f5ea..1f67e888c84 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -208,12 +208,12 @@ static void nvme_free_req_queue_cb(void *opaque)
 qemu_mutex_unlock(>lock);
 }
 
-static NVMeQueuePair *nvme_create_queue_pair(BlockDriverState *bs,
+static NVMeQueuePair *nvme_create_queue_pair(BDRVNVMeState *s,
+ AioContext *aio_context,
  int idx, int size,
  Error **errp)
 {
 int i, r;
-BDRVNVMeState *s = bs->opaque;
 Error *local_err = NULL;
 NVMeQueuePair *q;
 uint64_t prp_list_iova;
@@ -232,8 +232,7 @@ static NVMeQueuePair 
*nvme_create_queue_pair(BlockDriverState *bs,
 q->s = s;
 q->index = idx;
 qemu_co_queue_init(>free_req_queue);
-q->completion_bh = aio_bh_new(bdrv_get_aio_context(bs),
-  nvme_process_completion_bh, q);
+q->completion_bh = aio_bh_new(aio_context, nvme_process_completion_bh, q);
 r = qemu_vfio_dma_map(s->vfio, q->prp_list_pages,
   s->page_size * NVME_NUM_REQS,
   false, _list_iova);
@@ -637,7 +636,8 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error 
**errp)
 NvmeCmd cmd;
 int queue_size = NVME_QUEUE_SIZE;
 
-q = nvme_create_queue_pair(bs, n, queue_size, errp);
+q = nvme_create_queue_pair(s, bdrv_get_aio_context(bs),
+   n, queue_size, errp);
 if (!q) {
 return false;
 }
@@ -683,6 +683,7 @@ static int nvme_init(BlockDriverState *bs, const char 
*device, int namespace,
  Error **errp)
 {
 BDRVNVMeState *s = bs->opaque;
+AioContext *aio_context = bdrv_get_aio_context(bs);
 int ret;
 uint64_t cap;
 uint64_t timeout_ms;
@@ -743,7 +744,7 @@ static int nvme_init(BlockDriverState *bs, const char 
*device, int namespace,
 
 /* Set up admin queue. */
 s->queues = g_new(NVMeQueuePair *, 1);
-s->queues[INDEX_ADMIN] = nvme_create_queue_pair(bs, 0,
+s->queues[INDEX_ADMIN] = nvme_create_queue_pair(s, aio_context, 0,
   NVME_QUEUE_SIZE,
   errp);
 if (!s->queues[INDEX_ADMIN]) {
-- 
2.26.2




[PATCH v5 07/15] block/nvme: Rename local variable

2020-08-20 Thread Philippe Mathieu-Daudé
We are going to modify the code in the next commit. Renaming
the 'resp' variable to 'id' first makes the next commit easier
to review. No logical changes.

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Philippe Mathieu-Daudé 
---
 block/nvme.c | 19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 3101f1ad55d..99822d9fd36 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -510,8 +510,8 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 BDRVNVMeState *s = bs->opaque;
 NvmeIdCtrl *idctrl;
 NvmeIdNs *idns;
+uint8_t *id;
 NvmeLBAF *lbaf;
-uint8_t *resp;
 uint16_t oncs;
 int r;
 uint64_t iova;
@@ -520,14 +520,14 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 .cdw10 = cpu_to_le32(0x1),
 };
 
-resp = qemu_try_blockalign0(bs, sizeof(NvmeIdCtrl));
-if (!resp) {
+id = qemu_try_blockalign0(bs, sizeof(NvmeIdCtrl));
+if (!id) {
 error_setg(errp, "Cannot allocate buffer for identify response");
 goto out;
 }
-idctrl = (NvmeIdCtrl *)resp;
-idns = (NvmeIdNs *)resp;
-r = qemu_vfio_dma_map(s->vfio, resp, sizeof(NvmeIdCtrl), true, );
+idctrl = (NvmeIdCtrl *)id;
+idns = (NvmeIdNs *)id;
+r = qemu_vfio_dma_map(s->vfio, id, sizeof(NvmeIdCtrl), true, );
 if (r) {
 error_setg(errp, "Cannot map buffer for DMA");
 goto out;
@@ -554,8 +554,7 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 s->supports_write_zeroes = !!(oncs & NVME_ONCS_WRITE_ZEROS);
 s->supports_discard = !!(oncs & NVME_ONCS_DSM);
 
-memset(resp, 0, 4096);
-
+memset(id, 0, 4096);
 cmd.cdw10 = 0;
 cmd.nsid = cpu_to_le32(namespace);
 if (nvme_cmd_sync(bs, s->queues[INDEX_ADMIN], )) {
@@ -587,8 +586,8 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 
 s->blkshift = lbaf->ds;
 out:
-qemu_vfio_dma_unmap(s->vfio, resp);
-qemu_vfree(resp);
+qemu_vfio_dma_unmap(s->vfio, id);
+qemu_vfree(id);
 }
 
 static bool nvme_poll_queues(BDRVNVMeState *s)
-- 
2.26.2




[PATCH v5 06/15] block/nvme: Use common error path in nvme_add_io_queue()

2020-08-20 Thread Philippe Mathieu-Daudé
Rearrange nvme_add_io_queue() by using a common error path.
This will be proven useful in few commits where we add IRQ
notification to the IO queues.

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Philippe Mathieu-Daudé 
---
 block/nvme.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 53448b7d230..3101f1ad55d 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -649,8 +649,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error 
**errp)
 };
 if (nvme_cmd_sync(bs, s->queues[INDEX_ADMIN], )) {
 error_setg(errp, "Failed to create CQ io queue [%d]", n);
-nvme_free_queue_pair(q);
-return false;
+goto out_error;
 }
 cmd = (NvmeCmd) {
 .opcode = NVME_ADM_CMD_CREATE_SQ,
@@ -660,13 +659,15 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error 
**errp)
 };
 if (nvme_cmd_sync(bs, s->queues[INDEX_ADMIN], )) {
 error_setg(errp, "Failed to create SQ io queue [%d]", n);
-nvme_free_queue_pair(q);
-return false;
+goto out_error;
 }
 s->queues = g_renew(NVMeQueuePair *, s->queues, n + 1);
 s->queues[n] = q;
 s->nr_queues++;
 return true;
+out_error:
+nvme_free_queue_pair(q);
+return false;
 }
 
 static bool nvme_poll_cb(void *opaque)
-- 
2.26.2




[PATCH v5 09/15] block/nvme: Replace qemu_try_blockalign0 by qemu_try_blockalign/memset

2020-08-20 Thread Philippe Mathieu-Daudé
In the next commit we'll get rid of qemu_try_blockalign().
To ease review, first replace qemu_try_blockalign0() by explicit
calls to qemu_try_blockalign() and memset().

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Philippe Mathieu-Daudé 
---
 block/nvme.c | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 2bd1935f951..ac6bb52043d 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -174,12 +174,12 @@ static void nvme_init_queue(BlockDriverState *bs, 
NVMeQueue *q,
 
 bytes = ROUND_UP(nentries * entry_bytes, s->page_size);
 q->head = q->tail = 0;
-q->queue = qemu_try_blockalign0(bs, bytes);
-
+q->queue = qemu_try_blockalign(bs, bytes);
 if (!q->queue) {
 error_setg(errp, "Cannot allocate queue");
 return;
 }
+memset(q->queue, 0, bytes);
 r = qemu_vfio_dma_map(s->vfio, q->queue, bytes, false, >iova);
 if (r) {
 error_setg(errp, "Cannot map queue");
@@ -223,11 +223,12 @@ static NVMeQueuePair 
*nvme_create_queue_pair(BlockDriverState *bs,
 if (!q) {
 return NULL;
 }
-q->prp_list_pages = qemu_try_blockalign0(bs,
+q->prp_list_pages = qemu_try_blockalign(bs,
   s->page_size * NVME_QUEUE_SIZE);
 if (!q->prp_list_pages) {
 goto fail;
 }
+memset(q->prp_list_pages, 0, s->page_size * NVME_QUEUE_SIZE);
 qemu_mutex_init(>lock);
 q->s = s;
 q->index = idx;
@@ -521,7 +522,7 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 .cdw10 = cpu_to_le32(0x1),
 };
 
-id = qemu_try_blockalign0(bs, sizeof(*id));
+id = qemu_try_blockalign(bs, sizeof(*id));
 if (!id) {
 error_setg(errp, "Cannot allocate buffer for identify response");
 goto out;
@@ -531,8 +532,9 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 error_setg(errp, "Cannot map buffer for DMA");
 goto out;
 }
-cmd.prp1 = cpu_to_le64(iova);
 
+memset(id, 0, sizeof(*id));
+cmd.prp1 = cpu_to_le64(iova);
 if (nvme_cmd_sync(bs, s->queues[INDEX_ADMIN], )) {
 error_setg(errp, "Failed to identify controller");
 goto out;
@@ -1283,11 +1285,11 @@ static int coroutine_fn 
nvme_co_pdiscard(BlockDriverState *bs,
 
 assert(s->nr_queues > 1);
 
-buf = qemu_try_blockalign0(bs, s->page_size);
+buf = qemu_try_blockalign(bs, s->page_size);
 if (!buf) {
 return -ENOMEM;
 }
-
+memset(buf, 0, s->page_size);
 buf->nlb = cpu_to_le32(bytes >> s->blkshift);
 buf->slba = cpu_to_le64(offset >> s->blkshift);
 buf->cattr = 0;
-- 
2.26.2




[PATCH v5 10/15] block/nvme: Replace qemu_try_blockalign(bs) by qemu_try_memalign(pg_sz)

2020-08-20 Thread Philippe Mathieu-Daudé
qemu_try_blockalign() is a generic API that call back to the
block driver to return its page alignment. As we call from
within the very same driver, we already know to page alignment
stored in our state. Remove indirections and use the value from
BDRVNVMeState.
This change is required to later remove the BlockDriverState
argument, to make nvme_init_queue() per hardware, and not per
block driver.

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Philippe Mathieu-Daudé 
---
 block/nvme.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index ac6bb52043d..f180078e781 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -174,7 +174,7 @@ static void nvme_init_queue(BlockDriverState *bs, NVMeQueue 
*q,
 
 bytes = ROUND_UP(nentries * entry_bytes, s->page_size);
 q->head = q->tail = 0;
-q->queue = qemu_try_blockalign(bs, bytes);
+q->queue = qemu_try_memalign(s->page_size, bytes);
 if (!q->queue) {
 error_setg(errp, "Cannot allocate queue");
 return;
@@ -223,7 +223,7 @@ static NVMeQueuePair 
*nvme_create_queue_pair(BlockDriverState *bs,
 if (!q) {
 return NULL;
 }
-q->prp_list_pages = qemu_try_blockalign(bs,
+q->prp_list_pages = qemu_try_memalign(s->page_size,
   s->page_size * NVME_QUEUE_SIZE);
 if (!q->prp_list_pages) {
 goto fail;
@@ -522,7 +522,7 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 .cdw10 = cpu_to_le32(0x1),
 };
 
-id = qemu_try_blockalign(bs, sizeof(*id));
+id = qemu_try_memalign(s->page_size, sizeof(*id));
 if (!id) {
 error_setg(errp, "Cannot allocate buffer for identify response");
 goto out;
@@ -1141,7 +1141,7 @@ static int nvme_co_prw(BlockDriverState *bs, uint64_t 
offset, uint64_t bytes,
 return nvme_co_prw_aligned(bs, offset, bytes, qiov, is_write, flags);
 }
 trace_nvme_prw_buffered(s, offset, bytes, qiov->niov, is_write);
-buf = qemu_try_blockalign(bs, bytes);
+buf = qemu_try_memalign(s->page_size, bytes);
 
 if (!buf) {
 return -ENOMEM;
@@ -1285,7 +1285,7 @@ static int coroutine_fn nvme_co_pdiscard(BlockDriverState 
*bs,
 
 assert(s->nr_queues > 1);
 
-buf = qemu_try_blockalign(bs, s->page_size);
+buf = qemu_try_memalign(s->page_size, s->page_size);
 if (!buf) {
 return -ENOMEM;
 }
-- 
2.26.2




[PATCH v5 05/15] block/nvme: Improve error message when IO queue creation failed

2020-08-20 Thread Philippe Mathieu-Daudé
Do not use the same error message for different failures.
Display a different error whether it is the CQ or the SQ.

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Philippe Mathieu-Daudé 
---
 block/nvme.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 003809fbd83..53448b7d230 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -648,7 +648,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error 
**errp)
 .cdw11 = cpu_to_le32(0x3),
 };
 if (nvme_cmd_sync(bs, s->queues[INDEX_ADMIN], )) {
-error_setg(errp, "Failed to create io queue [%d]", n);
+error_setg(errp, "Failed to create CQ io queue [%d]", n);
 nvme_free_queue_pair(q);
 return false;
 }
@@ -659,7 +659,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error 
**errp)
 .cdw11 = cpu_to_le32(0x1 | (n << 16)),
 };
 if (nvme_cmd_sync(bs, s->queues[INDEX_ADMIN], )) {
-error_setg(errp, "Failed to create io queue [%d]", n);
+error_setg(errp, "Failed to create SQ io queue [%d]", n);
 nvme_free_queue_pair(q);
 return false;
 }
-- 
2.26.2




[PATCH v5 14/15] block/nvme: Extract nvme_poll_queue()

2020-08-20 Thread Philippe Mathieu-Daudé
As we want to do per-queue polling, extract the nvme_poll_queue()
method which operates on a single queue.

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Philippe Mathieu-Daudé 
---
 block/nvme.c | 44 +++-
 1 file changed, 27 insertions(+), 17 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 1f67e888c84..a61e86a83eb 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -590,31 +590,41 @@ out:
 qemu_vfree(id);
 }
 
+static bool nvme_poll_queue(NVMeQueuePair *q)
+{
+bool progress = false;
+
+const size_t cqe_offset = q->cq.head * NVME_CQ_ENTRY_BYTES;
+NvmeCqe *cqe = (NvmeCqe *)>cq.queue[cqe_offset];
+
+/*
+ * Do an early check for completions. q->lock isn't needed because
+ * nvme_process_completion() only runs in the event loop thread and
+ * cannot race with itself.
+ */
+if ((le16_to_cpu(cqe->status) & 0x1) == q->cq_phase) {
+return false;
+}
+
+qemu_mutex_lock(>lock);
+while (nvme_process_completion(q)) {
+/* Keep polling */
+progress = true;
+}
+qemu_mutex_unlock(>lock);
+
+return progress;
+}
+
 static bool nvme_poll_queues(BDRVNVMeState *s)
 {
 bool progress = false;
 int i;
 
 for (i = 0; i < s->nr_queues; i++) {
-NVMeQueuePair *q = s->queues[i];
-const size_t cqe_offset = q->cq.head * NVME_CQ_ENTRY_BYTES;
-NvmeCqe *cqe = (NvmeCqe *)>cq.queue[cqe_offset];
-
-/*
- * Do an early check for completions. q->lock isn't needed because
- * nvme_process_completion() only runs in the event loop thread and
- * cannot race with itself.
- */
-if ((le16_to_cpu(cqe->status) & 0x1) == q->cq_phase) {
-continue;
-}
-
-qemu_mutex_lock(>lock);
-while (nvme_process_completion(q)) {
-/* Keep polling */
+if (nvme_poll_queue(s->queues[i])) {
 progress = true;
 }
-qemu_mutex_unlock(>lock);
 }
 return progress;
 }
-- 
2.26.2




[PATCH v5 12/15] block/nvme: Replace BDRV_POLL_WHILE by AIO_WAIT_WHILE

2020-08-20 Thread Philippe Mathieu-Daudé
BDRV_POLL_WHILE() is defined as:

  #define BDRV_POLL_WHILE(bs, cond) ({  \
  BlockDriverState *bs_ = (bs); \
  AIO_WAIT_WHILE(bdrv_get_aio_context(bs_), \
 cond); })

As we will remove the BlockDriverState use in the next commit,
start by using the exploded version of BDRV_POLL_WHILE().

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Philippe Mathieu-Daudé 
---
 block/nvme.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/nvme.c b/block/nvme.c
index 5b69fc75a60..456fe61f5ea 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -493,6 +493,7 @@ static void nvme_cmd_sync_cb(void *opaque, int ret)
 static int nvme_cmd_sync(BlockDriverState *bs, NVMeQueuePair *q,
  NvmeCmd *cmd)
 {
+AioContext *aio_context = bdrv_get_aio_context(bs);
 NVMeRequest *req;
 int ret = -EINPROGRESS;
 req = nvme_get_free_req(q);
@@ -501,7 +502,7 @@ static int nvme_cmd_sync(BlockDriverState *bs, 
NVMeQueuePair *q,
 }
 nvme_submit_command(q, req, cmd, nvme_cmd_sync_cb, );
 
-BDRV_POLL_WHILE(bs, ret == -EINPROGRESS);
+AIO_WAIT_WHILE(aio_context, ret == -EINPROGRESS);
 return ret;
 }
 
-- 
2.26.2




[PATCH v5 04/15] block/nvme: Define INDEX macros to ease code review

2020-08-20 Thread Philippe Mathieu-Daudé
Use definitions instead of '0' or '1' indexes. Also this will
be useful when using multi-queues later.

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Philippe Mathieu-Daudé 
---
 block/nvme.c | 33 +++--
 1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index e1893b4e792..003809fbd83 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -103,6 +103,9 @@ typedef volatile struct {
 
 QEMU_BUILD_BUG_ON(offsetof(NVMeRegs, doorbells) != 0x1000);
 
+#define INDEX_ADMIN 0
+#define INDEX_IO(n) (1 + n)
+
 struct BDRVNVMeState {
 AioContext *aio_context;
 QEMUVFIOState *vfio;
@@ -531,7 +534,7 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 }
 cmd.prp1 = cpu_to_le64(iova);
 
-if (nvme_cmd_sync(bs, s->queues[0], )) {
+if (nvme_cmd_sync(bs, s->queues[INDEX_ADMIN], )) {
 error_setg(errp, "Failed to identify controller");
 goto out;
 }
@@ -555,7 +558,7 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 
 cmd.cdw10 = 0;
 cmd.nsid = cpu_to_le32(namespace);
-if (nvme_cmd_sync(bs, s->queues[0], )) {
+if (nvme_cmd_sync(bs, s->queues[INDEX_ADMIN], )) {
 error_setg(errp, "Failed to identify namespace");
 goto out;
 }
@@ -644,7 +647,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error 
**errp)
 .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0x)),
 .cdw11 = cpu_to_le32(0x3),
 };
-if (nvme_cmd_sync(bs, s->queues[0], )) {
+if (nvme_cmd_sync(bs, s->queues[INDEX_ADMIN], )) {
 error_setg(errp, "Failed to create io queue [%d]", n);
 nvme_free_queue_pair(q);
 return false;
@@ -655,7 +658,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error 
**errp)
 .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0x)),
 .cdw11 = cpu_to_le32(0x1 | (n << 16)),
 };
-if (nvme_cmd_sync(bs, s->queues[0], )) {
+if (nvme_cmd_sync(bs, s->queues[INDEX_ADMIN], )) {
 error_setg(errp, "Failed to create io queue [%d]", n);
 nvme_free_queue_pair(q);
 return false;
@@ -739,16 +742,18 @@ static int nvme_init(BlockDriverState *bs, const char 
*device, int namespace,
 
 /* Set up admin queue. */
 s->queues = g_new(NVMeQueuePair *, 1);
-s->queues[0] = nvme_create_queue_pair(bs, 0, NVME_QUEUE_SIZE, errp);
-if (!s->queues[0]) {
+s->queues[INDEX_ADMIN] = nvme_create_queue_pair(bs, 0,
+  NVME_QUEUE_SIZE,
+  errp);
+if (!s->queues[INDEX_ADMIN]) {
 ret = -EINVAL;
 goto out;
 }
 s->nr_queues = 1;
 QEMU_BUILD_BUG_ON(NVME_QUEUE_SIZE & 0xF000);
 s->regs->aqa = cpu_to_le32((NVME_QUEUE_SIZE << 16) | NVME_QUEUE_SIZE);
-s->regs->asq = cpu_to_le64(s->queues[0]->sq.iova);
-s->regs->acq = cpu_to_le64(s->queues[0]->cq.iova);
+s->regs->asq = cpu_to_le64(s->queues[INDEX_ADMIN]->sq.iova);
+s->regs->acq = cpu_to_le64(s->queues[INDEX_ADMIN]->cq.iova);
 
 /* After setting up all control registers we can enable device now. */
 s->regs->cc = cpu_to_le32((ctz32(NVME_CQ_ENTRY_BYTES) << 20) |
@@ -839,7 +844,7 @@ static int nvme_enable_disable_write_cache(BlockDriverState 
*bs, bool enable,
 .cdw11 = cpu_to_le32(enable ? 0x01 : 0x00),
 };
 
-ret = nvme_cmd_sync(bs, s->queues[0], );
+ret = nvme_cmd_sync(bs, s->queues[INDEX_ADMIN], );
 if (ret) {
 error_setg(errp, "Failed to configure NVMe write cache");
 }
@@ -1056,7 +1061,7 @@ static coroutine_fn int 
nvme_co_prw_aligned(BlockDriverState *bs,
 {
 int r;
 BDRVNVMeState *s = bs->opaque;
-NVMeQueuePair *ioq = s->queues[1];
+NVMeQueuePair *ioq = s->queues[INDEX_IO(0)];
 NVMeRequest *req;
 
 uint32_t cdw12 = (((bytes >> s->blkshift) - 1) & 0x) |
@@ -1171,7 +1176,7 @@ static coroutine_fn int nvme_co_pwritev(BlockDriverState 
*bs,
 static coroutine_fn int nvme_co_flush(BlockDriverState *bs)
 {
 BDRVNVMeState *s = bs->opaque;
-NVMeQueuePair *ioq = s->queues[1];
+NVMeQueuePair *ioq = s->queues[INDEX_IO(0)];
 NVMeRequest *req;
 NvmeCmd cmd = {
 .opcode = NVME_CMD_FLUSH,
@@ -1202,7 +1207,7 @@ static coroutine_fn int 
nvme_co_pwrite_zeroes(BlockDriverState *bs,
   BdrvRequestFlags flags)
 {
 BDRVNVMeState *s = bs->opaque;
-NVMeQueuePair *ioq = s->queues[1];
+NVMeQueuePair *ioq = s->queues[INDEX_IO(0)];
 NVMeRequest *req;
 
 uint32_t cdw12 = ((bytes >> s->blkshift) - 1) & 0x;
@@ -1255,7 +1260,7 @@ static int coroutine_fn nvme_co_pdiscard(BlockDriverState 
*bs,
  int bytes)
 {
 BDRVNVMeState *s = bs->opaque;
-NVMeQueuePair *ioq = s->queues[1];
+NVMeQueuePair *ioq = s->queues[INDEX_IO(0)];
 NVMeRequest *req;
 

[PATCH v5 08/15] block/nvme: Use union of NvmeIdCtrl / NvmeIdNs structures

2020-08-20 Thread Philippe Mathieu-Daudé
We allocate an unique chunk of memory then use it for two
different structures. By using an union, we make it clear
the data is overlapping (and we can remove the casts).

Suggested-by: Stefan Hajnoczi 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Philippe Mathieu-Daudé 
---
 block/nvme.c | 31 +++
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 99822d9fd36..2bd1935f951 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -508,9 +508,10 @@ static int nvme_cmd_sync(BlockDriverState *bs, 
NVMeQueuePair *q,
 static void nvme_identify(BlockDriverState *bs, int namespace, Error **errp)
 {
 BDRVNVMeState *s = bs->opaque;
-NvmeIdCtrl *idctrl;
-NvmeIdNs *idns;
-uint8_t *id;
+union {
+NvmeIdCtrl ctrl;
+NvmeIdNs ns;
+} *id;
 NvmeLBAF *lbaf;
 uint16_t oncs;
 int r;
@@ -520,14 +521,12 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 .cdw10 = cpu_to_le32(0x1),
 };
 
-id = qemu_try_blockalign0(bs, sizeof(NvmeIdCtrl));
+id = qemu_try_blockalign0(bs, sizeof(*id));
 if (!id) {
 error_setg(errp, "Cannot allocate buffer for identify response");
 goto out;
 }
-idctrl = (NvmeIdCtrl *)id;
-idns = (NvmeIdNs *)id;
-r = qemu_vfio_dma_map(s->vfio, id, sizeof(NvmeIdCtrl), true, );
+r = qemu_vfio_dma_map(s->vfio, id, sizeof(*id), true, );
 if (r) {
 error_setg(errp, "Cannot map buffer for DMA");
 goto out;
@@ -539,22 +538,22 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 goto out;
 }
 
-if (le32_to_cpu(idctrl->nn) < namespace) {
+if (le32_to_cpu(id->ctrl.nn) < namespace) {
 error_setg(errp, "Invalid namespace");
 goto out;
 }
-s->write_cache_supported = le32_to_cpu(idctrl->vwc) & 0x1;
-s->max_transfer = (idctrl->mdts ? 1 << idctrl->mdts : 0) * s->page_size;
+s->write_cache_supported = le32_to_cpu(id->ctrl.vwc) & 0x1;
+s->max_transfer = (id->ctrl.mdts ? 1 << id->ctrl.mdts : 0) * s->page_size;
 /* For now the page list buffer per command is one page, to hold at most
  * s->page_size / sizeof(uint64_t) entries. */
 s->max_transfer = MIN_NON_ZERO(s->max_transfer,
   s->page_size / sizeof(uint64_t) * s->page_size);
 
-oncs = le16_to_cpu(idctrl->oncs);
+oncs = le16_to_cpu(id->ctrl.oncs);
 s->supports_write_zeroes = !!(oncs & NVME_ONCS_WRITE_ZEROS);
 s->supports_discard = !!(oncs & NVME_ONCS_DSM);
 
-memset(id, 0, 4096);
+memset(id, 0, sizeof(*id));
 cmd.cdw10 = 0;
 cmd.nsid = cpu_to_le32(namespace);
 if (nvme_cmd_sync(bs, s->queues[INDEX_ADMIN], )) {
@@ -562,11 +561,11 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 goto out;
 }
 
-s->nsze = le64_to_cpu(idns->nsze);
-lbaf = >lbaf[NVME_ID_NS_FLBAS_INDEX(idns->flbas)];
+s->nsze = le64_to_cpu(id->ns.nsze);
+lbaf = >ns.lbaf[NVME_ID_NS_FLBAS_INDEX(id->ns.flbas)];
 
-if (NVME_ID_NS_DLFEAT_WRITE_ZEROES(idns->dlfeat) &&
-NVME_ID_NS_DLFEAT_READ_BEHAVIOR(idns->dlfeat) ==
+if (NVME_ID_NS_DLFEAT_WRITE_ZEROES(id->ns.dlfeat) &&
+NVME_ID_NS_DLFEAT_READ_BEHAVIOR(id->ns.dlfeat) ==
 NVME_ID_NS_DLFEAT_READ_BEHAVIOR_ZEROES) {
 bs->supported_write_flags |= BDRV_REQ_MAY_UNMAP;
 }
-- 
2.26.2




[PATCH v5 02/15] block/nvme: Avoid further processing if trace event not enabled

2020-08-20 Thread Philippe Mathieu-Daudé
Avoid further processing if TRACE_NVME_SUBMIT_COMMAND_RAW is
not enabled. This is an untested intend of performance optimization.

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Philippe Mathieu-Daudé 
---
 block/nvme.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/nvme.c b/block/nvme.c
index 2f5e3c2adfa..8c30a5fee28 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -441,6 +441,9 @@ static void nvme_trace_command(const NvmeCmd *cmd)
 {
 int i;
 
+if (!trace_event_get_state_backends(TRACE_NVME_SUBMIT_COMMAND_RAW)) {
+return;
+}
 for (i = 0; i < 8; ++i) {
 uint8_t *cmdp = (uint8_t *)cmd + i * 8;
 trace_nvme_submit_command_raw(cmdp[0], cmdp[1], cmdp[2], cmdp[3],
-- 
2.26.2




[PATCH v5 01/15] block/nvme: Replace magic value by SCALE_MS definition

2020-08-20 Thread Philippe Mathieu-Daudé
Use self-explicit SCALE_MS definition instead of magic value.

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Philippe Mathieu-Daudé 
---
 block/nvme.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/nvme.c b/block/nvme.c
index 374e2689157..2f5e3c2adfa 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -715,7 +715,7 @@ static int nvme_init(BlockDriverState *bs, const char 
*device, int namespace,
 /* Reset device to get a clean state. */
 s->regs->cc = cpu_to_le32(le32_to_cpu(s->regs->cc) & 0xFE);
 /* Wait for CSTS.RDY = 0. */
-deadline = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) + timeout_ms * 
100ULL;
+deadline = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) + timeout_ms * SCALE_MS;
 while (le32_to_cpu(s->regs->csts) & 0x1) {
 if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) > deadline) {
 error_setg(errp, "Timeout while waiting for device to reset (%"
-- 
2.26.2




[PATCH v5 03/15] block/nvme: Let nvme_create_queue_pair() fail gracefully

2020-08-20 Thread Philippe Mathieu-Daudé
As nvme_create_queue_pair() is allowed to fail, replace the
alloc() calls by try_alloc() to avoid aborting QEMU.

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Philippe Mathieu-Daudé 
---
 block/nvme.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 8c30a5fee28..e1893b4e792 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -213,14 +213,22 @@ static NVMeQueuePair 
*nvme_create_queue_pair(BlockDriverState *bs,
 int i, r;
 BDRVNVMeState *s = bs->opaque;
 Error *local_err = NULL;
-NVMeQueuePair *q = g_new0(NVMeQueuePair, 1);
+NVMeQueuePair *q;
 uint64_t prp_list_iova;
 
+q = g_try_new0(NVMeQueuePair, 1);
+if (!q) {
+return NULL;
+}
+q->prp_list_pages = qemu_try_blockalign0(bs,
+  s->page_size * NVME_QUEUE_SIZE);
+if (!q->prp_list_pages) {
+goto fail;
+}
 qemu_mutex_init(>lock);
 q->s = s;
 q->index = idx;
 qemu_co_queue_init(>free_req_queue);
-q->prp_list_pages = qemu_blockalign0(bs, s->page_size * NVME_NUM_REQS);
 q->completion_bh = aio_bh_new(bdrv_get_aio_context(bs),
   nvme_process_completion_bh, q);
 r = qemu_vfio_dma_map(s->vfio, q->prp_list_pages,
-- 
2.26.2




[PATCH v5 00/15] block/nvme: Various cleanups required to use multiple queues

2020-08-20 Thread Philippe Mathieu-Daudé
Hi Kevin,

This series is mostly code rearrangement (cleanups) to be
able to split the hardware code from the block driver code,
to be able to use multiple queues on the same hardware, or
multiple block drivers on the same hardware.

All this series is reviewed.

Since v4:
- added 'block/nvme: Use an array of EventNotifier' patch

Since v3:
- renamed QUEUE_INDEX_{ADMIN/IO} -> INDEX{ADMIN/IO}
- added stefanha tags

Since v2:
- addressed stefanha review comments
- added 4 trivial patches (to simplify the last one)
- register IRQ notifier for each queuepair (admin and io)

Since v1:
- rebased
- use SCALE_MS definition
- added Stefan's R-b
- addressed Stefan's review comments
  - use union { NvmeIdCtrl / NvmeIdNs }
  - move irq_notifier to NVMeQueuePair
  - removed patches depending on "a tracable hardware stateo
object instead of BDRVNVMeState".

Phil.

Philippe Mathieu-Daudé (15):
  block/nvme: Replace magic value by SCALE_MS definition
  block/nvme: Avoid further processing if trace event not enabled
  block/nvme: Let nvme_create_queue_pair() fail gracefully
  block/nvme: Define INDEX macros to ease code review
  block/nvme: Improve error message when IO queue creation failed
  block/nvme: Use common error path in nvme_add_io_queue()
  block/nvme: Rename local variable
  block/nvme: Use union of NvmeIdCtrl / NvmeIdNs structures
  block/nvme: Replace qemu_try_blockalign0 by qemu_try_blockalign/memset
  block/nvme: Replace qemu_try_blockalign(bs) by
qemu_try_memalign(pg_sz)
  block/nvme: Simplify nvme_init_queue() arguments
  block/nvme: Replace BDRV_POLL_WHILE by AIO_WAIT_WHILE
  block/nvme: Simplify nvme_create_queue_pair() arguments
  block/nvme: Extract nvme_poll_queue()
  block/nvme: Use an array of EventNotifier

 block/nvme.c | 211 ++-
 1 file changed, 125 insertions(+), 86 deletions(-)

-- 
2.26.2




Re: [PATCH v5 3/3] iotests: Test node/bitmap aliases during migration

2020-08-20 Thread Vladimir Sementsov-Ogievskiy

20.08.2020 18:07, Max Reitz wrote:

Signed-off-by: Max Reitz


Sorry, I missed v5 and answered on v4. Still, the only change is 
s/write/discard/ (good change, I'm for), so please refer to my answer on v4 for 
other comments.

--
Best regards,
Vladimir



Re: [PATCH v5 1/3] migration: Add block-bitmap-mapping parameter

2020-08-20 Thread Vladimir Sementsov-Ogievskiy

20.08.2020 18:07, Max Reitz wrote:

This migration parameter allows mapping block node names and bitmap
names to aliases for the purpose of block dirty bitmap migration.

This way, management tools can use different node and bitmap names on
the source and destination and pass the mapping of how bitmaps are to be
transferred to qemu (on the source, the destination, or even both with
arbitrary aliases in the migration stream).

While touching this code, fix a bug where bitmap names longer than 255
bytes would fail an assertion in qemu_put_counted_string().

Suggested-by: Vladimir Sementsov-Ogievskiy
Signed-off-by: Max Reitz



Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir



Re: [PATCH v4 4/4] iotests: Test node/bitmap aliases during migration

2020-08-20 Thread Vladimir Sementsov-Ogievskiy

# MYPYPATH=../../python/ mypy 300
300:154: error: Item "None" of "Optional[Match[Any]]" has no attribute "group"
Found 1 error in 1 file (checked 1 source file)

- the only complain. Suggest a fix:

diff --git a/tests/qemu-iotests/300 b/tests/qemu-iotests/300
index c6d86b1dbc..0241903743 100755
--- a/tests/qemu-iotests/300
+++ b/tests/qemu-iotests/300
@@ -148,11 +148,11 @@ class TestDirtyBitmapMigration(iotests.QMPTestCase):
 result = vm.qmp('human-monitor-command',
 command_line='info migrate_parameters')
 
-hmp_mapping = re.search(r'^block-bitmap-mapping:\r?(\n  .*)*\n',

-result['return'], flags=re.MULTILINE)
+m = re.search(r'^block-bitmap-mapping:\r?(\n  .*)*\n',
+  result['return'], flags=re.MULTILINE)
+hmp_mapping = m.group(0).replace('\r', '') if m else None
 
-self.assertEqual(hmp_mapping.group(0).replace('\r', ''),

- self.to_hmp_mapping(mapping))
+self.assertEqual(hmp_mapping, self.to_hmp_mapping(mapping))
 else:
 self.assert_qmp(result, 'error/desc', error)
 
===


# flake8 300
300:24:1: F401 'time' imported but unused
300:34:1: E302 expected 2 blank lines, found 1
300:42:67: E502 the backslash is redundant between brackets
300:47:67: E502 the backslash is redundant between brackets
300:67:53: E502 the backslash is redundant between brackets
300:122:9: E125 continuation line with same indent as next logical line
300:134:9: E125 continuation line with same indent as next logical line
300:185:52: E502 the backslash is redundant between brackets
300:186:72: E502 the backslash is redundant between brackets
300:285:77: E502 the backslash is redundant between brackets
300:305:77: E502 the backslash is redundant between brackets
300:306:78: E502 the backslash is redundant between brackets
300:330:77: E502 the backslash is redundant between brackets
300:350:77: E502 the backslash is redundant between brackets
300:385:57: E502 the backslash is redundant between brackets
300:386:59: E502 the backslash is redundant between brackets
300:387:67: E502 the backslash is redundant between brackets
300:412:78: E502 the backslash is redundant between brackets
300:425:78: E502 the backslash is redundant between brackets
300:435:78: E502 the backslash is redundant between brackets
300:436:76: E502 the backslash is redundant between brackets
300:451:66: E502 the backslash is redundant between brackets
300:473:78: E502 the backslash is redundant between brackets
300:474:79: E502 the backslash is redundant between brackets
300:488:78: E502 the backslash is redundant between brackets
300:489:77: E502 the backslash is redundant between brackets

- I post it just because ALE plugin in vim highlights all these things for me. 
Up to you, I don't ask you to fix it.

18.08.2020 16:32, Max Reitz wrote:

Signed-off-by: Max Reitz 
---
  tests/qemu-iotests/300 | 595 +
  tests/qemu-iotests/300.out |   5 +
  tests/qemu-iotests/group   |   1 +
  3 files changed, 601 insertions(+)
  create mode 100755 tests/qemu-iotests/300
  create mode 100644 tests/qemu-iotests/300.out



[..]


+def test_alias_on_both_migration(self) -> None:
+src_map = self.mapping(self.src_node_name, 'node-alias',
+   self.src_bmap_name, 'bmap-alias')
+
+dst_map = self.mapping(self.dst_node_name, 'node-alias',
+   self.dst_bmap_name, 'bmap-alias')
+
+self.set_mapping(self.vm_a, src_map)
+self.set_mapping(self.vm_b, dst_map)
+self.migrate()
+self.verify_dest_error(None)


Hmm, probably verify_dest_error() should be called from migrate(), as you call 
it (almost) always after migrate()


+
+


[..]


+def test_unused_mapping_on_dst(self) -> None:
+# Let the source not send any bitmaps
+self.set_mapping(self.vm_a, [])
+
+# Establish some mapping on the destination
+self.set_mapping(self.vm_b, [])


Seems, you wanted to specify non-empty mapping for vm_b, yes?
With any non-empty mapping here, just to better correspond to the comments:
Reviewed-by: Vladimir Sementsov-Ogievskiy 

(or keep this case with both mappings empty, and add one similar with empty 
mapping only on src)



--
Best regards,
Vladimir



Re: [RFC PATCH 07/22] block/export: Remove magic from block-export-add

2020-08-20 Thread Peter Krempa
On Thu, Aug 20, 2020 at 09:41:14 -0500, Eric Blake wrote:
> On 8/20/20 6:05 AM, Kevin Wolf wrote:
> 
> > As long as we can keep the compatibility code local to qmp_nbd_*(), I
> > don't think it's too bad. In particular because it's already written.
> > 
> > Instead of adjusting libvirt to changes in the nbd-* commands, I'd
> > rather have it change over to block-export-*. I would like to see the
> > nbd-server-add/remove commands deprecated soon after we have the
> > replacements.
> 
> Makes sense to me. So deprecate nbd-server-add in 5.2, and require
> block-export in 6.1.
> 
> 
> > > > +/*
> > > > + * nbd-server-add doesn't complain when a read-only device should 
> > > > be
> > > > + * exported as writable, but simply downgrades it. This is an 
> > > > error with
> > > > + * block-export-add.
> > > 
> > > I'd be happy with either marking this deprecated now (and fixing it in two
> > > releases), or declaring it a bug in nbd-server-add now (and fixing it
> > > outright).
> > 
> > How about deprecating nbd-server-add completely?
> 
> Works for me. Keeping the warts backwards-compatible in nbd-server-add is
> more palatable if we know we are going to drop it wholesale down the road.
> 
> > > > +/*
> > > > + * nbd-server-add removes the export when the named BlockBackend 
> > > > used for
> > > > + * @device goes away.
> > > > + */
> > > > +on_eject_blk = blk_by_name(arg->device);
> > > > +if (on_eject_blk) {
> > > > +nbd_export_set_on_eject_blk(export, on_eject_blk);
> > > > +}
> > > 
> > > Wait - is the magic export removal tied only to exporting a drive name, 
> > > and
> > > not a node name?  So as long as libvirt is using only node names whwen
> > > adding exports, a drive being unplugged won't interfere?
> > 
> > Yes, seems so. It's the existing behaviour, I'm only moving the code
> > around.
> > 
> > > Overall, the change makes sense to me, although I'd love to see if we 
> > > could
> > > go further on the writable vs. read-only issue.
> > 
> > If nbd-server-add will be going away relatively soon, it's probably not
> > worth the trouble. But if you have reasons to keep it, maybe we should
> > consider it.
> 
> No, I'm fine with the idea of getting rid of nbd-server-add, at which point
> changing it before removal is pointless.

I agree that this is a better approach. While it's technically possible
to retrofit old commands since QMP schema introspection allows granilar
detection of what's happening in this regard using a new command is IMO
better. Specifically for APPS which might not use QMP introspection to
the extent libvirt does.




[PATCH v5 2/3] iotests.py: Let wait_migration() return on failure

2020-08-20 Thread Max Reitz
Let wait_migration() return on failure (with the return value indicating
whether the migration was completed or has failed), so we can use it for
migrations that are expected to fail, too.

Signed-off-by: Max Reitz 
Reviewed-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/iotests.py | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 717b5b652c..e197c73ca5 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -729,16 +729,22 @@ class VM(qtest.QEMUQtestMachine):
 }
 ]))
 
-def wait_migration(self, expect_runstate):
+def wait_migration(self, expect_runstate: Optional[str]) -> bool:
 while True:
 event = self.event_wait('MIGRATION')
 log(event, filters=[filter_qmp_event])
-if event['data']['status'] == 'completed':
+if event['data']['status'] in ('completed', 'failed'):
 break
-# The event may occur in finish-migrate, so wait for the expected
-# post-migration runstate
-while self.qmp('query-status')['return']['status'] != expect_runstate:
-pass
+
+if event['data']['status'] == 'completed':
+# The event may occur in finish-migrate, so wait for the expected
+# post-migration runstate
+runstate = None
+while runstate != expect_runstate:
+runstate = self.qmp('query-status')['return']['status']
+return True
+else:
+return False
 
 def node_info(self, node_name):
 nodes = self.qmp('query-named-block-nodes')
-- 
2.26.2




[PATCH v5 3/3] iotests: Test node/bitmap aliases during migration

2020-08-20 Thread Max Reitz
Signed-off-by: Max Reitz 
---
 tests/qemu-iotests/300 | 595 +
 tests/qemu-iotests/300.out |   5 +
 tests/qemu-iotests/group   |   1 +
 3 files changed, 601 insertions(+)
 create mode 100755 tests/qemu-iotests/300
 create mode 100644 tests/qemu-iotests/300.out

diff --git a/tests/qemu-iotests/300 b/tests/qemu-iotests/300
new file mode 100755
index 00..face67bf7f
--- /dev/null
+++ b/tests/qemu-iotests/300
@@ -0,0 +1,595 @@
+#!/usr/bin/env python3
+#
+# Copyright (C) 2020 Red Hat, Inc.
+#
+# Tests for dirty bitmaps migration with node aliases
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+import os
+import random
+import re
+import time
+from typing import Dict, List, Optional, Union
+import iotests
+import qemu
+
+BlockBitmapMapping = List[Dict[str, Union[str, List[Dict[str, str]
+
+assert iotests.sock_dir is not None
+mig_sock = os.path.join(iotests.sock_dir, 'mig_sock')
+
+class TestDirtyBitmapMigration(iotests.QMPTestCase):
+src_node_name: str = ''
+dst_node_name: str = ''
+src_bmap_name: str = ''
+dst_bmap_name: str = ''
+
+def setUp(self) -> None:
+self.vm_a = iotests.VM(path_suffix='-a')
+self.vm_a.add_blockdev(f'node-name={self.src_node_name},' \
+   'driver=null-co')
+self.vm_a.launch()
+
+self.vm_b = iotests.VM(path_suffix='-b')
+self.vm_b.add_blockdev(f'node-name={self.dst_node_name},' \
+   'driver=null-co')
+self.vm_b.add_incoming(f'unix:{mig_sock}')
+self.vm_b.launch()
+
+result = self.vm_a.qmp('block-dirty-bitmap-add',
+   node=self.src_node_name,
+   name=self.src_bmap_name)
+self.assert_qmp(result, 'return', {})
+
+# Dirty some random megabytes
+for _ in range(9):
+mb_ofs = random.randrange(1024)
+self.vm_a.hmp_qemu_io(self.src_node_name, f'discard {mb_ofs}M 1M')
+
+result = self.vm_a.qmp('x-debug-block-dirty-bitmap-sha256',
+   node=self.src_node_name,
+   name=self.src_bmap_name)
+self.bitmap_hash_reference = result['return']['sha256']
+
+caps = [{'capability': name, 'state': True} \
+for name in ('dirty-bitmaps', 'events')]
+
+for vm in (self.vm_a, self.vm_b):
+result = vm.qmp('migrate-set-capabilities', capabilities=caps)
+self.assert_qmp(result, 'return', {})
+
+def tearDown(self) -> None:
+self.vm_a.shutdown()
+self.vm_b.shutdown()
+try:
+os.remove(mig_sock)
+except OSError:
+pass
+
+def check_bitmap(self, bitmap_name_valid: bool) -> None:
+result = self.vm_b.qmp('x-debug-block-dirty-bitmap-sha256',
+   node=self.dst_node_name,
+   name=self.dst_bmap_name)
+if bitmap_name_valid:
+self.assert_qmp(result, 'return/sha256',
+self.bitmap_hash_reference)
+else:
+self.assert_qmp(result, 'error/desc',
+f"Dirty bitmap '{self.dst_bmap_name}' not found")
+
+def migrate(self, bitmap_name_valid: bool = True,
+migration_success: bool = True) -> None:
+result = self.vm_a.qmp('migrate', uri=f'unix:{mig_sock}')
+self.assert_qmp(result, 'return', {})
+
+with iotests.Timeout(5, 'Timeout waiting for migration to complete'):
+self.assertEqual(self.vm_a.wait_migration('postmigrate'),
+ migration_success)
+self.assertEqual(self.vm_b.wait_migration('running'),
+ migration_success)
+
+if migration_success:
+self.check_bitmap(bitmap_name_valid)
+
+def verify_dest_error(self, msg: Optional[str]) -> None:
+"""
+Check whether the given error message is present in vm_b's log.
+(vm_b is shut down to do so.)
+If @msg is None, check that there has not been any error.
+"""
+self.vm_b.shutdown()
+if msg is None:
+self.assertNotIn('qemu-system-', self.vm_b.get_log())
+else:
+self.assertIn(msg, self.vm_b.get_log())
+
+@staticmethod
+def 

[PATCH v5 0/3] migration: Add block-bitmap-mapping parameter

2020-08-20 Thread Max Reitz
RFC v1: https://lists.nongnu.org/archive/html/qemu-block/2020-05/msg00912.html
RFC v2: https://lists.nongnu.org/archive/html/qemu-block/2020-05/msg00915.html
v1: https://lists.nongnu.org/archive/html/qemu-devel/2020-06/msg09792.html
v2: https://lists.nongnu.org/archive/html/qemu-block/2020-07/msg01179.html
v3: https://lists.nongnu.org/archive/html/qemu-block/2020-07/msg01385.html
v4: https://lists.nongnu.org/archive/html/qemu-block/2020-08/msg00566.html

Branch: https://github.com/XanClic/qemu.git migration-bitmap-mapping-v5
Branch: https://git.xanclic.moe/XanClic/qemu.git migration-bitmap-mapping-v5


Hi,

This new migration parameter allows mapping block node names and bitmap
names to aliases for the purpose of block dirty bitmap migration.

This way, management tools can use different node names on the source
and destination and pass the mapping of how bitmaps are to be
transferred to qemu (on the source, the destination, or even both with
arbitrary aliases in the migration stream).


v5:
- Patch 1:
  - QAPI documentation: On the destination, unmapped aliases are not
just reported and that’s it – encountering one will actually lead to
the whole rest of the bitmap migration data being discarded.

  - Use sizeof_field instead of hand-coding it

  - %s/255/UINT8_MAX/

  - s/Unable to read bitmap name string/
  Unable to read bitmap alias string/

  - Cancellation on unknown incoming bitmap aliases was handled
improperly; it should be handled properly now

- Old patch 2: Dropped

- Patch 3 (was 4): Use discards instead of writes to dirty the bitmap


git-backport-diff against v4:

Key:
[] : patches are identical
[] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/3:[0040] [FC] 'migration: Add block-bitmap-mapping parameter'
002/3:[] [--] 'iotests.py: Let wait_migration() return on failure'
003/3:[0002] [FC] 'iotests: Test node/bitmap aliases during migration'


Max Reitz (3):
  migration: Add block-bitmap-mapping parameter
  iotests.py: Let wait_migration() return on failure
  iotests: Test node/bitmap aliases during migration

 qapi/migration.json| 104 +-
 migration/migration.h  |   3 +
 migration/block-dirty-bitmap.c | 412 ---
 migration/migration.c  |  30 ++
 monitor/hmp-cmds.c |  30 ++
 tests/qemu-iotests/300 | 595 +
 tests/qemu-iotests/300.out |   5 +
 tests/qemu-iotests/group   |   1 +
 tests/qemu-iotests/iotests.py  |  18 +-
 9 files changed, 1135 insertions(+), 63 deletions(-)
 create mode 100755 tests/qemu-iotests/300
 create mode 100644 tests/qemu-iotests/300.out

-- 
2.26.2




[PATCH v5 1/3] migration: Add block-bitmap-mapping parameter

2020-08-20 Thread Max Reitz
This migration parameter allows mapping block node names and bitmap
names to aliases for the purpose of block dirty bitmap migration.

This way, management tools can use different node and bitmap names on
the source and destination and pass the mapping of how bitmaps are to be
transferred to qemu (on the source, the destination, or even both with
arbitrary aliases in the migration stream).

While touching this code, fix a bug where bitmap names longer than 255
bytes would fail an assertion in qemu_put_counted_string().

Suggested-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Max Reitz 
---
 qapi/migration.json| 104 -
 migration/migration.h  |   3 +
 migration/block-dirty-bitmap.c | 412 -
 migration/migration.c  |  30 +++
 monitor/hmp-cmds.c |  30 +++
 5 files changed, 522 insertions(+), 57 deletions(-)

diff --git a/qapi/migration.json b/qapi/migration.json
index ea53b23dca..5f6b06172c 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -508,6 +508,44 @@
   'data': [ 'none', 'zlib',
 { 'name': 'zstd', 'if': 'defined(CONFIG_ZSTD)' } ] }
 
+##
+# @BitmapMigrationBitmapAlias:
+#
+# @name: The name of the bitmap.
+#
+# @alias: An alias name for migration (for example the bitmap name on
+# the opposite site).
+#
+# Since: 5.2
+##
+{ 'struct': 'BitmapMigrationBitmapAlias',
+  'data': {
+  'name': 'str',
+  'alias': 'str'
+  } }
+
+##
+# @BitmapMigrationNodeAlias:
+#
+# Maps a block node name and the bitmaps it has to aliases for dirty
+# bitmap migration.
+#
+# @node-name: A block node name.
+#
+# @alias: An alias block node name for migration (for example the
+# node name on the opposite site).
+#
+# @bitmaps: Mappings for the bitmaps on this node.
+#
+# Since: 5.2
+##
+{ 'struct': 'BitmapMigrationNodeAlias',
+  'data': {
+  'node-name': 'str',
+  'alias': 'str',
+  'bitmaps': [ 'BitmapMigrationBitmapAlias' ]
+  } }
+
 ##
 # @MigrationParameter:
 #
@@ -642,6 +680,25 @@
 #  will consume more CPU.
 #  Defaults to 1. (Since 5.0)
 #
+# @block-bitmap-mapping: Maps block nodes and bitmaps on them to
+#  aliases for the purpose of dirty bitmap migration.  Such
+#  aliases may for example be the corresponding names on the
+#  opposite site.
+#  The mapping must be one-to-one, but not necessarily
+#  complete: On the source, unmapped bitmaps and all bitmaps
+#  on unmapped nodes will be ignored.  On the destination,
+#  encountering an unmapped alias in the incoming migration
+#  stream will result in a report, and all further bitmap
+#  migration data will then be discarded.
+#  Note that the destination does not know about bitmaps it
+#  does not receive, so there is no limitation or requirement
+#  regarding the number of bitmaps received, or how they are
+#  named, or on which nodes they are placed.
+#  By default (when this parameter has never been set), bitmap
+#  names are mapped to themselves.  Nodes are mapped to their
+#  block device name if there is one, and to their node name
+#  otherwise. (Since 5.2)
+#
 # Since: 2.4
 ##
 { 'enum': 'MigrationParameter',
@@ -656,7 +713,8 @@
'multifd-channels',
'xbzrle-cache-size', 'max-postcopy-bandwidth',
'max-cpu-throttle', 'multifd-compression',
-   'multifd-zlib-level' ,'multifd-zstd-level' ] }
+   'multifd-zlib-level' ,'multifd-zstd-level',
+   'block-bitmap-mapping' ] }
 
 ##
 # @MigrateSetParameters:
@@ -782,6 +840,25 @@
 #  will consume more CPU.
 #  Defaults to 1. (Since 5.0)
 #
+# @block-bitmap-mapping: Maps block nodes and bitmaps on them to
+#  aliases for the purpose of dirty bitmap migration.  Such
+#  aliases may for example be the corresponding names on the
+#  opposite site.
+#  The mapping must be one-to-one, but not necessarily
+#  complete: On the source, unmapped bitmaps and all bitmaps
+#  on unmapped nodes will be ignored.  On the destination,
+#  encountering an unmapped alias in the incoming migration
+#  stream will result in a report, and all further bitmap
+#  migration data will then be discarded.
+#  Note that the destination does not know about bitmaps it
+#  does not receive, so there is no limitation or requirement
+#  regarding the number of bitmaps received, or how they are
+#  named, or on which nodes they are placed.
+#  By default (when this parameter has never been set), bitmap
+#  names are mapped to themselves.  Nodes are mapped to their
+#  block device name if there is one, and to their node name
+#  otherwise. (Since 5.2)
+#
 # Since: 2.4
 ##
 # TODO either fuse back into MigrationParameters, or make
@@ -812,7 +889,8 @@
 

Re: [RFC PATCH v4 1/5] block/nvme: Use an array of EventNotifier

2020-08-20 Thread Stefan Hajnoczi
On Wed, Aug 19, 2020 at 06:03:14PM +0200, Philippe Mathieu-Daudé wrote:
> In preparation of using multiple IRQ (thus multiple eventfds)
> make BDRVNVMeState::irq_notifier an array (for now of a single
> element, the admin queue notifier).
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  block/nvme.c | 31 +--
>  1 file changed, 21 insertions(+), 10 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH v4 2/4] iotests.py: Add wait_for_runstate()

2020-08-20 Thread Max Reitz
On 20.08.20 16:34, Vladimir Sementsov-Ogievskiy wrote:
> 20.08.2020 17:23, Dr. David Alan Gilbert wrote:
>> * Eric Blake (ebl...@redhat.com) wrote:
>>> On 8/18/20 8:32 AM, Max Reitz wrote:
 Signed-off-by: Max Reitz 
 ---
    tests/qemu-iotests/iotests.py | 4 
    1 file changed, 4 insertions(+)
>>>
>>> Reviewed-by: Eric Blake 
>>>

 diff --git a/tests/qemu-iotests/iotests.py
 b/tests/qemu-iotests/iotests.py
 index 717b5b652c..ee93cf22db 100644
 --- a/tests/qemu-iotests/iotests.py
 +++ b/tests/qemu-iotests/iotests.py
 @@ -833,6 +833,10 @@ class VM(qtest.QEMUQtestMachine):
   'Found node %s under %s (but expected %s)' % \
   (node['name'], path, expected_node)
 +    def wait_for_runstate(self, runstate: str) -> None:
 +    while self.qmp('query-status')['return']['status'] !=
 runstate:
 +    time.sleep(0.2)
>>>
>>> This looks like it could inf-loop if we have a bug where the status
>>> never
>>> changes as expected; but I guess CI bots have timeouts at higher
>>> layers that
>>> would detect if such a bug sneaks in.
>>
>> Although it might be useful to make sure when such a timeout lands, you
>> know which state you thought you were waiting for.
>>
>> Dave
>>
> 
> Timeout class is defined in iotests.py, so we can simply insert
> something like
> 
>  ... , timeout=60) -> None:
>   with Timeout(timeout, f"Timeout waiting for '{runstate}' runstate"):
>  ...

Actually, I’m not even using this function here in v4 anymore, so this
patch might as well be dropped.

Max



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 07/22] block/export: Remove magic from block-export-add

2020-08-20 Thread Eric Blake

On 8/20/20 6:05 AM, Kevin Wolf wrote:


As long as we can keep the compatibility code local to qmp_nbd_*(), I
don't think it's too bad. In particular because it's already written.

Instead of adjusting libvirt to changes in the nbd-* commands, I'd
rather have it change over to block-export-*. I would like to see the
nbd-server-add/remove commands deprecated soon after we have the
replacements.


Makes sense to me. So deprecate nbd-server-add in 5.2, and require 
block-export in 6.1.




+/*
+ * nbd-server-add doesn't complain when a read-only device should be
+ * exported as writable, but simply downgrades it. This is an error with
+ * block-export-add.


I'd be happy with either marking this deprecated now (and fixing it in two
releases), or declaring it a bug in nbd-server-add now (and fixing it
outright).


How about deprecating nbd-server-add completely?


Works for me. Keeping the warts backwards-compatible in nbd-server-add 
is more palatable if we know we are going to drop it wholesale down the 
road.



+/*
+ * nbd-server-add removes the export when the named BlockBackend used for
+ * @device goes away.
+ */
+on_eject_blk = blk_by_name(arg->device);
+if (on_eject_blk) {
+nbd_export_set_on_eject_blk(export, on_eject_blk);
+}


Wait - is the magic export removal tied only to exporting a drive name, and
not a node name?  So as long as libvirt is using only node names whwen
adding exports, a drive being unplugged won't interfere?


Yes, seems so. It's the existing behaviour, I'm only moving the code
around.


Overall, the change makes sense to me, although I'd love to see if we could
go further on the writable vs. read-only issue.


If nbd-server-add will be going away relatively soon, it's probably not
worth the trouble. But if you have reasons to keep it, maybe we should
consider it.


No, I'm fine with the idea of getting rid of nbd-server-add, at which 
point changing it before removal is pointless.



--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v4 2/4] iotests.py: Add wait_for_runstate()

2020-08-20 Thread Vladimir Sementsov-Ogievskiy

20.08.2020 17:23, Dr. David Alan Gilbert wrote:

* Eric Blake (ebl...@redhat.com) wrote:

On 8/18/20 8:32 AM, Max Reitz wrote:

Signed-off-by: Max Reitz 
---
   tests/qemu-iotests/iotests.py | 4 
   1 file changed, 4 insertions(+)


Reviewed-by: Eric Blake 



diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 717b5b652c..ee93cf22db 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -833,6 +833,10 @@ class VM(qtest.QEMUQtestMachine):
  'Found node %s under %s (but expected %s)' % \
  (node['name'], path, expected_node)
+def wait_for_runstate(self, runstate: str) -> None:
+while self.qmp('query-status')['return']['status'] != runstate:
+time.sleep(0.2)


This looks like it could inf-loop if we have a bug where the status never
changes as expected; but I guess CI bots have timeouts at higher layers that
would detect if such a bug sneaks in.


Although it might be useful to make sure when such a timeout lands, you
know which state you thought you were waiting for.

Dave



Timeout class is defined in iotests.py, so we can simply insert something like

 ... , timeout=60) -> None:
  with Timeout(timeout, f"Timeout waiting for '{runstate}' runstate"):
 ...


--
Best regards,
Vladimir



Re: [PATCH v4 2/4] iotests.py: Add wait_for_runstate()

2020-08-20 Thread Dr. David Alan Gilbert
* Eric Blake (ebl...@redhat.com) wrote:
> On 8/18/20 8:32 AM, Max Reitz wrote:
> > Signed-off-by: Max Reitz 
> > ---
> >   tests/qemu-iotests/iotests.py | 4 
> >   1 file changed, 4 insertions(+)
> 
> Reviewed-by: Eric Blake 
> 
> > 
> > diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
> > index 717b5b652c..ee93cf22db 100644
> > --- a/tests/qemu-iotests/iotests.py
> > +++ b/tests/qemu-iotests/iotests.py
> > @@ -833,6 +833,10 @@ class VM(qtest.QEMUQtestMachine):
> >  'Found node %s under %s (but expected %s)' % \
> >  (node['name'], path, expected_node)
> > +def wait_for_runstate(self, runstate: str) -> None:
> > +while self.qmp('query-status')['return']['status'] != runstate:
> > +time.sleep(0.2)
> 
> This looks like it could inf-loop if we have a bug where the status never
> changes as expected; but I guess CI bots have timeouts at higher layers that
> would detect if such a bug sneaks in.

Although it might be useful to make sure when such a timeout lands, you
know which state you thought you were waiting for.

Dave

> > +
> >   index_re = re.compile(r'([^\[]+)\[([^\]]+)\]')
> >   class QMPTestCase(unittest.TestCase):
> > 
> 
> -- 
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.   +1-919-301-3226
> Virtualization:  qemu.org | libvirt.org
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK




Re: [RFC PATCH 13/22] block/export: Move refcount from NBDExport to BlockExport

2020-08-20 Thread Kevin Wolf
Am 19.08.2020 um 22:58 hat Eric Blake geschrieben:
> On 8/13/20 11:29 AM, Kevin Wolf wrote:
> > Having a refcount makes sense for all types of block exports. It is also
> > a prerequisite for keeping a list of all exports at the BlockExport
> > level.
> > 
> > Signed-off-by: Kevin Wolf 
> > ---
> 
> > +++ b/include/block/export.h
> > @@ -21,14 +21,24 @@ typedef struct BlockExport BlockExport;
> >   typedef struct BlockExportDriver {
> >   BlockExportType type;
> >   BlockExport *(*create)(BlockExportOptions *, Error **);
> > +void (*delete)(BlockExport *);
> >   } BlockExportDriver;
> >   struct BlockExport {
> >   const BlockExportDriver *drv;
> > +
> > +/*
> > + * Reference count for this block export. This includes strong 
> > references
> > + * both from the owner (qemu-nbd or the monitor) and clients connected 
> > to
> > + * the export.
> 
> I guess 'the monitor' includes qemu-storage-daemon.

Yes, qemu-storage-daemon has a QMP monitor, so I would count it there.
Even the --export command line option only calls the QMP command
internally.

> > + */
> > +int refcount;
> >   };
> >   extern const BlockExportDriver blk_exp_nbd;
> >   BlockExport *blk_exp_add(BlockExportOptions *export, Error **errp);
> > +void blk_exp_ref(BlockExport *exp);
> > +void blk_exp_unref(BlockExport *exp);
> 
> Yay, I think this naming is more consistent with the rest of qemu...
> 
> >   #endif
> > diff --git a/include/block/nbd.h b/include/block/nbd.h
> > index 23030db3f1..af8509ab70 100644
> > --- a/include/block/nbd.h
> > +++ b/include/block/nbd.h
> > @@ -336,8 +336,6 @@ NBDExport *nbd_export_new(BlockDriverState *bs,
> >   void nbd_export_set_on_eject_blk(BlockExport *exp, BlockBackend *blk);
> >   void nbd_export_close(NBDExport *exp);
> >   void nbd_export_remove(NBDExport *exp, NbdServerRemoveMode mode, Error 
> > **errp);
> > -void nbd_export_get(NBDExport *exp);
> > -void nbd_export_put(NBDExport *exp);
> 
> ...as opposed to this which is common in kernel but less so in this project.
> No hard feelings from me :)
> 
> > +++ b/blockdev-nbd.c
> > @@ -232,7 +232,7 @@ BlockExport *nbd_export_create(BlockExportOptions 
> > *exp_args, Error **errp)
> >   /* The list of named exports has a strong reference to this export 
> > now and
> >* our only way of accessing it is through nbd_export_find(), so we 
> > can drop
> >* the strong reference that is @exp. */
> > -nbd_export_put(exp);
> > +blk_exp_unref((BlockExport*) exp);
> 
> Even a helper function that converts NBDBlockExport* to BlockExport* rather
> than a cast might be nicer, but then again, I see from Max's review that
> this may be a temporary state of things.
> (The QAPI contains such type-safe container casts, such as
> qapi_DriveBackup_base(), if that helps...)

Yes, this goes away before the end of the series.

> > @@ -1537,7 +1536,8 @@ NBDExport *nbd_export_new(BlockDriverState *bs,
> >   exp = g_new0(NBDExport, 1);
> >   exp->common = (BlockExport) {
> > -.drv = _exp_nbd,
> > +.drv= _exp_nbd,
> > +.refcount   = 1,
> 
> I'm not sure whether trying to align the '=' is good, because the moment you
> add a longer field name, every other line has to be touched.  I'm fine with
> just one space on both side of =, even if it is more ragged to read.  But
> you're the author, so you get to pick.

I generally prefer aligned '=' because the code is read much more often
than it is written or modified, so being friendly for readers is
important.

> > @@ -1626,8 +1625,9 @@ NBDExport *nbd_export_new(BlockDriverState *bs,
> >   exp->ctx = ctx;
> >   blk_add_aio_context_notifier(blk, blk_aio_attached, blk_aio_detach, 
> > exp);
> > +blk_exp_ref(>common);
> >   QTAILQ_INSERT_TAIL(, exp, next);
> > -nbd_export_get(exp);
> > +
> 
> Is there any consequence to this changed ordering in grabbing the reference
> vs. updating the list?

No intended consequences, but if Max is right that the code (before and
after this series) lacks some locking, it might make a theoretical
difference. If it does, the new code is safer than the old one. If it
doesn't, it's just more consistent with the order I'm used to see in
other places: First take the reference, then use it.

Kevin




Re: [PATCH v13 00/11] iotests: Dump QCOW2 dirty bitmaps metadata

2020-08-20 Thread Andrey Shinkevich

On 20.08.2020 03:49, Eric Blake wrote:

On 8/14/20 6:56 AM, Andrey Shinkevich wrote:

Dear Eric!

Vladimir has compeated reviewing this series. I have not received any 
other responses to it so far.


So, is it good for pull request now? Would you please consider taking 
this series as you did it with the Vladimir's related one?


I've spent some time playing with this; I have now queued it on my 
bitmaps tree, and will be posting a pull request as soon as Paolo's 
meson changes settle.  I also made the tweaks suggested on 9/11.




Sounds good, thank you so much, Eric. I appreciate.

Andrey




Re: [PATCH v4 4/4] iotests: Test node/bitmap aliases during migration

2020-08-20 Thread Vladimir Sementsov-Ogievskiy

20.08.2020 16:17, Max Reitz wrote:

On 20.08.20 03:58, Eric Blake wrote:

On 8/18/20 8:32 AM, Max Reitz wrote:

Signed-off-by: Max Reitz 
---
   tests/qemu-iotests/300 | 595 +
   tests/qemu-iotests/300.out |   5 +


Rather sparse output (I hate debugging those sorts of outputs when the
test is failing).


Hm.  I don’t know, the stack trace usually gives a good idea and
./check -d gives QMP context.

The advantage of a sparse output is that we don’t need to adjust the
reference output every time some optional field is added somewhere.


   tests/qemu-iotests/group   |   1 +
   3 files changed, 601 insertions(+)
   create mode 100755 tests/qemu-iotests/300
   create mode 100644 tests/qemu-iotests/300.out




+    # Dirty some random megabytes
+    for _ in range(9):
+    mb_ofs = random.randrange(1024)
+    self.vm_a.hmp_qemu_io(self.src_node_name, f'write
{mb_ofs}M 1M')


It turns out that the discard operation likewise dirties the bitmap, but
slightly faster (see edb90bbd).  We could optimize it on top, but I'm
not going to require a micro-optimizing to get it in.  The test takes
about 12 seconds to run for me, but you didn't mark it as such in
'group', so that's good; but it turns up a problem:

300  fail   [20:55:54] [20:56:06]    output
mismatch (see 300.out.bad)
--- /home/eblake/qemu-tmp2/tests/qemu-iotests/300.out    2020-08-19
20:53:11.087791988 -0500
+++ /home/eblake/qemu-tmp2/tests/qemu-iotests/300.out.bad    2020-08-19
20:56:06.092428756 -0500
@@ -1,5 +1,41 @@
-.
+WARNING:qemu.machine:qemu received signal 11; command:
"/home/eblake/qemu-tmp2/tests/qemu-iotests/../../x86_64-softmmu/qemu-system-x86_64
-display none -vga none -chardev
socket,id=mon,path=/tmp/tmp.qT831UThme/qemu-b-798452-monitor.sock -mon
chardev=mon,mode=control -qtest
unix:path=/tmp/tmp.qT831UThme/qemu-b-798452-qtest.sock -accel qtest
-nodefaults -display none -accel qtest -blockdev
node-name=node0,driver=null-co -incoming unix:/tmp/tmp.qT831UThme/mig_sock"
+.FE...
+==
+ERROR: test_migratee_bitmap_is_not_mapped_on_dst
(__main__.TestBlockBitmapMappingErrors)
+--
+Traceback (most recent call last):
+  File
"/home/eblake/qemu-tmp2/tests/qemu-iotests/../../python/qemu/machine.py", line
435, in _do_shutdown
+    self._soft_shutdown(timeout, has_quit)
+  File
"/home/eblake/qemu-tmp2/tests/qemu-iotests/../../python/qemu/machine.py", line
415, in _soft_shutdown
+    self._qmp.cmd('quit')
+  File
"/home/eblake/qemu-tmp2/tests/qemu-iotests/../../python/qemu/qmp.py",
line 266, in cmd
+    return self.cmd_obj(qmp_cmd)
+  File
"/home/eblake/qemu-tmp2/tests/qemu-iotests/../../python/qemu/qmp.py",
line 246, in cmd_obj
+    self.__sock.sendall(json.dumps(qmp_cmd).encode('utf-8'))
+BrokenPipeError: [Errno 32] Broken pipe
+
+The above exception was the direct cause of the following exception:
+
+Traceback (most recent call last):
+  File "300", line 76, in tearDown
+    self.vm_b.shutdown()
+  File
"/home/eblake/qemu-tmp2/tests/qemu-iotests/../../python/qemu/machine.py", line
465, in shutdown
+    self._do_shutdown(timeout, has_quit)
+  File
"/home/eblake/qemu-tmp2/tests/qemu-iotests/../../python/qemu/machine.py", line
438, in _do_shutdown
+    raise AbnormalShutdown("Could not perform graceful shutdown") \
+qemu.machine.AbnormalShutdown: Could not perform graceful shutdown
+
+==
+FAIL: test_migratee_bitmap_is_not_mapped_on_dst
(__main__.TestBlockBitmapMappingErrors)
+--
+Traceback (most recent call last):
+  File "300", line 384, in test_migratee_bitmap_is_not_mapped_on_dst
+    self.migrate(False)
+  File "300", line 99, in migrate
+    self.assertEqual(self.vm_a.wait_migration('postmigrate'),
+AssertionError: False != True
+
  --
  Ran 37 tests

-OK
+FAILED (failures=1, errors=1)

I'm not sure why I'm seeing that, but it looks like you've got a bad
deref somewhere in the alias code.


Ah, crap.

This should fix it:

diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
index 89cb16b12c..d407dfefea 100644
--- a/migration/block-dirty-bitmap.c
+++ b/migration/block-dirty-bitmap.c
@@ -1091,7 +1091,9 @@ static int dirty_bitmap_load_header(QEMUFile *f,
DBMLoadState *s,
  } else {
  bitmap_name = s->bitmap_alias;
  }
+}

+if (!s->cancelled) {
  g_strlcpy(s->bitmap_name, bitmap_name, sizeof(s->bitmap_name));
  s->bitmap = bdrv_find_dirty_bitmap(s->bs, s->bitmap_name);




That's correct thing to do


I had this originally, and then I decided to drop that hunk just 

Re: [PATCH v7 35/47] commit: Deal with filters

2020-08-20 Thread Kevin Wolf
Am 20.08.2020 um 13:27 hat Max Reitz geschrieben:
> On 19.08.20 19:58, Kevin Wolf wrote:
> > Am 25.06.2020 um 17:22 hat Max Reitz geschrieben:
> >> This includes some permission limiting (for example, we only need to
> >> take the RESIZE permission if the base is smaller than the top).
> >>
> >> Signed-off-by: Max Reitz 
> >> ---
> >>  block/block-backend.c  |  9 +++-
> >>  block/commit.c | 96 +-
> >>  block/monitor/block-hmp-cmds.c |  2 +-
> >>  blockdev.c |  4 +-
> >>  4 files changed, 81 insertions(+), 30 deletions(-)
> >>
> >> diff --git a/block/block-backend.c b/block/block-backend.c
> >> index 6936b25c83..7f2c7dbccc 100644
> >> --- a/block/block-backend.c
> >> +++ b/block/block-backend.c
> >> @@ -2271,8 +2271,13 @@ int blk_commit_all(void)
> >>  AioContext *aio_context = blk_get_aio_context(blk);
> >>  
> >>  aio_context_acquire(aio_context);
> >> -if (blk_is_inserted(blk) && blk->root->bs->backing) {
> >> -int ret = bdrv_commit(blk->root->bs);
> > 
> > The old code didn't try to commit nodes that don't have a backing file.
> > 
> >> +if (blk_is_inserted(blk)) {
> >> +BlockDriverState *non_filter;
> >> +int ret;
> >> +
> >> +/* Legacy function, so skip implicit filters */
> >> +non_filter = bdrv_skip_implicit_filters(blk->root->bs);
> >> +ret = bdrv_commit(non_filter);
> > 
> > The new one tries unconditionally. For nodes without a backing file,
> > bdrv_commit() will return -ENOTSUP, so the whole function fails.
> 
> :(
> 
> Hm.  Should I fix it by checking for
> bdrv_cow_bs(bdrv_skip_implicit_filters())?  Or bdrv_backing_chain_next()
> and change the bdrv_skip_implicit_filters() to a bdrv_skip_filters()?  I
> feel like that would make even more sense.

I agree that bdrv_skip_filters() makes more sense. If I have a qcow2
image and an explicit throttle filter on top, there is no reason to skip
this image.

bdrv_backing_chain_next() or bdrv_cow_bs() should be the same in a
boolean context, so I'd vote for bdrv_cow_bs() because it has less work
to do to get the same result.

> > (First real bug at patch 35. I almost thought I wouldn't find any!)
> 
> :)
> 
> >>  if (ret < 0) {
> >>  aio_context_release(aio_context);
> >>  return ret;
> >> diff --git a/block/commit.c b/block/commit.c
> >> index 7732d02dfe..4122b6736d 100644
> >> --- a/block/commit.c
> >> +++ b/block/commit.c
> >> @@ -37,6 +37,7 @@ typedef struct CommitBlockJob {
> >>  BlockBackend *top;
> >>  BlockBackend *base;
> >>  BlockDriverState *base_bs;
> >> +BlockDriverState *base_overlay;
> >>  BlockdevOnError on_error;
> >>  bool base_read_only;
> >>  bool chain_frozen;
> > 
> > Hm, again this mysterious base_overlay. I know that stream introduced it
> > to avoid freezing the link to base, but commit doesn't seem to do that.
> > 
> > Is it to avoid using the block status of filter drivers between
> > base_overlay and base?
> 
> Yes.
> 
> > If so, I guess that goes back to the question I
> > raised earlier in this series: What is the block status supposed to tell
> > for filter nodes?
> 
> Honestly, I would really like to get away without having to answer that
> question in this series.  Intuitively, I feel like falling through to
> the next data-bearing layer is not something most callers want.  But I’d
> rather investigate that question separately from this series (even
> though that likely means we’ll never do it), and just treat it as it is
> in this series.

Well, I'm asking the question because not having the answer makes us
jump through hoops in this series to accomodate a behaviour it probably
shouldn't even have. (Because I agree that filters should probably keep
DATA clear, i.e. they are never the layer that defines the content.)

Additional node references (i.e. references that are not edges in the
graph) always make the design more complicated and require us to
consider more things like what happens on graph changes. So it's a
question of maintainability.

> > But anyway, in contrast to mirror, commit actually freezes the chain
> > between commit_top_bs and base, so it should be safe at least.
> > 
> >> @@ -89,7 +90,7 @@ static void commit_abort(Job *job)
> >>   * XXX Can (or should) we somehow keep 'consistent read' blocked even
> >>   * after the failed/cancelled commit job is gone? If we already wrote
> >>   * something to base, the intermediate images aren't valid any more. 
> >> */
> >> -bdrv_replace_node(s->commit_top_bs, backing_bs(s->commit_top_bs),
> >> +bdrv_replace_node(s->commit_top_bs, s->commit_top_bs->backing->bs,
> >>_abort);
> >>  
> >>  bdrv_unref(s->commit_top_bs);
> >> @@ -153,7 +154,7 @@ static int coroutine_fn commit_run(Job *job, Error 
> >> **errp)
> >>  break;
> >>  }
> >>  /* Copy if 

Re: [PATCH v4 3/4] iotests.py: Let wait_migration() return on failure

2020-08-20 Thread Vladimir Sementsov-Ogievskiy

18.08.2020 16:32, Max Reitz wrote:

Let wait_migration() return on failure (with the return value indicating
whether the migration was completed or has failed), so we can use it for
migrations that are expected to fail, too.

Signed-off-by: Max Reitz 
---
  tests/qemu-iotests/iotests.py | 18 --
  1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index ee93cf22db..f39fd580a6 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -729,16 +729,22 @@ class VM(qtest.QEMUQtestMachine):
  }
  ]))
  
-def wait_migration(self, expect_runstate):

+def wait_migration(self, expect_runstate: Optional[str]) -> bool:
  while True:
  event = self.event_wait('MIGRATION')
  log(event, filters=[filter_qmp_event])
-if event['data']['status'] == 'completed':
+if event['data']['status'] in ('completed', 'failed'):
  break
-# The event may occur in finish-migrate, so wait for the expected
-# post-migration runstate
-while self.qmp('query-status')['return']['status'] != expect_runstate:
-pass
+
+if event['data']['status'] == 'completed':
+# The event may occur in finish-migrate, so wait for the expected
+# post-migration runstate
+runstate = None
+while runstate != expect_runstate:
+runstate = self.qmp('query-status')['return']['status']


Would be good to use the helper from the previous patch. With it or not:
Reviewed-by: Vladimir Sementsov-Ogievskiy 


+return True
+else:
+return False
  
  def node_info(self, node_name):

  nodes = self.qmp('query-named-block-nodes')




--
Best regards,
Vladimir



Re: [PATCH v4 2/4] iotests.py: Add wait_for_runstate()

2020-08-20 Thread Vladimir Sementsov-Ogievskiy

18.08.2020 16:32, Max Reitz wrote:

Signed-off-by: Max Reitz


Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir



Re: [PATCH v4 1/4] migration: Add block-bitmap-mapping parameter

2020-08-20 Thread Max Reitz
On 20.08.20 14:58, Vladimir Sementsov-Ogievskiy wrote:
> 18.08.2020 16:32, Max Reitz wrote:
>> This migration parameter allows mapping block node names and bitmap
>> names to aliases for the purpose of block dirty bitmap migration.
>>
>> This way, management tools can use different node and bitmap names on
>> the source and destination and pass the mapping of how bitmaps are to be
>> transferred to qemu (on the source, the destination, or even both with
>> arbitrary aliases in the migration stream).
>>
>> While touching this code, fix a bug where bitmap names longer than 255
>> bytes would fail an assertion in qemu_put_counted_string().
>>
>> Suggested-by: Vladimir Sementsov-Ogievskiy 
>> Signed-off-by: Max Reitz 
>> ---
>>   qapi/migration.json    | 101 +++-
>>   migration/migration.h  |   3 +
>>   migration/block-dirty-bitmap.c | 409 -
>>   migration/migration.c  |  30 +++
>>   monitor/hmp-cmds.c |  30 +++
>>   5 files changed, 517 insertions(+), 56 deletions(-)
>>
>> diff --git a/qapi/migration.json b/qapi/migration.json
>> index ea53b23dca..0c4ae102b1 100644
>> --- a/qapi/migration.json
>> +++ b/qapi/migration.json
> 
> [..]
> 
>>   #
>> +# @block-bitmap-mapping: Maps block nodes and bitmaps on them to
>> +#  aliases for the purpose of dirty bitmap migration.  Such
>> +#  aliases may for example be the corresponding names on the
>> +#  opposite site.
>> +#  The mapping must be one-to-one, but not necessarily
>> +#  complete: On the source, unmapped bitmaps and all bitmaps
>> +#  on unmapped nodes will be ignored.  On the destination,
>> +#  all unmapped aliases in the incoming migration stream will
>> +#  be reported, but they will not result in failure.
> Actually, on unknown alias we cancel incoming bitmap migration, which
> means that destination vm continues to run, other (non-bitmap) migration
> states continue to migrate but all further chunks of bitmap migration
> will be ignored. (I'm not sure it worth be mentioned)

Ah, yeah.

[...]

>> @@ -303,21 +497,39 @@ static int add_bitmaps_to_list(DBMSaveState *s,
>> BlockDriverState *bs,
>>   return 0;
>>   }
>>   +    bitmap_name = bdrv_dirty_bitmap_name(bitmap);
>> +
>>   if (!bs_name || strcmp(bs_name, "") == 0) {
>>   error_report("Bitmap '%s' in unnamed node can't be migrated",
>> - bdrv_dirty_bitmap_name(bitmap));
>> + bitmap_name);
>>   return -1;
>>   }
>>   -    if (bs_name[0] == '#') {
>> +    if (alias_map) {
>> +    const AliasMapInnerNode *amin =
>> g_hash_table_lookup(alias_map, bs_name);
>> +
>> +    if (!amin) {
>> +    /* Skip bitmaps on nodes with no alias */
>> +    return 0;
>> +    }
>> +
>> +    node_alias = amin->string;
>> +    bitmap_aliases = amin->subtree;
>> +    } else {
>> +    node_alias = bs_name;
>> +    bitmap_aliases = NULL;
>> +    }
>> +
>> +    if (node_alias[0] == '#') {
>>   error_report("Bitmap '%s' in a node with auto-generated "
>>    "name '%s' can't be migrated",
>> - bdrv_dirty_bitmap_name(bitmap), bs_name);
>> + bitmap_name, node_alias);
>>   return -1;
>>   }
> 
> This check is related only to pre-alias_map behavior, so it's probably
> better to keep it inside else{} branch above. Still, aliases already
> checked to be wellformed, so this check will be always false anyway for
> aliases and will not hurt.

Hm, it’s a trade off.  It does look a bit weird, because how can aliases
be auto-generated?  But OTOH it makes it clearer that we’ll never allow
non-wellformed aliases through.

[...]

>>   if (s->flags & DIRTY_BITMAP_MIG_FLAG_BITMAP_NAME) {
>> -    if (!qemu_get_counted_string(f, s->bitmap_name)) {
>> +    const char *bitmap_name;
>> +
>> +    if (!qemu_get_counted_string(f, s->bitmap_alias)) {
>>   error_report("Unable to read bitmap name string");
> 
> Probably s/name/alias/ like for node error message.

Why not.

[...]

>> --- a/monitor/hmp-cmds.c
>> +++ b/monitor/hmp-cmds.c
>> @@ -469,6 +469,32 @@ void hmp_info_migrate_parameters(Monitor *mon,
>> const QDict *qdict)
>>   monitor_printf(mon, "%s: '%s'\n",
>>   MigrationParameter_str(MIGRATION_PARAMETER_TLS_AUTHZ),
>>   params->tls_authz);
>> +
>> +    if (params->has_block_bitmap_mapping) {
>> +    const BitmapMigrationNodeAliasList *bmnal;
>> +
>> +    monitor_printf(mon, "%s:\n",
>> +   MigrationParameter_str(
>> +  
>> MIGRATION_PARAMETER_BLOCK_BITMAP_MAPPING));
>> +
>> +    for (bmnal = params->block_bitmap_mapping;
>> + bmnal;
>> + bmnal = bmnal->next)
>> +    {
>> +    const BitmapMigrationNodeAlias *bmna = bmnal->value;
>> 

Re: [PATCH v4 4/4] iotests: Test node/bitmap aliases during migration

2020-08-20 Thread Max Reitz
On 20.08.20 03:58, Eric Blake wrote:
> On 8/18/20 8:32 AM, Max Reitz wrote:
>> Signed-off-by: Max Reitz 
>> ---
>>   tests/qemu-iotests/300 | 595 +
>>   tests/qemu-iotests/300.out |   5 +
> 
> Rather sparse output (I hate debugging those sorts of outputs when the
> test is failing).

Hm.  I don’t know, the stack trace usually gives a good idea and
./check -d gives QMP context.

The advantage of a sparse output is that we don’t need to adjust the
reference output every time some optional field is added somewhere.

>>   tests/qemu-iotests/group   |   1 +
>>   3 files changed, 601 insertions(+)
>>   create mode 100755 tests/qemu-iotests/300
>>   create mode 100644 tests/qemu-iotests/300.out
>>
> 
>> +    # Dirty some random megabytes
>> +    for _ in range(9):
>> +    mb_ofs = random.randrange(1024)
>> +    self.vm_a.hmp_qemu_io(self.src_node_name, f'write
>> {mb_ofs}M 1M')
> 
> It turns out that the discard operation likewise dirties the bitmap, but
> slightly faster (see edb90bbd).  We could optimize it on top, but I'm
> not going to require a micro-optimizing to get it in.  The test takes
> about 12 seconds to run for me, but you didn't mark it as such in
> 'group', so that's good; but it turns up a problem:
> 
> 300  fail   [20:55:54] [20:56:06]    output
> mismatch (see 300.out.bad)
> --- /home/eblake/qemu-tmp2/tests/qemu-iotests/300.out    2020-08-19
> 20:53:11.087791988 -0500
> +++ /home/eblake/qemu-tmp2/tests/qemu-iotests/300.out.bad    2020-08-19
> 20:56:06.092428756 -0500
> @@ -1,5 +1,41 @@
> -.
> +WARNING:qemu.machine:qemu received signal 11; command:
> "/home/eblake/qemu-tmp2/tests/qemu-iotests/../../x86_64-softmmu/qemu-system-x86_64
> -display none -vga none -chardev
> socket,id=mon,path=/tmp/tmp.qT831UThme/qemu-b-798452-monitor.sock -mon
> chardev=mon,mode=control -qtest
> unix:path=/tmp/tmp.qT831UThme/qemu-b-798452-qtest.sock -accel qtest
> -nodefaults -display none -accel qtest -blockdev
> node-name=node0,driver=null-co -incoming unix:/tmp/tmp.qT831UThme/mig_sock"
> +.FE...
> +==
> +ERROR: test_migratee_bitmap_is_not_mapped_on_dst
> (__main__.TestBlockBitmapMappingErrors)
> +--
> +Traceback (most recent call last):
> +  File
> "/home/eblake/qemu-tmp2/tests/qemu-iotests/../../python/qemu/machine.py", line
> 435, in _do_shutdown
> +    self._soft_shutdown(timeout, has_quit)
> +  File
> "/home/eblake/qemu-tmp2/tests/qemu-iotests/../../python/qemu/machine.py", line
> 415, in _soft_shutdown
> +    self._qmp.cmd('quit')
> +  File
> "/home/eblake/qemu-tmp2/tests/qemu-iotests/../../python/qemu/qmp.py",
> line 266, in cmd
> +    return self.cmd_obj(qmp_cmd)
> +  File
> "/home/eblake/qemu-tmp2/tests/qemu-iotests/../../python/qemu/qmp.py",
> line 246, in cmd_obj
> +    self.__sock.sendall(json.dumps(qmp_cmd).encode('utf-8'))
> +BrokenPipeError: [Errno 32] Broken pipe
> +
> +The above exception was the direct cause of the following exception:
> +
> +Traceback (most recent call last):
> +  File "300", line 76, in tearDown
> +    self.vm_b.shutdown()
> +  File
> "/home/eblake/qemu-tmp2/tests/qemu-iotests/../../python/qemu/machine.py", line
> 465, in shutdown
> +    self._do_shutdown(timeout, has_quit)
> +  File
> "/home/eblake/qemu-tmp2/tests/qemu-iotests/../../python/qemu/machine.py", line
> 438, in _do_shutdown
> +    raise AbnormalShutdown("Could not perform graceful shutdown") \
> +qemu.machine.AbnormalShutdown: Could not perform graceful shutdown
> +
> +==
> +FAIL: test_migratee_bitmap_is_not_mapped_on_dst
> (__main__.TestBlockBitmapMappingErrors)
> +--
> +Traceback (most recent call last):
> +  File "300", line 384, in test_migratee_bitmap_is_not_mapped_on_dst
> +    self.migrate(False)
> +  File "300", line 99, in migrate
> +    self.assertEqual(self.vm_a.wait_migration('postmigrate'),
> +AssertionError: False != True
> +
>  --
>  Ran 37 tests
> 
> -OK
> +FAILED (failures=1, errors=1)
> 
> I'm not sure why I'm seeing that, but it looks like you've got a bad
> deref somewhere in the alias code.

Ah, crap.

This should fix it:

diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
index 89cb16b12c..d407dfefea 100644
--- a/migration/block-dirty-bitmap.c
+++ b/migration/block-dirty-bitmap.c
@@ -1091,7 +1091,9 @@ static int dirty_bitmap_load_header(QEMUFile *f,
DBMLoadState *s,
 } else {
 bitmap_name = s->bitmap_alias;
 }
+}

+if (!s->cancelled) {
 g_strlcpy(s->bitmap_name, bitmap_name, sizeof(s->bitmap_name));
 

Re: [PATCH v4 1/4] migration: Add block-bitmap-mapping parameter

2020-08-20 Thread Vladimir Sementsov-Ogievskiy

18.08.2020 16:32, Max Reitz wrote:

This migration parameter allows mapping block node names and bitmap
names to aliases for the purpose of block dirty bitmap migration.

This way, management tools can use different node and bitmap names on
the source and destination and pass the mapping of how bitmaps are to be
transferred to qemu (on the source, the destination, or even both with
arbitrary aliases in the migration stream).

While touching this code, fix a bug where bitmap names longer than 255
bytes would fail an assertion in qemu_put_counted_string().

Suggested-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Max Reitz 
---
  qapi/migration.json| 101 +++-
  migration/migration.h  |   3 +
  migration/block-dirty-bitmap.c | 409 -
  migration/migration.c  |  30 +++
  monitor/hmp-cmds.c |  30 +++
  5 files changed, 517 insertions(+), 56 deletions(-)

diff --git a/qapi/migration.json b/qapi/migration.json
index ea53b23dca..0c4ae102b1 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json


[..]


  #
+# @block-bitmap-mapping: Maps block nodes and bitmaps on them to
+#  aliases for the purpose of dirty bitmap migration.  Such
+#  aliases may for example be the corresponding names on the
+#  opposite site.
+#  The mapping must be one-to-one, but not necessarily
+#  complete: On the source, unmapped bitmaps and all bitmaps
+#  on unmapped nodes will be ignored.  On the destination,
+#  all unmapped aliases in the incoming migration stream will
+#  be reported, but they will not result in failure.

Actually, on unknown alias we cancel incoming bitmap migration, which means 
that destination vm continues to run, other (non-bitmap) migration states 
continue to migrate but all further chunks of bitmap migration will be ignored. 
(I'm not sure it worth be mentioned)


+#  Note that the destination does not know about bitmaps it
+#  does not receive, so there is no limitation or requirement
+#  regarding the number of bitmaps received, or how they are
+#  named, or on which nodes they are placed.
+#  By default (when this parameter has never been set), bitmap
+#  names are mapped to themselves.  Nodes are mapped to their
+#  block device name if there is one, and to their node name
+#  otherwise. (Since 5.2)
+#
  # Since: 2.4
  ##
  { 'enum': 'MigrationParameter',
@@ -656,7 +712,8 @@
 'multifd-channels',
 'xbzrle-cache-size', 'max-postcopy-bandwidth',
 'max-cpu-throttle', 'multifd-compression',
-   'multifd-zlib-level' ,'multifd-zstd-level' ] }
+   'multifd-zlib-level' ,'multifd-zstd-level',
+   'block-bitmap-mapping' ] }
  
  ##


[..]


@@ -303,21 +497,39 @@ static int add_bitmaps_to_list(DBMSaveState *s, 
BlockDriverState *bs,
  return 0;
  }
  
+bitmap_name = bdrv_dirty_bitmap_name(bitmap);

+
  if (!bs_name || strcmp(bs_name, "") == 0) {
  error_report("Bitmap '%s' in unnamed node can't be migrated",
- bdrv_dirty_bitmap_name(bitmap));
+ bitmap_name);
  return -1;
  }
  
-if (bs_name[0] == '#') {

+if (alias_map) {
+const AliasMapInnerNode *amin = g_hash_table_lookup(alias_map, 
bs_name);
+
+if (!amin) {
+/* Skip bitmaps on nodes with no alias */
+return 0;
+}
+
+node_alias = amin->string;
+bitmap_aliases = amin->subtree;
+} else {
+node_alias = bs_name;
+bitmap_aliases = NULL;
+}
+
+if (node_alias[0] == '#') {
  error_report("Bitmap '%s' in a node with auto-generated "
   "name '%s' can't be migrated",
- bdrv_dirty_bitmap_name(bitmap), bs_name);
+ bitmap_name, node_alias);
  return -1;
  }


This check is related only to pre-alias_map behavior, so it's probably better 
to keep it inside else{} branch above. Still, aliases already checked to be 
wellformed, so this check will be always false anyway for aliases and will not 
hurt.

  
  FOR_EACH_DIRTY_BITMAP(bs, bitmap) {

-if (!bdrv_dirty_bitmap_name(bitmap)) {
+bitmap_name = bdrv_dirty_bitmap_name(bitmap);
+if (!bitmap_name) {
  continue;
  }
  


[..]


@@ -770,8 +1014,10 @@ static int dirty_bitmap_load_bits(QEMUFile *f, 
DBMLoadState *s)
  return 0;
  }
  
-static int dirty_bitmap_load_header(QEMUFile *f, DBMLoadState *s)

+static int dirty_bitmap_load_header(QEMUFile *f, DBMLoadState *s,
+GHashTable *alias_map)
  {
+GHashTable *bitmap_alias_map = NULL;
  Error *local_err = NULL;
  bool nothing;
  s->flags = qemu_get_bitmap_flags(f);
@@ -780,28 +1026,73 @@ static int dirty_bitmap_load_header(QEMUFile *f, 
DBMLoadState *s)
  nothing 

Re: [PATCH v4 1/4] migration: Add block-bitmap-mapping parameter

2020-08-20 Thread Max Reitz
On 20.08.20 03:17, Eric Blake wrote:
> On 8/18/20 8:32 AM, Max Reitz wrote:
>> This migration parameter allows mapping block node names and bitmap
>> names to aliases for the purpose of block dirty bitmap migration.
>>
>> This way, management tools can use different node and bitmap names on
>> the source and destination and pass the mapping of how bitmaps are to be
>> transferred to qemu (on the source, the destination, or even both with
>> arbitrary aliases in the migration stream).
>>
>> While touching this code, fix a bug where bitmap names longer than 255
>> bytes would fail an assertion in qemu_put_counted_string().
>>
>> Suggested-by: Vladimir Sementsov-Ogievskiy 
>> Signed-off-by: Max Reitz 
>> ---
> 
>> +##
>> +# @BitmapMigrationNodeAlias:
>> +#
>> +# Maps a block node name and the bitmaps it has to aliases for dirty
>> +# bitmap migration.
>> +#
>> +# @node-name: A block node name.
>> +#
>> +# @alias: An alias block node name for migration (for example the
>> +# node name on the opposite site).
>> +#
>> +# @bitmaps: Mappings for the bitmaps on this node.
>> +#
>> +# Since: 5.2
>> +##
>> +{ 'struct': 'BitmapMigrationNodeAlias',
>> +  'data': {
>> +  'node-name': 'str',
>> +  'alias': 'str',
>> +  'bitmaps': [ 'BitmapMigrationBitmapAlias' ]
>> +  } }
> 
> Possible change: should 'alias' be optional (if absent, it defaults to
> 'node-name')?  But that can be done on top, if we like it.

I don’t know whether anyone would make use of that, though, so I’d defer
it until someone wants it.

>> +static GHashTable *construct_alias_map(const
>> BitmapMigrationNodeAliasList *bbm,
>> +   bool name_to_alias,
>> +   Error **errp)
>> +{
>> +    GHashTable *alias_map;
>> +    size_t max_node_name_len =
>> +    sizeof(((BlockDriverState *)NULL)->node_name) - 1;
> 
> Looks a bit nicer as = sizeof_field(BlockDriverState, node_name) - 1.

Oh, I didn’t know we had that.

>> +
>> +    alias_map = g_hash_table_new_full(g_str_hash, g_str_equal,
>> +  g_free,
>> free_alias_map_inner_node);
>> +
>> +    for (; bbm; bbm = bbm->next) {
>> +    const BitmapMigrationNodeAlias *bmna = bbm->value;
>> +    const BitmapMigrationBitmapAliasList *bmbal;
>> +    AliasMapInnerNode *amin;
>> +    GHashTable *bitmaps_map;
>> +    const char *node_map_from, *node_map_to;
>> +
>> +    if (!id_wellformed(bmna->alias)) {
>> +    error_setg(errp, "The node alias '%s' is not well-formed",
>> +   bmna->alias);
>> +    goto fail;
>> +    }
>> +
>> +    if (strlen(bmna->alias) > 255) {
> 
> Magic number.  UINT8_MAX seems better (since the limit really is due to
> our migration format limiting to one byte).

Well, yes, but qemu_put_counted_string() uses that magic number, too.
*shrug*

>> +    g_hash_table_insert(alias_map, g_strdup(node_map_from), amin);
>> +
>> +    for (bmbal = bmna->bitmaps; bmbal; bmbal = bmbal->next) {
>> +    const BitmapMigrationBitmapAlias *bmba = bmbal->value;
>> +    const char *bmap_map_from, *bmap_map_to;
>> +
>> +    if (strlen(bmba->alias) > 255) {
> 
> and again
> 
>> +    error_setg(errp,
>> +   "The bitmap alias '%s' is longer than 255
>> bytes",
>> +   bmba->alias);
>> +    goto fail;
>> +    }
> 
> Thanks for adding in the length checking since last revision!
> 
> 
>> @@ -326,12 +538,29 @@ static int add_bitmaps_to_list(DBMSaveState *s,
>> BlockDriverState *bs,
>>   return -1;
>>   }
>>   +    if (bitmap_aliases) {
>> +    bitmap_alias = g_hash_table_lookup(bitmap_aliases,
>> bitmap_name);
>> +    if (!bitmap_alias) {
>> +    /* Skip bitmaps with no alias */
>> +    continue;
>> +    }
>> +    } else {
>> +    if (strlen(bitmap_name) > 255) {
>> +    error_report("Cannot migrate bitmap '%s' on node '%s': "
>> + "Name is longer than 255 bytes",
>> + bitmap_name, bs_name);
>> +    return -1;
> 
> Another one.
> 
> 
> Reviewed-by: Eric Blake 
> 
> I'm happy to make those touchups, and put this on my bitmaps queue for a
> pull request as soon as Paolo's meson stuff stabilizes.

Sounds good to me, thanks!

Max



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v7 14/47] stream: Deal with filters

2020-08-20 Thread Max Reitz
On 20.08.20 12:49, Vladimir Sementsov-Ogievskiy wrote:
> 20.08.2020 12:22, Max Reitz wrote:
>> On 20.08.20 10:31, Max Reitz wrote:
>>
>> [...]
>>
>>> So all in all, I believe the biggest surprise about what’s written into
>>> the top layer isn’t that it may be a json:{} filename, but the filename
>>> of a node that maybe doesn’t even exist anymore?  (Oh, no, please don’t
>>> tell me you can delete it and get an invalid pointer read...)
>>
>> (I tried triggering that, but, oh, it’s strdup’ed() in stream_start().
>> I’m a bit daft.)
>>
> 
> 
> If it's broken anyway, probably we can just revert c624b015bf and start
> to freeze base again?

Well, it’s only broken if you care about the backing filename string
that’s written to @top.  So it isn’t broken altogether.

Though, well.  If we all agree to just revert it and maybe add a @bottom
parameter instead, then I suppose we could do it.

(Maybe in a follow-up, though.)

Max



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v7 35/47] commit: Deal with filters

2020-08-20 Thread Max Reitz
On 19.08.20 19:58, Kevin Wolf wrote:
> Am 25.06.2020 um 17:22 hat Max Reitz geschrieben:
>> This includes some permission limiting (for example, we only need to
>> take the RESIZE permission if the base is smaller than the top).
>>
>> Signed-off-by: Max Reitz 
>> ---
>>  block/block-backend.c  |  9 +++-
>>  block/commit.c | 96 +-
>>  block/monitor/block-hmp-cmds.c |  2 +-
>>  blockdev.c |  4 +-
>>  4 files changed, 81 insertions(+), 30 deletions(-)
>>
>> diff --git a/block/block-backend.c b/block/block-backend.c
>> index 6936b25c83..7f2c7dbccc 100644
>> --- a/block/block-backend.c
>> +++ b/block/block-backend.c
>> @@ -2271,8 +2271,13 @@ int blk_commit_all(void)
>>  AioContext *aio_context = blk_get_aio_context(blk);
>>  
>>  aio_context_acquire(aio_context);
>> -if (blk_is_inserted(blk) && blk->root->bs->backing) {
>> -int ret = bdrv_commit(blk->root->bs);
> 
> The old code didn't try to commit nodes that don't have a backing file.
> 
>> +if (blk_is_inserted(blk)) {
>> +BlockDriverState *non_filter;
>> +int ret;
>> +
>> +/* Legacy function, so skip implicit filters */
>> +non_filter = bdrv_skip_implicit_filters(blk->root->bs);
>> +ret = bdrv_commit(non_filter);
> 
> The new one tries unconditionally. For nodes without a backing file,
> bdrv_commit() will return -ENOTSUP, so the whole function fails.

:(

Hm.  Should I fix it by checking for
bdrv_cow_bs(bdrv_skip_implicit_filters())?  Or bdrv_backing_chain_next()
and change the bdrv_skip_implicit_filters() to a bdrv_skip_filters()?  I
feel like that would make even more sense.

> (First real bug at patch 35. I almost thought I wouldn't find any!)

:)

>>  if (ret < 0) {
>>  aio_context_release(aio_context);
>>  return ret;
>> diff --git a/block/commit.c b/block/commit.c
>> index 7732d02dfe..4122b6736d 100644
>> --- a/block/commit.c
>> +++ b/block/commit.c
>> @@ -37,6 +37,7 @@ typedef struct CommitBlockJob {
>>  BlockBackend *top;
>>  BlockBackend *base;
>>  BlockDriverState *base_bs;
>> +BlockDriverState *base_overlay;
>>  BlockdevOnError on_error;
>>  bool base_read_only;
>>  bool chain_frozen;
> 
> Hm, again this mysterious base_overlay. I know that stream introduced it
> to avoid freezing the link to base, but commit doesn't seem to do that.
> 
> Is it to avoid using the block status of filter drivers between
> base_overlay and base?

Yes.

> If so, I guess that goes back to the question I
> raised earlier in this series: What is the block status supposed to tell
> for filter nodes?

Honestly, I would really like to get away without having to answer that
question in this series.  Intuitively, I feel like falling through to
the next data-bearing layer is not something most callers want.  But I’d
rather investigate that question separately from this series (even
though that likely means we’ll never do it), and just treat it as it is
in this series.

> But anyway, in contrast to mirror, commit actually freezes the chain
> between commit_top_bs and base, so it should be safe at least.
> 
>> @@ -89,7 +90,7 @@ static void commit_abort(Job *job)
>>   * XXX Can (or should) we somehow keep 'consistent read' blocked even
>>   * after the failed/cancelled commit job is gone? If we already wrote
>>   * something to base, the intermediate images aren't valid any more. */
>> -bdrv_replace_node(s->commit_top_bs, backing_bs(s->commit_top_bs),
>> +bdrv_replace_node(s->commit_top_bs, s->commit_top_bs->backing->bs,
>>_abort);
>>  
>>  bdrv_unref(s->commit_top_bs);
>> @@ -153,7 +154,7 @@ static int coroutine_fn commit_run(Job *job, Error 
>> **errp)
>>  break;
>>  }
>>  /* Copy if allocated above the base */
>> -ret = bdrv_is_allocated_above(blk_bs(s->top), blk_bs(s->base), 
>> false,
>> +ret = bdrv_is_allocated_above(blk_bs(s->top), s->base_overlay, true,
>>offset, COMMIT_BUFFER_SIZE, );
>>  copy = (ret == 1);
>>  trace_commit_one_iteration(s, offset, n, ret);
>> @@ -253,15 +254,35 @@ void commit_start(const char *job_id, BlockDriverState 
>> *bs,
>>  CommitBlockJob *s;
>>  BlockDriverState *iter;
>>  BlockDriverState *commit_top_bs = NULL;
>> +BlockDriverState *filtered_base;
>>  Error *local_err = NULL;
>> +int64_t base_size, top_size;
>> +uint64_t perms, iter_shared_perms;
>>  int ret;
>>  
>>  assert(top != bs);
>> -if (top == base) {
>> +if (bdrv_skip_filters(top) == bdrv_skip_filters(base)) {
>>  error_setg(errp, "Invalid files for merge: top and base are the 
>> same");
>>  return;
>>  }
>>  
>> +base_size = bdrv_getlength(base);
>> +if (base_size < 0) {
>> +error_setg_errno(errp, -base_size, 

Re: [RFC PATCH 08/22] nbd: Add max-connections to nbd-server-start

2020-08-20 Thread Kevin Wolf
Am 19.08.2020 um 22:00 hat Eric Blake geschrieben:
> On 8/13/20 11:29 AM, Kevin Wolf wrote:
> > This is a QMP equivalent of qemu-nbd's --share option, limiting the
> > maximum number of clients that can attach at the same time.
> > 
> > Signed-off-by: Kevin Wolf 
> > ---
> >   qapi/block-export.json | 10 --
> >   include/block/nbd.h|  3 ++-
> >   block/monitor/block-hmp-cmds.c |  2 +-
> >   blockdev-nbd.c | 33 ++---
> >   qemu-storage-daemon.c  |  4 ++--
> >   5 files changed, 39 insertions(+), 13 deletions(-)
> > 
> > diff --git a/qapi/block-export.json b/qapi/block-export.json
> > index 40369814b4..1fdc55c53a 100644
> > --- a/qapi/block-export.json
> > +++ b/qapi/block-export.json
> > @@ -14,6 +14,8 @@
> >   # is only resolved at time of use, so can be deleted and
> >   # recreated on the fly while the NBD server is active.
> >   # If missing, it will default to denying access (since 4.0).
> > +# @max-connections: The maximum number of connections to allow at the same
> > +#   time, 0 for unlimited. (since 5.2; default: 0)
> 
> Nice way to add feature parity.
> 
> Limiting the number of connections (particularly for a writable client,
> where we cannot guarantee cache consistency between the connections), seems
> like a worthwhile feature to have; I've always found it odd that qemu-nbd
> and QMP nbd-server-add defaulted to different limits (1 vs. unlimited).  For
> reference, nbdkit defaults to unlimited, and I'm happy if
> qemu-storage-daemon does likewise; but changing qemu-nbd's default of 1
> would be backwards incompatible and may cause surprises (there's always
> 'qemu-nbd -e' when needed).  I also wonder if we should change 'qemu-nbd -e
> 0' to mean unlimited rather than an error (right now, qemu-iotests/common.rc
> uses -e 42 for all nbd-based tests for a saner limit than just 1, but it
> smells of being arbitrary compared to unlimited).

I think eventually the actual NBD server implementation in qemu-nbd
should go away and it should just reuse the QMP one. Changing the
default would remove one more difference between them, which can only be
helpful in this process. (Though of course having a different default is
still simple enough for a simple wrapper.)

Kevin




Re: [RFC PATCH 07/22] block/export: Remove magic from block-export-add

2020-08-20 Thread Kevin Wolf
Am 19.08.2020 um 21:50 hat Eric Blake geschrieben:
> cc: Peter Krempa
> 
> On 8/13/20 11:29 AM, Kevin Wolf wrote:
> > nbd-server-add tries to be convenient and adds two questionable
> > features that we don't want to share in block-export-add, even for NBD
> > exports:
> > 
> > 1. When requesting a writable export of a read-only device, the export
> > is silently downgraded to read-only. This should be an error in the
> > context of block-export-add.
> 
> I'd be happy for this to be an error even with nbd-export-add; I don't think
> it would harm any of libvirt's existing usage (either for storage migration,
> or for incremental backups).
> 
> Side note: In the past, I had a proposal to enhance the NBD Protocol to
> allow a client to advertise to the server its intent on being a read-only or
> read-write client.  Not relevant to this patch, but this part of the commit
> message reminds me that I should revisit that topic (Rich and I recently hit
> another case in nbdkit where such an extension would be nice, when it comes
> to using NBD's multi-conn for better performance on a read-only connection,
> but only if the server knows the client intends to be read-only)
> 
> > 
> > 2. When using a BlockBackend name, unplugging the device from the guest
> > will automatically stop the NBD server, too. This may sometimes be
> > what you want, but it could also be very surprising. Let's keep
> > things explicit with block-export-add. If the user wants to stop the
> > export, they should tell us so.
> 
> Here, keeping the nbd command different from the block-export command seems
> tolerable.  On the other hand, I wonder if Peter needs to change anything in
> libvirt's incremental backup code to handle this sudden disappearance of an
> NBD device during a disk hot-unplug (that is, either the presence of an
> ongoing pull-mode backup should block disk unplug, or libvirt needs a way to
> guarantee that an ongoing backup NBD device remains in spite of subsequent
> disk actions on the guest).  Depending on libvirt's needs, we may want to
> revisit the nbd command to have the same policy as block-export-add, plus an
> introspectible feature notation.

As long as we can keep the compatibility code local to qmp_nbd_*(), I
don't think it's too bad. In particular because it's already written.

Instead of adjusting libvirt to changes in the nbd-* commands, I'd
rather have it change over to block-export-*. I would like to see the
nbd-server-add/remove commands deprecated soon after we have the
replacements.

> > 
> > Move these things into the nbd-server-add QMP command handler so that
> > they apply only there.
> > 
> > Signed-off-by: Kevin Wolf 
> > ---
> >   include/block/nbd.h   |  3 ++-
> 
> > +void qmp_block_export_add(BlockExportOptions *export, Error **errp)
> > +{
> > +blk_exp_add(export, errp);
> >   }
> >   void qmp_nbd_server_add(BlockExportOptionsNbd *arg, Error **errp)
> >   {
> > -BlockExportOptions export = {
> > +BlockExport *export;
> > +BlockDriverState *bs;
> > +BlockBackend *on_eject_blk;
> > +
> > +BlockExportOptions export_opts = {
> >   .type = BLOCK_EXPORT_TYPE_NBD,
> >   .u.nbd = *arg,
> >   };
> > -qmp_block_export_add(, errp);
> > +
> > +/*
> > + * nbd-server-add doesn't complain when a read-only device should be
> > + * exported as writable, but simply downgrades it. This is an error 
> > with
> > + * block-export-add.
> 
> I'd be happy with either marking this deprecated now (and fixing it in two
> releases), or declaring it a bug in nbd-server-add now (and fixing it
> outright).

How about deprecating nbd-server-add completely?

> > + */
> > +bs = bdrv_lookup_bs(arg->device, arg->device, NULL);
> > +if (bs && bdrv_is_read_only(bs)) {
> > +arg->writable = false;
> > +}
> > +
> > +export = blk_exp_add(_opts, errp);
> > +if (!export) {
> > +return;
> > +}
> > +
> > +/*
> > + * nbd-server-add removes the export when the named BlockBackend used 
> > for
> > + * @device goes away.
> > + */
> > +on_eject_blk = blk_by_name(arg->device);
> > +if (on_eject_blk) {
> > +nbd_export_set_on_eject_blk(export, on_eject_blk);
> > +}
> 
> Wait - is the magic export removal tied only to exporting a drive name, and
> not a node name?  So as long as libvirt is using only node names whwen
> adding exports, a drive being unplugged won't interfere?

Yes, seems so. It's the existing behaviour, I'm only moving the code
around.

> Overall, the change makes sense to me, although I'd love to see if we could
> go further on the writable vs. read-only issue.

If nbd-server-add will be going away relatively soon, it's probably not
worth the trouble. But if you have reasons to keep it, maybe we should
consider it.

Kevin




Re: [PATCH v7 14/47] stream: Deal with filters

2020-08-20 Thread Vladimir Sementsov-Ogievskiy

20.08.2020 12:22, Max Reitz wrote:

On 20.08.20 10:31, Max Reitz wrote:

[...]


So all in all, I believe the biggest surprise about what’s written into
the top layer isn’t that it may be a json:{} filename, but the filename
of a node that maybe doesn’t even exist anymore?  (Oh, no, please don’t
tell me you can delete it and get an invalid pointer read...)


(I tried triggering that, but, oh, it’s strdup’ed() in stream_start().
I’m a bit daft.)




If it's broken anyway, probably we can just revert c624b015bf and start to 
freeze base again?


--
Best regards,
Vladimir



Re: [PATCH v3] block/nbd: use non-blocking connect: fix vm hang on connect()

2020-08-20 Thread Vladimir Sementsov-Ogievskiy

19.08.2020 20:52, Eric Blake wrote:

On 8/12/20 9:52 AM, Vladimir Sementsov-Ogievskiy wrote:

This make nbd connection_co to yield during reconnects, so that
reconnect doesn't hang up the main thread. This is very important in
case of unavailable nbd server host: connect() call may take a long
time, blocking the main thread (and due to reconnect, it will hang
again and again with small gaps of working time during pauses between
connection attempts).




How to reproduce the bug, fixed with this commit:

1. Create an image on node1:
    qemu-img create -f qcow2 xx 100M

2. Start NBD server on node1:
    qemu-nbd xx

3. Start vm with second nbd disk on node2, like this:

   ./x86_64-softmmu/qemu-system-x86_64 -nodefaults -drive \
 file=/work/images/cent7.qcow2 -drive file=nbd+tcp://192.168.100.2 \
 -vnc :0 -qmp stdio -m 2G -enable-kvm -vga std


Where is the configuration to set up retry on the nbd connection?  I wonder if 
you have a non-upstream patch that turns it on by default in your builds; for 
upstream, I would have expected something more along the lines of -blockdev 
driver=nbd,reconnect-delay=20,server.type=inet,server.data.hostname=192.168.100.2,server.data.port=10809
 (typing off the top of my head, rather than actually tested).


No, it's not necessary: reconnect is enabled always. reconnect-delay just says 
what to do with guest requests when connection is down. By default, they just 
fails immediately. But even with reconnect-delay=0 reconnect code works and 
tries to reestablish the connection.





4. Access the vm through vnc (or some other way?), and check that NBD
    drive works:

    dd if=/dev/sdb of=/dev/null bs=1M count=10

    - the command should succeed.

5. Now, let's trigger nbd-reconnect loop in Qemu process. For this:

5.1 Kill NBD server on node1

5.2 run "dd if=/dev/sdb of=/dev/null bs=1M count=10" in the guest
 again. The command should fail and a lot of error messages about
 failing disk may appear as well.


Why does the guest access fail when the server goes away?  Shouldn't the 
pending guest requests merely be queued for retry (where the guest has not seen 
a failure yet, but may do so if timeouts are reached), rather than being 
instant errors?


And that's exactly how it should work when reconnect-delay is 0. If you set 
reconnect-delay to be >0, then in this period of time after detection of 
connection failure all the requests will be queued.





 Now NBD client driver in Qemu tries to reconnect.
 Still, VM works well.

6. Make node1 unavailable on NBD port, so connect() from node2 will
    last for a long time:

    On node1 (Note, that 10809 is just a default NBD port):

    sudo iptables -A INPUT -p tcp --dport 10809 -j DROP

    After some time the guest hangs, and you may check in gdb that Qemu
    hangs in connect() call, issued from the main thread. This is the
    BUG.

7. Don't forget to drop iptables rule from your node1:

    sudo iptables -D INPUT -p tcp --dport 10809 -j DROP






--
Best regards,
Vladimir



Re: [PATCH v7 33/47] mirror: Deal with filters

2020-08-20 Thread Max Reitz
On 19.08.20 18:50, Kevin Wolf wrote:
> Am 25.06.2020 um 17:22 hat Max Reitz geschrieben:
>> This includes some permission limiting (for example, we only need to
>> take the RESIZE permission for active commits where the base is smaller
>> than the top).
>>
>> Use this opportunity to rename qmp_drive_mirror()'s "source" BDS to
>> "target_backing_bs", because that is what it really refers to.
>>
>> Signed-off-by: Max Reitz 
> 
>> @@ -1682,6 +1721,7 @@ static BlockJob *mirror_start_job(
>>  s->zero_target = zero_target;
>>  s->copy_mode = copy_mode;
>>  s->base = base;
>> +s->base_overlay = bdrv_find_overlay(bs, base);
>>  s->granularity = granularity;
>>  s->buf_size = ROUND_UP(buf_size, granularity);
>>  s->unmap = unmap;
> 
> Is this valid without freezing the links between base_overlay and base?

Er...

> Actually, I guess we should freeze everything between bs and base (for
> base != NULL) and it's a preexisting problem that just happens to affect
> this code, too.

Yes, that’s how it looks to me, too.  I don’t think that has anything to
do with this patch.

> Or maybe freezing everything is too much. We only want to make sure that
> no non-filter is inserted between base and base_overlay and that base
> (and now base_overlay) always stay in the backing chain of bs. But what
> options apart from freezing do we have to achieve this?

I don’t know of any, and I don’t know whether anyone would actually care
if we were to just freeze everything.

> Why is using base_overlay even better than using base? Assuming there is
> a good reason, maybe the commit message could spell it out.

The problem is that querying the block status for a filter node falls
through to the underlying data-carrying node.  So if there’s a filter on
top of @base, and we query for is_allocated_above above @base, then
we’ll include @base, which we do not want.

Max



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3] block/nbd: use non-blocking connect: fix vm hang on connect()

2020-08-20 Thread Vladimir Sementsov-Ogievskiy

19.08.2020 17:46, Eric Blake wrote:

On 8/12/20 9:52 AM, Vladimir Sementsov-Ogievskiy wrote:

This make nbd connection_co to yield during reconnects, so that


s/make nbd connection_co to/makes nbd's connection_co/


reconnect doesn't hang up the main thread. This is very important in


s/hang up/block/


case of unavailable nbd server host: connect() call may take a long


s/of/of an/


time, blocking the main thread (and due to reconnect, it will hang
again and again with small gaps of working time during pauses between
connection attempts).

Realization notes:

  - We don't want to implement non-blocking connect() over non-blocking
  socket, because getaddrinfo() doesn't have portable non-blocking
  realization anyway, so let's just use a thread for both getaddrinfo()
  and connect().

  - We can't use qio_channel_socket_connect_async (which behave


behaves


  similarly and start a thread to execute connect() call), as it's rely


starts
relying


  on someone iterating main loop (g_main_loop_run() or something like
  this), which is not always the case.

  - We can't use thread_pool_submit_co API, as thread pool waits for all
  threads to finish (but we don't want to wait for blocking reconnect
  attempt on shutdown.

  So, we just create the thread by hand. Some additional difficulties
  are:

  - We want our connect don't block drained sections and aio context


s/don't block/to avoid blocking/


  switches. To achieve this, we make it possible to "cancel" synchronous
  wait for the connect (which is an coroutine yield actually), still,


s/an/a/


  the thread continues in background, and it successful result may be


s/it successful/if successful, its/


  reused on next reconnect attempt.

  - We don't want to wait for reconnect on shutdown, so there is
  CONNECT_THREAD_RUNNING_DETACHED thread state, which means that block
  layer not more interested in a result, and thread should close new


which means that the block layer is no longer interested


  connected socket on finish and free the state.

How to reproduce the bug, fixed with this commit:

1. Create an image on node1:
    qemu-img create -f qcow2 xx 100M

2. Start NBD server on node1:
    qemu-nbd xx

3. Start vm with second nbd disk on node2, like this:

   ./x86_64-softmmu/qemu-system-x86_64 -nodefaults -drive \
 file=/work/images/cent7.qcow2 -drive file=nbd+tcp://192.168.100.2 \
 -vnc :0 -qmp stdio -m 2G -enable-kvm -vga std

4. Access the vm through vnc (or some other way?), and check that NBD
    drive works:

    dd if=/dev/sdb of=/dev/null bs=1M count=10

    - the command should succeed.

5. Now, let's trigger nbd-reconnect loop in Qemu process. For this:

5.1 Kill NBD server on node1

5.2 run "dd if=/dev/sdb of=/dev/null bs=1M count=10" in the guest
 again. The command should fail and a lot of error messages about
 failing disk may appear as well.

 Now NBD client driver in Qemu tries to reconnect.
 Still, VM works well.

6. Make node1 unavailable on NBD port, so connect() from node2 will
    last for a long time:

    On node1 (Note, that 10809 is just a default NBD port):

    sudo iptables -A INPUT -p tcp --dport 10809 -j DROP

    After some time the guest hangs, and you may check in gdb that Qemu
    hangs in connect() call, issued from the main thread. This is the
    BUG.

7. Don't forget to drop iptables rule from your node1:

    sudo iptables -D INPUT -p tcp --dport 10809 -j DROP

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---

Hi!

This a continuation of "[PATCH v2 for-5.1? 0/5] Fix nbd reconnect dead-locks",
which was mostly merged to 5.1. The only last patch was not merged, and
here is a no-change resend for convenience.


  block/nbd.c | 266 +++-
  1 file changed, 265 insertions(+), 1 deletion(-)


Looks big, but the commit message goes into good detail about what the problem 
is, why the solution takes the approach it does, and a good formula for 
reproduction.



diff --git a/block/nbd.c b/block/nbd.c
index 7bb881fef4..919ec5e573 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -38,6 +38,7 @@
  #include "qapi/qapi-visit-sockets.h"
  #include "qapi/qmp/qstring.h"
+#include "qapi/clone-visitor.h"
  #include "block/qdict.h"
  #include "block/nbd.h"
@@ -62,6 +63,47 @@ typedef enum NBDClientState {
  NBD_CLIENT_QUIT
  } NBDClientState;
+typedef enum NBDConnectThreadState {
+/* No thread, no pending results */
+    CONNECT_THREAD_NONE,


I'd indent the comments by four spaces, to line up with the enumeration values 
they describe.


+
+/* Thread is running, no results for now */
+    CONNECT_THREAD_RUNNING,
+
+/*
+ * Thread is running, but requestor exited. Thread should close the new socket
+ * and free the connect state on exit.
+ */
+    CONNECT_THREAD_RUNNING_DETACHED,
+
+/* Thread finished, results are stored in a state */
+    CONNECT_THREAD_FAIL,
+    CONNECT_THREAD_SUCCESS
+} NBDConnectThreadState;
+
+typedef struct NBDConnectThread 

Re: [PATCH v3 12/12] block/qcow2: automatically insert preallocate filter when on FUSE

2020-08-20 Thread Vladimir Sementsov-Ogievskiy

19.08.2020 18:15, Stefan Hajnoczi wrote:

On Mon, Aug 17, 2020 at 12:15:53PM +0300, Vladimir Sementsov-Ogievskiy wrote:

vstorage has slow allocation, so this patch detect vstorage
(I hope, we don't use other FUSE filesystems) and inserts preallocate
filter between qcow2 node and its file child.

The following test executes more than 10 times faster
(43.2s -> 3.9s for me) with this patch. (/newssd3 is mount point of
vstorage, both cs and mds are on same ssd local ssd disk)

 IMG=/newssd3/z
 FILE_OPTS=file.filename=$IMG
 COUNT=15000
 CHUNK=64K
 CLUSTER=1M
 rm -f $IMG
 ./qemu-img create -f qcow2 -o cluster_size=$CLUSTER $IMG 1G
 ./qemu-img bench -c $COUNT -d 1 -s $CHUNK -w -t none -f qcow2 $IMG


Kevin's input is needed here. I think the philosophy is that nodes are
not automatically inserted. The user should define the graph explicitly.
The management tool would be responsible for configuring a preallocate
filter node.



This patch originally is not intended to be merged upstream, only for 
downstream.
I post it just to possibly get some ideas, could it be somehow useful for 
others.
(I'm not sure, that all FUSE filesystems needs this filter. But vstorage needs)

Hmm, about automatically inserted nodes: why not? block jobs insert their 
filters
automatically.. Anyway, for our downstream process the simplest thing is to
insert it automatically in Qemu.



--
Best regards,
Vladimir



Re: [PATCH v7 14/47] stream: Deal with filters

2020-08-20 Thread Max Reitz
On 20.08.20 10:31, Max Reitz wrote:

[...]

> So all in all, I believe the biggest surprise about what’s written into
> the top layer isn’t that it may be a json:{} filename, but the filename
> of a node that maybe doesn’t even exist anymore?  (Oh, no, please don’t
> tell me you can delete it and get an invalid pointer read...)

(I tried triggering that, but, oh, it’s strdup’ed() in stream_start().
I’m a bit daft.)



signature.asc
Description: OpenPGP digital signature


Re: [PULL 0/2] Block patches for 5.1.0-rc4

2020-08-20 Thread Peter Maydell
On Tue, 11 Aug 2020 at 10:35, Max Reitz  wrote:
>
> Hi,
>
> There is a bug in the backup job that breaks backups from images whose
> size is not aligned to the job's cluster size (i.e., qemu crashes
> because of a failed assertion).  If this bug makes it into the release,
> it would be a regression from 5.0.
>
> On one hand, this is probably a rare configuration that should not
> happen in practice.  On the other, it is a regression, and the fix
> (patch 1) is simple.  So I think it would be good to have this in 5.1.
>
>
> The following changes since commit e1d322c40524d2c544d1fcd37b267d106d16d328:
>
>   Update version for v5.1.0-rc3 release (2020-08-05 17:37:17 +0100)
>
> are available in the Git repository at:
>
>   https://github.com/XanClic/qemu.git tags/pull-block-2020-08-11
>
> for you to fetch changes up to 1f3765b652930a3b485f1a67542c2410c3774abe:
>
>   iotests: add test for unaligned granularity bitmap backup (2020-08-11 
> 09:29:31 +0200)
>
> 
> Block patches for 5.1.0-rc4:
> - Fix abort when running a backup job on an image whose size is not
>   aligned to the backup job's cluster size
>


Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/5.2
for any user-visible changes.

-- PMM



Re: [PATCH v7 14/47] stream: Deal with filters

2020-08-20 Thread Max Reitz
On 19.08.20 17:16, Kevin Wolf wrote:
> Am 19.08.2020 um 16:47 hat Max Reitz geschrieben:
>> On 18.08.20 16:28, Kevin Wolf wrote:
>>> Am 25.06.2020 um 17:21 hat Max Reitz geschrieben:
 Because of the (not so recent anymore) changes that make the stream job
 independent of the base node and instead track the node above it, we
 have to split that "bottom" node into two cases: The bottom COW node,
 and the node directly above the base node (which may be an R/W filter
 or the bottom COW node).

 Signed-off-by: Max Reitz 
 ---
  qapi/block-core.json |  4 +++
  block/stream.c   | 63 
  blockdev.c   |  4 ++-
  3 files changed, 53 insertions(+), 18 deletions(-)

 diff --git a/qapi/block-core.json b/qapi/block-core.json
 index b20332e592..df87855429 100644
 --- a/qapi/block-core.json
 +++ b/qapi/block-core.json
 @@ -2486,6 +2486,10 @@
  # On successful completion the image file is updated to drop the backing 
 file
  # and the BLOCK_JOB_COMPLETED event is emitted.
  #
 +# In case @device is a filter node, block-stream modifies the first 
 non-filter
 +# overlay node below it to point to base's backing node (or NULL if @base 
 was
 +# not specified) instead of modifying @device itself.
>>>
>>> Not to @base's backing node, but to @base itself (or actually, to
>>> above_base's backing node, which is initially @base, but may have
>>> changed when the job is completed).
>>
>> Oh, yes.
>>
>> (I thought I had noticed that already at some point and fixed it
>> locally...  But apparently not.)
>>
>>> Should we also document what using a filter node for @base means?
>>
>> Hm.  What does it mean?  I think the more interesting case is what it
>> means if above_base is a filter, right?
>>
>> Maybe we can put in somewhere in the “If a base file is specified then
>> sectors are not copied from that base file and its backing chain.”  But
>> the more I think about it, the less I know what we could add to it.
>> What happens if there are filters above @base is that their data isn’t
>> copied, because that’s exactly the data in @base.
> 
> The interesting part is probably the graph reconfiguration at the end of
> the job. Which is actually already documented:
> 
> # When streaming completes the image file will have the base
> # file as its backing file.
> 
> Of course, this is not entirely correct any more (because the base may
> have changed).
> 
> If @base is a filter, what backing file path do we write into the top
> layer? A json: filename including the filter?

Yes.

Or, actually.  Now that I read the code...  It takes @base’s filename
before the stream job and then uses that.  So if @base has changed
during the job, then it still uses the old filename.

But that’s not really due to this series.

> Is this worth mentioning
> or do you consider it obvious?

Hm.  I consider it obvious, yes.  @base becomes @top’s backing file (at
least without any graph changes while the job is running), so naturally
what’s written into the image header is @base’s filename – which is a
json:{} filename.

On second thought, @backing-file’s description mysteriously states that
“QEMU will automatically determine the backing file string to use”.
Which makes sense because it would clearly not make sense to describe
what’s actually happening, which is to use @base’s filename at job start
regardless of whether it’s still there at the end of the job.

So I suppose I have the choice of either documenting exactly what’s
happening, even though it doesn’t make much sense, or just not, keeping
it mysterious.

So all in all, I believe the biggest surprise about what’s written into
the top layer isn’t that it may be a json:{} filename, but the filename
of a node that maybe doesn’t even exist anymore?  (Oh, no, please don’t
tell me you can delete it and get an invalid pointer read...)

The more I think about it, the more I think there are problems beyond
the scope of this series here. :/

Max



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 2/3] block: add logging facility for long standing IO requests

2020-08-20 Thread David Edmondson
On Monday, 2020-08-10 at 13:14:46 +03, Denis V. Lunev wrote:

> +strftime(buf, sizeof(buf), "%m-%d %H:%M:%S",

"%F %T" would include the year, which can be useful.

> + localtime_r(_time_host_s, ));
> +
> +bs = blk_bs(blk_stats2blk(stats));
> +qemu_log("long %s[%ld] IO request: %d.03%d since %s.%03d bs: %s(%s, 
> %s)\n",
> + block_account_type(cookie->type), cookie->bytes,
> + (int)(latency_ms / 1000), (int)(latency_ms % 1000), buf,
> + (int)((cookie->start_time_ns / 100) % 1000),

Is there a reason not to use %u rather than casting?

dme.
-- 
We wanna wait, but here we go again.



Re: [PATCH v8 0/6] block: seriously improve savevm/loadvm performance

2020-08-20 Thread Denis V. Lunev
On 7/9/20 4:26 PM, Denis V. Lunev wrote:
> This series do standard basic things:
> - it creates intermediate buffer for all writes from QEMU migration code
>   to QCOW2 image,
> - this buffer is sent to disk asynchronously, allowing several writes to
>   run in parallel.
>
> In general, migration code is fantastically inefficent (by observation),
> buffers are not aligned and sent with arbitrary pieces, a lot of time
> less than 100 bytes at a chunk, which results in read-modify-write
> operations with non-cached operations. It should also be noted that all
> operations are performed into unallocated image blocks, which also suffer
> due to partial writes to such new clusters.
>
> This patch series is an implementation of idea discussed in the RFC
> posted by Denis Plotnikov
> https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg01925.html
> Results with this series over NVME are better than original code
> original rfcthis
> cached:  1.79s  2.38s   1.27s
> non-cached:  3.29s  1.31s   0.81s
>
> Changes from v7:
> - dropped lock from LoadVMState
> - fixed assert in last patch
> - dropped patch 1 as queued
>
> Changes from v6:
> - blk_load_vmstate kludges added (patchew problem fixed)
>
> Changes from v5:
> - loadvm optimizations added with Vladimir comments included
>
> Changes from v4:
> - added patch 4 with blk_save_vmstate() cleanup
> - added R-By
> - bdrv_flush_vmstate -> bdrv_finalize_vmstate
> - fixed return code of bdrv_co_do_save_vmstate
> - fixed typos in comments (Eric, thanks!)
> - fixed patchew warnings
>
> Changes from v3:
> - rebased to master
> - added patch 3 which removes aio_task_pool_wait_one()
> - added R-By to patch 1
> - patch 4 is rewritten via bdrv_run_co
> - error path in blk_save_vmstate() is rewritten to call bdrv_flush_vmstate
>   unconditionally
> - added some comments
> - fixes initialization in bdrv_co_vmstate_save_task_entry as suggested
>
> Changes from v2:
> - code moved from QCOW2 level to generic block level
> - created bdrv_flush_vmstate helper to fix 022, 029 tests
> - added recursive for bs->file in bdrv_co_flush_vmstate (fix 267)
> - fixed blk_save_vmstate helper
> - fixed coroutine wait as Vladimir suggested with waiting fixes from me
>
> Changes from v1:
> - patchew warning fixed
> - fixed validation that only 1 waiter is allowed in patch 1
>
> Signed-off-by: Denis V. Lunev 
> CC: Kevin Wolf 
> CC: Max Reitz 
> CC: Stefan Hajnoczi 
> CC: Fam Zheng 
> CC: Juan Quintela 
> CC: "Dr. David Alan Gilbert" 
> CC: Vladimir Sementsov-Ogievskiy 
> CC: Denis Plotnikov 
>
>
ping



Re: [PATCH v2 for 5.2 0/3] block: add logging facility for long standing IO requests

2020-08-20 Thread Denis V. Lunev
On 8/12/20 5:00 PM, Stefan Hajnoczi wrote:
> On Mon, Aug 10, 2020 at 01:14:44PM +0300, Denis V. Lunev wrote:
>> There are severe delays with IO requests processing if QEMU is running in
>> virtual machine or over software defined storage. Such delays potentially
>> results in unpredictable guest behavior. For example, guests over IDE or
>> SATA drive could remount filesystem read-only if write is performed
>> longer than 10 seconds.
>>
>> Such reports are very complex to process. Some good starting point for this
>> seems quite reasonable. This patch provides one. It adds logging of such
>> potentially dangerous long IO operations.
>>
>> Changed from v2:
>> - removed accidentally added slirp subproject ID
>> - added comment describing timeout selection to patch 3
>>
>> Changes from v1:
>> - fixed conversions using macros suggested by Stefan
>> - fixed option declaration
>> - enabled by default with patch 3
>>
>> Signed-off-by: Denis V. Lunev 
>> CC: Vladimir Sementsov-Ogievskiy 
>> CC: Kevin Wolf 
>> CC: Max Reitz 
>> CC: Stefan Hajnoczi 
> Reviewed-by: Stefan Hajnoczi 
ping



Re: [PATCH v7 4/7] scripts: add coroutine-wrapper.py

2020-08-20 Thread Paolo Bonzini
On 10/06/20 12:03, Vladimir Sementsov-Ogievskiy wrote:
> +{struct_name} s = {{
> +.poll_state.bs = {bs},
> +.poll_state.in_progress = true,
> +
> +{ func.gen_block('.{name} = {name},') }
> +}};
> +
> +s.poll_state.co = qemu_coroutine_create({name}_entry, );
> +
> +return bdrv_poll_co(_state);
> +}}
> +}}"""
> +
> +
> +def gen_wrappers_file(input_code: str) -> str:
> +res = gen_header()
> +for func in func_decl_iter(input_code):
> +res += '\n\n\n'
> +res += gen_wrapper(func)
> +
> +return prettify(res)  # prettify to wrap long lines
> +
> +
> +if __name__ == '__main__':
> +print(gen_wrappers_file(sys.stdin.read()))
> -- 

For Meson support, you'll have to move the "cat" inside the script.  But
since func_decl_iter can work separately on each file, it's enough to do
something like

for filename in sys.argv:
with open(filename, 'r') as f:
   print(gen_wrappers_file(f.read()))

Paolo




Re: [PATCH v7 0/7] coroutines: generate wrapper code

2020-08-20 Thread Paolo Bonzini
On 20/08/20 03:33, Eric Blake wrote:
>>>
>>
>> OK, will do. Thanks for taking a look!
> 
> As this series touched Makefile to add a generated .c, you'll also need
> to rebase that part to apply on top of Paolo's meson conversion (cc'ing
> him if you need help figuring it out)

It should be trivial to do so using custom_target.

Paolo