Re: [PATCH] doc: Describe missing generic -blockdev options

2019-10-15 Thread Peter Maydell
On Tue, 15 Oct 2019 at 13:40, Kevin Wolf  wrote:
>
> We added more generic options after introducing -blockdev and forgot to
> update the documentation (man page and --help output) accordingly. Do
> that now.
>
> Signed-off-by: Kevin Wolf 
> ---
>  qemu-options.hx | 19 ++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 793d70ff93..9f6aa3dde3 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -849,7 +849,8 @@ ETEXI
>  DEF("blockdev", HAS_ARG, QEMU_OPTION_blockdev,
>  "-blockdev [driver=]driver[,node-name=N][,discard=ignore|unmap]\n"
>  "  [,cache.direct=on|off][,cache.no-flush=on|off]\n"
> -"  [,read-only=on|off][,detect-zeroes=on|off|unmap]\n"
> +"  [,read-only=on|off][,auto-read-only=on|off]\n"
> +"  [,force-share=on|off][,detect-zeroes=on|off|unmap]\n"
>  "  [,driver specific parameters...]\n"
>  "configure a block backend\n", QEMU_ARCH_ALL)
>  STEXI
> @@ -885,6 +886,22 @@ name is not intended to be predictable and changes 
> between QEMU invocations.
>  For the top level, an explicit node name must be specified.
>  @item read-only
>  Open the node read-only. Guest write attempts will fail.
> +
> +Note that some block drivers support only read-only access, either generally 
> or
> +in certain configurations. In this case, the default value
> +@option{read-only=off} does not work and the option must be specified
> +explicitly.
> +@item auto-read-only
> +If @option{auto-read-only=on} is set, QEMU is allowed not to open the image
> +read-write even if @option{read-only=off} is requested, but fall back to
> +read-only instead (and switch between the modes later), e.g. depending on
> +whether the image file is writable or whether a writing user is attached to 
> the
> +node.
> +@item force-share
> +Override the image locking system of QEMU and force the node to allowing
> +sharing all permissions with other uses.

Grammar nit: "to allow sharing"; but maybe the phrasing could
be clarified anyway -- I'm not entirely sure what 'sharing
permissions" would be. The first part of the sentence suggests
this option is "force the image file to be opened even if some
other QEMU instance has it open already", but the second half
soudns like "don't lock the image, so that some other use later
is allowed to open it" ? Or is it both, or something else?

> +
> +Enabling @option{force-share=on} requires @option{read-only=on}.

thanks
-- PMM



Re: [PATCH] doc: Describe missing generic -blockdev options

2019-10-15 Thread Eric Blake

On 10/15/19 7:38 AM, Kevin Wolf wrote:

We added more generic options after introducing -blockdev and forgot to
update the documentation (man page and --help output) accordingly. Do
that now.

Signed-off-by: Kevin Wolf 
---
  qemu-options.hx | 19 ++-
  1 file changed, 18 insertions(+), 1 deletion(-)




@@ -885,6 +886,22 @@ name is not intended to be predictable and changes between 
QEMU invocations.
  For the top level, an explicit node name must be specified.
  @item read-only
  Open the node read-only. Guest write attempts will fail.
+
+Note that some block drivers support only read-only access, either generally or
+in certain configurations. In this case, the default value
+@option{read-only=off} does not work and the option must be specified
+explicitly.
+@item auto-read-only
+If @option{auto-read-only=on} is set, QEMU is allowed not to open the image
+read-write even if @option{read-only=off} is requested, but fall back to
+read-only instead (and switch between the modes later), e.g. depending on
+whether the image file is writable or whether a writing user is attached to the
+node.


Hard to read.  Maybe:

If @option{auto-read-only=on} is set, QEMU may fall back to read-only 
usage even when @option{read-only=off} is requested, or even switch 
between modes as needed, e.g. depending on whether the image file is 
writable or whether a writing user is attached to the node.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



Re: [PATCH] blockdev: Use error_report() in hmp_commit()

2019-10-15 Thread Eric Blake

On 10/15/19 7:39 AM, Kevin Wolf wrote:

Instead of using monitor_printf() to report errors, hmp_commit() should
use error_report() like other places do.

Signed-off-by: Kevin Wolf 
---
  blockdev.c | 7 +++
  1 file changed, 3 insertions(+), 4 deletions(-)



Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



Re: [PATCH] doc: Describe missing generic -blockdev options

2019-10-15 Thread Kevin Wolf
Am 15.10.2019 um 15:55 hat Peter Maydell geschrieben:
> On Tue, 15 Oct 2019 at 13:40, Kevin Wolf  wrote:
> >
> > We added more generic options after introducing -blockdev and forgot to
> > update the documentation (man page and --help output) accordingly. Do
> > that now.
> >
> > Signed-off-by: Kevin Wolf 
> > ---
> >  qemu-options.hx | 19 ++-
> >  1 file changed, 18 insertions(+), 1 deletion(-)
> >
> > diff --git a/qemu-options.hx b/qemu-options.hx
> > index 793d70ff93..9f6aa3dde3 100644
> > --- a/qemu-options.hx
> > +++ b/qemu-options.hx
> > @@ -849,7 +849,8 @@ ETEXI
> >  DEF("blockdev", HAS_ARG, QEMU_OPTION_blockdev,
> >  "-blockdev [driver=]driver[,node-name=N][,discard=ignore|unmap]\n"
> >  "  [,cache.direct=on|off][,cache.no-flush=on|off]\n"
> > -"  [,read-only=on|off][,detect-zeroes=on|off|unmap]\n"
> > +"  [,read-only=on|off][,auto-read-only=on|off]\n"
> > +"  [,force-share=on|off][,detect-zeroes=on|off|unmap]\n"
> >  "  [,driver specific parameters...]\n"
> >  "configure a block backend\n", QEMU_ARCH_ALL)
> >  STEXI
> > @@ -885,6 +886,22 @@ name is not intended to be predictable and changes 
> > between QEMU invocations.
> >  For the top level, an explicit node name must be specified.
> >  @item read-only
> >  Open the node read-only. Guest write attempts will fail.
> > +
> > +Note that some block drivers support only read-only access, either 
> > generally or
> > +in certain configurations. In this case, the default value
> > +@option{read-only=off} does not work and the option must be specified
> > +explicitly.
> > +@item auto-read-only
> > +If @option{auto-read-only=on} is set, QEMU is allowed not to open the image
> > +read-write even if @option{read-only=off} is requested, but fall back to
> > +read-only instead (and switch between the modes later), e.g. depending on
> > +whether the image file is writable or whether a writing user is attached 
> > to the
> > +node.
> > +@item force-share
> > +Override the image locking system of QEMU and force the node to allowing
> > +sharing all permissions with other uses.
> 
> Grammar nit: "to allow sharing"; but maybe the phrasing could
> be clarified anyway -- I'm not entirely sure what 'sharing
> permissions" would be. The first part of the sentence suggests
> this option is "force the image file to be opened even if some
> other QEMU instance has it open already", but the second half
> soudns like "don't lock the image, so that some other use later
> is allowed to open it" ? Or is it both, or something else?

It's more the latter. Open the image file and allow other instances to
have it open as well (existing and future instances), but still error
out if the other instance doesn't allow sharing.

I'm open for suggestions on how to phrase this better.

Kevin



Re: [PATCH] doc: Describe missing generic -blockdev options

2019-10-15 Thread Eric Blake

On 10/15/19 9:05 AM, Kevin Wolf wrote:


+@item force-share
+Override the image locking system of QEMU and force the node to allowing
+sharing all permissions with other uses.


Grammar nit: "to allow sharing"; but maybe the phrasing could
be clarified anyway -- I'm not entirely sure what 'sharing
permissions" would be. The first part of the sentence suggests
this option is "force the image file to be opened even if some
other QEMU instance has it open already", but the second half
soudns like "don't lock the image, so that some other use later
is allowed to open it" ? Or is it both, or something else?


It's more the latter. Open the image file and allow other instances to
have it open as well (existing and future instances), but still error
out if the other instance doesn't allow sharing.

I'm open for suggestions on how to phrase this better.


Here's a shot (although I'm not 100% certain I've captured the nuances 
correctly):


Override the image locking system of QEMU by forcing the node to utilize 
weaker shared access for permissions where it would normally request 
exclusive access.  When there is the potential for multiple instances to 
have the same file open (whether this invocation of qemu is the first or 
the second instance), both instances must permit shared access for the 
second instance to succeed at opening the file.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



[PATCH v2 02/21] iotests/qcow2.py: Split feature fields into bits

2019-10-15 Thread Max Reitz
Print the feature fields as a set of bits so that filtering is easier.

Signed-off-by: Max Reitz 
---
 tests/qemu-iotests/031.out  | 36 +--
 tests/qemu-iotests/036.out  | 18 +-
 tests/qemu-iotests/039.out  | 22 ++--
 tests/qemu-iotests/060.out  | 20 +--
 tests/qemu-iotests/061.out  | 72 ++---
 tests/qemu-iotests/137.out  |  2 +-
 tests/qemu-iotests/qcow2.py | 18 +++---
 7 files changed, 99 insertions(+), 89 deletions(-)

diff --git a/tests/qemu-iotests/031.out b/tests/qemu-iotests/031.out
index 68a74d03b9..d535e407bc 100644
--- a/tests/qemu-iotests/031.out
+++ b/tests/qemu-iotests/031.out
@@ -18,9 +18,9 @@ refcount_table_offset 0x1
 refcount_table_clusters   1
 nb_snapshots  0
 snapshot_offset   0x0
-incompatible_features 0x0
-compatible_features   0x0
-autoclear_features0x0
+incompatible_features []
+compatible_features   []
+autoclear_features[]
 refcount_order4
 header_length 72
 
@@ -46,9 +46,9 @@ refcount_table_offset 0x1
 refcount_table_clusters   1
 nb_snapshots  0
 snapshot_offset   0x0
-incompatible_features 0x0
-compatible_features   0x0
-autoclear_features0x0
+incompatible_features []
+compatible_features   []
+autoclear_features[]
 refcount_order4
 header_length 72
 
@@ -74,9 +74,9 @@ refcount_table_offset 0x1
 refcount_table_clusters   1
 nb_snapshots  0
 snapshot_offset   0x0
-incompatible_features 0x0
-compatible_features   0x0
-autoclear_features0x0
+incompatible_features []
+compatible_features   []
+autoclear_features[]
 refcount_order4
 header_length 72
 
@@ -109,9 +109,9 @@ refcount_table_offset 0x1
 refcount_table_clusters   1
 nb_snapshots  0
 snapshot_offset   0x0
-incompatible_features 0x0
-compatible_features   0x0
-autoclear_features0x0
+incompatible_features []
+compatible_features   []
+autoclear_features[]
 refcount_order4
 header_length 104
 
@@ -142,9 +142,9 @@ refcount_table_offset 0x1
 refcount_table_clusters   1
 nb_snapshots  0
 snapshot_offset   0x0
-incompatible_features 0x0
-compatible_features   0x0
-autoclear_features0x0
+incompatible_features []
+compatible_features   []
+autoclear_features[]
 refcount_order4
 header_length 104
 
@@ -175,9 +175,9 @@ refcount_table_offset 0x1
 refcount_table_clusters   1
 nb_snapshots  0
 snapshot_offset   0x0
-incompatible_features 0x0
-compatible_features   0x0
-autoclear_features0x0
+incompatible_features []
+compatible_features   []
+autoclear_features[]
 refcount_order4
 header_length 104
 
diff --git a/tests/qemu-iotests/036.out b/tests/qemu-iotests/036.out
index e489b44386..15229a9604 100644
--- a/tests/qemu-iotests/036.out
+++ b/tests/qemu-iotests/036.out
@@ -16,9 +16,9 @@ refcount_table_offset 0x1
 refcount_table_clusters   1
 nb_snapshots  0
 snapshot_offset   0x0
-incompatible_features 0x8000
-compatible_features   0x0
-autoclear_features0x0
+incompatible_features [63]
+compatible_features   []
+autoclear_features[]
 refcount_order4
 header_length 104
 
@@ -50,9 +50,9 @@ refcount_table_offset 0x1
 refcount_table_clusters   1
 nb_snapshots  0
 snapshot_offset   0x0
-incompatible_features 0x0
-compatible_features   0x0
-autoclear_features0x8000
+incompatible_features []
+compatible_features   []
+autoclear_features[63]
 refcount_order4
 header_length 104
 
@@ -78,9 +78,9 @@ refcount_table_offset 0x1
 refcount_table_clusters   1
 nb_snapshots  0
 snapshot_offset   0x0
-incompatible_features 0x0
-compatible_features   0x0
-autoclear_features0x0
+incompatible_features []
+compatible_features   []
+autoclear_features[]
 refcount_order4
 header_length 104
 
diff --git a/tests/qemu-iotests/039.out b/tests/qemu-iotests/039.out
index 2e356d51b6..bdafa3ace3 100644
--- a/tests/qemu-iotests/039.out
+++ b/tests/qemu-iotests/039.out
@@ -4,7 +4,7 @@ QA output created by 039
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728
 wrote 512/512 bytes at offset 0
 512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-incompatible_features 0x0
+incompatible_features []
 No errors were found on the image.
 
 == Creating a dirty image file ==
@@ -12,7 +12,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728
 wrote 512/512 bytes at offset 0
 512 bytes, X ops; XX:XX:XX.X (XXX 

[PATCH v2 05/21] iotests: Replace IMGOPTS by _unsupported_imgopts

2019-10-15 Thread Max Reitz
Some tests require compat=1.1 and thus set IMGOPTS='compat=1.1'
globally.  That is not how it should be done; instead, they should
simply set _unsupported_imgopts to compat=0.10 (compat=1.1 is the
default anyway).

This makes the tests heed user-specified $IMGOPTS.  Some do not work
with all image options, though, so we need to disable them accordingly.

Signed-off-by: Max Reitz 
---
 tests/qemu-iotests/036 | 3 +--
 tests/qemu-iotests/060 | 4 ++--
 tests/qemu-iotests/062 | 3 ++-
 tests/qemu-iotests/066 | 3 ++-
 tests/qemu-iotests/068 | 3 ++-
 tests/qemu-iotests/098 | 4 ++--
 6 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/tests/qemu-iotests/036 b/tests/qemu-iotests/036
index 5f929ad3be..bbaf0ef45b 100755
--- a/tests/qemu-iotests/036
+++ b/tests/qemu-iotests/036
@@ -43,9 +43,8 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 # This tests qcow2-specific low-level functionality
 _supported_fmt qcow2
 _supported_proto file
-
 # Only qcow2v3 and later supports feature bits
-IMGOPTS="compat=1.1"
+_unsupported_imgopts 'compat=0.10'
 
 echo
 echo === Image with unknown incompatible feature bit ===
diff --git a/tests/qemu-iotests/060 b/tests/qemu-iotests/060
index b91d8321bb..9c2ef42522 100755
--- a/tests/qemu-iotests/060
+++ b/tests/qemu-iotests/060
@@ -48,6 +48,8 @@ _filter_io_error()
 _supported_fmt qcow2
 _supported_proto file
 _supported_os Linux
+# These tests only work for compat=1.1 images with refcount_bits=16
+_unsupported_imgopts 'compat=0.10' 'refcount_bits=\([^1]\|.\([^6]\|$\)\)'
 
 rt_offset=65536  # 0x1 (XXX: just an assumption)
 rb_offset=131072 # 0x2 (XXX: just an assumption)
@@ -55,8 +57,6 @@ l1_offset=196608 # 0x3 (XXX: just an assumption)
 l2_offset=262144 # 0x4 (XXX: just an assumption)
 l2_offset_after_snapshot=524288 # 0x8 (XXX: just an assumption)
 
-IMGOPTS="compat=1.1"
-
 OPEN_RW="open -o overlap-check=all $TEST_IMG"
 # Overlap checks are done before write operations only, therefore opening an
 # image read-only makes the overlap-check option irrelevant
diff --git a/tests/qemu-iotests/062 b/tests/qemu-iotests/062
index d5f818fcce..ac0d2a9a3b 100755
--- a/tests/qemu-iotests/062
+++ b/tests/qemu-iotests/062
@@ -40,8 +40,9 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 # This tests qocw2-specific low-level functionality
 _supported_fmt qcow2
 _supported_proto generic
+# We need zero clusters and snapshots
+_unsupported_imgopts 'compat=0.10' 'refcount_bits=1[^0-9]'
 
-IMGOPTS="compat=1.1"
 IMG_SIZE=64M
 
 echo
diff --git a/tests/qemu-iotests/066 b/tests/qemu-iotests/066
index 28f8c98412..00eb80d89e 100755
--- a/tests/qemu-iotests/066
+++ b/tests/qemu-iotests/066
@@ -39,9 +39,10 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 # This tests qocw2-specific low-level functionality
 _supported_fmt qcow2
 _supported_proto generic
+# We need zero clusters and snapshots
+_unsupported_imgopts 'compat=0.10' 'refcount_bits=1[^0-9]'
 
 # Intentionally create an unaligned image
-IMGOPTS="compat=1.1"
 IMG_SIZE=$((64 * 1024 * 1024 + 512))
 
 echo
diff --git a/tests/qemu-iotests/068 b/tests/qemu-iotests/068
index 22f5ca3ba6..65650fca9a 100755
--- a/tests/qemu-iotests/068
+++ b/tests/qemu-iotests/068
@@ -39,8 +39,9 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 # This tests qocw2-specific low-level functionality
 _supported_fmt qcow2
 _supported_proto generic
+# Internal snapshots are (currently) impossible with refcount_bits=1
+_unsupported_imgopts 'compat=0.10' 'refcount_bits=1[^0-9]'
 
-IMGOPTS="compat=1.1"
 IMG_SIZE=128K
 
 case "$QEMU_DEFAULT_MACHINE" in
diff --git a/tests/qemu-iotests/098 b/tests/qemu-iotests/098
index 1c1d1c468f..700068b328 100755
--- a/tests/qemu-iotests/098
+++ b/tests/qemu-iotests/098
@@ -40,8 +40,8 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 
 _supported_fmt qcow2
 _supported_proto file
-
-IMGOPTS="compat=1.1"
+# The code path we want to test here only works for compat=1.1 images
+_unsupported_imgopts 'compat=0.10'
 
 for event in l1_update empty_image_prepare reftable_update refblock_alloc; do
 
-- 
2.21.0




[PATCH v2 01/21] iotests/qcow2.py: Add dump-header-exts

2019-10-15 Thread Max Reitz
This is useful for tests that want to whitelist fields from dump-header
(with grep) but still print all header extensions.

Signed-off-by: Max Reitz 
---
 tests/qemu-iotests/qcow2.py | 5 +
 1 file changed, 5 insertions(+)

diff --git a/tests/qemu-iotests/qcow2.py b/tests/qemu-iotests/qcow2.py
index b392972d1b..d813b4fc81 100755
--- a/tests/qemu-iotests/qcow2.py
+++ b/tests/qemu-iotests/qcow2.py
@@ -154,6 +154,10 @@ def cmd_dump_header(fd):
 h.dump()
 h.dump_extensions()
 
+def cmd_dump_header_exts(fd):
+h = QcowHeader(fd)
+h.dump_extensions()
+
 def cmd_set_header(fd, name, value):
 try:
 value = int(value, 0)
@@ -230,6 +234,7 @@ def cmd_set_feature_bit(fd, group, bit):
 
 cmds = [
 [ 'dump-header',  cmd_dump_header,  0, 'Dump image header 
and header extensions' ],
+[ 'dump-header-exts', cmd_dump_header_exts, 0, 'Dump image header 
extensions' ],
 [ 'set-header',   cmd_set_header,   2, 'Set a field in the 
header'],
 [ 'add-header-ext',   cmd_add_header_ext,   2, 'Add a header 
extension' ],
 [ 'add-header-ext-stdio', cmd_add_header_ext_stdio, 1, 'Add a header 
extension, data from stdin' ],
-- 
2.21.0




[PATCH v2 00/21] iotests: Allow ./check -o data_file

2019-10-15 Thread Max Reitz
Hi,

The cover letter from v1 (explaining the motivation behind this series
and the general structure) is here:

https://lists.nongnu.org/archive/html/qemu-block/2019-09/msg01323.html


For v2, I’ve tried to address Maxim’s comments:
- Patch 1 through 3: New
- Patch 4: Only print feature bits instead of blacklisting stuff that we
   don’t need
- Patch 5:
  - Fix typo
  - Add comment why 098 needs compat=1.1
- Patch 16: Use _check_test_img
- Patch 17: Use the new _filter_json_filename
- Patch 18: Rethink the incompatible feature filter approach: Instead of
filtering out the data_file bit, just check whether the
dirty bit is present (because that is all we want to know)
- Patch 19: Use the new _filter_json_filename
- Patch 20: Rebase conflicts due to the changes to patch 5
- Patch 21:
  - Add and use _get_data_file
  - Add a comment how the data_file_filter in _filter_qemu_img_map works


git-backport-diff against v1:

Key:
[] : patches are identical
[] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/21:[down] 'iotests/qcow2.py: Add dump-header-exts'
002/21:[down] 'iotests/qcow2.py: Split feature fields into bits'
003/21:[down] 'iotests: Add _filter_json_filename'
004/21:[0060] [FC] 'iotests: Filter refcount_order in 036'
005/21:[0003] [FC] 'iotests: Replace IMGOPTS by _unsupported_imgopts'
006/21:[] [--] 'iotests: Drop compat=1.1 in 050'
007/21:[] [--] 'iotests: Let _make_test_img parse its parameters'
008/21:[] [--] 'iotests: Add -o and --no-opts to _make_test_img'
009/21:[] [--] 'iotests: Inject space into -ocompat=0.10 in 051'
010/21:[] [--] 'iotests: Replace IMGOPTS= by -o'
011/21:[] [--] 'iotests: Replace IMGOPTS='' by --no-opts'
012/21:[] [--] 'iotests: Drop IMGOPTS use in 267'
013/21:[] [--] 'iotests: Avoid qemu-img create'
014/21:[] [--] 'iotests: Use _rm_test_img for deleting test images'
015/21:[] [--] 'iotests: Avoid cp/mv of test images'
016/21:[0004] [FC] 'iotests: Make 091 work with data_file'
017/21:[0004] [FC] 'iotests: Make 110 work with data_file'
018/21:[0002] [FC] 'iotests: Make 137 work with data_file'
019/21:[0004] [FC] 'iotests: Make 198 work with data_file'
020/21:[0002] [FC] 'iotests: Disable data_file where it cannot be used'
021/21:[0034] [FC] 'iotests: Allow check -o data_file'


Max Reitz (21):
  iotests/qcow2.py: Add dump-header-exts
  iotests/qcow2.py: Split feature fields into bits
  iotests: Add _filter_json_filename
  iotests: Filter refcount_order in 036
  iotests: Replace IMGOPTS by _unsupported_imgopts
  iotests: Drop compat=1.1 in 050
  iotests: Let _make_test_img parse its parameters
  iotests: Add -o and --no-opts to _make_test_img
  iotests: Inject space into -ocompat=0.10 in 051
  iotests: Replace IMGOPTS= by -o
  iotests: Replace IMGOPTS='' by --no-opts
  iotests: Drop IMGOPTS use in 267
  iotests: Avoid qemu-img create
  iotests: Use _rm_test_img for deleting test images
  iotests: Avoid cp/mv of test images
  iotests: Make 091 work with data_file
  iotests: Make 110 work with data_file
  iotests: Make 137 work with data_file
  iotests: Make 198 work with data_file
  iotests: Disable data_file where it cannot be used
  iotests: Allow check -o data_file

 tests/qemu-iotests/005   |  2 +-
 tests/qemu-iotests/007   |  5 ++-
 tests/qemu-iotests/014   |  2 +
 tests/qemu-iotests/015   |  5 ++-
 tests/qemu-iotests/019   |  6 +--
 tests/qemu-iotests/020   |  6 +--
 tests/qemu-iotests/024   | 10 ++---
 tests/qemu-iotests/026   |  5 ++-
 tests/qemu-iotests/028   |  2 +-
 tests/qemu-iotests/029   |  7 ++--
 tests/qemu-iotests/031   |  9 ++--
 tests/qemu-iotests/031.out   | 36 
 tests/qemu-iotests/036   | 15 ---
 tests/qemu-iotests/036.out   | 66 -
 tests/qemu-iotests/039   | 27 +---
 tests/qemu-iotests/039.out   | 22 +-
 tests/qemu-iotests/043   |  4 +-
 tests/qemu-iotests/046   |  2 +
 tests/qemu-iotests/048   |  4 +-
 tests/qemu-iotests/050   |  8 +---
 tests/qemu-iotests/051   |  7 ++--
 tests/qemu-iotests/053   |  4 +-
 tests/qemu-iotests/058   |  7 ++--
 tests/qemu-iotests/059   | 20 -
 tests/qemu-iotests/060   | 12 +++---
 tests/qemu-iotests/060.out   | 20 -
 tests/qemu-iotests/061   | 61 ++-
 tests/qemu-iotests/061.out   | 72 
 tests/qemu-iotests/062   |  3 +-
 tests/qemu-iotests/063   | 18 
 tests/qemu-iotests/063.out   |  3 +-
 tests/qemu-iotests/066   |  3 +-
 tests/qemu-iotests/067   |  6 ++-
 tests/qemu-iotests/068   |  4 +-
 tests/qemu-iotests/069   |  

[PATCH v2 03/21] iotests: Add _filter_json_filename

2019-10-15 Thread Max Reitz
Signed-off-by: Max Reitz 
---
 tests/qemu-iotests/common.filter | 24 
 1 file changed, 24 insertions(+)

diff --git a/tests/qemu-iotests/common.filter b/tests/qemu-iotests/common.filter
index 9f418b4881..63bc6f6f26 100644
--- a/tests/qemu-iotests/common.filter
+++ b/tests/qemu-iotests/common.filter
@@ -227,5 +227,29 @@ _filter_qmp_empty_return()
 grep -v '{"return": {}}'
 }
 
+_filter_json_filename()
+{
+$PYTHON -c 'import sys
+result, *fnames = sys.stdin.read().split("json:{")
+depth = 0
+for fname in fnames:
+depth += 1 # For the opening brace in the split separator
+for chr_i, chr in enumerate(fname):
+if chr == "{":
+depth += 1
+elif chr == "}":
+depth -= 1
+if depth == 0:
+break
+
+# json:{} filenames may be nested; filter out everything from
+# inside the outermost one
+if depth == 0:
+chr_i += 1 # First character past the filename
+result += "json:{ /* filtered */ }" + fname[chr_i:]
+
+sys.stdout.write(result)'
+}
+
 # make sure this script returns success
 true
-- 
2.21.0




[PATCH v3 1/5] qcow2: Allow writing compressed data of multiple clusters

2019-10-15 Thread Andrey Shinkevich
QEMU currently supports writing compressed data of the size equal to
one cluster. This patch allows writing QCOW2 compressed data that
exceed one cluster. Now, we split buffered data into separate clusters
and write them compressed using the existing functionality.
To inform the block layer about writing all the data compressed, we
introduce the 'compress' command line option. Based on that option, the
written data will be aligned by the cluster size at the generic layer.

Suggested-by: Pavel Butsykin 
Suggested-by: Vladimir Sementsov-Ogievskiy 
Suggested-by: Roman Kagan 
Signed-off-by: Andrey Shinkevich 
---
 block.c   |  12 +-
 block/io.c|   2 +-
 block/qcow2.c | 106 ++
 block/qcow2.h |   1 +
 blockdev.c|   4 ++
 include/block/block.h |   1 +
 include/block/block_int.h |   2 +
 qapi/block-core.json  |   6 ++-
 qemu-options.hx   |   6 ++-
 9 files changed, 108 insertions(+), 32 deletions(-)

diff --git a/block.c b/block.c
index 5944124..4cfbea2 100644
--- a/block.c
+++ b/block.c
@@ -1418,6 +1418,11 @@ QemuOptsList bdrv_runtime_opts = {
 .type = QEMU_OPT_BOOL,
 .help = "always accept other writers (default: off)",
 },
+{
+.name = BDRV_OPT_COMPRESS,
+.type = QEMU_OPT_BOOL,
+.help = "compress all writes to the image (default: off)",
+},
 { /* end of list */ }
 },
 };
@@ -2983,6 +2988,11 @@ static BlockDriverState *bdrv_open_inherit(const char 
*filename,
 flags &= ~BDRV_O_RDWR;
 }
 
+if (!g_strcmp0(qdict_get_try_str(options, BDRV_OPT_COMPRESS), "on") ||
+qdict_get_try_bool(options, BDRV_OPT_COMPRESS, false)) {
+bs->all_write_compressed = true;
+}
+
 if (flags & BDRV_O_SNAPSHOT) {
 snapshot_options = qdict_new();
 bdrv_temp_snapshot_options(_flags, snapshot_options,
@@ -3208,7 +3218,7 @@ static int bdrv_reset_options_allowed(BlockDriverState 
*bs,
  * in bdrv_reopen_prepare() so they can be left out of @new_opts */
 const char *const common_options[] = {
 "node-name", "discard", "cache.direct", "cache.no-flush",
-"read-only", "auto-read-only", "detect-zeroes", NULL
+"read-only", "auto-read-only", "detect-zeroes", "compress", NULL
 };
 
 for (e = qdict_first(bs->options); e; e = qdict_next(bs->options, e)) {
diff --git a/block/io.c b/block/io.c
index f8c3596..6a5509c 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1922,7 +1922,7 @@ static int coroutine_fn bdrv_aligned_pwritev(BdrvChild 
*child,
 } else if (flags & BDRV_REQ_ZERO_WRITE) {
 bdrv_debug_event(bs, BLKDBG_PWRITEV_ZERO);
 ret = bdrv_co_do_pwrite_zeroes(bs, offset, bytes, flags);
-} else if (flags & BDRV_REQ_WRITE_COMPRESSED) {
+} else if (flags & BDRV_REQ_WRITE_COMPRESSED || bs->all_write_compressed) {
 ret = bdrv_driver_pwritev_compressed(bs, offset, bytes,
  qiov, qiov_offset);
 } else if (bytes <= max_transfer) {
diff --git a/block/qcow2.c b/block/qcow2.c
index 7961c05..9a85d73 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1787,6 +1787,10 @@ static void qcow2_refresh_limits(BlockDriverState *bs, 
Error **errp)
 /* Encryption works on a sector granularity */
 bs->bl.request_alignment = qcrypto_block_get_sector_size(s->crypto);
 }
+if (bs->all_write_compressed) {
+bs->bl.request_alignment = MAX(bs->bl.request_alignment,
+   s->cluster_size);
+}
 bs->bl.pwrite_zeroes_alignment = s->cluster_size;
 bs->bl.pdiscard_alignment = s->cluster_size;
 }
@@ -4152,10 +4156,8 @@ fail:
 return ret;
 }
 
-/* XXX: put compressed sectors first, then all the cluster aligned
-   tables to avoid losing bytes in alignment */
 static coroutine_fn int
-qcow2_co_pwritev_compressed_part(BlockDriverState *bs,
+qcow2_co_pwritev_compressed_task(BlockDriverState *bs,
  uint64_t offset, uint64_t bytes,
  QEMUIOVector *qiov, size_t qiov_offset)
 {
@@ -4165,32 +4167,11 @@ qcow2_co_pwritev_compressed_part(BlockDriverState *bs,
 uint8_t *buf, *out_buf;
 uint64_t cluster_offset;
 
-if (has_data_file(bs)) {
-return -ENOTSUP;
-}
-
-if (bytes == 0) {
-/* align end of file to a sector boundary to ease reading with
-   sector based I/Os */
-int64_t len = bdrv_getlength(bs->file->bs);
-if (len < 0) {
-return len;
-}
-return bdrv_co_truncate(bs->file, len, PREALLOC_MODE_OFF, NULL);
-}
-
-if (offset_into_cluster(s, offset)) {
-return -EINVAL;
-}
+assert(bytes == s->cluster_size || (bytes < s->cluster_size &&
+   (offset + bytes == bs->total_sectors << BDRV_SECTOR_BITS)));
 
 buf = qemu_blockalign(bs, s->cluster_size);
- 

[PATCH v3 2/5] tests/qemu-iotests: add case to write compressed data of multiple clusters

2019-10-15 Thread Andrey Shinkevich
Add the test case to the iotest #214 that checks possibility of writing
compressed data of more than one cluster size.

Signed-off-by: Andrey Shinkevich 
---
 tests/qemu-iotests/214 | 35 +++
 tests/qemu-iotests/214.out | 15 +++
 2 files changed, 50 insertions(+)

diff --git a/tests/qemu-iotests/214 b/tests/qemu-iotests/214
index 21ec8a2..0003dc2 100755
--- a/tests/qemu-iotests/214
+++ b/tests/qemu-iotests/214
@@ -89,6 +89,41 @@ _check_test_img -r all
 $QEMU_IO -c "read  -P 0x11  0 4M" "$TEST_IMG" 2>&1 | _filter_qemu_io | 
_filter_testdir
 $QEMU_IO -c "read  -P 0x22 4M 4M" "$TEST_IMG" 2>&1 | _filter_qemu_io | 
_filter_testdir
 
+echo
+echo "=== Write compressed data of multiple clusters ==="
+echo
+cluster_size=0x1
+_make_test_img 2M -o cluster_size=$cluster_size
+
+echo "Uncompressed data:"
+let data_size="8 * $cluster_size"
+$QEMU_IO -c "write -P 0xaa 0 $data_size" "$TEST_IMG" \
+ 2>&1 | _filter_qemu_io | _filter_testdir
+$QEMU_IMG info "$TEST_IMG" | sed -n '/disk size:/ s/^ *//p'
+
+_make_test_img 2M -o cluster_size=$cluster_size
+let data_size="3 * $cluster_size + ($cluster_size >> 1)"
+# Set compress=on. That will align the written data
+# by the cluster size and will write them compressed.
+QEMU_IO_OPTIONS=$QEMU_IO_OPTIONS_NO_FMT \
+$QEMU_IO -c "write -P 0xbb 0 $data_size" --image-opts \
+ driver=$IMGFMT,compress=on,file.filename=$TEST_IMG \
+ 2>&1 | _filter_qemu_io | _filter_testdir
+
+let offset="4 * $cluster_size"
+QEMU_IO_OPTIONS=$QEMU_IO_OPTIONS_NO_FMT \
+$QEMU_IO -c "write -P 0xcc $offset $data_size" "json:{\
+'driver': '$IMGFMT',
+'file': {
+'driver': 'file',
+'filename': '$TEST_IMG'
+},
+'compress': true
+}" | _filter_qemu_io | _filter_testdir
+
+echo "After the multiple cluster data have been written compressed,"
+$QEMU_IMG info "$TEST_IMG" | sed -n '/disk size:/ s/^ *//p'
+
 # success, all done
 echo '*** done'
 rm -f $seq.full
diff --git a/tests/qemu-iotests/214.out b/tests/qemu-iotests/214.out
index 0fcd8dc..09a2e9a 100644
--- a/tests/qemu-iotests/214.out
+++ b/tests/qemu-iotests/214.out
@@ -32,4 +32,19 @@ read 4194304/4194304 bytes at offset 0
 4 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 read 4194304/4194304 bytes at offset 4194304
 4 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+=== Write compressed data of multiple clusters ===
+
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2097152
+Uncompressed data:
+wrote 524288/524288 bytes at offset 0
+512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+disk size: 772 KiB
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2097152
+wrote 229376/229376 bytes at offset 0
+224 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 229376/229376 bytes at offset 262144
+224 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+After the multiple cluster data have been written compressed,
+disk size: 268 KiB
 *** done
-- 
1.8.3.1




[PATCH v2 3/3] tests: More iotest 223 improvements

2019-10-15 Thread Eric Blake
Run the core of the test twice, once without iothreads, and again
with, for more coverage of both setups.

Suggested-by: Nir Soffer 
Signed-off-by: Eric Blake 
---
 tests/qemu-iotests/223 | 16 ++-
 tests/qemu-iotests/223.out | 85 +-
 2 files changed, 97 insertions(+), 4 deletions(-)

diff --git a/tests/qemu-iotests/223 b/tests/qemu-iotests/223
index 2ba3d8124b4f..8b43ddb02b2c 100755
--- a/tests/qemu-iotests/223
+++ b/tests/qemu-iotests/223
@@ -117,10 +117,19 @@ _send_qemu_cmd $QEMU_HANDLE 
'{"execute":"qmp_capabilities"}' "return"
 _send_qemu_cmd $QEMU_HANDLE '{"execute":"blockdev-add",
   "arguments":{"driver":"qcow2", "node-name":"n",
 "file":{"driver":"file", "filename":"'"$TEST_IMG"'"}}}' "return"
-_send_qemu_cmd $QEMU_HANDLE '{"execute":"x-blockdev-set-iothread",
-  "arguments":{"node-name":"n", "iothread":"io0"}}' "return"
 _send_qemu_cmd $QEMU_HANDLE '{"execute":"block-dirty-bitmap-disable",
   "arguments":{"node":"n", "name":"b"}}' "return"
+
+for attempt in normal iothread; do
+
+echo
+echo "=== Set up NBD with $attempt access ==="
+echo
+if [ $attempt = iothread ]; then
+_send_qemu_cmd $QEMU_HANDLE '{"execute":"x-blockdev-set-iothread",
+  "arguments":{"node-name":"n", "iothread":"io0"}}' "return"
+fi
+
 _send_qemu_cmd $QEMU_HANDLE '{"execute":"nbd-server-add",
   "arguments":{"device":"n"}}' "error" # Attempt add without server
 _send_qemu_cmd $QEMU_HANDLE '{"execute":"nbd-server-start",
@@ -180,6 +189,9 @@ _send_qemu_cmd $QEMU_HANDLE '{"execute":"nbd-server-remove",
   "arguments":{"name":"n2"}}' "error" # Attempt duplicate clean
 _send_qemu_cmd $QEMU_HANDLE '{"execute":"nbd-server-stop"}' "return"
 _send_qemu_cmd $QEMU_HANDLE '{"execute":"nbd-server-stop"}' "error" # Again
+
+done
+
 _send_qemu_cmd $QEMU_HANDLE '{"execute":"quit"}' "return"
 wait=yes _cleanup_qemu

diff --git a/tests/qemu-iotests/223.out b/tests/qemu-iotests/223.out
index 8bfc5072ea9d..ed543047956f 100644
--- a/tests/qemu-iotests/223.out
+++ b/tests/qemu-iotests/223.out
@@ -28,10 +28,91 @@ wrote 2097152/2097152 bytes at offset 2097152
 {"return": {}}
 {"execute":"blockdev-add", "arguments":{"driver":"qcow2", "node-name":"n", 
"file":{"driver":"file", "filename":"TEST_DIR/t.qcow2"}}}
 {"return": {}}
-{"execute":"x-blockdev-set-iothread", "arguments":{"node-name":"n", 
"iothread":"io0"}}
-{"return": {}}
 {"execute":"block-dirty-bitmap-disable", "arguments":{"node":"n", "name":"b"}}
 {"return": {}}
+
+=== Set up NBD with normal access ===
+
+{"execute":"nbd-server-add", "arguments":{"device":"n"}}
+{"error": {"class": "GenericError", "desc": "NBD server not running"}}
+{"execute":"nbd-server-start", "arguments":{"addr":{"type":"unix", 
"data":{"path":"TEST_DIR/nbd"
+{"return": {}}
+{"execute":"nbd-server-start", "arguments":{"addr":{"type":"unix", 
"data":{"path":"TEST_DIR/nbd1"
+{"error": {"class": "GenericError", "desc": "NBD server already running"}}
+exports available: 0
+{"execute":"nbd-server-add", "arguments":{"device":"n", "bitmap":"b"}}
+{"return": {}}
+{"execute":"nbd-server-add", "arguments":{"device":"nosuch"}}
+{"error": {"class": "GenericError", "desc": "Cannot find device=nosuch nor 
node_name=nosuch"}}
+{"execute":"nbd-server-add", "arguments":{"device":"n"}}
+{"error": {"class": "GenericError", "desc": "NBD server already has export 
named 'n'"}}
+{"execute":"nbd-server-add", "arguments":{"device":"n", "name":"n2", 
"bitmap":"b2"}}
+{"error": {"class": "GenericError", "desc": "Enabled bitmap 'b2' incompatible 
with readonly export"}}
+{"execute":"nbd-server-add", "arguments":{"device":"n", "name":"n2", 
"bitmap":"b3"}}
+{"error": {"class": "GenericError", "desc": "Bitmap 'b3' is not found"}}
+{"execute":"nbd-server-add", "arguments":{"device":"n", "name":"n2", 
"writable":true, "bitmap":"b2"}}
+{"return": {}}
+exports available: 2
+ export: 'n'
+  size:  4194304
+  flags: 0x58f ( readonly flush fua df multi cache )
+  min block: 1
+  opt block: 4096
+  max block: 33554432
+  available meta contexts: 2
+   base:allocation
+   qemu:dirty-bitmap:b
+ export: 'n2'
+  size:  4194304
+  flags: 0xced ( flush fua trim zeroes df cache fast-zero )
+  min block: 1
+  opt block: 4096
+  max block: 33554432
+  available meta contexts: 2
+   base:allocation
+   qemu:dirty-bitmap:b2
+
+=== Contrast normal status to large granularity dirty-bitmap ===
+
+read 512/512 bytes at offset 512
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 524288/524288 bytes at offset 524288
+512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 1048576/1048576 bytes at offset 1048576
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 2097152/2097152 bytes at offset 2097152
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, 
"offset": OFFSET},
+{ "start": 4096, "length": 1044480, "depth": 0, "zero": true, "data": false, 
"offset": OFFSET},
+{ "start": 1048576, "length": 3145728, 

[PATCH v2 2/3] iotests: Include QMP input in .out files

2019-10-15 Thread Eric Blake
We generally include relevant HMP input in .out files, by virtue of
the fact that HMP echoes its input.  But QMP does not, so we have to
explicitly inject it in the output stream, in order to make it easier
to read .out files to see what behavior is being tested (especially
true where the output file is a sequence of {'return': {}}).

Suggested-by: Max Reitz 
Signed-off-by: Eric Blake 
---
 tests/qemu-iotests/common.qemu |  9 
 tests/qemu-iotests/085.out | 26 ++
 tests/qemu-iotests/094.out |  4 ++
 tests/qemu-iotests/095.out |  2 +
 tests/qemu-iotests/109.out | 88 ++
 tests/qemu-iotests/117.out |  5 ++
 tests/qemu-iotests/127.out |  4 ++
 tests/qemu-iotests/140.out |  5 ++
 tests/qemu-iotests/141.out | 26 ++
 tests/qemu-iotests/143.out |  3 ++
 tests/qemu-iotests/144.out |  5 ++
 tests/qemu-iotests/153.out | 11 +
 tests/qemu-iotests/156.out | 11 +
 tests/qemu-iotests/161.out |  8 
 tests/qemu-iotests/173.out |  4 ++
 tests/qemu-iotests/182.out |  8 
 tests/qemu-iotests/183.out | 11 +
 tests/qemu-iotests/185.out | 18 +++
 tests/qemu-iotests/191.out |  8 
 tests/qemu-iotests/200.out |  1 +
 tests/qemu-iotests/223.out | 19 
 tests/qemu-iotests/229.out |  3 ++
 tests/qemu-iotests/249.out |  6 +++
 23 files changed, 285 insertions(+)

diff --git a/tests/qemu-iotests/common.qemu b/tests/qemu-iotests/common.qemu
index 8d2021a7eb0c..abc231743e82 100644
--- a/tests/qemu-iotests/common.qemu
+++ b/tests/qemu-iotests/common.qemu
@@ -123,6 +123,9 @@ _timed_wait_for()
 # until either timeout, or a response.  If it is not set, or <=0,
 # then the command is only sent once.
 #
+# If neither $silent nor $mismatch_only is set, and $cmd begins with '{',
+# echo the command before sending it the first time.
+#
 # If $qemu_error_no_exit is set, then even if the expected response
 # is not seen, we will not exit.  $QEMU_STATUS[$1] will be set it -1 in
 # that case.
@@ -152,6 +155,12 @@ _send_qemu_cmd()
 shift $(($# - 2))
 fi

+# Display QMP being sent, but not HMP (since HMP already echoes its
+# input back to output); decide based on leading '{'
+if [ -z "$silent" ] && [ -z "$mismatch_only" ] &&
+[ "$cmd" != "${cmd#{}" ]; then
+echo "${cmd}" | _filter_testdir
+fi
 while [ ${count} -gt 0 ]
 do
 echo "${cmd}" >&${QEMU_IN[${h}]}
diff --git a/tests/qemu-iotests/085.out b/tests/qemu-iotests/085.out
index 2a5f256cd3ec..e92f125b63c4 100644
--- a/tests/qemu-iotests/085.out
+++ b/tests/qemu-iotests/085.out
@@ -7,48 +7,61 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728

 === Sending capabilities ===

+{ 'execute': 'qmp_capabilities' }
 {"return": {}}

 === Create a single snapshot on virtio0 ===

+{ 'execute': 'blockdev-snapshot-sync', 'arguments': { 'device': 'virtio0', 
'snapshot-file':'TEST_DIR/1-snapshot-v0.qcow2', 'format': 'qcow2' } }
 Formatting 'TEST_DIR/1-snapshot-v0.qcow2', fmt=qcow2 size=134217728 
backing_file=TEST_DIR/t.qcow2.1 backing_fmt=qcow2 cluster_size=65536 
lazy_refcounts=off refcount_bits=16
 {"return": {}}

 === Invalid command - missing device and nodename ===

+{ 'execute': 'blockdev-snapshot-sync', 'arguments': { 
'snapshot-file':'TEST_DIR/1-snapshot-v0.qcow2', 'format': 'qcow2' } }
 {"error": {"class": "GenericError", "desc": "Cannot find device= nor 
node_name="}}

 === Invalid command - missing snapshot-file ===

+{ 'execute': 'blockdev-snapshot-sync', 'arguments': { 'device': 'virtio0', 
'format': 'qcow2' } }
 {"error": {"class": "GenericError", "desc": "Parameter 'snapshot-file' is 
missing"}}


 === Create several transactional group snapshots ===

+{ 'execute': 'transaction', 'arguments': {'actions': [ { 'type': 
'blockdev-snapshot-sync', 'data' : { 'device': 'virtio0', 'snapshot-file': 
'TEST_DIR/2-snapshot-v0.qcow2' } }, { 'type': 'blockdev-snapshot-sync', 'data' 
: { 'device': 'virtio1', 'snapshot-file': 'TEST_DIR/2-snapshot-v1.qcow2' } } ] 
} }
 Formatting 'TEST_DIR/2-snapshot-v0.qcow2', fmt=qcow2 size=134217728 
backing_file=TEST_DIR/1-snapshot-v0.qcow2 backing_fmt=qcow2 cluster_size=65536 
lazy_refcounts=off refcount_bits=16
 Formatting 'TEST_DIR/2-snapshot-v1.qcow2', fmt=qcow2 size=134217728 
backing_file=TEST_DIR/t.qcow2.2 backing_fmt=qcow2 cluster_size=65536 
lazy_refcounts=off refcount_bits=16
 {"return": {}}
+{ 'execute': 'transaction', 'arguments': {'actions': [ { 'type': 
'blockdev-snapshot-sync', 'data' : { 'device': 'virtio0', 'snapshot-file': 
'TEST_DIR/3-snapshot-v0.qcow2' } }, { 'type': 'blockdev-snapshot-sync', 'data' 
: { 'device': 'virtio1', 'snapshot-file': 'TEST_DIR/3-snapshot-v1.qcow2' } } ] 
} }
 Formatting 'TEST_DIR/3-snapshot-v0.qcow2', fmt=qcow2 size=134217728 
backing_file=TEST_DIR/2-snapshot-v0.qcow2 backing_fmt=qcow2 cluster_size=65536 
lazy_refcounts=off refcount_bits=16
 Formatting 'TEST_DIR/3-snapshot-v1.qcow2', fmt=qcow2 

[PATCH v3 0/5] qcow2: advanced compression options

2019-10-15 Thread Andrey Shinkevich
New enhancements for writing compressed data to QCOW2 image.

The preceding patches have been queued in the Max's block branch:

Based-on: <20190916175324.18478-1-vsement...@virtuozzo.com>

v2:
Instead of introducing multiple key options for many drivers, the
'compression' option has been introduced on generic block layer
as suggested by Roman Kagan. Discussed on the thread ID
<1570026166-748566-1-git-send-email-andrey.shinkev...@virtuozzo.com>

Andrey Shinkevich (5):
  qcow2: Allow writing compressed data of multiple clusters
  tests/qemu-iotests: add case to write compressed data of multiple
clusters
  block: support compressed write for copy-on-read
  block-stream: add compress option
  tests/qemu-iotests: add case for block-stream compress

 block.c|  12 -
 block/io.c |  23 +++---
 block/qcow2.c  | 106 +
 block/qcow2.h  |   1 +
 block/stream.c |  10 -
 block/trace-events |   2 +-
 blockdev.c |  16 ++-
 include/block/block.h  |   1 +
 include/block/block_int.h  |   2 +
 qapi/block-core.json   |   6 ++-
 qemu-options.hx|   6 ++-
 tests/qemu-iotests/030 |  51 +-
 tests/qemu-iotests/030.out |   4 +-
 tests/qemu-iotests/214 |  35 +++
 tests/qemu-iotests/214.out |  15 +++
 15 files changed, 246 insertions(+), 44 deletions(-)

-- 
1.8.3.1




[PATCH v3 5/5] tests/qemu-iotests: add case for block-stream compress

2019-10-15 Thread Andrey Shinkevich
Add a case to the iotest #030 that tests the 'compress' option for a
block-stream job.

Signed-off-by: Andrey Shinkevich 
---
 tests/qemu-iotests/030 | 51 +-
 tests/qemu-iotests/030.out |  4 ++--
 2 files changed, 52 insertions(+), 3 deletions(-)

diff --git a/tests/qemu-iotests/030 b/tests/qemu-iotests/030
index f3766f2..f0f0e26 100755
--- a/tests/qemu-iotests/030
+++ b/tests/qemu-iotests/030
@@ -21,7 +21,8 @@
 import time
 import os
 import iotests
-from iotests import qemu_img, qemu_io
+from iotests import qemu_img, qemu_io, qemu_img_pipe
+import json
 
 backing_img = os.path.join(iotests.test_dir, 'backing.img')
 mid_img = os.path.join(iotests.test_dir, 'mid.img')
@@ -956,6 +957,54 @@ class TestSetSpeed(iotests.QMPTestCase):
 
 self.cancel_and_wait(resume=True)
 
+class TestCompressed(iotests.QMPTestCase):
+test_img_init_size = 0
+
+def setUp(self):
+qemu_img('create', '-f', iotests.imgfmt, backing_img, '1M')
+qemu_img('create', '-f', iotests.imgfmt, '-o',
+ 'backing_file=%s' % backing_img, mid_img)
+qemu_img('create', '-f', iotests.imgfmt, '-o',
+ 'backing_file=%s' % mid_img, test_img)
+qemu_io('-c', 'write -P 0x1 0 512k', backing_img)
+top = json.loads(qemu_img_pipe('info', '--output=json', test_img))
+self.test_img_init_size = top['actual-size']
+self.vm = iotests.VM().add_drive(test_img, "backing.node-name=mid," +
+ "backing.backing.node-name=base," +
+ "compress=on")
+self.vm.launch()
+
+def tearDown(self):
+self.vm.shutdown()
+os.remove(test_img)
+os.remove(mid_img)
+os.remove(backing_img)
+
+def test_stream_compress(self):
+self.assert_no_active_block_jobs()
+
+result = self.vm.qmp('block-stream', device='mid', job_id='stream-mid')
+self.assert_qmp(result, 'return', {})
+
+self.wait_until_completed(drive='stream-mid')
+# Remove other 'JOB_STATUS_CHANGE' events for the job 'stream-mid'
+self.vm.get_qmp_events(wait=True)
+
+result = self.vm.qmp('block-stream', device='drive0',
+ job_id='stream-top')
+self.assert_qmp(result, 'return', {})
+
+self.wait_until_completed(drive='stream-top')
+self.vm.shutdown()
+
+top = json.loads(qemu_img_pipe('info', '--output=json', test_img))
+mid = json.loads(qemu_img_pipe('info', '--output=json', mid_img))
+base = json.loads(qemu_img_pipe('info', '--output=json', backing_img))
+
+self.assertEqual(mid['actual-size'], base['actual-size'])
+self.assertLess(top['actual-size'], mid['actual-size'])
+self.assertLess(self.test_img_init_size, top['actual-size'])
+
 if __name__ == '__main__':
 iotests.main(supported_fmts=['qcow2', 'qed'],
  supported_protocols=['file'])
diff --git a/tests/qemu-iotests/030.out b/tests/qemu-iotests/030.out
index 6d9bee1..af8dac1 100644
--- a/tests/qemu-iotests/030.out
+++ b/tests/qemu-iotests/030.out
@@ -1,5 +1,5 @@
-...
+
 --
-Ran 27 tests
+Ran 28 tests
 
 OK
-- 
1.8.3.1




[PATCH v3 4/5] block-stream: add compress option

2019-10-15 Thread Andrey Shinkevich
Allow data compression during block-stream job for backup backing chain.

Signed-off-by: Andrey Shinkevich 
---
 block/stream.c | 10 --
 blockdev.c | 12 +++-
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/block/stream.c b/block/stream.c
index 5562ccb..25f9324 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -41,10 +41,16 @@ typedef struct StreamBlockJob {
 static int coroutine_fn stream_populate(BlockBackend *blk,
 int64_t offset, uint64_t bytes)
 {
+BlockDriverState *bs = blk_bs(blk);
+int flags = BDRV_REQ_COPY_ON_READ | BDRV_REQ_PREFETCH;
+
+if (bs->all_write_compressed) {
+flags |= BDRV_REQ_WRITE_COMPRESSED;
+}
+
 assert(bytes < SIZE_MAX);
 
-return blk_co_preadv(blk, offset, bytes, NULL,
- BDRV_REQ_COPY_ON_READ | BDRV_REQ_PREFETCH);
+return blk_co_preadv(blk, offset, bytes, NULL, flags);
 }
 
 static void stream_abort(Job *job)
diff --git a/blockdev.c b/blockdev.c
index 2103730..fd824da 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -471,7 +471,7 @@ static BlockBackend *blockdev_init(const char *file, QDict 
*bs_opts,
 int bdrv_flags = 0;
 int on_read_error, on_write_error;
 bool account_invalid, account_failed;
-bool writethrough, read_only;
+bool writethrough, read_only, compress;
 BlockBackend *blk;
 BlockDriverState *bs;
 ThrottleConfig cfg;
@@ -570,6 +570,7 @@ static BlockBackend *blockdev_init(const char *file, QDict 
*bs_opts,
 }
 
 read_only = qemu_opt_get_bool(opts, BDRV_OPT_READ_ONLY, false);
+compress = qemu_opt_get_bool(opts, BDRV_OPT_COMPRESS, false);
 
 /* init */
 if ((!file || !*file) && !qdict_size(bs_opts)) {
@@ -595,6 +596,8 @@ static BlockBackend *blockdev_init(const char *file, QDict 
*bs_opts,
 qdict_set_default_str(bs_opts, BDRV_OPT_READ_ONLY,
   read_only ? "on" : "off");
 qdict_set_default_str(bs_opts, BDRV_OPT_AUTO_READ_ONLY, "on");
+qdict_set_default_str(bs_opts, BDRV_OPT_COMPRESS,
+  compress ? "on" : "off");
 assert((bdrv_flags & BDRV_O_CACHE_MASK) == 0);
 
 if (runstate_check(RUN_STATE_INMIGRATE)) {
@@ -3308,6 +3311,13 @@ void qmp_block_stream(bool has_job_id, const char 
*job_id, const char *device,
 goto out;
 }
 
+if (bs->all_write_compressed &&
+bs->drv->bdrv_co_pwritev_compressed_part == NULL) {
+error_setg(errp, "Compression is not supported for this drive %s",
+   bdrv_get_device_name(bs));
+goto out;
+}
+
 /* backing_file string overrides base bs filename */
 base_name = has_backing_file ? backing_file : base_name;
 
-- 
1.8.3.1




[PATCH v3 3/5] block: support compressed write for copy-on-read

2019-10-15 Thread Andrey Shinkevich
Support the data compression during block-stream job over a backup
backing chain implemented in the following patch 'block-stream:
add compress option'.

Signed-off-by: Anton Nefedov 
Signed-off-by: Denis V. Lunev 
Signed-off-by: Andrey Shinkevich 
---
 block/io.c | 21 -
 block/trace-events |  2 +-
 2 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/block/io.c b/block/io.c
index 6a5509c..fc7f157 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1264,12 +1264,13 @@ static int coroutine_fn 
bdrv_co_do_copy_on_readv(BdrvChild *child,
  * allocating cluster in the image file.  Note that this value may exceed
  * BDRV_REQUEST_MAX_BYTES (even when the original read did not), which
  * is one reason we loop rather than doing it all at once.
+ * Also, this is crucial for compressed copy-on-read.
  */
 bdrv_round_to_clusters(bs, offset, bytes, _offset, _bytes);
 skip_bytes = offset - cluster_offset;
 
 trace_bdrv_co_do_copy_on_readv(bs, offset, bytes,
-   cluster_offset, cluster_bytes);
+   cluster_offset, cluster_bytes, flags);
 
 while (cluster_bytes) {
 int64_t pnum;
@@ -1328,9 +1329,15 @@ static int coroutine_fn 
bdrv_co_do_copy_on_readv(BdrvChild *child,
 /* This does not change the data on the disk, it is not
  * necessary to flush even in cache=writethrough mode.
  */
-ret = bdrv_driver_pwritev(bs, cluster_offset, pnum,
-  _qiov, 0,
-  BDRV_REQ_WRITE_UNCHANGED);
+if (flags & BDRV_REQ_WRITE_COMPRESSED) {
+ret = bdrv_driver_pwritev_compressed(bs, cluster_offset,
+ pnum, _qiov,
+ qiov_offset);
+} else {
+ret = bdrv_driver_pwritev(bs, cluster_offset, pnum,
+  _qiov, 0,
+  BDRV_REQ_WRITE_UNCHANGED);
+}
 }
 
 if (ret < 0) {
@@ -1396,7 +1403,11 @@ static int coroutine_fn bdrv_aligned_preadv(BdrvChild 
*child,
  * to pass through to drivers.  For now, there aren't any
  * passthrough flags.  */
 assert(!(flags & ~(BDRV_REQ_NO_SERIALISING | BDRV_REQ_COPY_ON_READ |
-   BDRV_REQ_PREFETCH)));
+   BDRV_REQ_PREFETCH | BDRV_REQ_WRITE_COMPRESSED)));
+
+/* write compressed only makes sense with copy on read */
+assert(!(flags & BDRV_REQ_WRITE_COMPRESSED) ||
+   (flags & BDRV_REQ_COPY_ON_READ));
 
 /* Handle Copy on Read and associated serialisation */
 if (flags & BDRV_REQ_COPY_ON_READ) {
diff --git a/block/trace-events b/block/trace-events
index 3aa27e6..f444548 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -14,7 +14,7 @@ blk_root_detach(void *child, void *blk, void *bs) "child %p 
blk %p bs %p"
 bdrv_co_preadv(void *bs, int64_t offset, int64_t nbytes, unsigned int flags) 
"bs %p offset %"PRId64" nbytes %"PRId64" flags 0x%x"
 bdrv_co_pwritev(void *bs, int64_t offset, int64_t nbytes, unsigned int flags) 
"bs %p offset %"PRId64" nbytes %"PRId64" flags 0x%x"
 bdrv_co_pwrite_zeroes(void *bs, int64_t offset, int count, int flags) "bs %p 
offset %"PRId64" count %d flags 0x%x"
-bdrv_co_do_copy_on_readv(void *bs, int64_t offset, unsigned int bytes, int64_t 
cluster_offset, int64_t cluster_bytes) "bs %p offset %"PRId64" bytes %u 
cluster_offset %"PRId64" cluster_bytes %"PRId64
+bdrv_co_do_copy_on_readv(void *bs, int64_t offset, unsigned int bytes, int64_t 
cluster_offset, int64_t cluster_bytes, int flags) "bs %p offset %"PRId64" bytes 
%u cluster_offset %"PRId64" cluster_bytes %"PRId64" flags 0x%x"
 bdrv_co_copy_range_from(void *src, uint64_t src_offset, void *dst, uint64_t 
dst_offset, uint64_t bytes, int read_flags, int write_flags) "src %p offset 
%"PRIu64" dst %p offset %"PRIu64" bytes %"PRIu64" rw flags 0x%x 0x%x"
 bdrv_co_copy_range_to(void *src, uint64_t src_offset, void *dst, uint64_t 
dst_offset, uint64_t bytes, int read_flags, int write_flags) "src %p offset 
%"PRIu64" dst %p offset %"PRIu64" bytes %"PRIu64" rw flags 0x%x 0x%x"
 
-- 
1.8.3.1




[PATCH v2 0/3] tests: More iotest 223 improvements

2019-10-15 Thread Eric Blake
[subject line kept for continuity with v1, but now touches much more]

Max suggested that instead of special-casing just 223 to trace QMP
input as well output, that we should instead patch common.qemu to do
it for all tests.  That in turn found that test 173 has been broken
since v3.0.  Max also suggested that 223 use a for loop rather than
massive code duplication, which does indeed look nicer.

Eric Blake (3):
  iotests: Fix 173
  iotests: Include QMP input in .out files
  tests: More iotest 223 improvements

 tests/qemu-iotests/common.qemu |   9 +++
 tests/qemu-iotests/085.out |  26 +
 tests/qemu-iotests/094.out |   4 ++
 tests/qemu-iotests/095.out |   2 +
 tests/qemu-iotests/109.out |  88 +
 tests/qemu-iotests/117.out |   5 ++
 tests/qemu-iotests/127.out |   4 ++
 tests/qemu-iotests/140.out |   5 ++
 tests/qemu-iotests/141.out |  26 +
 tests/qemu-iotests/143.out |   3 +
 tests/qemu-iotests/144.out |   5 ++
 tests/qemu-iotests/153.out |  11 
 tests/qemu-iotests/156.out |  11 
 tests/qemu-iotests/161.out |   8 +++
 tests/qemu-iotests/173 |   4 +-
 tests/qemu-iotests/173.out |  10 +++-
 tests/qemu-iotests/182.out |   8 +++
 tests/qemu-iotests/183.out |  11 
 tests/qemu-iotests/185.out |  18 ++
 tests/qemu-iotests/191.out |   8 +++
 tests/qemu-iotests/200.out |   1 +
 tests/qemu-iotests/223 |  16 +-
 tests/qemu-iotests/223.out | 100 +
 tests/qemu-iotests/229.out |   3 +
 tests/qemu-iotests/249.out |   6 ++
 25 files changed, 387 insertions(+), 5 deletions(-)

-- 
2.21.0




[PATCH v2 16/21] iotests: Make 091 work with data_file

2019-10-15 Thread Max Reitz
The image end offset as reported by qemu-img check is different when
using an external data file; we do not care about its value here, so we
can just filter it.  Incidentally, common.rc already has _check_test_img
for us which does exactly that.

Signed-off-by: Max Reitz 
---
 tests/qemu-iotests/091 | 2 +-
 tests/qemu-iotests/091.out | 2 --
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/tests/qemu-iotests/091 b/tests/qemu-iotests/091
index f4b44659ae..0874fa84c8 100755
--- a/tests/qemu-iotests/091
+++ b/tests/qemu-iotests/091
@@ -101,7 +101,7 @@ echo "Check image pattern"
 ${QEMU_IO} -c "read -P 0x22 0 4M" "${TEST_IMG}" | _filter_testdir | 
_filter_qemu_io
 
 echo "Running 'qemu-img check -r all \$TEST_IMG'"
-"${QEMU_IMG}" check -r all "${TEST_IMG}" 2>&1 | _filter_testdir | _filter_qemu
+_check_test_img -r all
 
 echo "*** done"
 rm -f $seq.full
diff --git a/tests/qemu-iotests/091.out b/tests/qemu-iotests/091.out
index 5017f8c2d9..5ec7b00f13 100644
--- a/tests/qemu-iotests/091.out
+++ b/tests/qemu-iotests/091.out
@@ -23,6 +23,4 @@ read 4194304/4194304 bytes at offset 0
 4 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 Running 'qemu-img check -r all $TEST_IMG'
 No errors were found on the image.
-80/16384 = 0.49% allocated, 0.00% fragmented, 0.00% compressed clusters
-Image end offset: 5570560
 *** done
-- 
2.21.0




[PATCH v2 08/21] iotests: Add -o and --no-opts to _make_test_img

2019-10-15 Thread Max Reitz
Blindly overriding IMGOPTS is suboptimal as this discards user-specified
options.  Whatever options the test needs should simply be appended.

Some tests do this (with IMGOPTS=$(_optstr_add "$IMGOPTS" "...")), but
that is cumbersome.  It’s simpler to just give _make_test_img an -o
parameter with which tests can add options.

Some tests actually must override the user-specified options, though,
for example when creating an image in a different format than the test
$IMGFMT.  For such cases, --no-opts allows clearing the current option
list.

Signed-off-by: Max Reitz 
Reviewed-by: Maxim Levitsky 
---
 tests/qemu-iotests/common.rc | 13 +
 1 file changed, 13 insertions(+)

diff --git a/tests/qemu-iotests/common.rc b/tests/qemu-iotests/common.rc
index 3e7adc4834..f3784077de 100644
--- a/tests/qemu-iotests/common.rc
+++ b/tests/qemu-iotests/common.rc
@@ -287,6 +287,7 @@ _make_test_img()
 local use_backing=0
 local backing_file=""
 local object_options=""
+local opts_param=false
 local misc_params=()
 
 if [ -n "$TEST_IMG_FILE" ]; then
@@ -307,6 +308,10 @@ _make_test_img()
 if [ "$use_backing" = "1" -a -z "$backing_file" ]; then
 backing_file=$param
 continue
+elif $opts_param; then
+optstr=$(_optstr_add "$optstr" "$param")
+opts_param=false
+continue
 fi
 
 case "$param" in
@@ -314,6 +319,14 @@ _make_test_img()
 use_backing=1
 ;;
 
+-o)
+opts_param=true
+;;
+
+--no-opts)
+optstr=""
+;;
+
 *)
 misc_params=("${misc_params[@]}" "$param")
 ;;
-- 
2.21.0




[PATCH v2 11/21] iotests: Replace IMGOPTS='' by --no-opts

2019-10-15 Thread Max Reitz
Signed-off-by: Max Reitz 
Reviewed-by: Maxim Levitsky 
---
 tests/qemu-iotests/071 | 4 ++--
 tests/qemu-iotests/174 | 2 +-
 tests/qemu-iotests/178 | 4 ++--
 tests/qemu-iotests/197 | 4 ++--
 tests/qemu-iotests/215 | 4 ++--
 5 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/tests/qemu-iotests/071 b/tests/qemu-iotests/071
index fab52b..4e31943244 100755
--- a/tests/qemu-iotests/071
+++ b/tests/qemu-iotests/071
@@ -58,7 +58,7 @@ echo
 echo "=== Testing blkverify through filename ==="
 echo
 
-TEST_IMG="$TEST_IMG.base" IMGOPTS="" IMGFMT="raw" _make_test_img $IMG_SIZE |\
+TEST_IMG="$TEST_IMG.base" IMGFMT="raw" _make_test_img --no-opts $IMG_SIZE |\
 _filter_imgfmt
 _make_test_img $IMG_SIZE
 $QEMU_IO -c "open -o 
driver=raw,file.driver=blkverify,file.raw.filename=$TEST_IMG.base $TEST_IMG" \
@@ -73,7 +73,7 @@ echo
 echo "=== Testing blkverify through file blockref ==="
 echo
 
-TEST_IMG="$TEST_IMG.base" IMGOPTS="" IMGFMT="raw" _make_test_img $IMG_SIZE |\
+TEST_IMG="$TEST_IMG.base" IMGFMT="raw" _make_test_img --no-opts $IMG_SIZE |\
 _filter_imgfmt
 _make_test_img $IMG_SIZE
 $QEMU_IO -c "open -o 
driver=raw,file.driver=blkverify,file.raw.filename=$TEST_IMG.base,file.test.driver=$IMGFMT,file.test.file.filename=$TEST_IMG"
 \
diff --git a/tests/qemu-iotests/174 b/tests/qemu-iotests/174
index 0a952a73fd..e2f14a38c6 100755
--- a/tests/qemu-iotests/174
+++ b/tests/qemu-iotests/174
@@ -40,7 +40,7 @@ _unsupported_fmt raw
 
 
 size=256K
-IMGFMT=raw IMGKEYSECRET= IMGOPTS= _make_test_img $size | _filter_imgfmt
+IMGFMT=raw IMGKEYSECRET= _make_test_img --no-opts $size | _filter_imgfmt
 
 echo
 echo "== reading wrong format should fail =="
diff --git a/tests/qemu-iotests/178 b/tests/qemu-iotests/178
index 21231cadd3..75b5e8f314 100755
--- a/tests/qemu-iotests/178
+++ b/tests/qemu-iotests/178
@@ -62,8 +62,8 @@ $QEMU_IMG measure -O foo "$TEST_IMG" # unknown image file 
format
 
 make_test_img_with_fmt() {
 # Shadow global variables within this function
-local IMGFMT="$1" IMGOPTS=""
-_make_test_img "$2"
+local IMGFMT="$1"
+_make_test_img --no-opts "$2"
 }
 
 qemu_io_with_fmt() {
diff --git a/tests/qemu-iotests/197 b/tests/qemu-iotests/197
index 1d4f6786db..4d3d08ad6f 100755
--- a/tests/qemu-iotests/197
+++ b/tests/qemu-iotests/197
@@ -66,8 +66,8 @@ if [ "$IMGFMT" = "vpc" ]; then
 fi
 _make_test_img 4G
 $QEMU_IO -c "write -P 55 3G 1k" "$TEST_IMG" | _filter_qemu_io
-IMGPROTO=file IMGFMT=qcow2 IMGOPTS= TEST_IMG_FILE="$TEST_WRAP" \
-_make_test_img -F "$IMGFMT" -b "$TEST_IMG" | _filter_img_create
+IMGPROTO=file IMGFMT=qcow2 TEST_IMG_FILE="$TEST_WRAP" \
+_make_test_img --no-opts -F "$IMGFMT" -b "$TEST_IMG" | _filter_img_create
 $QEMU_IO -f qcow2 -c "write -z -u 1M 64k" "$TEST_WRAP" | _filter_qemu_io
 
 # Ensure that a read of two clusters, but where one is already allocated,
diff --git a/tests/qemu-iotests/215 b/tests/qemu-iotests/215
index 2eb377d682..55a1874dcd 100755
--- a/tests/qemu-iotests/215
+++ b/tests/qemu-iotests/215
@@ -63,8 +63,8 @@ if [ "$IMGFMT" = "vpc" ]; then
 fi
 _make_test_img 4G
 $QEMU_IO -c "write -P 55 3G 1k" "$TEST_IMG" | _filter_qemu_io
-IMGPROTO=file IMGFMT=qcow2 IMGOPTS= TEST_IMG_FILE="$TEST_WRAP" \
-_make_test_img -F "$IMGFMT" -b "$TEST_IMG" | _filter_img_create
+IMGPROTO=file IMGFMT=qcow2 TEST_IMG_FILE="$TEST_WRAP" \
+_make_test_img --no-opts -F "$IMGFMT" -b "$TEST_IMG" | _filter_img_create
 $QEMU_IO -f qcow2 -c "write -z -u 1M 64k" "$TEST_WRAP" | _filter_qemu_io
 
 # Ensure that a read of two clusters, but where one is already allocated,
-- 
2.21.0




[PATCH v2 15/21] iotests: Avoid cp/mv of test images

2019-10-15 Thread Max Reitz
This will not work with external data files, so try to get tests working
without it as far as possible.

Signed-off-by: Max Reitz 
Reviewed-by: Maxim Levitsky 
---
 tests/qemu-iotests/063 | 12 
 tests/qemu-iotests/063.out |  3 ++-
 tests/qemu-iotests/085 |  9 +++--
 tests/qemu-iotests/085.out |  8 
 4 files changed, 13 insertions(+), 19 deletions(-)

diff --git a/tests/qemu-iotests/063 b/tests/qemu-iotests/063
index eef2b8a534..c750b3806e 100755
--- a/tests/qemu-iotests/063
+++ b/tests/qemu-iotests/063
@@ -51,15 +51,13 @@ _unsupported_imgopts "subformat=monolithicFlat" \
 _make_test_img 4M
 
 echo "== Testing conversion with -n fails with no target file =="
-# check .orig file does not exist
-rm -f "$TEST_IMG.orig"
 if $QEMU_IMG convert -f $IMGFMT -O $IMGFMT -n "$TEST_IMG" "$TEST_IMG.orig" 
>/dev/null 2>&1; then
 exit 1
 fi
 
 echo "== Testing conversion with -n succeeds with a target file =="
-rm -f "$TEST_IMG.orig"
-cp "$TEST_IMG" "$TEST_IMG.orig"
+_rm_test_img "$TEST_IMG.orig"
+TEST_IMG="$TEST_IMG.orig" _make_test_img 4M
 if ! $QEMU_IMG convert -f $IMGFMT -O $IMGFMT -n "$TEST_IMG" "$TEST_IMG.orig" ; 
then
 exit 1
 fi
@@ -85,10 +83,8 @@ fi
 _check_test_img
 
 echo "== Testing conversion to a smaller file fails =="
-rm -f "$TEST_IMG.orig"
-mv "$TEST_IMG" "$TEST_IMG.orig"
-_make_test_img 2M
-if $QEMU_IMG convert -f $IMGFMT -O $IMGFMT -n "$TEST_IMG.orig" "$TEST_IMG" 
>/dev/null 2>&1; then
+TEST_IMG="$TEST_IMG.target" _make_test_img 2M
+if $QEMU_IMG convert -f $IMGFMT -O $IMGFMT -n "$TEST_IMG" "$TEST_IMG.target" 
>/dev/null 2>&1; then
 exit 1
 fi
 
diff --git a/tests/qemu-iotests/063.out b/tests/qemu-iotests/063.out
index 7b691b2c9e..890b719bf0 100644
--- a/tests/qemu-iotests/063.out
+++ b/tests/qemu-iotests/063.out
@@ -2,11 +2,12 @@ QA output created by 063
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=4194304
 == Testing conversion with -n fails with no target file ==
 == Testing conversion with -n succeeds with a target file ==
+Formatting 'TEST_DIR/t.IMGFMT.orig', fmt=IMGFMT size=4194304
 == Testing conversion to raw is the same after conversion with -n ==
 == Testing conversion back to original format ==
 No errors were found on the image.
 == Testing conversion to a smaller file fails ==
-Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2097152
+Formatting 'TEST_DIR/t.IMGFMT.target', fmt=IMGFMT size=2097152
 == Regression testing for copy offloading bug ==
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576
 Formatting 'TEST_DIR/t.IMGFMT.target', fmt=IMGFMT size=1048576
diff --git a/tests/qemu-iotests/085 b/tests/qemu-iotests/085
index bbea1252d2..46981dbb64 100755
--- a/tests/qemu-iotests/085
+++ b/tests/qemu-iotests/085
@@ -105,8 +105,7 @@ add_snapshot_image()
 {
 base_image="${TEST_DIR}/$((${1}-1))-${snapshot_virt0}"
 snapshot_file="${TEST_DIR}/${1}-${snapshot_virt0}"
-_make_test_img -u -b "${base_image}" "$size"
-mv "${TEST_IMG}" "${snapshot_file}"
+TEST_IMG=$snapshot_file _make_test_img -u -b "${base_image}" "$size"
 do_blockdev_add "$1" "'backing': null, " "${snapshot_file}"
 }
 
@@ -122,10 +121,8 @@ blockdev_snapshot()
 
 size=128M
 
-_make_test_img $size
-mv "${TEST_IMG}" "${TEST_IMG}.1"
-_make_test_img $size
-mv "${TEST_IMG}" "${TEST_IMG}.2"
+TEST_IMG="$TEST_IMG.1" _make_test_img $size
+TEST_IMG="$TEST_IMG.2" _make_test_img $size
 
 echo
 echo === Running QEMU ===
diff --git a/tests/qemu-iotests/085.out b/tests/qemu-iotests/085.out
index 2a5f256cd3..313198f182 100644
--- a/tests/qemu-iotests/085.out
+++ b/tests/qemu-iotests/085.out
@@ -1,6 +1,6 @@
 QA output created by 085
-Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728
-Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728
+Formatting 'TEST_DIR/t.IMGFMT.1', fmt=IMGFMT size=134217728
+Formatting 'TEST_DIR/t.IMGFMT.2', fmt=IMGFMT size=134217728
 
 === Running QEMU ===
 
@@ -55,10 +55,10 @@ Formatting 'TEST_DIR/10-snapshot-v1.qcow2', fmt=qcow2 
size=134217728 backing_fil
 
 === Create a couple of snapshots using blockdev-snapshot ===
 
-Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728 
backing_file=TEST_DIR/10-snapshot-v0.IMGFMT
+Formatting 'TEST_DIR/11-snapshot-v0.IMGFMT', fmt=IMGFMT size=134217728 
backing_file=TEST_DIR/10-snapshot-v0.IMGFMT
 {"return": {}}
 {"return": {}}
-Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728 
backing_file=TEST_DIR/11-snapshot-v0.IMGFMT
+Formatting 'TEST_DIR/12-snapshot-v0.IMGFMT', fmt=IMGFMT size=134217728 
backing_file=TEST_DIR/11-snapshot-v0.IMGFMT
 {"return": {}}
 {"return": {}}
 
-- 
2.21.0




[PATCH v2 09/21] iotests: Inject space into -ocompat=0.10 in 051

2019-10-15 Thread Max Reitz
It did not matter before, but now that _make_test_img understands -o, we
should use it properly here.

Signed-off-by: Max Reitz 
Reviewed-by: Maxim Levitsky 
---
 tests/qemu-iotests/051 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/051 b/tests/qemu-iotests/051
index 53bcdbc911..9cd1d60d45 100755
--- a/tests/qemu-iotests/051
+++ b/tests/qemu-iotests/051
@@ -157,7 +157,7 @@ echo
 echo === With version 2 images enabling lazy refcounts must fail ===
 echo
 
-_make_test_img -ocompat=0.10 $size
+_make_test_img -o compat=0.10 $size
 
 run_qemu -drive file="$TEST_IMG",format=qcow2,lazy-refcounts=on
 run_qemu -drive file="$TEST_IMG",format=qcow2,lazy-refcounts=off
-- 
2.21.0




Re: [PATCH] blockdev: Use error_report() in hmp_commit()

2019-10-15 Thread Philippe Mathieu-Daudé

On 10/15/19 2:39 PM, Kevin Wolf wrote:

Instead of using monitor_printf() to report errors, hmp_commit() should
use error_report() like other places do.

Signed-off-by: Kevin Wolf 
---
  blockdev.c | 7 +++
  1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index f89e48fc79..e2358966c3 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1088,11 +1088,11 @@ void hmp_commit(Monitor *mon, const QDict *qdict)
  
  blk = blk_by_name(device);

  if (!blk) {
-monitor_printf(mon, "Device '%s' not found\n", device);
+error_report("Device '%s' not found", device);
  return;
  }
  if (!blk_is_available(blk)) {
-monitor_printf(mon, "Device '%s' has no medium\n", device);
+error_report("Device '%s' has no medium", device);
  return;
  }
  
@@ -1105,8 +1105,7 @@ void hmp_commit(Monitor *mon, const QDict *qdict)

  aio_context_release(aio_context);
  }
  if (ret < 0) {
-monitor_printf(mon, "'commit' error for '%s': %s\n", device,
-   strerror(-ret));
+error_report("'commit' error for '%s': %s", device, strerror(-ret));
  }
  }
  



Reviewed-by: Philippe Mathieu-Daudé 



Re: [PATCH v2 00/20] nvme: support NVMe v1.3d, SGLs and multiple namespaces

2019-10-15 Thread no-reply
Patchew URL: https://patchew.org/QEMU/20191015103900.313928-1-...@irrelevant.dk/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Subject: [PATCH v2 00/20] nvme: support NVMe v1.3d, SGLs and multiple namespaces
Type: series
Message-id: 20191015103900.313928-1-...@irrelevant.dk

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Switched to a new branch 'test'
c68f7e0 nvme: handle dma errors
855f2b8 nvme: make lba data size configurable
68fc575 nvme: remove redundant NvmeCmd pointer parameter
eb585d1 nvme: bump controller pci device id
227280c nvme: support multiple namespaces
ccc877b nvme: add support for scatter gather lists
76d6fe6 nvme: allow multiple aios per command
73227cb nvme: refactor prp mapping
df5fd9f nvme: bump supported specification version to 1.3
c85c0ff nvme: add missing mandatory features
1188552 nvme: add logging to error information log page
714808c nvme: add support for the asynchronous event request command
88bdfce nvme: add support for the get log page command
7716649 nvme: refactor device realization
7d2d51e nvme: add support for the abort command
4ec0e81 nvme: allow completion queues in the cmb
68f00db nvme: populate the mandatory subnqn and ver fields
f08d66a nvme: add missing fields in the identify controller data structure
315a6eb nvme: move device parameters to separate struct
b94cf4a nvme: remove superfluous breaks

=== OUTPUT BEGIN ===
1/20 Checking commit b94cf4aea07b (nvme: remove superfluous breaks)
2/20 Checking commit 315a6eb1f09f (nvme: move device parameters to separate 
struct)
ERROR: Macros with complex values should be enclosed in parenthesis
#177: FILE: hw/block/nvme.h:6:
+#define DEFINE_NVME_PROPERTIES(_state, _props) \
+DEFINE_PROP_STRING("serial", _state, _props.serial), \
+DEFINE_PROP_UINT32("cmb_size_mb", _state, _props.cmb_size_mb, 0), \
+DEFINE_PROP_UINT32("num_queues", _state, _props.num_queues, 64)

total: 1 errors, 0 warnings, 181 lines checked

Patch 2/20 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

3/20 Checking commit f08d66aa761b (nvme: add missing fields in the identify 
controller data structure)
4/20 Checking commit 68f00db57e87 (nvme: populate the mandatory subnqn and ver 
fields)
5/20 Checking commit 4ec0e81a8ca5 (nvme: allow completion queues in the cmb)
6/20 Checking commit 7d2d51e5da89 (nvme: add support for the abort command)
7/20 Checking commit 7716649c3d6d (nvme: refactor device realization)
8/20 Checking commit 88bdfce1a599 (nvme: add support for the get log page 
command)
9/20 Checking commit 714808cd3ef8 (nvme: add support for the asynchronous event 
request command)
10/20 Checking commit 11885522fa87 (nvme: add logging to error information log 
page)
11/20 Checking commit c85c0ff5ea35 (nvme: add missing mandatory features)
12/20 Checking commit df5fd9f283a4 (nvme: bump supported specification version 
to 1.3)
13/20 Checking commit 73227cb3c83c (nvme: refactor prp mapping)
14/20 Checking commit 76d6fe6ea1cf (nvme: allow multiple aios per command)
15/20 Checking commit ccc877b6f72b (nvme: add support for scatter gather lists)
16/20 Checking commit 227280c8d08c (nvme: support multiple namespaces)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#42: 
new file mode 100644

total: 0 errors, 1 warnings, 801 lines checked

Patch 16/20 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
17/20 Checking commit eb585d1231e3 (nvme: bump controller pci device id)
18/20 Checking commit 68fc575b3fc7 (nvme: remove redundant NvmeCmd pointer 
parameter)
19/20 Checking commit 855f2b86dd6c (nvme: make lba data size configurable)
20/20 Checking commit c68f7e0d0c55 (nvme: handle dma errors)
WARNING: line over 80 characters
#77: FILE: hw/block/nvme.c:257:
+if (nvme_addr_read(n, prp_ent, (void *) prp_list, 
prp_trans)) {

WARNING: line over 80 characters
#103: FILE: hw/block/nvme.c:428:
+if (nvme_addr_read(n, addr, segment, nsgld * 
sizeof(NvmeSglDescriptor))) {

total: 0 errors, 2 warnings, 148 lines checked

Patch 20/20 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/20191015103900.313928-1-...@irrelevant.dk/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-de...@redhat.com

Re: [PATCH v2 6/6] tests/qemu-iotests: add case for block-stream compress

2019-10-15 Thread Andrey Shinkevich


On 03/10/2019 17:58, Vladimir Sementsov-Ogievskiy wrote:
> 02.10.2019 17:22, Andrey Shinkevich wrote:
>> Add a test case to the iotest #030 that checks 'compress' option for a
>> block-stream job.
>>
>> Signed-off-by: Andrey Shinkevich 
>> ---
>>tests/qemu-iotests/030 | 49 
>> +-
>>tests/qemu-iotests/030.out |  4 ++--
>>2 files changed, 50 insertions(+), 3 deletions(-)
>>
>> diff --git a/tests/qemu-iotests/030 b/tests/qemu-iotests/030
>> index f3766f2..13fe5a2 100755
>> --- a/tests/qemu-iotests/030
>> +++ b/tests/qemu-iotests/030
>> @@ -21,7 +21,8 @@
>>import time
>>import os
>>import iotests
>> -from iotests import qemu_img, qemu_io
>> +from iotests import qemu_img, qemu_io, qemu_img_pipe
>> +import json
>>
>>backing_img = os.path.join(iotests.test_dir, 'backing.img')
>>mid_img = os.path.join(iotests.test_dir, 'mid.img')
>> @@ -956,6 +957,52 @@ class TestSetSpeed(iotests.QMPTestCase):
>>
>>self.cancel_and_wait(resume=True)
>>
>> +class TestCompressed(iotests.QMPTestCase):
>> +
>> +def setUp(self):
>> +qemu_img('create', '-f', iotests.imgfmt, backing_img, '1M')
>> +qemu_img('create', '-f', iotests.imgfmt, '-o',
>> + 'backing_file=%s' % backing_img, mid_img)
>> +qemu_img('create', '-f', iotests.imgfmt, '-o',
>> + 'backing_file=%s' % mid_img, test_img)
>> +qemu_io('-c', 'write -P 0x1 0 512k', backing_img)
>> +self.vm = iotests.VM().add_drive(test_img, "backing.node-name=mid," 
>> +
>> + "backing.backing.node-name=base")
>> +self.vm.launch()
> 
> Why you can't just add a test-case to TestSingleDrive class?

Their setUp() functions differ.

> 
>> +
>> +def tearDown(self):
>> +self.vm.shutdown()
>> +os.remove(test_img)
>> +os.remove(mid_img)
>> +os.remove(backing_img)
>> +
>> +def test_stream_compress(self):
>> +self.assert_no_active_block_jobs()
>> +
>> +result = self.vm.qmp('block-stream', device='mid', 
>> job_id='stream-mid')
>> +self.assert_qmp(result, 'return', {})
>> +
>> +self.wait_until_completed(drive='stream-mid')
>> +for event in self.vm.get_qmp_events(wait=True):
>> +if event['event'] == 'BLOCK_JOB_COMPLETED':
>> +self.dictpath(event, 'data/device')
>> +self.assert_qmp_absent(event, 'data/error')
> 
> COMPLETED event is for sure already waited by wait_until_completed
> 
>> +
>> +result = self.vm.qmp('block-stream', device='drive0', base=mid_img,
>> + job_id='stream-top', compress=True)
>> +self.assert_qmp(result, 'return', {})
>> +
>> +self.wait_until_completed(drive='stream-top')
>> +self.assert_no_active_block_jobs()
> 
> this assertion is done in wait_until_completed
> 
>> +self.vm.shutdown()
>> +
>> +top = json.loads(qemu_img_pipe('info', '--output=json', test_img))
>> +mid = json.loads(qemu_img_pipe('info', '--output=json', mid_img))
>> +base = json.loads(qemu_img_pipe('info', '--output=json', 
>> backing_img))
>> +
>> +self.assertEqual(mid['actual-size'], base['actual-size'])
>> +self.assertLess(top['actual-size'], mid['actual-size'])
>> +
>>if __name__ == '__main__':
>>iotests.main(supported_fmts=['qcow2', 'qed'],
>> supported_protocols=['file'])
>> diff --git a/tests/qemu-iotests/030.out b/tests/qemu-iotests/030.out
>> index 6d9bee1..af8dac1 100644
>> --- a/tests/qemu-iotests/030.out
>> +++ b/tests/qemu-iotests/030.out
>> @@ -1,5 +1,5 @@
>> -...
>> +
>>--
>> -Ran 27 tests
>> +Ran 28 tests
>>
>>OK
>>
> 
> 

-- 
With the best regards,
Andrey Shinkevich


[PATCH v2 1/3] iotests: Fix 173

2019-10-15 Thread Eric Blake
This test has been broken since 3.0.  It used TEST_IMG to influence
the name of a file created during _make_test_img, but commit 655ae6bb
changed things so that the wrong file name is being created, which
then caused _launch_qemu to fail.  In the meantime, the set of events
issued for the actions of the test has increased.

Why haven't we noticed the failure? Because the test rarely gets run:
'./check -qcow2 173' is insufficient (that defaults to using file protocol)
'./check -nfs 173' is insufficient (that defaults to using raw format)
so the test is only run with:
./check -qcow2 -nfs 173

Note that we already have a number of other problems with -nfs:
./check -nfs (fails 18/30)
./check -qcow2 -nfs (fails 45/76 after this patch)
and it's not on my priority list to fix those.  Rather, I found this
because of my next patch's work on tests using _send_qemu_cmd.

Fixes: 655ae6b
Signed-off-by: Eric Blake 
---
 tests/qemu-iotests/173 | 4 ++--
 tests/qemu-iotests/173.out | 6 +-
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/tests/qemu-iotests/173 b/tests/qemu-iotests/173
index 9e2fa2e73cb9..29dcaa1960df 100755
--- a/tests/qemu-iotests/173
+++ b/tests/qemu-iotests/173
@@ -47,9 +47,9 @@ size=100M
 BASE_IMG="${TEST_DIR}/image.base"
 TOP_IMG="${TEST_DIR}/image.snp1"

-TEST_IMG="${BASE_IMG}" _make_test_img $size
+TEST_IMG_FILE="${BASE_IMG}" _make_test_img $size

-TEST_IMG="${TOP_IMG}" _make_test_img $size
+TEST_IMG_FILE="${TOP_IMG}" _make_test_img $size

 echo
 echo === Running QEMU, using block-stream to find backing image ===
diff --git a/tests/qemu-iotests/173.out b/tests/qemu-iotests/173.out
index f477a0099a32..e83d17ec2f64 100644
--- a/tests/qemu-iotests/173.out
+++ b/tests/qemu-iotests/173.out
@@ -7,6 +7,10 @@ Formatting 'TEST_DIR/image.snp1', fmt=IMGFMT size=104857600
 {"return": {}}
 {"return": {}}
 {"return": {}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
"JOB_STATUS_CHANGE", "data": {"status": "created", "id": "disk2"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
"JOB_STATUS_CHANGE", "data": {"status": "running", "id": "disk2"}}
 {"return": {}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
"BLOCK_JOB_COMPLETED", "data": {"device": "disk2", "len": 104857600, "offset": 
104857600, "speed": 0, "type": "stream"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
"JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "disk2"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
"JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "disk2"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
"BLOCK_JOB_COMPLETED", "data": {"device": "disk2", "len": 0, "offset": 0, 
"speed": 0, "type": "stream"}}
 *** done
-- 
2.21.0




Re: [PULL 0/2] Tracing patches

2019-10-15 Thread Philippe Mathieu-Daudé

On 10/15/19 2:24 PM, Peter Maydell wrote:

On Mon, 14 Oct 2019 at 09:57, Stefan Hajnoczi  wrote:


The following changes since commit 98b2e3c9ab3abfe476a2b02f8f51813edb90e72d:

   Merge remote-tracking branch 'remotes/stefanha/tags/block-pull-request' into 
staging (2019-10-08 16:08:35 +0100)

are available in the Git repository at:

   https://github.com/stefanha/qemu.git tags/tracing-pull-request

for you to fetch changes up to a1f4fc951a277c49a25418cafb028ec5529707fa:

   trace: avoid "is" with a literal Python 3.8 warnings (2019-10-14 09:54:46 
+0100)


Pull request



Stefan Hajnoczi (2):
   trace: add --group=all to tracing.txt
   trace: avoid "is" with a literal Python 3.8 warnings




Applied, thanks.


Buh, v2 missed :(



Re: [RFC PATCH 23/23] qcow2: Add the 'extended_l2' option and the QCOW2_INCOMPAT_EXTL2 bit

2019-10-15 Thread Eric Blake

On 10/15/19 10:23 AM, Alberto Garcia wrote:

Now that the implementation of subclusters is complete we can finally
add the necessary options to create and read images with this feature,
which we call "extended L2 entries".

Signed-off-by: Alberto Garcia 
---



+++ b/qapi/block-core.json
@@ -85,6 +85,7 @@
'compat': 'str',
'*data-file': 'str',
'*data-file-raw': 'bool',
+  '*extended-l2': 'bool',
'*lazy-refcounts': 'bool',
'*corrupt': 'bool',
'refcount-bits': 'int',


Missing documentation for the new member.

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



Re: [RFC PATCH 00/23] Add subcluster allocation to qcow2

2019-10-15 Thread Eric Blake

On 10/15/19 10:23 AM, Alberto Garcia wrote:

Hi,

this series adds a new feature to the qcow2 on-disk format called
"Extended L2 Entries", which allows us to do subcluster allocation.

This cover letter explains the reasons behind this proposal, the
changes to the on-disk format, test results and pending work. If you
are curious you can also have a look at previous discussions about
this feature:




=== Changes to the on-disk format ===

An L2 entry is 64 bits wide, with this format (for uncompressed
clusters):

6356 5548 4740 3932 3124 2316 15 8 7  0
       
**<> <--><--->*
   Rsrved  host cluster offset of data Reserved
   (6 bits)(47 bits)   (8 bits)

 bit 63: refcount == 1   (QCOW_OFLAG_COPIED)
 bit 62: compressed = 1  (QCOW_OFLAG_COMPRESSED)
 bit  0: all zeros   (QCOW_OFLAG_ZERO)

If Extended L2 Entries are enabled, bit 0 becomes reserved and must be
unset, and this 64-bit bitmap follows the entry:

6356 5548 4740 3932 3124 2316 15 8 7  0
       
<-> <->
  subcluster reads as zerossubcluster is allocated
  (32 bits)   (32 bits)


I like the grouping - you can then do a 4-byte read and comparison to 0 
to see if the entire cluster reads as zeroes or is unallocated.


With 32k clusters, this results in 1k subclusters.  In cluster 1 (offset 
32k), which bits map where?  (The obvious choices are that sub-cluster 
32k maps to bit 0, 33k maps to bit 1, ...; or that sub-cluster 32k maps 
to bit 31, 33k maps to bit 30, ...)


/me reads ahead

okay, in patch 5, you said you map the most significant bit to the first 
cluster. That feels backwards to me; I wonder if the math is any easier 
if you map sub-clusters starting from the least-significant, because 
then you get:


bit = (address >> cluster_size) & 32

rather than

bit = 31 - ((address >> cluster_size) & 32)



Some comments about the results:

- The smallest allowed cluster size for an image with subclusters is
   16 KB (in this case the subclusters size is 512 bytes), hence the
   missing values in the 4 KB and 8 KB rows.


Again reading ahead, I see that patch 5 requires a 16k minimum cluster 
for using extended L2.  Could we still permit clusters smaller than 
that, but merely document that subclusters are always a minimum of 512 
bytes and therefore for an 8k cluster we only use 16 bits (leaving the 
other 16 bits zero)?  But I'm also fine with the simplicity of just 
stating that subclusters require at least 16k clusters.




=== To do ===

A couple of things are missing from this series:

- The ability to efficiently zero individual subclusters using
   qcow2_co_pwrite_zeroes(). At the moment only full clusters can be
   zeroed with this method.

- Alternatively we could get rid of the individual "all zeroes" bits
   altogether and have 64 subclusters per cluster. We would still have
   the QCOW_OFLAG_ZERO bit in the standard cluster descriptor.


I think you've got more flexibility with the two bits per sub-cluster 
than you would with just 1 bit and 64 subclusters, so I don't think this 
direction is going to get us far.




- The number of subclusters per cluster is always 32. It would be
   trivial to allow configuring this, but I don't see any use case.


Agreed.



- Tests: I have a few written that I'll add in future revisions of
   this series.

- handle_alloc_space() works at the subclusters level. That is, if you
   have an unallocated 2MB cluster with 64KB subclusters, no backing
   image and you write 4KB of data, QEMU won't write zeroes to the
   affected subcluster(s) and will use handle_alloc_space() instead.
   The other subclusters won't be touched and will remain unallocated.
   This behavior is consistent with how subclusters work and saves disk
   space, but offers slightly lower performance (see test results
   above). Theoretically we could offer a setting to configure this,
   but I'm not convinced that this is very useful.

===

As usual, feedback is welcome,


Looks promising!

How do subclusters interact with external data files?

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



[RFC PATCH 00/23] Add subcluster allocation to qcow2

2019-10-15 Thread Alberto Garcia
Hi,

this series adds a new feature to the qcow2 on-disk format called
"Extended L2 Entries", which allows us to do subcluster allocation.

This cover letter explains the reasons behind this proposal, the
changes to the on-disk format, test results and pending work. If you
are curious you can also have a look at previous discussions about
this feature:

   https://lists.gnu.org/archive/html/qemu-block/2017-04/msg00178.html
   https://lists.gnu.org/archive/html/qemu-block/2019-06/msg01155.html

This is the first proper version of the patches, and I believe that
the implementation is complete. However since I'm proposing a change
to the on-disk format I'm labeling this as RFC because I'm expecting
some debate. I'll remove the RFC tag and add more tests in future
revisions.

=== Problem ===

A qcow2 image is divided into units of constant size called clusters,
and among other things it contains metadata that maps guest addresses
to host addresses (the so-called L1 and L2 tables).

There are two basic problems that result from this:

1) Reading from or writing to a qcow2 image involves reading the
   corresponding entry on the L2 table that maps the guest address to
   the host address. This is very slow because it involves two I/O
   operations: one on the L2 table and the other one on the actual
   data cluster.

2) A cluster is the smallest unit of allocation. Therefore writing a
   mere 512 bytes to an empty disk requires allocating a complete
   cluster and filling it with zeroes (or with data from the backing
   image if there is one). This wastes more disk space and also has a
   negative impact on I/O.

Problem (1) can be solved by caching the L2 tables in memory. The
maximum amount of disk space used by L2 tables depends on the virtual
disk size and the cluster size:

   max_l2_size = virtual_disk_size * 8 / cluster_size

Because of this, the only way to reduce the size of the L2 tables is
by increasing the cluster size (which can be any power of two between
512 bytes and 2 MB). But then we hit problem (2): I/O is slower and
more disk space is wasted.

=== The proposal ===

The proposal is to extend the qcow2 format by allowing subcluster
allocation. The on-disk format remains essentially the same, except
that each data cluster is internally divided into 32 subclusters of
equal size.

The way it works in practice is with a new optional feature called
"Extended L2 Entries", that needs to be enabled when an image is
created. With this, each entry on an L2 table is accompanied by a
bitmap indicating the allocation state of each one of the subclusters
for that cluster. The size of an L2 entry doubles from 64 to 128 bits.

Other than L2 entries, all other data structures remain unchanged, but
for data clusters the smallest unit of allocation is now the
subcluster. Reference counting is still at the cluster level, because
there is no way to reference individual subclusters. Copy-on-write on
internal snapshots needs to copy complete clusters, so that scenario
would not benefit from this change.

I see two main use cases for this feature:

a) The qcow2 image is not too large / the L2 cache is not a problem,
   but you want to increase the allocation performance. In this case
   you can have a 128KB cluster with 4KB subclusters (with 4KB being a
   common block size in ext4 and other filesystems)

b) The qcow2 image is very large and you want to save metadata space
   in order to have a smaller L2 cache. In this case you can go for
   the maximum cluster size (2MB) but you want to have smaller
   subclusters to increase the allocation performance and optimize the
   disk usage.

=== Changes to the on-disk format ===

An L2 entry is 64 bits wide, with this format (for uncompressed
clusters):

6356 5548 4740 3932 3124 2316 15 8 7  0
       
**<> <--><--->*
  Rsrved  host cluster offset of data Reserved
  (6 bits)(47 bits)   (8 bits)

bit 63: refcount == 1   (QCOW_OFLAG_COPIED)
bit 62: compressed = 1  (QCOW_OFLAG_COMPRESSED)
bit  0: all zeros   (QCOW_OFLAG_ZERO)

If Extended L2 Entries are enabled, bit 0 becomes reserved and must be
unset, and this 64-bit bitmap follows the entry:

6356 5548 4740 3932 3124 2316 15 8 7  0
       
<-> <->
 subcluster reads as zerossubcluster is allocated
 (32 bits)   (32 bits)

All this applies to uncompressed clusters. Compressed clusters are not
divided into subclusters, the cluster descriptor remains exactly the
same, and the 64-bit bitmap is not used (i.e. all bits are always 0).

=== Test results ===

I made all tests on an SSD drive, 

[RFC PATCH 21/23] qcow2: Add subcluster support to handle_alloc_space()

2019-10-15 Thread Alberto Garcia
The bdrv_co_pwrite_zeroes() call here fills complete clusters with
zeroes, but it can happen that some subclusters are not part of the
write request or the copy-on-write. This patch makes sure that only
the affected subclusters are overwritten.

A potential improvement would be to also fill with zeroes the other
subclusters if we can guarantee that we are not overwriting existing
data. However this would waste more disk space, so we should first
evaluate if it's really worth doing.

Signed-off-by: Alberto Garcia 
---
 block/qcow2.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index c222cd261d..c54278ab0b 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2194,6 +2194,9 @@ static int handle_alloc_space(BlockDriverState *bs, 
QCowL2Meta *l2meta)
 
 for (m = l2meta; m != NULL; m = m->next) {
 int ret;
+uint64_t start_offset = m->alloc_offset + m->cow_start.offset;
+uint64_t nb_bytes = m->cow_end.offset + m->cow_end.nb_bytes -
+m->cow_start.offset;
 
 if (!m->cow_start.nb_bytes && !m->cow_end.nb_bytes) {
 continue;
@@ -2208,16 +2211,14 @@ static int handle_alloc_space(BlockDriverState *bs, 
QCowL2Meta *l2meta)
  * efficiently zero out the whole clusters
  */
 
-ret = qcow2_pre_write_overlap_check(bs, 0, m->alloc_offset,
-m->nb_clusters * s->cluster_size,
+ret = qcow2_pre_write_overlap_check(bs, 0, start_offset, nb_bytes,
 true);
 if (ret < 0) {
 return ret;
 }
 
 BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC_SPACE);
-ret = bdrv_co_pwrite_zeroes(s->data_file, m->alloc_offset,
-m->nb_clusters * s->cluster_size,
+ret = bdrv_co_pwrite_zeroes(s->data_file, start_offset, nb_bytes,
 BDRV_REQ_NO_FALLBACK);
 if (ret < 0) {
 if (ret != -ENOTSUP && ret != -EAGAIN) {
-- 
2.20.1




[RFC PATCH 14/23] qcow2: Add subcluster support to qcow2_get_cluster_offset()

2019-10-15 Thread Alberto Garcia
The logic of this function remains pretty much the same, except that
it uses count_contiguous_subclusters(), which combines the logic of
count_contiguous_clusters() / count_contiguous_clusters_unallocated()
and checks individual subclusters.

Signed-off-by: Alberto Garcia 
---
 block/qcow2-cluster.c | 111 --
 1 file changed, 52 insertions(+), 59 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 8df0f67316..71d4cc518a 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -372,66 +372,51 @@ fail:
 }
 
 /*
- * Checks how many clusters in a given L2 slice are contiguous in the image
- * file. As soon as one of the flags in the bitmask stop_flags changes compared
- * to the first cluster, the search is stopped and the cluster is not counted
- * as contiguous. (This allows it, for example, to stop at the first compressed
- * cluster which may require a different handling)
+ * Return the number of contiguous subclusters of the exact same type
+ * in a given L2 slice, starting from cluster @l2_index, subcluster
+ * @sc_index. At most @nb_clusters are checked. Allocated clusters are
+ * also required to be contiguous in the image file.
  */
-static int count_contiguous_clusters(BlockDriverState *bs, int nb_clusters,
-int cluster_size, uint64_t *l2_slice, int l2_index, uint64_t 
stop_flags)
+static int count_contiguous_subclusters(BlockDriverState *bs, int nb_clusters,
+unsigned sc_index, uint64_t *l2_slice,
+int l2_index)
 {
 BDRVQcow2State *s = bs->opaque;
-int i;
-QCow2ClusterType first_cluster_type;
-uint64_t mask = stop_flags | L2E_OFFSET_MASK | QCOW_OFLAG_COMPRESSED;
-uint64_t first_entry = get_l2_entry(s, l2_slice, l2_index);
-uint64_t offset = first_entry & mask;
-
-first_cluster_type = qcow2_get_cluster_type(bs, first_entry);
-if (first_cluster_type == QCOW2_CLUSTER_UNALLOCATED) {
-return 0;
+int i, j, count = 0;
+uint64_t l2_entry = get_l2_entry(s, l2_slice, l2_index);
+uint64_t l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index);
+uint64_t expected_offset = l2_entry & L2E_OFFSET_MASK;
+bool check_offset = true;
+QCow2ClusterType type =
+qcow2_get_subcluster_type(bs, l2_entry, l2_bitmap, sc_index);
+
+assert(type != QCOW2_CLUSTER_INVALID); /* The caller should check this */
+
+if (type == QCOW2_CLUSTER_COMPRESSED) {
+return 1; /* Compressed clusters are always counted one by one */
 }
 
-/* must be allocated */
-assert(first_cluster_type == QCOW2_CLUSTER_NORMAL ||
-   first_cluster_type == QCOW2_CLUSTER_ZERO_ALLOC);
-
-for (i = 0; i < nb_clusters; i++) {
-uint64_t l2_entry = get_l2_entry(s, l2_slice, l2_index + i) & mask;
-if (offset + (uint64_t) i * cluster_size != l2_entry) {
-break;
-}
+if (type == QCOW2_CLUSTER_UNALLOCATED || type == QCOW2_CLUSTER_ZERO_PLAIN) 
{
+check_offset = false;
 }
 
-return i;
-}
-
-/*
- * Checks how many consecutive unallocated clusters in a given L2
- * slice have the same cluster type.
- */
-static int count_contiguous_clusters_unallocated(BlockDriverState *bs,
- int nb_clusters,
- uint64_t *l2_slice,
- int l2_index,
- QCow2ClusterType wanted_type)
-{
-BDRVQcow2State *s = bs->opaque;
-int i;
-
-assert(wanted_type == QCOW2_CLUSTER_ZERO_PLAIN ||
-   wanted_type == QCOW2_CLUSTER_UNALLOCATED);
 for (i = 0; i < nb_clusters; i++) {
-uint64_t entry = get_l2_entry(s, l2_slice, l2_index + i);
-QCow2ClusterType type = qcow2_get_cluster_type(bs, entry);
-
-if (type != wanted_type) {
-break;
+l2_entry = get_l2_entry(s, l2_slice, l2_index + i);
+l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index + i);
+if (check_offset && expected_offset != (l2_entry & L2E_OFFSET_MASK)) {
+goto out;
+}
+for (j = (i == 0) ? sc_index : 0; j < s->subclusters_per_cluster; j++) 
{
+if (qcow2_get_subcluster_type(bs, l2_entry, l2_bitmap, j) != type) 
{
+goto out;
+}
+count++;
 }
+expected_offset += s->cluster_size;
 }
 
-return i;
+out:
+return count;
 }
 
 static int coroutine_fn do_perform_cow_read(BlockDriverState *bs,
@@ -514,8 +499,8 @@ int qcow2_get_cluster_offset(BlockDriverState *bs, uint64_t 
offset,
  unsigned int *bytes, uint64_t *cluster_offset)
 {
 BDRVQcow2State *s = bs->opaque;
-unsigned int l2_index;
-uint64_t l1_index, l2_offset, *l2_slice;
+unsigned int l2_index, sc_index;
+uint64_t l1_index, l2_offset, *l2_slice, l2_bitmap;
 int c;
 

[RFC PATCH 13/23] qcow2: Add subcluster support to calculate_l2_meta()

2019-10-15 Thread Alberto Garcia
If an image has subclusters then there are more copy-on-write
scenarios that we need to consider. Let's say we have a write request
from the middle of subcluster #3 until the end of the cluster:

   - If the cluster is new, then subclusters #0 to #3 from the old
 cluster must be copied into the new one.

   - If the cluster is new but the old cluster was unallocated, then
 only subcluster #3 needs copy-on-write. #0 to #2 are marked as
 unallocated in the bitmap of the new L2 entry.

   - If we are overwriting an old cluster and subcluster #3 is
 unallocated or has the all-zeroes bit set then we need
 copy-on-write on subcluster #3.

   - If we are overwriting an old cluster and subcluster #3 was
 allocated then there is no need to copy-on-write.

Signed-off-by: Alberto Garcia 
---
 block/qcow2-cluster.c | 136 +-
 1 file changed, 108 insertions(+), 28 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 67f90e415d..8df0f67316 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1034,14 +1034,16 @@ void qcow2_alloc_cluster_abort(BlockDriverState *bs, 
QCowL2Meta *m)
  * If @keep_old is true it means that the clusters were already
  * allocated and will be overwritten. If false then the clusters are
  * new and we have to decrease the reference count of the old ones.
+ *
+ * Returns 1 on success, -errno on failure.
  */
-static void calculate_l2_meta(BlockDriverState *bs, uint64_t host_offset,
-  uint64_t guest_offset, uint64_t bytes,
-  uint64_t *l2_slice, QCowL2Meta **m, bool 
keep_old)
+static int calculate_l2_meta(BlockDriverState *bs, uint64_t host_offset,
+ uint64_t guest_offset, uint64_t bytes,
+ uint64_t *l2_slice, QCowL2Meta **m, bool keep_old)
 {
 BDRVQcow2State *s = bs->opaque;
-int l2_index = offset_to_l2_slice_index(s, guest_offset);
-uint64_t l2_entry;
+int sc_index, l2_index = offset_to_l2_slice_index(s, guest_offset);
+uint64_t l2_entry, l2_bitmap;
 unsigned cow_start_from, cow_end_to;
 unsigned cow_start_to = offset_into_cluster(s, guest_offset);
 unsigned cow_end_from = cow_start_to + bytes;
@@ -1049,38 +1051,108 @@ static void calculate_l2_meta(BlockDriverState *bs, 
uint64_t host_offset,
 QCowL2Meta *old_m = *m;
 QCow2ClusterType type;
 
-/* Return if there's no COW (all clusters are normal and we keep them) */
+/* Return if there's no COW (all subclusters are normal and we are
+ * keeping the clusters) */
 if (keep_old) {
+unsigned first_sc = cow_start_to / s->subcluster_size;
+unsigned last_sc = (cow_end_from - 1) / s->subcluster_size;
 int i;
-for (i = 0; i < nb_clusters; i++) {
-l2_entry = get_l2_entry(s, l2_slice, l2_index + i);
-if (qcow2_get_cluster_type(bs, l2_entry) != QCOW2_CLUSTER_NORMAL) {
+for (i = first_sc; i <= last_sc; i++) {
+unsigned c = i / s->subclusters_per_cluster;
+unsigned sc = i % s->subclusters_per_cluster;
+l2_entry = get_l2_entry(s, l2_slice, l2_index + c);
+l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index + c);
+type = qcow2_get_subcluster_type(bs, l2_entry, l2_bitmap, sc);
+if (type == QCOW2_CLUSTER_INVALID) {
+l2_index += c; /* Point to the invalid entry */
+goto fail;
+}
+if (type != QCOW2_CLUSTER_NORMAL) {
 break;
 }
 }
-if (i == nb_clusters) {
-return;
+if (i == last_sc + 1) {
+return 1;
 }
 }
 
 /* Get the L2 entry from the first cluster */
 l2_entry = get_l2_entry(s, l2_slice, l2_index);
-type = qcow2_get_cluster_type(bs, l2_entry);
+l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index);
+sc_index = offset_to_sc_index(s, guest_offset);
+type = qcow2_get_subcluster_type(bs, l2_entry, l2_bitmap, sc_index);
 
-if (type == QCOW2_CLUSTER_NORMAL && keep_old) {
-cow_start_from = cow_start_to;
+if (type == QCOW2_CLUSTER_INVALID) {
+goto fail;
+}
+
+if (!keep_old) {
+switch (type) {
+case QCOW2_CLUSTER_NORMAL:
+case QCOW2_CLUSTER_COMPRESSED:
+case QCOW2_CLUSTER_ZERO_ALLOC:
+case QCOW2_CLUSTER_UNALLOCATED_SUBCLUSTER:
+cow_start_from = 0;
+break;
+case QCOW2_CLUSTER_ZERO_PLAIN:
+case QCOW2_CLUSTER_UNALLOCATED:
+cow_start_from = sc_index << s->subcluster_bits;
+break;
+default:
+g_assert_not_reached();
+}
 } else {
-cow_start_from = 0;
+switch (type) {
+case QCOW2_CLUSTER_NORMAL:
+cow_start_from = cow_start_to;
+break;
+case QCOW2_CLUSTER_ZERO_ALLOC:
+case 

[RFC PATCH 17/23] qcow2: Add subcluster support to check_refcounts_l2()

2019-10-15 Thread Alberto Garcia
Setting the QCOW_OFLAG_ZERO bit of the L2 entry is forbidden if an
image has subclusters. Instead, the individual 'all zeroes' bits must
be used.

Signed-off-by: Alberto Garcia 
---
 block/qcow2-refcount.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index a2c4d36378..3eda523e25 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -1685,8 +1685,13 @@ static int check_refcounts_l2(BlockDriverState *bs, 
BdrvCheckResult *res,
 int ign = active ? QCOW2_OL_ACTIVE_L2 :
QCOW2_OL_INACTIVE_L2;
 
-l2_entry = QCOW_OFLAG_ZERO;
-set_l2_entry(s, l2_table, i, l2_entry);
+if (has_subclusters(s)) {
+set_l2_entry(s, l2_table, i, 0);
+set_l2_bitmap(s, l2_table, i,
+  QCOW_L2_BITMAP_ALL_ZEROES);
+} else {
+set_l2_entry(s, l2_table, i, QCOW_OFLAG_ZERO);
+}
 ret = qcow2_pre_write_overlap_check(bs, ign,
 l2e_offset, l2_entry_size(s), false);
 if (ret < 0) {
-- 
2.20.1




[RFC PATCH 11/23] qcow2: Add qcow2_get_subcluster_type()

2019-10-15 Thread Alberto Garcia
This function returns the type of an individual subcluster. If an
image does not have subclusters then this returns the exact same value
as qcow2_get_cluster_type().

The information in standard and extended L2 entries is encoded in a
slightly different way, but all existing QCow2ClusterType values are
also valid for subclusters and have the same meanings (although they
typically only apply to the requested subcluster).

There are two important exceptions to this:

  a) QCOW2_CLUSTER_COMPRESSED means that the whole cluster is
 compressed. We do not support compression at the subcluster
 level.

  b) QCOW2_CLUSTER_UNALLOCATED means that the cluster is unallocated,
 that is, the offset field of the L2 entry does not point to a
 host cluster. All subclusters are obviously unallocated too but
 any of them could be of type QCOW2_CLUSTER_ZERO_PLAIN.

In addition to that, extended L2 entries allow one new scenario where
the cluster is normally allocated but an individual subcluster is not.
This is very different from (b) and because of that this patch adds a
new value called QCOW2_CLUSTER_UNALLOCATED_SUBCLUSTER.

As a last thing, this patch adds QCOW2_CLUSTER_INVALID to detect the
cases where an L2 entry has a value that violates the spec. The caller
is responsible for handling these situations.

To prevent compatibility problems with images that have invalid values
but are currently being read by QEMU without causing side effects,
QCOW2_CLUSTER_INVALID is only returned for images with extended L2
entries.

Signed-off-by: Alberto Garcia 
---
 block/qcow2.h | 62 +++
 1 file changed, 62 insertions(+)

diff --git a/block/qcow2.h b/block/qcow2.h
index d9fe883fe0..60e4bf963e 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -74,6 +74,15 @@
 
 #define QCOW_MAX_SUBCLUSTERS_PER_CLUSTER 32
 
+/* The subcluster X [0..31] reads as zeroes */
+#define QCOW_OFLAG_SUB_ZERO(X)((1ULL << 63) >> (X))
+/* The subcluster X [0..31] is allocated */
+#define QCOW_OFLAG_SUB_ALLOC(X)   ((1ULL << 31) >> (X))
+/* L2 entry bitmap with all "read as zeroes" bits set */
+#define QCOW_L2_BITMAP_ALL_ZEROES 0xULL
+/* L2 entry bitmap with all allocation bits set */
+#define QCOW_L2_BITMAP_ALL_ALLOC  0xULL
+
 #define MIN_CLUSTER_BITS 9
 #define MAX_CLUSTER_BITS 21
 
@@ -435,10 +444,12 @@ typedef struct QCowL2Meta
 
 typedef enum QCow2ClusterType {
 QCOW2_CLUSTER_UNALLOCATED,
+QCOW2_CLUSTER_UNALLOCATED_SUBCLUSTER,
 QCOW2_CLUSTER_ZERO_PLAIN,
 QCOW2_CLUSTER_ZERO_ALLOC,
 QCOW2_CLUSTER_NORMAL,
 QCOW2_CLUSTER_COMPRESSED,
+QCOW2_CLUSTER_INVALID,
 } QCow2ClusterType;
 
 typedef enum QCow2MetadataOverlap {
@@ -618,6 +629,57 @@ static inline QCow2ClusterType 
qcow2_get_cluster_type(BlockDriverState *bs,
 }
 }
 
+/* In an image without subsclusters this returns the same value as
+ * qcow2_get_cluster_type() */
+static inline int qcow2_get_subcluster_type(BlockDriverState *bs,
+uint64_t l2_entry,
+uint64_t l2_bitmap,
+unsigned sc_index)
+{
+BDRVQcow2State *s = bs->opaque;
+QCow2ClusterType type = qcow2_get_cluster_type(bs, l2_entry);
+assert(sc_index < s->subclusters_per_cluster);
+
+if (has_subclusters(s)) {
+bool sc_zero  = l2_bitmap & QCOW_OFLAG_SUB_ZERO(sc_index);
+bool sc_alloc = l2_bitmap & QCOW_OFLAG_SUB_ALLOC(sc_index);
+switch (type) {
+case QCOW2_CLUSTER_COMPRESSED:
+if (l2_bitmap != 0) {
+return QCOW2_CLUSTER_INVALID;
+}
+break;
+case QCOW2_CLUSTER_ZERO_PLAIN:
+case QCOW2_CLUSTER_ZERO_ALLOC:
+return QCOW2_CLUSTER_INVALID;
+case QCOW2_CLUSTER_NORMAL:
+if (!sc_zero && !sc_alloc) {
+return QCOW2_CLUSTER_UNALLOCATED_SUBCLUSTER;
+} else if (!sc_zero && sc_alloc) {
+return QCOW2_CLUSTER_NORMAL;
+} else if (sc_zero && !sc_alloc) {
+return QCOW2_CLUSTER_ZERO_ALLOC;
+} else { /* sc_zero && sc_alloc */
+return QCOW2_CLUSTER_INVALID;
+}
+case QCOW2_CLUSTER_UNALLOCATED:
+if (!sc_zero && !sc_alloc) {
+return QCOW2_CLUSTER_UNALLOCATED;
+} else if (!sc_zero && sc_alloc) {
+return QCOW2_CLUSTER_INVALID;
+} else if (sc_zero && !sc_alloc) {
+return QCOW2_CLUSTER_ZERO_PLAIN;
+} else { /* sc_zero && sc_alloc */
+return QCOW2_CLUSTER_INVALID;
+}
+default:
+g_assert_not_reached();
+}
+}
+
+return type;
+}
+
 /* Check whether refcounts are eager or lazy */
 static inline bool qcow2_need_accurate_refcounts(BDRVQcow2State *s)
 {
-- 
2.20.1




[RFC PATCH 07/23] qcow2: Add subcluster-related fields to BDRVQcow2State

2019-10-15 Thread Alberto Garcia
This patch adds the following new fields to BDRVQcow2State:

- subclusters_per_cluster: Number of subclusters in a cluster
- subcluster_size: The size of each subcluster, in bytes
- subcluster_bits: No. of bits so 1 << subcluster_bits = subcluster_size

Images without subclusters are treated as if they had exactly one,
with subcluster_size = cluster_size.

Signed-off-by: Alberto Garcia 
---
 block/qcow2.c | 5 +
 block/qcow2.h | 5 +
 2 files changed, 10 insertions(+)

diff --git a/block/qcow2.c b/block/qcow2.c
index 4d16393e61..be9854c5ea 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1341,6 +1341,11 @@ static int coroutine_fn qcow2_do_open(BlockDriverState 
*bs, QDict *options,
 }
 }
 
+s->subclusters_per_cluster =
+has_subclusters(s) ? QCOW_MAX_SUBCLUSTERS_PER_CLUSTER : 1;
+s->subcluster_size = s->cluster_size / s->subclusters_per_cluster;
+s->subcluster_bits = ctz32(s->subcluster_size);
+
 /* Check support for various header values */
 if (header.refcount_order > 6) {
 error_setg(errp, "Reference count entry width too large; may not "
diff --git a/block/qcow2.h b/block/qcow2.h
index 6d6fc57f41..e6486a2cf8 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -72,6 +72,8 @@
 /* The cluster reads as all zeros */
 #define QCOW_OFLAG_ZERO (1ULL << 0)
 
+#define QCOW_MAX_SUBCLUSTERS_PER_CLUSTER 32
+
 #define MIN_CLUSTER_BITS 9
 #define MAX_CLUSTER_BITS 21
 
@@ -274,6 +276,9 @@ typedef struct BDRVQcow2State {
 int cluster_bits;
 int cluster_size;
 int l2_slice_size;
+int subcluster_bits;
+int subcluster_size;
+int subclusters_per_cluster;
 int l2_bits;
 int l2_size;
 int l1_size;
-- 
2.20.1




[RFC PATCH 08/23] qcow2: Add offset_to_sc_index()

2019-10-15 Thread Alberto Garcia
For a given offset, return the subcluster number within its cluster
(i.e. with 32 subclusters per cluster it returns a number between 0
and 31).

Signed-off-by: Alberto Garcia 
---
 block/qcow2.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/block/qcow2.h b/block/qcow2.h
index e6486a2cf8..c450267c88 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -556,6 +556,11 @@ static inline int offset_to_l2_slice_index(BDRVQcow2State 
*s, int64_t offset)
 return (offset >> s->cluster_bits) & (s->l2_slice_size - 1);
 }
 
+static inline int offset_to_sc_index(BDRVQcow2State *s, int64_t offset)
+{
+return (offset >> s->subcluster_bits) & (s->subclusters_per_cluster - 1);
+}
+
 static inline int64_t qcow2_vm_state_offset(BDRVQcow2State *s)
 {
 return (int64_t)s->l1_vm_state_index << (s->cluster_bits + s->l2_bits);
-- 
2.20.1




[RFC PATCH 18/23] qcow2: Add subcluster support to expand_zero_clusters_in_l1()

2019-10-15 Thread Alberto Garcia
Two changes are needed in order to add subcluster support to this
function: deallocated clusters must have their bitmaps cleared, and
expanded clusters must have all the "subcluster allocated" bits set.

Signed-off-by: Alberto Garcia 
---
 block/qcow2-cluster.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index bf32447d18..dc72f0e595 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -2033,6 +2033,7 @@ static int expand_zero_clusters_in_l1(BlockDriverState 
*bs, uint64_t *l1_table,
 /* not backed; therefore we can simply deallocate the
  * cluster */
 set_l2_entry(s, l2_slice, j, 0);
+set_l2_bitmap(s, l2_slice, j, 0);
 l2_dirty = true;
 continue;
 }
@@ -2099,6 +2100,7 @@ static int expand_zero_clusters_in_l1(BlockDriverState 
*bs, uint64_t *l1_table,
 } else {
 set_l2_entry(s, l2_slice, j, offset);
 }
+set_l2_bitmap(s, l2_slice, j, QCOW_L2_BITMAP_ALL_ALLOC);
 l2_dirty = true;
 }
 
-- 
2.20.1




[RFC PATCH 02/23] qcow2: Split cluster_needs_cow() out of count_cow_clusters()

2019-10-15 Thread Alberto Garcia
We are going to need it in other places.

Signed-off-by: Alberto Garcia 
---
 block/qcow2-cluster.c | 34 +++---
 1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index fe2523ed66..f462e169c0 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1068,6 +1068,24 @@ static void calculate_l2_meta(BlockDriverState *bs, 
uint64_t host_offset,
 QLIST_INSERT_HEAD(>cluster_allocs, *m, next_in_flight);
 }
 
+/* Returns true if writing to a cluster requires COW */
+static bool cluster_needs_cow(BlockDriverState *bs, uint64_t l2_entry)
+{
+switch (qcow2_get_cluster_type(bs, l2_entry)) {
+case QCOW2_CLUSTER_NORMAL:
+if (l2_entry & QCOW_OFLAG_COPIED) {
+return false;
+}
+case QCOW2_CLUSTER_UNALLOCATED:
+case QCOW2_CLUSTER_COMPRESSED:
+case QCOW2_CLUSTER_ZERO_PLAIN:
+case QCOW2_CLUSTER_ZERO_ALLOC:
+return true;
+default:
+abort();
+}
+}
+
 /*
  * Returns the number of contiguous clusters that can be used for an allocating
  * write, but require COW to be performed (this includes yet unallocated space,
@@ -1080,25 +1098,11 @@ static int count_cow_clusters(BlockDriverState *bs, int 
nb_clusters,
 
 for (i = 0; i < nb_clusters; i++) {
 uint64_t l2_entry = be64_to_cpu(l2_slice[l2_index + i]);
-QCow2ClusterType cluster_type = qcow2_get_cluster_type(bs, l2_entry);
-
-switch(cluster_type) {
-case QCOW2_CLUSTER_NORMAL:
-if (l2_entry & QCOW_OFLAG_COPIED) {
-goto out;
-}
+if (!cluster_needs_cow(bs, l2_entry)) {
 break;
-case QCOW2_CLUSTER_UNALLOCATED:
-case QCOW2_CLUSTER_COMPRESSED:
-case QCOW2_CLUSTER_ZERO_PLAIN:
-case QCOW2_CLUSTER_ZERO_ALLOC:
-break;
-default:
-abort();
 }
 }
 
-out:
 assert(i <= nb_clusters);
 return i;
 }
-- 
2.20.1




[PATCH v2 04/21] iotests: Filter refcount_order in 036

2019-10-15 Thread Max Reitz
This test can run just fine with other values for refcount_bits, so we
should filter the value from qcow2.py's dump-header.  In fact, we can
filter everything but the feature bits and header extensions, because
that is what the test is about.

(036 currently ignores user-specified image options, but that will be
fixed in the next patch.)

Signed-off-by: Max Reitz 
---
 tests/qemu-iotests/036 |  9 ---
 tests/qemu-iotests/036.out | 48 --
 2 files changed, 6 insertions(+), 51 deletions(-)

diff --git a/tests/qemu-iotests/036 b/tests/qemu-iotests/036
index f06ff67408..5f929ad3be 100755
--- a/tests/qemu-iotests/036
+++ b/tests/qemu-iotests/036
@@ -55,7 +55,8 @@ $PYTHON qcow2.py "$TEST_IMG" set-feature-bit incompatible 63
 
 # Without feature table
 $PYTHON qcow2.py "$TEST_IMG" del-header-ext 0x6803f857
-$PYTHON qcow2.py "$TEST_IMG" dump-header
+$PYTHON qcow2.py "$TEST_IMG" dump-header | grep features
+$PYTHON qcow2.py "$TEST_IMG" dump-header-exts
 _img_info
 
 # With feature table containing bit 63
@@ -103,14 +104,16 @@ echo === Create image with unknown autoclear feature bit 
===
 echo
 _make_test_img 64M
 $PYTHON qcow2.py "$TEST_IMG" set-feature-bit autoclear 63
-$PYTHON qcow2.py "$TEST_IMG" dump-header
+$PYTHON qcow2.py "$TEST_IMG" dump-header | grep features
+$PYTHON qcow2.py "$TEST_IMG" dump-header-exts
 
 echo
 echo === Repair image ===
 echo
 _check_test_img -r all
 
-$PYTHON qcow2.py "$TEST_IMG" dump-header
+$PYTHON qcow2.py "$TEST_IMG" dump-header | grep features
+$PYTHON qcow2.py "$TEST_IMG" dump-header-exts
 
 # success, all done
 echo "*** done"
diff --git a/tests/qemu-iotests/036.out b/tests/qemu-iotests/036.out
index 15229a9604..0b52b934e1 100644
--- a/tests/qemu-iotests/036.out
+++ b/tests/qemu-iotests/036.out
@@ -3,25 +3,9 @@ QA output created by 036
 === Image with unknown incompatible feature bit ===
 
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
-magic 0x514649fb
-version   3
-backing_file_offset   0x0
-backing_file_size 0x0
-cluster_bits  16
-size  67108864
-crypt_method  0
-l1_size   1
-l1_table_offset   0x3
-refcount_table_offset 0x1
-refcount_table_clusters   1
-nb_snapshots  0
-snapshot_offset   0x0
 incompatible_features [63]
 compatible_features   []
 autoclear_features[]
-refcount_order4
-header_length 104
-
 qemu-img: Could not open 'TEST_DIR/t.IMGFMT': Unsupported IMGFMT feature(s): 
Unknown incompatible feature: 8000
 qemu-img: Could not open 'TEST_DIR/t.IMGFMT': Unsupported IMGFMT feature(s): 
Test feature
 
@@ -37,25 +21,9 @@ qemu-img: Could not open 'TEST_DIR/t.IMGFMT': Unsupported 
IMGFMT feature(s): tes
 === Create image with unknown autoclear feature bit ===
 
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
-magic 0x514649fb
-version   3
-backing_file_offset   0x0
-backing_file_size 0x0
-cluster_bits  16
-size  67108864
-crypt_method  0
-l1_size   1
-l1_table_offset   0x3
-refcount_table_offset 0x1
-refcount_table_clusters   1
-nb_snapshots  0
-snapshot_offset   0x0
 incompatible_features []
 compatible_features   []
 autoclear_features[63]
-refcount_order4
-header_length 104
-
 Header extension:
 magic 0x6803f857
 length192
@@ -65,25 +33,9 @@ data  
 === Repair image ===
 
 No errors were found on the image.
-magic 0x514649fb
-version   3
-backing_file_offset   0x0
-backing_file_size 0x0
-cluster_bits  16
-size  67108864
-crypt_method  0
-l1_size   1
-l1_table_offset   0x3
-refcount_table_offset 0x1
-refcount_table_clusters   1
-nb_snapshots  0
-snapshot_offset   0x0
 incompatible_features []
 compatible_features   []
 autoclear_features[]
-refcount_order4
-header_length 104
-
 Header extension:
 magic 0x6803f857
 length192
-- 
2.21.0




[PATCH v2 13/21] iotests: Avoid qemu-img create

2019-10-15 Thread Max Reitz
Use _make_test_img whenever possible.  This way, we will not ignore
user-specified image options.

Signed-off-by: Max Reitz 
Reviewed-by: Maxim Levitsky 
---
 tests/qemu-iotests/094 | 2 +-
 tests/qemu-iotests/111 | 3 +--
 tests/qemu-iotests/123 | 2 +-
 tests/qemu-iotests/153 | 2 +-
 tests/qemu-iotests/200 | 4 ++--
 5 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/tests/qemu-iotests/094 b/tests/qemu-iotests/094
index 9343e09492..d645952d54 100755
--- a/tests/qemu-iotests/094
+++ b/tests/qemu-iotests/094
@@ -45,7 +45,7 @@ _supported_proto nbd
 _unsupported_imgopts "subformat=monolithicFlat" "subformat=twoGbMaxExtentFlat"
 
 _make_test_img 64M
-$QEMU_IMG create -f $IMGFMT "$TEST_DIR/source.$IMGFMT" 64M | _filter_img_create
+TEST_IMG_FILE="$TEST_DIR/source.$IMGFMT" IMGPROTO=file _make_test_img 64M
 
 _launch_qemu -drive if=none,id=src,file="$TEST_DIR/source.$IMGFMT",format=raw \
  -nodefaults
diff --git a/tests/qemu-iotests/111 b/tests/qemu-iotests/111
index 490a5bbcb5..3b43d1bd83 100755
--- a/tests/qemu-iotests/111
+++ b/tests/qemu-iotests/111
@@ -41,8 +41,7 @@ _supported_fmt qed qcow qcow2 vmdk
 _supported_proto file
 _unsupported_imgopts "subformat=monolithicFlat" "subformat=twoGbMaxExtentFlat"
 
-$QEMU_IMG create -f $IMGFMT -b "$TEST_IMG.inexistent" "$TEST_IMG" 2>&1 \
-| _filter_testdir | _filter_imgfmt
+_make_test_img -b "$TEST_IMG.inexistent"
 
 # success, all done
 echo '*** done'
diff --git a/tests/qemu-iotests/123 b/tests/qemu-iotests/123
index d33950eb54..74d40d0478 100755
--- a/tests/qemu-iotests/123
+++ b/tests/qemu-iotests/123
@@ -44,7 +44,7 @@ _supported_os Linux
 SRC_IMG="$TEST_DIR/source.$IMGFMT"
 
 _make_test_img 1M
-$QEMU_IMG create -f $IMGFMT "$SRC_IMG" 1M | _filter_img_create
+TEST_IMG_FILE=$SRC_IMG IMGPROTO=file _make_test_img 1M
 
 $QEMU_IO -c 'write -P 42 0 1M' "$SRC_IMG" | _filter_qemu_io
 
diff --git a/tests/qemu-iotests/153 b/tests/qemu-iotests/153
index c969a1a16f..e59090259c 100755
--- a/tests/qemu-iotests/153
+++ b/tests/qemu-iotests/153
@@ -98,7 +98,7 @@ for opts1 in "" "read-only=on" "read-only=on,force-share=on"; 
do
 
 echo
 echo "== Creating test image =="
-$QEMU_IMG create -f $IMGFMT "${TEST_IMG}" -b ${TEST_IMG}.base | 
_filter_img_create
+_make_test_img -b "${TEST_IMG}.base"
 
 echo
 echo "== Launching QEMU, opts: '$opts1' =="
diff --git a/tests/qemu-iotests/200 b/tests/qemu-iotests/200
index 72d431f251..d904885136 100755
--- a/tests/qemu-iotests/200
+++ b/tests/qemu-iotests/200
@@ -46,8 +46,8 @@ _supported_proto file
 BACKING_IMG="${TEST_DIR}/backing.img"
 TEST_IMG="${TEST_DIR}/test.img"
 
-${QEMU_IMG} create -f $IMGFMT "${BACKING_IMG}" 512M | _filter_img_create
-${QEMU_IMG} create -f $IMGFMT -F $IMGFMT "${TEST_IMG}" -b "${BACKING_IMG}" 
512M | _filter_img_create
+TEST_IMG="$BACKING_IMG" _make_test_img 512M
+_make_test_img -F $IMGFMT -b "$BACKING_IMG" 512M
 
 ${QEMU_IO} -c "write -P 0xa5 512 300M" "${BACKING_IMG}" | _filter_qemu_io
 
-- 
2.21.0




[PATCH v2 14/21] iotests: Use _rm_test_img for deleting test images

2019-10-15 Thread Max Reitz
Just rm will not delete external data files.  Use _rm_test_img every
time we delete a test image.

(In the process, clean up the indentation of every _cleanup() this patch
touches.)

((Also, use quotes consistently.  I am happy to see unquoted instances
like "rm -rf $TEST_DIR/..." go.))

Signed-off-by: Max Reitz 
---
 tests/qemu-iotests/005 |  2 +-
 tests/qemu-iotests/019 |  6 +++---
 tests/qemu-iotests/020 |  6 +++---
 tests/qemu-iotests/024 | 10 +-
 tests/qemu-iotests/028 |  2 +-
 tests/qemu-iotests/029 |  2 +-
 tests/qemu-iotests/043 |  4 +++-
 tests/qemu-iotests/048 |  2 +-
 tests/qemu-iotests/050 |  4 ++--
 tests/qemu-iotests/053 |  4 ++--
 tests/qemu-iotests/058 |  2 +-
 tests/qemu-iotests/059 |  2 +-
 tests/qemu-iotests/061 |  2 +-
 tests/qemu-iotests/063 |  6 --
 tests/qemu-iotests/069 |  2 +-
 tests/qemu-iotests/074 |  2 +-
 tests/qemu-iotests/080 |  2 +-
 tests/qemu-iotests/081 |  6 +++---
 tests/qemu-iotests/085 |  9 ++---
 tests/qemu-iotests/088 |  2 +-
 tests/qemu-iotests/092 |  2 +-
 tests/qemu-iotests/094 |  2 +-
 tests/qemu-iotests/095 |  5 +++--
 tests/qemu-iotests/099 |  7 ---
 tests/qemu-iotests/109 |  4 ++--
 tests/qemu-iotests/110 |  4 ++--
 tests/qemu-iotests/122 |  6 --
 tests/qemu-iotests/123 |  2 +-
 tests/qemu-iotests/141 |  4 +++-
 tests/qemu-iotests/142 |  2 +-
 tests/qemu-iotests/144 |  4 +++-
 tests/qemu-iotests/153 | 10 +++---
 tests/qemu-iotests/156 |  8 ++--
 tests/qemu-iotests/159 |  2 +-
 tests/qemu-iotests/160 |  3 ++-
 tests/qemu-iotests/161 |  4 ++--
 tests/qemu-iotests/170 |  2 +-
 tests/qemu-iotests/172 |  6 +++---
 tests/qemu-iotests/173 |  3 ++-
 tests/qemu-iotests/178 |  2 +-
 tests/qemu-iotests/182 |  2 +-
 tests/qemu-iotests/183 |  2 +-
 tests/qemu-iotests/185 |  4 ++--
 tests/qemu-iotests/187 |  6 +++---
 tests/qemu-iotests/190 |  2 +-
 tests/qemu-iotests/191 |  6 +++---
 tests/qemu-iotests/195 |  2 +-
 tests/qemu-iotests/197 |  2 +-
 tests/qemu-iotests/200 |  3 ++-
 tests/qemu-iotests/215 |  2 +-
 tests/qemu-iotests/225 |  2 +-
 tests/qemu-iotests/229 |  3 ++-
 tests/qemu-iotests/232 |  4 +++-
 tests/qemu-iotests/243 |  2 +-
 tests/qemu-iotests/244 |  4 ++--
 tests/qemu-iotests/247 |  4 +++-
 tests/qemu-iotests/249 |  4 ++--
 tests/qemu-iotests/252 |  2 +-
 58 files changed, 119 insertions(+), 96 deletions(-)

diff --git a/tests/qemu-iotests/005 b/tests/qemu-iotests/005
index 58442762fe..2b651f2c37 100755
--- a/tests/qemu-iotests/005
+++ b/tests/qemu-iotests/005
@@ -62,7 +62,7 @@ if [ "$IMGFMT" = "raw" ]; then
 if ! truncate --size=5T "$TEST_IMG"; then
 _notrun "file system on $TEST_DIR does not support large enough files"
 fi
-rm "$TEST_IMG"
+_rm_test_img "$TEST_IMG"
 fi
 
 echo
diff --git a/tests/qemu-iotests/019 b/tests/qemu-iotests/019
index b4f5234609..813a84acac 100755
--- a/tests/qemu-iotests/019
+++ b/tests/qemu-iotests/019
@@ -30,9 +30,9 @@ status=1  # failure is the default!
 
 _cleanup()
 {
-   _cleanup_test_img
-rm -f "$TEST_IMG.base"
-rm -f "$TEST_IMG.orig"
+_cleanup_test_img
+_rm_test_img "$TEST_IMG.base"
+_rm_test_img "$TEST_IMG.orig"
 }
 trap "_cleanup; exit \$status" 0 1 2 3 15
 
diff --git a/tests/qemu-iotests/020 b/tests/qemu-iotests/020
index f41b92f35f..20f8f185d0 100755
--- a/tests/qemu-iotests/020
+++ b/tests/qemu-iotests/020
@@ -28,9 +28,9 @@ status=1  # failure is the default!
 
 _cleanup()
 {
-   _cleanup_test_img
-rm -f "$TEST_IMG.base"
-rm -f "$TEST_IMG.orig"
+_cleanup_test_img
+_rm_test_img "$TEST_IMG.base"
+_rm_test_img "$TEST_IMG.orig"
 }
 trap "_cleanup; exit \$status" 0 1 2 3 15
 
diff --git a/tests/qemu-iotests/024 b/tests/qemu-iotests/024
index 23298c6f59..e2e766241e 100755
--- a/tests/qemu-iotests/024
+++ b/tests/qemu-iotests/024
@@ -29,12 +29,12 @@ status=1# failure is the default!
 _cleanup()
 {
 _cleanup_test_img
-rm -f "$TEST_DIR/t.$IMGFMT.base_old"
-rm -f "$TEST_DIR/t.$IMGFMT.base_new"
+_rm_test_img "$TEST_DIR/t.$IMGFMT.base_old"
+_rm_test_img "$TEST_DIR/t.$IMGFMT.base_new"
 
-rm -f "$TEST_DIR/subdir/t.$IMGFMT"
-rm -f "$TEST_DIR/subdir/t.$IMGFMT.base_old"
-rm -f "$TEST_DIR/subdir/t.$IMGFMT.base_new"
+_rm_test_img "$TEST_DIR/subdir/t.$IMGFMT"
+_rm_test_img "$TEST_DIR/subdir/t.$IMGFMT.base_old"
+_rm_test_img "$TEST_DIR/subdir/t.$IMGFMT.base_new"
 rmdir "$TEST_DIR/subdir" 2> /dev/null
 }
 trap "_cleanup; exit \$status" 0 1 2 3 15
diff --git a/tests/qemu-iotests/028 b/tests/qemu-iotests/028
index 71301ec6e5..caf1258647 100755
--- a/tests/qemu-iotests/028
+++ b/tests/qemu-iotests/028
@@ -32,7 +32,7 @@ status=1  # failure is the default!
 _cleanup()
 {
 _cleanup_qemu
-rm -f "${TEST_IMG}.copy"
+_rm_test_img "${TEST_IMG}.copy"
 _cleanup_test_img
 }
 trap "_cleanup; exit \$status" 0 1 2 3 15
diff --git a/tests/qemu-iotests/029 b/tests/qemu-iotests/029
index 94c2713132..9254ede5e5 100755
--- a/tests/qemu-iotests/029
+++ b/tests/qemu-iotests/029
@@ 

[PATCH v2 10/21] iotests: Replace IMGOPTS= by -o

2019-10-15 Thread Max Reitz
Tests should not overwrite all user-supplied image options, but only add
to it (which will effectively overwrite conflicting values).  Accomplish
this by passing options to _make_test_img via -o instead of $IMGOPTS.

For some tests, there is no functional change because they already only
appended options to IMGOPTS.  For these, this patch is just a
simplification.

For others, this is a change, so they now heed user-specified $IMGOPTS.
Some of those tests do not work with all image options, though, so we
need to disable them accordingly.

Signed-off-by: Max Reitz 
---
 tests/qemu-iotests/031 |  9 ---
 tests/qemu-iotests/039 | 24 ++
 tests/qemu-iotests/059 | 18 ++---
 tests/qemu-iotests/060 |  6 ++---
 tests/qemu-iotests/061 | 57 ++
 tests/qemu-iotests/079 |  3 +--
 tests/qemu-iotests/106 |  2 +-
 tests/qemu-iotests/108 |  2 +-
 tests/qemu-iotests/112 | 32 
 tests/qemu-iotests/115 |  3 +--
 tests/qemu-iotests/121 |  6 ++---
 tests/qemu-iotests/125 |  2 +-
 tests/qemu-iotests/137 |  2 +-
 tests/qemu-iotests/138 |  3 +--
 tests/qemu-iotests/175 |  2 +-
 tests/qemu-iotests/190 |  2 +-
 tests/qemu-iotests/191 |  3 +--
 tests/qemu-iotests/220 |  4 ++-
 tests/qemu-iotests/243 |  6 +++--
 tests/qemu-iotests/244 | 10 +---
 tests/qemu-iotests/250 |  3 +--
 tests/qemu-iotests/265 |  2 +-
 22 files changed, 100 insertions(+), 101 deletions(-)

diff --git a/tests/qemu-iotests/031 b/tests/qemu-iotests/031
index a3c25ec237..c44fcf91bb 100755
--- a/tests/qemu-iotests/031
+++ b/tests/qemu-iotests/031
@@ -40,19 +40,22 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 # This tests qcow2-specific low-level functionality
 _supported_fmt qcow2
 _supported_proto file
+# We want to test compat=0.10, which does not support refcount widths
+# other than 16
+_unsupported_imgopts 'refcount_bits=\([^1]\|.\([^6]\|$\)\)'
 
 CLUSTER_SIZE=65536
 
 # qcow2.py output depends on the exact options used, so override the command
 # line here as an exception
-for IMGOPTS in "compat=0.10" "compat=1.1"; do
+for compat in "compat=0.10" "compat=1.1"; do
 
 echo
-echo = Testing with -o $IMGOPTS =
+echo = Testing with -o $compat =
 echo
 echo === Create image with unknown header extension ===
 echo
-_make_test_img 64M
+_make_test_img -o $compat 64M
 $PYTHON qcow2.py "$TEST_IMG" add-header-ext 0x12345678 "This is a test 
header extension"
 $PYTHON qcow2.py "$TEST_IMG" dump-header
 _check_test_img
diff --git a/tests/qemu-iotests/039 b/tests/qemu-iotests/039
index 325da63a4c..99563bf126 100755
--- a/tests/qemu-iotests/039
+++ b/tests/qemu-iotests/039
@@ -50,8 +50,7 @@ size=128M
 echo
 echo "== Checking that image is clean on shutdown =="
 
-IMGOPTS="compat=1.1,lazy_refcounts=on"
-_make_test_img $size
+_make_test_img -o "compat=1.1,lazy_refcounts=on" $size
 
 $QEMU_IO -c "write -P 0x5a 0 512" "$TEST_IMG" | _filter_qemu_io
 
@@ -62,8 +61,7 @@ _check_test_img
 echo
 echo "== Creating a dirty image file =="
 
-IMGOPTS="compat=1.1,lazy_refcounts=on"
-_make_test_img $size
+_make_test_img -o "compat=1.1,lazy_refcounts=on" $size
 
 _NO_VALGRIND \
 $QEMU_IO -c "write -P 0x5a 0 512" \
@@ -98,8 +96,7 @@ $QEMU_IO -c "read -P 0x5a 0 512" "$TEST_IMG" | _filter_qemu_io
 echo
 echo "== Opening a dirty image read/write should repair it =="
 
-IMGOPTS="compat=1.1,lazy_refcounts=on"
-_make_test_img $size
+_make_test_img -o "compat=1.1,lazy_refcounts=on" $size
 
 _NO_VALGRIND \
 $QEMU_IO -c "write -P 0x5a 0 512" \
@@ -117,8 +114,7 @@ $PYTHON qcow2.py "$TEST_IMG" dump-header | grep 
incompatible_features
 echo
 echo "== Creating an image file with lazy_refcounts=off =="
 
-IMGOPTS="compat=1.1,lazy_refcounts=off"
-_make_test_img $size
+_make_test_img -o "compat=1.1,lazy_refcounts=off" $size
 
 _NO_VALGRIND \
 $QEMU_IO -c "write -P 0x5a 0 512" \
@@ -132,11 +128,9 @@ _check_test_img
 echo
 echo "== Committing to a backing file with lazy_refcounts=on =="
 
-IMGOPTS="compat=1.1,lazy_refcounts=on"
-TEST_IMG="$TEST_IMG".base _make_test_img $size
+TEST_IMG="$TEST_IMG".base _make_test_img -o "compat=1.1,lazy_refcounts=on" 
$size
 
-IMGOPTS="compat=1.1,lazy_refcounts=on,backing_file=$TEST_IMG.base"
-_make_test_img $size
+_make_test_img -o "compat=1.1,lazy_refcounts=on,backing_file=$TEST_IMG.base" 
$size
 
 $QEMU_IO -c "write 0 512" "$TEST_IMG" | _filter_qemu_io
 $QEMU_IMG commit "$TEST_IMG"
@@ -151,8 +145,7 @@ TEST_IMG="$TEST_IMG".base _check_test_img
 echo
 echo "== Changing lazy_refcounts setting at runtime =="
 
-IMGOPTS="compat=1.1,lazy_refcounts=off"
-_make_test_img $size
+_make_test_img -o "compat=1.1,lazy_refcounts=off" $size
 
 _NO_VALGRIND \
 $QEMU_IO -c "reopen -o lazy-refcounts=on" \
@@ -164,8 +157,7 @@ $QEMU_IO -c "reopen -o lazy-refcounts=on" \
 $PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features
 _check_test_img
 
-IMGOPTS="compat=1.1,lazy_refcounts=on"
-_make_test_img $size
+_make_test_img -o 

[PATCH v2 20/21] iotests: Disable data_file where it cannot be used

2019-10-15 Thread Max Reitz
Signed-off-by: Max Reitz 
---
 tests/qemu-iotests/007 | 5 +++--
 tests/qemu-iotests/014 | 2 ++
 tests/qemu-iotests/015 | 5 +++--
 tests/qemu-iotests/026 | 5 -
 tests/qemu-iotests/029 | 5 +++--
 tests/qemu-iotests/031 | 6 +++---
 tests/qemu-iotests/036 | 5 +++--
 tests/qemu-iotests/039 | 3 +++
 tests/qemu-iotests/046 | 2 ++
 tests/qemu-iotests/048 | 2 ++
 tests/qemu-iotests/051 | 5 +++--
 tests/qemu-iotests/058 | 5 +++--
 tests/qemu-iotests/060 | 6 --
 tests/qemu-iotests/061 | 6 --
 tests/qemu-iotests/062 | 2 +-
 tests/qemu-iotests/066 | 2 +-
 tests/qemu-iotests/067 | 6 --
 tests/qemu-iotests/068 | 5 +++--
 tests/qemu-iotests/071 | 3 +++
 tests/qemu-iotests/073 | 2 ++
 tests/qemu-iotests/074 | 2 ++
 tests/qemu-iotests/080 | 5 +++--
 tests/qemu-iotests/090 | 2 ++
 tests/qemu-iotests/098 | 6 --
 tests/qemu-iotests/099 | 3 ++-
 tests/qemu-iotests/103 | 5 +++--
 tests/qemu-iotests/108 | 6 --
 tests/qemu-iotests/112 | 5 +++--
 tests/qemu-iotests/114 | 2 ++
 tests/qemu-iotests/121 | 3 +++
 tests/qemu-iotests/138 | 2 ++
 tests/qemu-iotests/156 | 2 ++
 tests/qemu-iotests/176 | 7 +--
 tests/qemu-iotests/191 | 2 ++
 tests/qemu-iotests/201 | 6 +++---
 tests/qemu-iotests/214 | 3 ++-
 tests/qemu-iotests/217 | 3 ++-
 tests/qemu-iotests/220 | 5 +++--
 tests/qemu-iotests/243 | 6 --
 tests/qemu-iotests/244 | 5 +++--
 tests/qemu-iotests/250 | 2 ++
 tests/qemu-iotests/267 | 5 +++--
 42 files changed, 117 insertions(+), 52 deletions(-)

diff --git a/tests/qemu-iotests/007 b/tests/qemu-iotests/007
index 7d3544b479..160683adf8 100755
--- a/tests/qemu-iotests/007
+++ b/tests/qemu-iotests/007
@@ -41,8 +41,9 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 _supported_fmt qcow2
 _supported_proto generic
 # refcount_bits must be at least 4 so we can create ten internal snapshots
-# (1 bit supports none, 2 bits support two, 4 bits support 14)
-_unsupported_imgopts 'refcount_bits=\(1\|2\)[^0-9]'
+# (1 bit supports none, 2 bits support two, 4 bits support 14);
+# snapshot are generally impossible with external data files
+_unsupported_imgopts 'refcount_bits=\(1\|2\)[^0-9]' data_file
 
 echo
 echo "creating image"
diff --git a/tests/qemu-iotests/014 b/tests/qemu-iotests/014
index 2f728a1956..e1221c0fff 100755
--- a/tests/qemu-iotests/014
+++ b/tests/qemu-iotests/014
@@ -43,6 +43,8 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 _supported_fmt qcow2
 _supported_proto file
 _supported_os Linux
+# Compression and snapshots do not work with external data files
+_unsupported_imgopts data_file
 
 TEST_OFFSETS="0 4294967296"
 TEST_OPS="writev read write readv"
diff --git a/tests/qemu-iotests/015 b/tests/qemu-iotests/015
index eec5387f3d..4d8effd0ae 100755
--- a/tests/qemu-iotests/015
+++ b/tests/qemu-iotests/015
@@ -40,8 +40,9 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 # actually any format that supports snapshots
 _supported_fmt qcow2
 _supported_proto generic
-# Internal snapshots are (currently) impossible with refcount_bits=1
-_unsupported_imgopts 'refcount_bits=1[^0-9]'
+# Internal snapshots are (currently) impossible with refcount_bits=1,
+# and generally impossible with external data files
+_unsupported_imgopts 'refcount_bits=1[^0-9]' data_file
 
 echo
 echo "creating image"
diff --git a/tests/qemu-iotests/026 b/tests/qemu-iotests/026
index 3430029ed6..a4aa74764f 100755
--- a/tests/qemu-iotests/026
+++ b/tests/qemu-iotests/026
@@ -49,7 +49,10 @@ _supported_cache_modes writethrough none
 # 32 and 64 bits do not work either, however, due to different leaked cluster
 # count on error.
 # Thus, the only remaining option is refcount_bits=16.
-_unsupported_imgopts 'refcount_bits=\([^1]\|.\([^6]\|$\)\)'
+#
+# As for data_file, none of the refcount tests can work for it.
+_unsupported_imgopts 'refcount_bits=\([^1]\|.\([^6]\|$\)\)' \
+data_file
 
 echo "Errors while writing 128 kB"
 echo
diff --git a/tests/qemu-iotests/029 b/tests/qemu-iotests/029
index 9254ede5e5..2161a4b87a 100755
--- a/tests/qemu-iotests/029
+++ b/tests/qemu-iotests/029
@@ -42,8 +42,9 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 _supported_fmt qcow2
 _supported_proto generic
 _unsupported_proto vxhs
-# Internal snapshots are (currently) impossible with refcount_bits=1
-_unsupported_imgopts 'refcount_bits=1[^0-9]'
+# Internal snapshots are (currently) impossible with refcount_bits=1,
+# and generally impossible with external data files
+_unsupported_imgopts 'refcount_bits=1[^0-9]' data_file
 
 offset_size=24
 offset_l1_size=36
diff --git a/tests/qemu-iotests/031 b/tests/qemu-iotests/031
index c44fcf91bb..646ecd593f 100755
--- a/tests/qemu-iotests/031
+++ b/tests/qemu-iotests/031
@@ -40,9 +40,9 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 # This tests qcow2-specific low-level functionality
 _supported_fmt qcow2
 _supported_proto file
-# We want to test compat=0.10, which does not support refcount widths
-# other than 16
-_unsupported_imgopts 'refcount_bits=\([^1]\|.\([^6]\|$\)\)'
+# We want to test compat=0.10, which does not support external 

Re: [PATCH v2 1/2] nbd: Don't send oversize strings

2019-10-15 Thread Eric Blake

On 10/11/19 2:32 AM, Vladimir Sementsov-Ogievskiy wrote:

11.10.2019 0:00, Eric Blake wrote:

Qemu as server currently won't accept export names larger than 256
bytes, nor create dirty bitmap names longer than 1023 bytes, so most
uses of qemu as client or server have no reason to get anywhere near
the NBD spec maximum of a 4k limit per string.

However, we weren't actually enforcing things, ignoring when the
remote side violates the protocol on input, and also having several
code paths where we send oversize strings on output (for example,
qemu-nbd --description could easily send more than 4k).  Tighten
things up as follows:

client:
- Perform bounds check on export name and dirty bitmap request prior
to handing it to server
- Validate that copied server replies are not too long (ignoring
NBD_INFO_* replies that are not copied is not too bad)
server:
- Perform bounds check on export name and description prior to
advertising it to client
- Reject client name or metadata query that is too long

Signed-off-by: Eric Blake 
---



+++ b/include/block/nbd.h
@@ -232,6 +232,7 @@ enum {
* going larger would require an audit of more code to make sure we
* aren't overflowing some other buffer. */


This comment says, that we restrict export name to 256...


Yes, because we still stack-allocate the name in places, but 4k is too 
large for stack allocation.  But we're inconsistent on where we use the 
smaller 256-limit; the server won't serve an image that large, but 
doesn't prevent a client from requesting a 4k name export (even though 
that export will not be present).




+++ b/blockdev-nbd.c
@@ -162,6 +162,11 @@ void qmp_nbd_server_add(const char *device, bool has_name, 
const char *name,
   name = device;
   }

+if (strlen(name) > NBD_MAX_STRING_SIZE) {
+error_setg(errp, "export name '%s' too long", name);
+return;
+}


Hmmm, no, so here we restrict to 4096, but, we will not allow client to request 
more than
256. Seems, to correctly update server-part, we should drop NBD_MAX_NAME_SIZE 
and do the
audit mentioned in the comment above its definition.


Yeah, I guess it's time to just get rid of NBD_MAX_NAME_SIZE, and move 
away from stack allocations.  Should I do that as a followup to this 
patch, or spin a v3?



+++ b/nbd/client.c
@@ -289,8 +289,8 @@ static int nbd_receive_list(QIOChannel *ioc, char **name, 
char **description,
   return -1;
   }
   len -= sizeof(namelen);
-if (len < namelen) {
-error_setg(errp, "incorrect option name length");
+if (len < namelen || namelen > NBD_MAX_STRING_SIZE) {
+error_setg(errp, "incorrect list name length");


New wording made me go above and read the comment, what functions does. Comment 
is good, but without
it, it sounds like name of the list for me...


Maybe:

incorrect name length in server's list response




   nbd_send_opt_abort(ioc);
   return -1;
   }
@@ -303,6 +303,11 @@ static int nbd_receive_list(QIOChannel *ioc, char **name, 
char **description,
   local_name[namelen] = '\0';
   len -= namelen;
   if (len) {
+if (len > NBD_MAX_STRING_SIZE) {
+error_setg(errp, "incorrect list description length");


and

incorrect description length in server's list response



@@ -648,6 +657,7 @@ static int nbd_send_meta_query(QIOChannel *ioc, uint32_t 
opt,
   if (query) {
   query_len = strlen(query);
   data_len += sizeof(query_len) + query_len;
+assert(query_len <= NBD_MAX_STRING_SIZE);
   } else {
   assert(opt == NBD_OPT_LIST_META_CONTEXT);
   }


you may assert export_len as well..


It was asserted earlier, but doing it again might not hurt, especially 
if I do the followup patch getting rid of NBD_MAX_NAME_SIZE




@@ -1561,6 +1569,8 @@ NBDExport *nbd_export_new(BlockDriverState *bs, uint64_t 
dev_offset,
   exp->export_bitmap = bm;
   exp->export_bitmap_context = g_strdup_printf("qemu:dirty-bitmap:%s",
bitmap);
+/* See BME_MAX_NAME_SIZE in block/qcow2-bitmap.c */


Hmm. BME_MAX_NAME_SIZE is checked only when creating persistent bitmaps. But 
for non-persistent
name length is actually unlimited. So, we should either limit all bitmap names 
to 1023 (hope,
this will not break existing scenarios) or error out here (or earlier) instead 
of assertion.


I'm leaning towards limiting ALL bitmaps to the same length (as we've 
already debated the idea of being able to convert an existing bitmap 
from transient to persistent).




We also may want QEMU_BUILD_BUG_ON(NBD_MAX_STRING_SIZE < BME_MAX_NAME_SIZE + 
sizeof("qemu:dirty-bitmap:") - 1)


Except that BME_MAX_NAME_SIZE is not (currently) in a public .h file.

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



[RFC PATCH 09/23] qcow2: Add l2_entry_size()

2019-10-15 Thread Alberto Garcia
qcow2 images with subclusters have 128-bit L2 entries. The first 64
bits contain the same information as traditional images and the last
64 bits form a bitmap with the status of each individual subcluster.

Because of that we cannot assume that L2 entries are sizeof(uint64_t)
anymore. This function returns the proper value for the image.

Signed-off-by: Alberto Garcia 
---
 block/qcow2-cluster.c  | 12 ++--
 block/qcow2-refcount.c | 14 --
 block/qcow2.c  |  6 +++---
 block/qcow2.h  |  5 +
 4 files changed, 22 insertions(+), 15 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index b2045d51bf..67f90e415d 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -209,7 +209,7 @@ static int l2_load(BlockDriverState *bs, uint64_t offset,
uint64_t l2_offset, uint64_t **l2_slice)
 {
 BDRVQcow2State *s = bs->opaque;
-int start_of_slice = sizeof(uint64_t) *
+int start_of_slice = l2_entry_size(s) *
 (offset_to_l2_index(s, offset) - offset_to_l2_slice_index(s, offset));
 
 return qcow2_cache_get(bs, s->l2_table_cache, l2_offset + start_of_slice,
@@ -277,7 +277,7 @@ static int l2_allocate(BlockDriverState *bs, int l1_index)
 
 /* allocate a new l2 entry */
 
-l2_offset = qcow2_alloc_clusters(bs, s->l2_size * sizeof(uint64_t));
+l2_offset = qcow2_alloc_clusters(bs, s->l2_size * l2_entry_size(s));
 if (l2_offset < 0) {
 ret = l2_offset;
 goto fail;
@@ -301,7 +301,7 @@ static int l2_allocate(BlockDriverState *bs, int l1_index)
 
 /* allocate a new entry in the l2 cache */
 
-slice_size2 = s->l2_slice_size * sizeof(uint64_t);
+slice_size2 = s->l2_slice_size * l2_entry_size(s);
 n_slices = s->cluster_size / slice_size2;
 
 trace_qcow2_l2_allocate_get_empty(bs, l1_index);
@@ -365,7 +365,7 @@ fail:
 }
 s->l1_table[l1_index] = old_l2_offset;
 if (l2_offset > 0) {
-qcow2_free_clusters(bs, l2_offset, s->l2_size * sizeof(uint64_t),
+qcow2_free_clusters(bs, l2_offset, s->l2_size * l2_entry_size(s),
 QCOW2_DISCARD_ALWAYS);
 }
 return ret;
@@ -708,7 +708,7 @@ static int get_cluster_table(BlockDriverState *bs, uint64_t 
offset,
 
 /* Then decrease the refcount of the old table */
 if (l2_offset) {
-qcow2_free_clusters(bs, l2_offset, s->l2_size * sizeof(uint64_t),
+qcow2_free_clusters(bs, l2_offset, s->l2_size * l2_entry_size(s),
 QCOW2_DISCARD_OTHER);
 }
 
@@ -1880,7 +1880,7 @@ static int expand_zero_clusters_in_l1(BlockDriverState 
*bs, uint64_t *l1_table,
 int ret;
 int i, j;
 
-slice_size2 = s->l2_slice_size * sizeof(uint64_t);
+slice_size2 = s->l2_slice_size * l2_entry_size(s);
 n_slices = s->cluster_size / slice_size2;
 
 if (!is_active_l1) {
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 14f71df7da..a2c4d36378 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -1253,7 +1253,7 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
 l2_slice = NULL;
 l1_table = NULL;
 l1_size2 = l1_size * sizeof(uint64_t);
-slice_size2 = s->l2_slice_size * sizeof(uint64_t);
+slice_size2 = s->l2_slice_size * l2_entry_size(s);
 n_slices = s->cluster_size / slice_size2;
 
 s->cache_discards = true;
@@ -1604,7 +1604,7 @@ static int check_refcounts_l2(BlockDriverState *bs, 
BdrvCheckResult *res,
 int i, l2_size, nb_csectors, ret;
 
 /* Read L2 table from disk */
-l2_size = s->l2_size * sizeof(uint64_t);
+l2_size = s->l2_size * l2_entry_size(s);
 l2_table = g_malloc(l2_size);
 
 ret = bdrv_pread(bs->file, l2_offset, l2_table, l2_size);
@@ -1679,15 +1679,16 @@ static int check_refcounts_l2(BlockDriverState *bs, 
BdrvCheckResult *res,
 fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR",
 offset);
 if (fix & BDRV_FIX_ERRORS) {
+int idx = i * (l2_entry_size(s) / sizeof(uint64_t));
 uint64_t l2e_offset =
-l2_offset + (uint64_t)i * sizeof(uint64_t);
+l2_offset + (uint64_t)i * l2_entry_size(s);
 int ign = active ? QCOW2_OL_ACTIVE_L2 :
QCOW2_OL_INACTIVE_L2;
 
 l2_entry = QCOW_OFLAG_ZERO;
 set_l2_entry(s, l2_table, i, l2_entry);
 ret = qcow2_pre_write_overlap_check(bs, ign,
-l2e_offset, sizeof(uint64_t), false);
+l2e_offset, l2_entry_size(s), false);
 if (ret < 0) {
 fprintf(stderr, "ERROR: Overlap check failed\n");
 res->check_errors++;
@@ -1697,7 +1698,8 @@ static int 

[RFC PATCH 04/23] qcow2: Add get_l2_entry() and set_l2_entry()

2019-10-15 Thread Alberto Garcia
The size of an L2 entry is 64 bits, but if we want to have subclusters
we need extended L2 entries. This means that we have to access L2
tables and slices differently depending on whether an image has
extended L2 entries or not.

This patch replaces all l2_slice[] accesses with calls to
get_l2_entry() and set_l2_entry().

Signed-off-by: Alberto Garcia 
---
 block/qcow2-cluster.c  | 65 ++
 block/qcow2-refcount.c | 17 +--
 block/qcow2.h  | 12 
 3 files changed, 55 insertions(+), 39 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 70b2e32f7e..b2045d51bf 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -379,12 +379,13 @@ fail:
  * cluster which may require a different handling)
  */
 static int count_contiguous_clusters(BlockDriverState *bs, int nb_clusters,
-int cluster_size, uint64_t *l2_slice, uint64_t stop_flags)
+int cluster_size, uint64_t *l2_slice, int l2_index, uint64_t 
stop_flags)
 {
+BDRVQcow2State *s = bs->opaque;
 int i;
 QCow2ClusterType first_cluster_type;
 uint64_t mask = stop_flags | L2E_OFFSET_MASK | QCOW_OFLAG_COMPRESSED;
-uint64_t first_entry = be64_to_cpu(l2_slice[0]);
+uint64_t first_entry = get_l2_entry(s, l2_slice, l2_index);
 uint64_t offset = first_entry & mask;
 
 first_cluster_type = qcow2_get_cluster_type(bs, first_entry);
@@ -397,7 +398,7 @@ static int count_contiguous_clusters(BlockDriverState *bs, 
int nb_clusters,
first_cluster_type == QCOW2_CLUSTER_ZERO_ALLOC);
 
 for (i = 0; i < nb_clusters; i++) {
-uint64_t l2_entry = be64_to_cpu(l2_slice[i]) & mask;
+uint64_t l2_entry = get_l2_entry(s, l2_slice, l2_index + i) & mask;
 if (offset + (uint64_t) i * cluster_size != l2_entry) {
 break;
 }
@@ -413,14 +414,16 @@ static int count_contiguous_clusters(BlockDriverState 
*bs, int nb_clusters,
 static int count_contiguous_clusters_unallocated(BlockDriverState *bs,
  int nb_clusters,
  uint64_t *l2_slice,
+ int l2_index,
  QCow2ClusterType wanted_type)
 {
+BDRVQcow2State *s = bs->opaque;
 int i;
 
 assert(wanted_type == QCOW2_CLUSTER_ZERO_PLAIN ||
wanted_type == QCOW2_CLUSTER_UNALLOCATED);
 for (i = 0; i < nb_clusters; i++) {
-uint64_t entry = be64_to_cpu(l2_slice[i]);
+uint64_t entry = get_l2_entry(s, l2_slice, l2_index + i);
 QCow2ClusterType type = qcow2_get_cluster_type(bs, entry);
 
 if (type != wanted_type) {
@@ -566,7 +569,7 @@ int qcow2_get_cluster_offset(BlockDriverState *bs, uint64_t 
offset,
 /* find the cluster offset for the given disk offset */
 
 l2_index = offset_to_l2_slice_index(s, offset);
-*cluster_offset = be64_to_cpu(l2_slice[l2_index]);
+*cluster_offset = get_l2_entry(s, l2_slice, l2_index);
 
 nb_clusters = size_to_clusters(s, bytes_needed);
 /* bytes_needed <= *bytes + offset_in_cluster, both of which are unsigned
@@ -601,14 +604,14 @@ int qcow2_get_cluster_offset(BlockDriverState *bs, 
uint64_t offset,
 case QCOW2_CLUSTER_UNALLOCATED:
 /* how many empty clusters ? */
 c = count_contiguous_clusters_unallocated(bs, nb_clusters,
-  _slice[l2_index], type);
+  l2_slice, l2_index, type);
 *cluster_offset = 0;
 break;
 case QCOW2_CLUSTER_ZERO_ALLOC:
 case QCOW2_CLUSTER_NORMAL:
 /* how many allocated clusters ? */
 c = count_contiguous_clusters(bs, nb_clusters, s->cluster_size,
-  _slice[l2_index], QCOW_OFLAG_ZERO);
+  l2_slice, l2_index, QCOW_OFLAG_ZERO);
 *cluster_offset &= L2E_OFFSET_MASK;
 if (offset_into_cluster(s, *cluster_offset)) {
 qcow2_signal_corruption(bs, true, -1, -1,
@@ -761,7 +764,7 @@ int qcow2_alloc_compressed_cluster_offset(BlockDriverState 
*bs,
 
 /* Compression can't overwrite anything. Fail if the cluster was already
  * allocated. */
-cluster_offset = be64_to_cpu(l2_slice[l2_index]);
+cluster_offset = get_l2_entry(s, l2_slice, l2_index);
 if (cluster_offset & L2E_OFFSET_MASK) {
 qcow2_cache_put(s->l2_table_cache, (void **) _slice);
 return -EIO;
@@ -786,7 +789,7 @@ int qcow2_alloc_compressed_cluster_offset(BlockDriverState 
*bs,
 
 BLKDBG_EVENT(bs->file, BLKDBG_L2_UPDATE_COMPRESSED);
 qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
-l2_slice[l2_index] = cpu_to_be64(cluster_offset);
+set_l2_entry(s, l2_slice, l2_index, cluster_offset);
 qcow2_cache_put(s->l2_table_cache, (void **) _slice);
 
 *host_offset = cluster_offset & 

[RFC PATCH 16/23] qcow2: Add subcluster support to discard_in_l2_slice()

2019-10-15 Thread Alberto Garcia
Setting the QCOW_OFLAG_ZERO bit of the L2 entry is forbidden if an
image has subclusters. Instead, the individual 'all zeroes' bits must
be used.

Signed-off-by: Alberto Garcia 
---
 block/qcow2-cluster.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index c554b1a88c..bf32447d18 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1769,7 +1769,11 @@ static int discard_in_l2_slice(BlockDriverState *bs, 
uint64_t offset,
 
 /* First remove L2 entries */
 qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
-if (!full_discard && s->qcow_version >= 3) {
+if (has_subclusters(s)) {
+set_l2_entry(s, l2_slice, l2_index + i, 0);
+set_l2_bitmap(s, l2_slice, l2_index + i,
+  full_discard ? 0 : QCOW_L2_BITMAP_ALL_ZEROES);
+} else if (!full_discard && s->qcow_version >= 3) {
 set_l2_entry(s, l2_slice, l2_index + i, QCOW_OFLAG_ZERO);
 } else {
 set_l2_entry(s, l2_slice, l2_index + i, 0);
-- 
2.20.1




[PATCH v2 06/21] iotests: Drop compat=1.1 in 050

2019-10-15 Thread Max Reitz
IMGOPTS can never be empty for qcow2, because the check scripts adds
compat=1.1 unless the user specified any compat option themselves.
Thus, this block does not do anything and can be dropped.

Signed-off-by: Max Reitz 
Reviewed-by: Maxim Levitsky 
---
 tests/qemu-iotests/050 | 4 
 1 file changed, 4 deletions(-)

diff --git a/tests/qemu-iotests/050 b/tests/qemu-iotests/050
index 211fc00797..272ecab195 100755
--- a/tests/qemu-iotests/050
+++ b/tests/qemu-iotests/050
@@ -41,10 +41,6 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 _supported_fmt qcow2 qed
 _supported_proto file
 
-if test "$IMGFMT" = qcow2 && test $IMGOPTS = ""; then
-  IMGOPTS=compat=1.1
-fi
-
 echo
 echo "== Creating images =="
 
-- 
2.21.0




[PATCH v2 18/21] iotests: Make 137 work with data_file

2019-10-15 Thread Max Reitz
When using an external data file, there are no refcounts for data
clusters.  We thus have to adjust the corruption test in this patch to
not be based around a data cluster allocation, but the L2 table
allocation (L2 tables are still refcounted with external data files).

Furthermore, we should not print qcow2.py's list of incompatible
features because it differs depending on whether there is an external
data file or not.

With those two changes, the test will work both with an external data
files (once that options works with the iotests at all).

Signed-off-by: Max Reitz 
---
 tests/qemu-iotests/137 | 15 +++
 tests/qemu-iotests/137.out |  6 ++
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/tests/qemu-iotests/137 b/tests/qemu-iotests/137
index 6cf2997577..7ae86892f7 100755
--- a/tests/qemu-iotests/137
+++ b/tests/qemu-iotests/137
@@ -138,14 +138,21 @@ $QEMU_IO \
 "$TEST_IMG" 2>&1 | _filter_qemu_io
 
 # The dirty bit must not be set
-$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features
+# (Filter the external data file bit)
+if $PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features \
+| grep -q '\<0\>'
+then
+echo 'ERROR: Dirty bit set'
+else
+echo 'OK: Dirty bit not set'
+fi
 
 # Similarly we can test whether corruption detection has been enabled:
-# Create L1/L2, overwrite first entry in refcount block, allocate something.
+# Create L1, overwrite refcounts, force allocation of L2 by writing
+# data.
 # Disabling the checks should fail, so the corruption must be detected.
 _make_test_img 64M
-$QEMU_IO -c "write 0 64k" "$TEST_IMG" | _filter_qemu_io
-poke_file "$TEST_IMG" "$((0x2))" "\x00\x00"
+poke_file "$TEST_IMG" "$((0x2))" "\x00\x00\x00\x00\x00\x00\x00\x00"
 $QEMU_IO \
 -c "reopen -o overlap-check=none,lazy-refcounts=42" \
 -c "write 64k 64k" \
diff --git a/tests/qemu-iotests/137.out b/tests/qemu-iotests/137.out
index bd4523a853..86377c80cd 100644
--- a/tests/qemu-iotests/137.out
+++ b/tests/qemu-iotests/137.out
@@ -36,11 +36,9 @@ qemu-io: Unsupported value 'blubb' for qcow2 option 
'overlap-check'. Allowed are
 wrote 512/512 bytes at offset 0
 512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 ./common.rc: Killed  ( VALGRIND_QEMU="${VALGRIND_QEMU_IO}" 
_qemu_proc_exec "${VALGRIND_LOGFILE}" "$QEMU_IO_PROG" $QEMU_IO_ARGS "$@" )
-incompatible_features []
+OK: Dirty bit not set
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
-wrote 65536/65536 bytes at offset 0
-64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 qemu-io: Parameter 'lazy-refcounts' expects 'on' or 'off'
-qcow2: Marking image as corrupt: Preventing invalid write on metadata 
(overlaps with qcow2_header); further corruption events will be suppressed
+qcow2: Marking image as corrupt: Preventing invalid allocation of L2 table at 
offset 0; further corruption events will be suppressed
 write failed: Input/output error
 *** done
-- 
2.21.0




[PATCH v2 21/21] iotests: Allow check -o data_file

2019-10-15 Thread Max Reitz
The problem with allowing the data_file option is that you want to use a
different data file per image used in the test.  Therefore, we need to
allow patterns like -o data_file='$TEST_IMG.data_file'.

Then, we need to filter it out from qemu-img map, qemu-img create, and
remove the data file in _rm_test_img.

Signed-off-by: Max Reitz 
---
 tests/qemu-iotests/common.filter | 23 +--
 tests/qemu-iotests/common.rc | 22 +-
 2 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/tests/qemu-iotests/common.filter b/tests/qemu-iotests/common.filter
index 63bc6f6f26..9dd05689d1 100644
--- a/tests/qemu-iotests/common.filter
+++ b/tests/qemu-iotests/common.filter
@@ -121,7 +121,13 @@ _filter_actual_image_size()
 # replace driver-specific options in the "Formatting..." line
 _filter_img_create()
 {
-$SED -e "s#$REMOTE_TEST_DIR#TEST_DIR#g" \
+data_file_filter=()
+if data_file=$(_get_data_file "$TEST_IMG"); then
+data_file_filter=(-e "s# data_file=$data_file##")
+fi
+
+$SED "${data_file_filter[@]}" \
+-e "s#$REMOTE_TEST_DIR#TEST_DIR#g" \
 -e "s#$IMGPROTO:$TEST_DIR#TEST_DIR#g" \
 -e "s#$TEST_DIR#TEST_DIR#g" \
 -e "s#$IMGFMT#IMGFMT#g" \
@@ -204,9 +210,22 @@ _filter_img_info()
 # human and json output
 _filter_qemu_img_map()
 {
+# Assuming the data_file value in $IMGOPTS contains a '$TEST_IMG',
+# create a filter that replaces the data file name by $TEST_IMG.
+# Example:
+#   In $IMGOPTS: 'data_file=$TEST_IMG.data_file'
+#   Then data_file_pattern == '\(.*\).data_file'
+#   And  data_file_filter  == -e 's#\(.*\).data_file#\1#
+data_file_filter=()
+if data_file_pattern=$(_get_data_file '\\(.*\\)'); then
+data_file_filter=(-e "s#$data_file_pattern#\\1#")
+fi
+
 $SED -e 's/\([0-9a-fx]* *[0-9a-fx]* *\)[0-9a-fx]* */\1/g' \
 -e 's/"offset": [0-9]\+/"offset": OFFSET/g' \
--e 's/Mapped to *//' | _filter_testdir | _filter_imgfmt
+-e 's/Mapped to *//' \
+"${data_file_filter[@]}" \
+| _filter_testdir | _filter_imgfmt
 }
 
 _filter_nbd()
diff --git a/tests/qemu-iotests/common.rc b/tests/qemu-iotests/common.rc
index f3784077de..bed789a691 100644
--- a/tests/qemu-iotests/common.rc
+++ b/tests/qemu-iotests/common.rc
@@ -277,6 +277,20 @@ _stop_nbd_server()
 fi
 }
 
+# Gets the data_file value from IMGOPTS and replaces the '$TEST_IMG'
+# pattern by '$1'
+# Caution: The replacement is done with sed, so $1 must be escaped
+#  properly.  (The delimiter is '#'.)
+_get_data_file()
+{
+if ! echo "$IMGOPTS" | grep -q 'data_file='; then
+return 1
+fi
+
+echo "$IMGOPTS" | sed -e 's/.*data_file=\([^,]*\).*/\1/' \
+| sed -e "s#\\\$TEST_IMG#$1#"
+}
+
 _make_test_img()
 {
 # extra qemu-img options can be added by tests
@@ -297,7 +311,8 @@ _make_test_img()
 fi
 
 if [ -n "$IMGOPTS" ]; then
-optstr=$(_optstr_add "$optstr" "$IMGOPTS")
+imgopts_expanded=$(echo "$IMGOPTS" | sed -e 
"s#\\\$TEST_IMG#$img_name#")
+optstr=$(_optstr_add "$optstr" "$imgopts_expanded")
 fi
 if [ -n "$IMGKEYSECRET" ]; then
 object_options="--object secret,id=keysec0,data=$IMGKEYSECRET"
@@ -376,6 +391,11 @@ _rm_test_img()
 # Remove all the extents for vmdk
 "$QEMU_IMG" info "$img" 2>/dev/null | grep 'filename:' | cut -f 2 -d: \
 | xargs -I {} rm -f "{}"
+elif [ "$IMGFMT" = "qcow2" ]; then
+# Remove external data file
+if data_file=$(_get_data_file "$img"); then
+rm -f "$data_file"
+fi
 fi
 rm -f "$img"
 }
-- 
2.21.0




[RFC PATCH 10/23] qcow2: Update get/set_l2_entry() and add get/set_l2_bitmap()

2019-10-15 Thread Alberto Garcia
Extended L2 entries are 128-bit wide: 64 bits for the entry itself and
64 bits for the subcluster allocation bitmap.

In order to support them correctly get/set_l2_entry() need to be
updated so they take the entry width into account in order to
calculate the correct offset.

This patch also adds the get/set_l2_bitmap() functions that are used
to access the bitmaps. For convenience, these functions are no-ops
when used in traditional qcow2 images.

Signed-off-by: Alberto Garcia 
---
 block/qcow2.h | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/block/qcow2.h b/block/qcow2.h
index 9a7648af47..d9fe883fe0 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -504,15 +504,37 @@ static inline size_t l2_entry_size(BDRVQcow2State *s)
 static inline uint64_t get_l2_entry(BDRVQcow2State *s, uint64_t *l2_slice,
 int idx)
 {
+idx *= l2_entry_size(s) / sizeof(uint64_t);
 return be64_to_cpu(l2_slice[idx]);
 }
 
+static inline uint64_t get_l2_bitmap(BDRVQcow2State *s, uint64_t *l2_slice,
+ int idx)
+{
+if (has_subclusters(s)) {
+idx *= l2_entry_size(s) / sizeof(uint64_t);
+return be64_to_cpu(l2_slice[idx + 1]);
+} else {
+return 0;
+}
+}
+
 static inline void set_l2_entry(BDRVQcow2State *s, uint64_t *l2_slice,
 int idx, uint64_t entry)
 {
+idx *= l2_entry_size(s) / sizeof(uint64_t);
 l2_slice[idx] = cpu_to_be64(entry);
 }
 
+static inline void set_l2_bitmap(BDRVQcow2State *s, uint64_t *l2_slice,
+ int idx, uint64_t bitmap)
+{
+if (has_subclusters(s)) {
+idx *= l2_entry_size(s) / sizeof(uint64_t);
+l2_slice[idx + 1] = cpu_to_be64(bitmap);
+}
+}
+
 static inline bool has_data_file(BlockDriverState *bs)
 {
 BDRVQcow2State *s = bs->opaque;
-- 
2.20.1




[RFC PATCH 03/23] qcow2: Process QCOW2_CLUSTER_ZERO_ALLOC clusters in handle_copied()

2019-10-15 Thread Alberto Garcia
When writing to a qcow2 file there are two functions that take a
virtual offset and return a host offset, possibly allocating new
clusters if necessary:

   - handle_copied() looks for normal data clusters that are already
 allocated and have a reference count of 1. In those clusters we
 can simply write the data and there is no need to perform any
 copy-on-write.

   - handle_alloc() looks for clusters that do need copy-on-write,
 either because they haven't been allocated yet, because their
 reference count is != 1 or because they are ZERO_ALLOC clusters.

The ZERO_ALLOC case is a bit special because those are clusters that
are already allocated and they could perfectly be dealt with in
handle_copied() (as long as copy-on-write is performed when required).

In fact, there is extra code specifically for them in handle_alloc()
that tries to reuse the existing allocation if possible and frees them
otherwise.

This patch changes the handling of ZERO_ALLOC clusters so the
semantics of these two functions are now like this:

   - handle_copied() looks for clusters that are already allocated and
 which we can overwrite (NORMAL and ZERO_ALLOC clusters with a
 reference count of 1).

   - handle_alloc() looks for clusters for which we need a new
 allocation (all other cases).

One importante difference after this change is that clusters found in
handle_copied() may now require copy-on-write, but this will be anyway
necessary once we add support for subclusters.

Signed-off-by: Alberto Garcia 
---
 block/qcow2-cluster.c | 177 +++---
 1 file changed, 96 insertions(+), 81 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index f462e169c0..70b2e32f7e 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1021,7 +1021,8 @@ void qcow2_alloc_cluster_abort(BlockDriverState *bs, 
QCowL2Meta *m)
 
 /*
  * For a given write request, create a new QCowL2Meta structure and
- * add it to @m.
+ * add it to @m. If the write request does not need copy-on-write or
+ * changes to the L2 metadata then this function does nothing.
  *
  * @host_offset points to the beginning of the first cluster.
  *
@@ -1034,15 +1035,51 @@ void qcow2_alloc_cluster_abort(BlockDriverState *bs, 
QCowL2Meta *m)
  */
 static void calculate_l2_meta(BlockDriverState *bs, uint64_t host_offset,
   uint64_t guest_offset, uint64_t bytes,
-  QCowL2Meta **m, bool keep_old)
+  uint64_t *l2_slice, QCowL2Meta **m, bool 
keep_old)
 {
 BDRVQcow2State *s = bs->opaque;
-unsigned cow_start_from = 0;
+int l2_index = offset_to_l2_slice_index(s, guest_offset);
+uint64_t l2_entry;
+unsigned cow_start_from, cow_end_to;
 unsigned cow_start_to = offset_into_cluster(s, guest_offset);
 unsigned cow_end_from = cow_start_to + bytes;
-unsigned cow_end_to = ROUND_UP(cow_end_from, s->cluster_size);
 unsigned nb_clusters = size_to_clusters(s, cow_end_from);
 QCowL2Meta *old_m = *m;
+QCow2ClusterType type;
+
+/* Return if there's no COW (all clusters are normal and we keep them) */
+if (keep_old) {
+int i;
+for (i = 0; i < nb_clusters; i++) {
+l2_entry = be64_to_cpu(l2_slice[l2_index + i]);
+if (qcow2_get_cluster_type(bs, l2_entry) != QCOW2_CLUSTER_NORMAL) {
+break;
+}
+}
+if (i == nb_clusters) {
+return;
+}
+}
+
+/* Get the L2 entry from the first cluster */
+l2_entry = be64_to_cpu(l2_slice[l2_index]);
+type = qcow2_get_cluster_type(bs, l2_entry);
+
+if (type == QCOW2_CLUSTER_NORMAL && keep_old) {
+cow_start_from = cow_start_to;
+} else {
+cow_start_from = 0;
+}
+
+/* Get the L2 entry from the last cluster */
+l2_entry = be64_to_cpu(l2_slice[l2_index + nb_clusters - 1]);
+type = qcow2_get_cluster_type(bs, l2_entry);
+
+if (type == QCOW2_CLUSTER_NORMAL && keep_old) {
+cow_end_to = cow_end_from;
+} else {
+cow_end_to = ROUND_UP(cow_end_from, s->cluster_size);
+}
 
 *m = g_malloc0(sizeof(**m));
 **m = (QCowL2Meta) {
@@ -1068,18 +1105,18 @@ static void calculate_l2_meta(BlockDriverState *bs, 
uint64_t host_offset,
 QLIST_INSERT_HEAD(>cluster_allocs, *m, next_in_flight);
 }
 
-/* Returns true if writing to a cluster requires COW */
+/* Returns true if the cluster is unallocated or has refcount > 1 */
 static bool cluster_needs_cow(BlockDriverState *bs, uint64_t l2_entry)
 {
 switch (qcow2_get_cluster_type(bs, l2_entry)) {
 case QCOW2_CLUSTER_NORMAL:
+case QCOW2_CLUSTER_ZERO_ALLOC:
 if (l2_entry & QCOW_OFLAG_COPIED) {
 return false;
 }
 case QCOW2_CLUSTER_UNALLOCATED:
 case QCOW2_CLUSTER_COMPRESSED:
 case QCOW2_CLUSTER_ZERO_PLAIN:
-case QCOW2_CLUSTER_ZERO_ALLOC:
 return true;
 default:
 

[RFC PATCH 01/23] qcow2: Add calculate_l2_meta()

2019-10-15 Thread Alberto Garcia
handle_alloc() creates a QCowL2Meta structure in order to update the
image metadata and perform the necessary copy-on-write operations.

This patch moves that code to a separate function so it can be used
from other places.

Signed-off-by: Alberto Garcia 
---
 block/qcow2-cluster.c | 76 +--
 1 file changed, 52 insertions(+), 24 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 8d5fa1539c..fe2523ed66 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1019,6 +1019,55 @@ void qcow2_alloc_cluster_abort(BlockDriverState *bs, 
QCowL2Meta *m)
 QCOW2_DISCARD_NEVER);
 }
 
+/*
+ * For a given write request, create a new QCowL2Meta structure and
+ * add it to @m.
+ *
+ * @host_offset points to the beginning of the first cluster.
+ *
+ * @guest_offset and @bytes indicate the offset and length of the
+ * request.
+ *
+ * If @keep_old is true it means that the clusters were already
+ * allocated and will be overwritten. If false then the clusters are
+ * new and we have to decrease the reference count of the old ones.
+ */
+static void calculate_l2_meta(BlockDriverState *bs, uint64_t host_offset,
+  uint64_t guest_offset, uint64_t bytes,
+  QCowL2Meta **m, bool keep_old)
+{
+BDRVQcow2State *s = bs->opaque;
+unsigned cow_start_from = 0;
+unsigned cow_start_to = offset_into_cluster(s, guest_offset);
+unsigned cow_end_from = cow_start_to + bytes;
+unsigned cow_end_to = ROUND_UP(cow_end_from, s->cluster_size);
+unsigned nb_clusters = size_to_clusters(s, cow_end_from);
+QCowL2Meta *old_m = *m;
+
+*m = g_malloc0(sizeof(**m));
+**m = (QCowL2Meta) {
+.next   = old_m,
+
+.alloc_offset   = host_offset,
+.offset = start_of_cluster(s, guest_offset),
+.nb_clusters= nb_clusters,
+
+.keep_old_clusters = keep_old,
+
+.cow_start = {
+.offset = cow_start_from,
+.nb_bytes   = cow_start_to - cow_start_from,
+},
+.cow_end = {
+.offset = cow_end_from,
+.nb_bytes   = cow_end_to - cow_end_from,
+},
+};
+
+qemu_co_queue_init(&(*m)->dependent_requests);
+QLIST_INSERT_HEAD(>cluster_allocs, *m, next_in_flight);
+}
+
 /*
  * Returns the number of contiguous clusters that can be used for an allocating
  * write, but require COW to be performed (this includes yet unallocated space,
@@ -1414,35 +1463,14 @@ static int handle_alloc(BlockDriverState *bs, uint64_t 
guest_offset,
 uint64_t requested_bytes = *bytes + offset_into_cluster(s, guest_offset);
 int avail_bytes = MIN(INT_MAX, nb_clusters << s->cluster_bits);
 int nb_bytes = MIN(requested_bytes, avail_bytes);
-QCowL2Meta *old_m = *m;
-
-*m = g_malloc0(sizeof(**m));
-
-**m = (QCowL2Meta) {
-.next   = old_m,
-
-.alloc_offset   = alloc_cluster_offset,
-.offset = start_of_cluster(s, guest_offset),
-.nb_clusters= nb_clusters,
-
-.keep_old_clusters  = keep_old_clusters,
-
-.cow_start = {
-.offset = 0,
-.nb_bytes   = offset_into_cluster(s, guest_offset),
-},
-.cow_end = {
-.offset = nb_bytes,
-.nb_bytes   = avail_bytes - nb_bytes,
-},
-};
-qemu_co_queue_init(&(*m)->dependent_requests);
-QLIST_INSERT_HEAD(>cluster_allocs, *m, next_in_flight);
 
 *host_offset = alloc_cluster_offset + offset_into_cluster(s, guest_offset);
 *bytes = MIN(*bytes, nb_bytes - offset_into_cluster(s, guest_offset));
 assert(*bytes != 0);
 
+calculate_l2_meta(bs, alloc_cluster_offset, guest_offset, *bytes,
+  m, keep_old_clusters);
+
 return 1;
 
 fail:
-- 
2.20.1




Re: [PULL 0/2] Tracing patches

2019-10-15 Thread Peter Maydell
On Tue, 15 Oct 2019 at 16:38, Philippe Mathieu-Daudé  wrote:
>
> On 10/15/19 2:24 PM, Peter Maydell wrote:
> > On Mon, 14 Oct 2019 at 09:57, Stefan Hajnoczi  wrote:
> >>
> >> The following changes since commit 
> >> 98b2e3c9ab3abfe476a2b02f8f51813edb90e72d:
> >>
> >>Merge remote-tracking branch 'remotes/stefanha/tags/block-pull-request' 
> >> into staging (2019-10-08 16:08:35 +0100)
> >>
> >> are available in the Git repository at:
> >>
> >>https://github.com/stefanha/qemu.git tags/tracing-pull-request
> >>
> >> for you to fetch changes up to a1f4fc951a277c49a25418cafb028ec5529707fa:
> >>
> >>trace: avoid "is" with a literal Python 3.8 warnings (2019-10-14 
> >> 09:54:46 +0100)
> >>
> >> 
> >> Pull request
> >>
> >> 
> >>
> >> Stefan Hajnoczi (2):
> >>trace: add --group=all to tracing.txt
> >>trace: avoid "is" with a literal Python 3.8 warnings
> >>
> >
> >
> > Applied, thanks.
>
> Buh, v2 missed :(

Oops. I don't necessarily notice updated pullreq versions unless
somebody follows up to the v1 coverletter to say the pull is out of date.

thanks
-- PMM



[PATCH v2 12/21] iotests: Drop IMGOPTS use in 267

2019-10-15 Thread Max Reitz
Overwriting IMGOPTS means ignoring all user-supplied options, which is
not what we want.  Replace the current IMGOPTS use by a new BACKING_FILE
variable.

Signed-off-by: Max Reitz 
---
 tests/qemu-iotests/267 | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/tests/qemu-iotests/267 b/tests/qemu-iotests/267
index d37a67c012..eda45449d4 100755
--- a/tests/qemu-iotests/267
+++ b/tests/qemu-iotests/267
@@ -68,7 +68,11 @@ size=128M
 
 run_test()
 {
-_make_test_img $size
+if [ -n "$BACKING_FILE" ]; then
+_make_test_img -b "$BACKING_FILE" $size
+else
+_make_test_img $size
+fi
 printf "savevm snap0\ninfo snapshots\nloadvm snap0\n" | run_qemu "$@" | 
_filter_date
 }
 
@@ -119,12 +123,12 @@ echo
 
 TEST_IMG="$TEST_IMG.base" _make_test_img $size
 
-IMGOPTS="backing_file=$TEST_IMG.base" \
+BACKING_FILE="$TEST_IMG.base" \
 run_test -blockdev 
driver=file,filename="$TEST_IMG.base",node-name=backing-file \
  -blockdev driver=file,filename="$TEST_IMG",node-name=file \
  -blockdev driver=$IMGFMT,file=file,backing=backing-file,node-name=fmt
 
-IMGOPTS="backing_file=$TEST_IMG.base" \
+BACKING_FILE="$TEST_IMG.base" \
 run_test -blockdev 
driver=file,filename="$TEST_IMG.base",node-name=backing-file \
  -blockdev driver=$IMGFMT,file=backing-file,node-name=backing-fmt \
  -blockdev driver=file,filename="$TEST_IMG",node-name=file \
@@ -141,7 +145,7 @@ echo
 echo "=== -blockdev with NBD server on the backing file ==="
 echo
 
-IMGOPTS="backing_file=$TEST_IMG.base" _make_test_img $size
+_make_test_img -b "$TEST_IMG.base" $size
 cat <

[PATCH v2 19/21] iotests: Make 198 work with data_file

2019-10-15 Thread Max Reitz
We do not care about the json:{} filenames here, so we can just filter
them out and thus make the test work both with and without external data
files.

Signed-off-by: Max Reitz 
---
 tests/qemu-iotests/198 | 6 --
 tests/qemu-iotests/198.out | 4 ++--
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/tests/qemu-iotests/198 b/tests/qemu-iotests/198
index c8f824cfae..fb0d5a29d3 100755
--- a/tests/qemu-iotests/198
+++ b/tests/qemu-iotests/198
@@ -92,13 +92,15 @@ echo
 echo "== checking image base =="
 $QEMU_IMG info --image-opts $IMGSPECBASE | _filter_img_info --format-specific \
 | sed -e "/^disk size:/ D" -e '/refcount bits:/ D' -e '/compat:/ D' \
-  -e '/lazy refcounts:/ D' -e '/corrupt:/ D'
+  -e '/lazy refcounts:/ D' -e '/corrupt:/ D' -e '/^\s*data file/ D' \
+| _filter_json_filename
 
 echo
 echo "== checking image layer =="
 $QEMU_IMG info --image-opts $IMGSPECLAYER | _filter_img_info --format-specific 
\
 | sed -e "/^disk size:/ D" -e '/refcount bits:/ D' -e '/compat:/ D' \
-  -e '/lazy refcounts:/ D' -e '/corrupt:/ D'
+  -e '/lazy refcounts:/ D' -e '/corrupt:/ D' -e '/^\s*data file/ D' \
+| _filter_json_filename
 
 
 # success, all done
diff --git a/tests/qemu-iotests/198.out b/tests/qemu-iotests/198.out
index e86b175e39..831ce3a289 100644
--- a/tests/qemu-iotests/198.out
+++ b/tests/qemu-iotests/198.out
@@ -32,7 +32,7 @@ read 16777216/16777216 bytes at offset 0
 16 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
 == checking image base ==
-image: json:{"encrypt.key-secret": "sec0", "driver": "IMGFMT", "file": 
{"driver": "file", "filename": "TEST_DIR/t.IMGFMT.base"}}
+image: json:{ /* filtered */ }
 file format: IMGFMT
 virtual size: 16 MiB (16777216 bytes)
 Format specific information:
@@ -74,7 +74,7 @@ Format specific information:
 master key iters: 1024
 
 == checking image layer ==
-image: json:{"encrypt.key-secret": "sec1", "driver": "IMGFMT", "file": 
{"driver": "file", "filename": "TEST_DIR/t.IMGFMT"}}
+image: json:{ /* filtered */ }
 file format: IMGFMT
 virtual size: 16 MiB (16777216 bytes)
 backing file: TEST_DIR/t.IMGFMT.base
-- 
2.21.0




[PATCH v2 17/21] iotests: Make 110 work with data_file

2019-10-15 Thread Max Reitz
The only difference is that the json:{} filename of the image looks
different.  We actually do not care about that filename in this test, we
are only interested in (1) that there is a json:{} filename, and (2)
whether the backing filename can be constructed.

So just filter out the json:{} data, thus making this test pass both
with and without data_file.

Signed-off-by: Max Reitz 
---
 tests/qemu-iotests/110 | 7 +--
 tests/qemu-iotests/110.out | 4 ++--
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/tests/qemu-iotests/110 b/tests/qemu-iotests/110
index f78df0e6e1..139c02c2cf 100755
--- a/tests/qemu-iotests/110
+++ b/tests/qemu-iotests/110
@@ -67,6 +67,7 @@ echo
 # Across blkdebug without a config file, you cannot reconstruct filenames, so
 # qemu is incapable of knowing the directory of the top image from the filename
 # alone. However, using bdrv_dirname(), it should still work.
+# (Filter out the json:{} filename so this test works with external data files)
 TEST_IMG="json:{
 'driver': '$IMGFMT',
 'file': {
@@ -82,7 +83,8 @@ TEST_IMG="json:{
 }
 ]
 }
-}" _img_info | _filter_img_info | grep -v 'backing file format'
+}" _img_info | _filter_img_info | grep -v 'backing file format' \
+| _filter_json_filename
 
 echo
 echo '=== Backing name is always relative to the backed image ==='
@@ -114,7 +116,8 @@ TEST_IMG="json:{
 }
 ]
 }
-}" _img_info | _filter_img_info | grep -v 'backing file format'
+}" _img_info | _filter_img_info | grep -v 'backing file format' \
+| _filter_json_filename
 
 
 # success, all done
diff --git a/tests/qemu-iotests/110.out b/tests/qemu-iotests/110.out
index f60b26390e..f835553a99 100644
--- a/tests/qemu-iotests/110.out
+++ b/tests/qemu-iotests/110.out
@@ -11,7 +11,7 @@ backing file: t.IMGFMT.base (actual path: 
TEST_DIR/t.IMGFMT.base)
 
 === Non-reconstructable filename ===
 
-image: json:{"driver": "IMGFMT", "file": {"set-state.0.event": "read_aio", 
"image": {"driver": "file", "filename": "TEST_DIR/t.IMGFMT"}, "driver": 
"blkdebug", "set-state.0.new_state": 42}}
+image: json:{ /* filtered */ }
 file format: IMGFMT
 virtual size: 64 MiB (67108864 bytes)
 backing file: t.IMGFMT.base (actual path: TEST_DIR/t.IMGFMT.base)
@@ -22,7 +22,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 
backing_file=t.IMGFMT.b
 
 === Nodes without a common directory ===
 
-image: json:{"driver": "IMGFMT", "file": {"children": [{"driver": "file", 
"filename": "TEST_DIR/t.IMGFMT"}, {"driver": "file", "filename": 
"TEST_DIR/t.IMGFMT.copy"}], "driver": "quorum", "vote-threshold": 1}}
+image: json:{ /* filtered */ }
 file format: IMGFMT
 virtual size: 64 MiB (67108864 bytes)
 backing file: t.IMGFMT.base (cannot determine actual path)
-- 
2.21.0




[RFC PATCH 23/23] qcow2: Add the 'extended_l2' option and the QCOW2_INCOMPAT_EXTL2 bit

2019-10-15 Thread Alberto Garcia
Now that the implementation of subclusters is complete we can finally
add the necessary options to create and read images with this feature,
which we call "extended L2 entries".

Signed-off-by: Alberto Garcia 
---
 block/qcow2.c|  47 ++
 block/qcow2.h|   8 ++-
 include/block/block_int.h|   1 +
 qapi/block-core.json |   2 +
 tests/qemu-iotests/031.out   |   8 +--
 tests/qemu-iotests/036.out   |   4 +-
 tests/qemu-iotests/049.out   | 102 +++
 tests/qemu-iotests/060.out   |   1 +
 tests/qemu-iotests/061.out   |  20 +++---
 tests/qemu-iotests/065   |  18 --
 tests/qemu-iotests/082.out   |  48 ---
 tests/qemu-iotests/085.out   |  38 ++--
 tests/qemu-iotests/144.out   |   4 +-
 tests/qemu-iotests/182.out   |   2 +-
 tests/qemu-iotests/185.out   |   8 +--
 tests/qemu-iotests/198.out   |   2 +
 tests/qemu-iotests/206.out   |   4 ++
 tests/qemu-iotests/242.out   |   5 ++
 tests/qemu-iotests/255.out   |   8 +--
 tests/qemu-iotests/common.filter |   1 +
 20 files changed, 221 insertions(+), 110 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 2eb032aed7..44d97d30b1 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1346,6 +1346,12 @@ static int coroutine_fn qcow2_do_open(BlockDriverState 
*bs, QDict *options,
 s->subcluster_size = s->cluster_size / s->subclusters_per_cluster;
 s->subcluster_bits = ctz32(s->subcluster_size);
 
+if (s->subcluster_size < (1 << MIN_CLUSTER_BITS)) {
+error_setg(errp, "Unsupported subcluster size: %d", 
s->subcluster_size);
+ret = -EINVAL;
+goto fail;
+}
+
 /* Check support for various header values */
 if (header.refcount_order > 6) {
 error_setg(errp, "Reference count entry width too large; may not "
@@ -2646,6 +2652,11 @@ int qcow2_update_header(BlockDriverState *bs)
 .bit  = QCOW2_COMPAT_LAZY_REFCOUNTS_BITNR,
 .name = "lazy refcounts",
 },
+{
+.type = QCOW2_FEAT_TYPE_INCOMPATIBLE,
+.bit  = QCOW2_INCOMPAT_EXTL2_BITNR,
+.name = "extended L2 entries",
+},
 };
 
 ret = header_ext_add(buf, QCOW2_EXT_MAGIC_FEATURE_TABLE,
@@ -3138,6 +3149,27 @@ qcow2_co_create(BlockdevCreateOptions *create_options, 
Error **errp)
 goto out;
 }
 
+if (!qcow2_opts->has_extended_l2) {
+qcow2_opts->extended_l2 = false;
+}
+if (qcow2_opts->extended_l2) {
+unsigned min_cluster_size =
+(1 << MIN_CLUSTER_BITS) * QCOW_MAX_SUBCLUSTERS_PER_CLUSTER;
+if (version < 3) {
+error_setg(errp, "Extended L2 entries are only supported with "
+   "compatibility level 1.1 and above (use version=v3 or "
+   "greater)");
+ret = -EINVAL;
+goto out;
+}
+if (cluster_size < min_cluster_size) {
+error_setg(errp, "Extended L2 entries are only supported with "
+   "cluster sizes of at least %u bytes", min_cluster_size);
+ret = -EINVAL;
+goto out;
+}
+}
+
 if (!qcow2_opts->has_refcount_bits) {
 qcow2_opts->refcount_bits = 16;
 }
@@ -3232,6 +3264,11 @@ qcow2_co_create(BlockdevCreateOptions *create_options, 
Error **errp)
 cpu_to_be64(QCOW2_AUTOCLEAR_DATA_FILE_RAW);
 }
 
+if (qcow2_opts->extended_l2) {
+header->incompatible_features |=
+cpu_to_be64(QCOW2_INCOMPAT_EXTL2);
+}
+
 ret = blk_pwrite(blk, 0, header, cluster_size, 0);
 g_free(header);
 if (ret < 0) {
@@ -3409,6 +3446,7 @@ static int coroutine_fn qcow2_co_create_opts(const char 
*filename, QemuOpts *opt
 { BLOCK_OPT_BACKING_FMT,"backing-fmt" },
 { BLOCK_OPT_CLUSTER_SIZE,   "cluster-size" },
 { BLOCK_OPT_LAZY_REFCOUNTS, "lazy-refcounts" },
+{ BLOCK_OPT_EXTL2,  "extended-l2" },
 { BLOCK_OPT_REFCOUNT_BITS,  "refcount-bits" },
 { BLOCK_OPT_ENCRYPT,BLOCK_OPT_ENCRYPT_FORMAT },
 { BLOCK_OPT_COMPAT_LEVEL,   "version" },
@@ -4612,6 +4650,9 @@ static ImageInfoSpecific 
*qcow2_get_specific_info(BlockDriverState *bs,
 .corrupt= s->incompatible_features &
   QCOW2_INCOMPAT_CORRUPT,
 .has_corrupt= true,
+.has_extended_l2= true,
+.extended_l2= s->incompatible_features &
+  QCOW2_INCOMPAT_EXTL2,
 .refcount_bits  = s->refcount_bits,
 .has_bitmaps= !!bitmaps,
 .bitmaps= bitmaps,
@@ -5205,6 +5246,12 @@ static QemuOptsList qcow2_create_opts = {
 .help = "Postpone refcount updates",
 .def_value_str = "off"
 

[RFC PATCH 20/23] qcow2: Update L2 bitmap in qcow2_alloc_cluster_link_l2()

2019-10-15 Thread Alberto Garcia
The L2 bitmap needs to be updated after each write to indicate what
new subclusters are now allocated.

This needs to happen even if the cluster was already allocated and the
L2 entry was otherwise valid.

Signed-off-by: Alberto Garcia 
---
 block/qcow2-cluster.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 75579c1470..9a4bf672b3 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -980,6 +980,22 @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, 
QCowL2Meta *m)
 
 set_l2_entry(s, l2_slice, l2_index + i, QCOW_OFLAG_COPIED |
  (cluster_offset + (i << s->cluster_bits)));
+
+/* Update bitmap with the subclusters that were just written */
+if (has_subclusters(s)) {
+uint64_t written_from = m->cow_start.offset;
+uint64_t written_to = m->cow_end.offset + m->cow_end.nb_bytes;
+uint64_t l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index + i);
+int sc;
+for (sc = 0; sc < s->subclusters_per_cluster; sc++) {
+uint64_t sc_off = i * s->cluster_size + sc * 
s->subcluster_size;
+if (sc_off >= written_from && sc_off < written_to) {
+l2_bitmap |= QCOW_OFLAG_SUB_ALLOC(sc);
+l2_bitmap &= ~QCOW_OFLAG_SUB_ZERO(sc);
+}
+}
+set_l2_bitmap(s, l2_slice, l2_index + i, l2_bitmap);
+}
  }
 
 
-- 
2.20.1




[PATCH v2 07/21] iotests: Let _make_test_img parse its parameters

2019-10-15 Thread Max Reitz
This will allow us to add more options than just -b.

Signed-off-by: Max Reitz 
Reviewed-by: Maxim Levitsky 
---
 tests/qemu-iotests/common.rc | 28 
 1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/tests/qemu-iotests/common.rc b/tests/qemu-iotests/common.rc
index 12b4751848..3e7adc4834 100644
--- a/tests/qemu-iotests/common.rc
+++ b/tests/qemu-iotests/common.rc
@@ -282,12 +282,12 @@ _make_test_img()
 # extra qemu-img options can be added by tests
 # at least one argument (the image size) needs to be added
 local extra_img_options=""
-local image_size=$*
 local optstr=""
 local img_name=""
 local use_backing=0
 local backing_file=""
 local object_options=""
+local misc_params=()
 
 if [ -n "$TEST_IMG_FILE" ]; then
 img_name=$TEST_IMG_FILE
@@ -303,11 +303,23 @@ _make_test_img()
 optstr=$(_optstr_add "$optstr" "key-secret=keysec0")
 fi
 
-if [ "$1" = "-b" ]; then
-use_backing=1
-backing_file=$2
-image_size=$3
-fi
+for param; do
+if [ "$use_backing" = "1" -a -z "$backing_file" ]; then
+backing_file=$param
+continue
+fi
+
+case "$param" in
+-b)
+use_backing=1
+;;
+
+*)
+misc_params=("${misc_params[@]}" "$param")
+;;
+esac
+done
+
 if [ \( "$IMGFMT" = "qcow2" -o "$IMGFMT" = "qed" \) -a -n "$CLUSTER_SIZE" 
]; then
 optstr=$(_optstr_add "$optstr" "cluster_size=$CLUSTER_SIZE")
 fi
@@ -323,9 +335,9 @@ _make_test_img()
 # XXX(hch): have global image options?
 (
  if [ $use_backing = 1 ]; then
-$QEMU_IMG create $object_options -f $IMGFMT $extra_img_options -b 
"$backing_file" "$img_name" $image_size 2>&1
+$QEMU_IMG create $object_options -f $IMGFMT $extra_img_options -b 
"$backing_file" "$img_name" "${misc_params[@]}" 2>&1
  else
-$QEMU_IMG create $object_options -f $IMGFMT $extra_img_options 
"$img_name" $image_size 2>&1
+$QEMU_IMG create $object_options -f $IMGFMT $extra_img_options 
"$img_name" "${misc_params[@]}" 2>&1
  fi
 ) | _filter_img_create
 
-- 
2.21.0




[RFC PATCH 06/23] qcow2: Add dummy has_subclusters() function

2019-10-15 Thread Alberto Garcia
This function will be used by the qcow2 code to check if an image has
subclusters or not.

At the moment this simply returns false. Once all patches needed for
subcluster support are ready then QEMU will be able to create and
read images with subclusters and this function will return the actual
value.

Signed-off-by: Alberto Garcia 
---
 block/qcow2.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/block/qcow2.h b/block/qcow2.h
index 0b68c55c01..6d6fc57f41 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -485,6 +485,12 @@ typedef enum QCow2MetadataOverlap {
 
 #define INV_OFFSET (-1ULL)
 
+static inline bool has_subclusters(BDRVQcow2State *s)
+{
+/* FIXME: Return false until this feature is complete */
+return false;
+}
+
 static inline uint64_t get_l2_entry(BDRVQcow2State *s, uint64_t *l2_slice,
 int idx)
 {
-- 
2.20.1




[RFC PATCH 12/23] qcow2: Handle QCOW2_CLUSTER_UNALLOCATED_SUBCLUSTER

2019-10-15 Thread Alberto Garcia
In the previous patch we added a new QCow2ClusterType named
QCOW2_CLUSTER_UNALLOCATED_SUBCLUSTER. There is a couple of places
where this new value needs to be handled, and that is what this patch
does.

Signed-off-by: Alberto Garcia 
---
 block/qcow2.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 131711d6fa..c222cd261d 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1922,8 +1922,8 @@ static int coroutine_fn 
qcow2_co_block_status(BlockDriverState *bs,
 
 *pnum = bytes;
 
-if ((ret == QCOW2_CLUSTER_NORMAL || ret == QCOW2_CLUSTER_ZERO_ALLOC) &&
-!s->crypto) {
+if ((ret == QCOW2_CLUSTER_NORMAL || ret == QCOW2_CLUSTER_ZERO_ALLOC ||
+ ret == QCOW2_CLUSTER_UNALLOCATED_SUBCLUSTER) && !s->crypto) {
 index_in_cluster = offset & (s->cluster_size - 1);
 *map = cluster_offset | index_in_cluster;
 *file = s->data_file->bs;
@@ -1931,7 +1931,8 @@ static int coroutine_fn 
qcow2_co_block_status(BlockDriverState *bs,
 }
 if (ret == QCOW2_CLUSTER_ZERO_PLAIN || ret == QCOW2_CLUSTER_ZERO_ALLOC) {
 status |= BDRV_BLOCK_ZERO;
-} else if (ret != QCOW2_CLUSTER_UNALLOCATED) {
+} else if (ret != QCOW2_CLUSTER_UNALLOCATED &&
+   ret != QCOW2_CLUSTER_UNALLOCATED_SUBCLUSTER) {
 status |= BDRV_BLOCK_DATA;
 }
 if (s->metadata_preallocation && (status & BDRV_BLOCK_DATA) &&
@@ -2009,6 +2010,7 @@ static coroutine_fn int 
qcow2_co_preadv_part(BlockDriverState *bs,
 
 switch (ret) {
 case QCOW2_CLUSTER_UNALLOCATED:
+case QCOW2_CLUSTER_UNALLOCATED_SUBCLUSTER:
 
 if (bs->backing) {
 BLKDBG_EVENT(bs->file, BLKDBG_READ_BACKING_AIO);
@@ -3542,6 +3544,7 @@ static coroutine_fn int 
qcow2_co_pwrite_zeroes(BlockDriverState *bs,
 nr = s->cluster_size;
 ret = qcow2_get_cluster_offset(bs, offset, , );
 if (ret != QCOW2_CLUSTER_UNALLOCATED &&
+ret != QCOW2_CLUSTER_UNALLOCATED_SUBCLUSTER &&
 ret != QCOW2_CLUSTER_ZERO_PLAIN &&
 ret != QCOW2_CLUSTER_ZERO_ALLOC) {
 qemu_co_mutex_unlock(>lock);
@@ -3612,6 +3615,7 @@ qcow2_co_copy_range_from(BlockDriverState *bs,
 
 switch (ret) {
 case QCOW2_CLUSTER_UNALLOCATED:
+case QCOW2_CLUSTER_UNALLOCATED_SUBCLUSTER:
 if (bs->backing && bs->backing->bs) {
 int64_t backing_length = bdrv_getlength(bs->backing->bs);
 if (src_offset >= backing_length) {
-- 
2.20.1




[RFC PATCH 05/23] qcow2: Document the Extended L2 Entries feature

2019-10-15 Thread Alberto Garcia
Subcluster allocation in qcow2 is implemented by extending the
existing L2 table entries and adding additional information to
indicate the allocation status of each subcluster.

This patch documents the changes to the qcow2 format and how they
affect the calculation of the L2 cache size.

Signed-off-by: Alberto Garcia 
---
 docs/interop/qcow2.txt | 68 --
 docs/qcow2-cache.txt   | 19 +++-
 2 files changed, 83 insertions(+), 4 deletions(-)

diff --git a/docs/interop/qcow2.txt b/docs/interop/qcow2.txt
index af5711e533..d34261f955 100644
--- a/docs/interop/qcow2.txt
+++ b/docs/interop/qcow2.txt
@@ -39,6 +39,9 @@ The first cluster of a qcow2 image contains the file header:
 as the maximum cluster size and won't be able to open 
images
 with larger cluster sizes.
 
+Note: if the image has Extended L2 Entries then 
cluster_bits
+must be at least 14 (i.e. 16384 byte clusters).
+
  24 - 31:   size
 Virtual disk size in bytes.
 
@@ -109,7 +112,12 @@ in the description of a field.
 An External Data File Name header extension may
 be present if this bit is set.
 
-Bits 3-63:  Reserved (set to 0)
+Bit 3:  Extended L2 Entries.  If this bit is set then
+L2 table entries use an extended format that
+allows subcluster-based allocation. See the
+Extended L2 Entries section for more details.
+
+Bits 4-63:  Reserved (set to 0)
 
  80 -  87:  compatible_features
 Bitmask of compatible features. An implementation can
@@ -437,7 +445,7 @@ cannot be relaxed without an incompatible layout change).
 Given an offset into the virtual disk, the offset into the image file can be
 obtained as follows:
 
-l2_entries = (cluster_size / sizeof(uint64_t))
+l2_entries = (cluster_size / sizeof(uint64_t))[*]
 
 l2_index = (offset / cluster_size) % l2_entries
 l1_index = (offset / cluster_size) / l2_entries
@@ -447,6 +455,8 @@ obtained as follows:
 
 return cluster_offset + (offset % cluster_size)
 
+[*] this changes if Extended L2 Entries are enabled, see next section
+
 L1 table entry:
 
 Bit  0 -  8:Reserved (set to 0)
@@ -487,7 +497,8 @@ Standard Cluster Descriptor:
 nor is data read from the backing file if the cluster is
 unallocated.
 
-With version 2, this is always 0.
+With version 2 or with extended L2 entries (see the next
+section), this is always 0.
 
  1 -  8:Reserved (set to 0)
 
@@ -524,6 +535,57 @@ file (except if bit 0 in the Standard Cluster Descriptor 
is set). If there is
 no backing file or the backing file is smaller than the image, they shall read
 zeros for all parts that are not covered by the backing file.
 
+== Extended L2 Entries ==
+
+An image uses Extended L2 Entries if bit 3 is set on the incompatible_features
+field of the header.
+
+In these images standard data clusters are divided into 32 subclusters of the
+same size. They are contiguous and start from the beginning of the cluster.
+Subclusters can be allocated independently and the L2 entry contains 
information
+indicating the status of each one of them. Compressed data clusters don't have
+subclusters so they are treated like in images without this feature.
+
+The size of an extended L2 entry is 128 bits so the number of entries per table
+is calculated using this formula:
+
+l2_entries = (cluster_size / (2 * sizeof(uint64_t)))
+
+The first 64 bits have the same format as the standard L2 table entry described
+in the previous section, with the exception of bit 0 of the standard cluster
+descriptor.
+
+The last 64 bits contain a subcluster allocation bitmap with this format:
+
+Subcluster Allocation Bitmap (for standard clusters):
+
+Bit  0 -  31:   Allocation status (one bit per subcluster)
+
+1: the subcluster is allocated. In this case the
+   host cluster offset field must contain a valid
+   offset.
+0: the subcluster is not allocated. In this case
+   read requests shall go to the backing file or
+   return zeros if there is no backing file data.
+
+Bits are assigned starting from the most significant one.
+(i.e. bit x is used for subcluster 31 - x)
+
+32 -  63Subcluster reads as zeros (one bit per subcluster)
+
+1: the subcluster reads as zeros. In this case the
+   allocation status bit must be unset. The host
+   cluster offset field may or may not be set.
+   

[RFC PATCH 22/23] qcow2: Restrict qcow2_co_pwrite_zeroes() to full clusters only

2019-10-15 Thread Alberto Garcia
Ideally it should be possible to zero individual subclusters using
this function, but this is currently not implemented.

Signed-off-by: Alberto Garcia 
---
 block/qcow2.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/block/qcow2.c b/block/qcow2.c
index c54278ab0b..2eb032aed7 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -3544,6 +3544,12 @@ static coroutine_fn int 
qcow2_co_pwrite_zeroes(BlockDriverState *bs,
 bytes = s->cluster_size;
 nr = s->cluster_size;
 ret = qcow2_get_cluster_offset(bs, offset, , );
+/* TODO: allow zeroing separate subclusters, we only allow
+ * zeroing full clusters at the moment. */
+if (nr != bytes) {
+qemu_co_mutex_unlock(>lock);
+return -ENOTSUP;
+}
 if (ret != QCOW2_CLUSTER_UNALLOCATED &&
 ret != QCOW2_CLUSTER_UNALLOCATED_SUBCLUSTER &&
 ret != QCOW2_CLUSTER_ZERO_PLAIN &&
-- 
2.20.1




[RFC PATCH 19/23] qcow2: Fix offset calculation in handle_dependencies()

2019-10-15 Thread Alberto Garcia
l2meta_cow_start() and l2meta_cow_end() are not necessarily
cluster-aligned if the image has subclusters, so update the
calculation of old_start and old_end to guarantee that no two requests
try to write on the same cluster.

Signed-off-by: Alberto Garcia 
---
 block/qcow2-cluster.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index dc72f0e595..75579c1470 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1262,8 +1262,8 @@ static int handle_dependencies(BlockDriverState *bs, 
uint64_t guest_offset,
 
 uint64_t start = guest_offset;
 uint64_t end = start + bytes;
-uint64_t old_start = l2meta_cow_start(old_alloc);
-uint64_t old_end = l2meta_cow_end(old_alloc);
+uint64_t old_start = start_of_cluster(s, l2meta_cow_start(old_alloc));
+uint64_t old_end = ROUND_UP(l2meta_cow_end(old_alloc), 
s->cluster_size);
 
 if (end <= old_start || start >= old_end) {
 /* No intersection */
-- 
2.20.1




[RFC PATCH 15/23] qcow2: Add subcluster support to zero_in_l2_slice()

2019-10-15 Thread Alberto Garcia
Setting the QCOW_OFLAG_ZERO bit of the L2 entry is forbidden if an
image has subclusters. Instead, the individual 'all zeroes' bits must
be used.

Signed-off-by: Alberto Garcia 
---
 block/qcow2-cluster.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 71d4cc518a..c554b1a88c 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1849,7 +1849,7 @@ static int zero_in_l2_slice(BlockDriverState *bs, 
uint64_t offset,
 assert(nb_clusters <= INT_MAX);
 
 for (i = 0; i < nb_clusters; i++) {
-uint64_t old_offset;
+uint64_t old_offset, l2_entry = 0;
 QCow2ClusterType cluster_type;
 
 old_offset = get_l2_entry(s, l2_slice, l2_index + i);
@@ -1866,12 +1866,18 @@ static int zero_in_l2_slice(BlockDriverState *bs, 
uint64_t offset,
 
 qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
 if (cluster_type == QCOW2_CLUSTER_COMPRESSED || unmap) {
-set_l2_entry(s, l2_slice, l2_index + i, QCOW_OFLAG_ZERO);
 qcow2_free_any_clusters(bs, old_offset, 1, QCOW2_DISCARD_REQUEST);
 } else {
-uint64_t entry = get_l2_entry(s, l2_slice, l2_index + i);
-set_l2_entry(s, l2_slice, l2_index + i, entry | QCOW_OFLAG_ZERO);
+l2_entry = get_l2_entry(s, l2_slice, l2_index + i);
 }
+
+if (has_subclusters(s)) {
+set_l2_bitmap(s, l2_slice, l2_index + i, 
QCOW_L2_BITMAP_ALL_ZEROES);
+} else {
+l2_entry |= QCOW_OFLAG_ZERO;
+}
+
+set_l2_entry(s, l2_slice, l2_index + i, l2_entry);
 }
 
 qcow2_cache_put(s->l2_table_cache, (void **) _slice);
-- 
2.20.1




Re: [PATCH v2 1/2] nbd: Don't send oversize strings

2019-10-15 Thread Vladimir Sementsov-Ogievskiy
15.10.2019 18:07, Eric Blake wrote:
> On 10/11/19 2:32 AM, Vladimir Sementsov-Ogievskiy wrote:
>> 11.10.2019 0:00, Eric Blake wrote:
>>> Qemu as server currently won't accept export names larger than 256
>>> bytes, nor create dirty bitmap names longer than 1023 bytes, so most
>>> uses of qemu as client or server have no reason to get anywhere near
>>> the NBD spec maximum of a 4k limit per string.
>>>
>>> However, we weren't actually enforcing things, ignoring when the
>>> remote side violates the protocol on input, and also having several
>>> code paths where we send oversize strings on output (for example,
>>> qemu-nbd --description could easily send more than 4k).  Tighten
>>> things up as follows:
>>>
>>> client:
>>> - Perform bounds check on export name and dirty bitmap request prior
>>>     to handing it to server
>>> - Validate that copied server replies are not too long (ignoring
>>>     NBD_INFO_* replies that are not copied is not too bad)
>>> server:
>>> - Perform bounds check on export name and description prior to
>>>     advertising it to client
>>> - Reject client name or metadata query that is too long
>>>
>>> Signed-off-by: Eric Blake 
>>> ---
> 
>>> +++ b/include/block/nbd.h
>>> @@ -232,6 +232,7 @@ enum {
>>>     * going larger would require an audit of more code to make sure we
>>>     * aren't overflowing some other buffer. */
>>
>> This comment says, that we restrict export name to 256...
> 
> Yes, because we still stack-allocate the name in places, but 4k is too large 
> for stack allocation.  But we're inconsistent on where we use the smaller 
> 256-limit; the server won't serve an image that large, but doesn't prevent a 
> client from requesting a 4k name export (even though that export will not be 
> present).
> 
> 
>>> +++ b/blockdev-nbd.c
>>> @@ -162,6 +162,11 @@ void qmp_nbd_server_add(const char *device, bool 
>>> has_name, const char *name,
>>>    name = device;
>>>    }
>>>
>>> +    if (strlen(name) > NBD_MAX_STRING_SIZE) {
>>> +    error_setg(errp, "export name '%s' too long", name);
>>> +    return;
>>> +    }
>>
>> Hmmm, no, so here we restrict to 4096, but, we will not allow client to 
>> request more than
>> 256. Seems, to correctly update server-part, we should drop 
>> NBD_MAX_NAME_SIZE and do the
>> audit mentioned in the comment above its definition.
> 
> Yeah, I guess it's time to just get rid of NBD_MAX_NAME_SIZE, and move away 
> from stack allocations.  Should I do that as a followup to this patch, or 
> spin a v3?

Hmm. It's OK too.

With
  - fixed mem-leak in nbd_process_options
  - s/x_dirty_bitmap/x-dirty-bitmap in nbd_process_options in error message
  - following yours new wordings

Reviewed-by: Vladimir Sementsov-Ogievskiy 

However, this patch introduces possible crash point, asserting on bitmap name 
below, so it would better
be fixed before this patch or immediately after it.. Still, it's unlikely to 
have a bitmap with name
longer than 4k..

> 
>>> +++ b/nbd/client.c
>>> @@ -289,8 +289,8 @@ static int nbd_receive_list(QIOChannel *ioc, char 
>>> **name, char **description,
>>>    return -1;
>>>    }
>>>    len -= sizeof(namelen);
>>> -    if (len < namelen) {
>>> -    error_setg(errp, "incorrect option name length");
>>> +    if (len < namelen || namelen > NBD_MAX_STRING_SIZE) {
>>> +    error_setg(errp, "incorrect list name length");
>>
>> New wording made me go above and read the comment, what functions does. 
>> Comment is good, but without
>> it, it sounds like name of the list for me...
> 
> Maybe:
> 
> incorrect name length in server's list response

Yes, this is better, thanks

> 
>>
>>>    nbd_send_opt_abort(ioc);
>>>    return -1;
>>>    }
>>> @@ -303,6 +303,11 @@ static int nbd_receive_list(QIOChannel *ioc, char 
>>> **name, char **description,
>>>    local_name[namelen] = '\0';
>>>    len -= namelen;
>>>    if (len) {
>>> +    if (len > NBD_MAX_STRING_SIZE) {
>>> +    error_setg(errp, "incorrect list description length");
> 
> and
> 
> incorrect description length in server's list response
> 
> 
>>> @@ -648,6 +657,7 @@ static int nbd_send_meta_query(QIOChannel *ioc, 
>>> uint32_t opt,
>>>    if (query) {
>>>    query_len = strlen(query);
>>>    data_len += sizeof(query_len) + query_len;
>>> +    assert(query_len <= NBD_MAX_STRING_SIZE);
>>>    } else {
>>>    assert(opt == NBD_OPT_LIST_META_CONTEXT);
>>>    }
>>
>> you may assert export_len as well..
> 
> It was asserted earlier, but doing it again might not hurt, especially if I 
> do the followup patch getting rid of NBD_MAX_NAME_SIZE
> 
> 
>>> @@ -1561,6 +1569,8 @@ NBDExport *nbd_export_new(BlockDriverState *bs, 
>>> uint64_t dev_offset,
>>>    exp->export_bitmap = bm;
>>>    exp->export_bitmap_context = 
>>> g_strdup_printf("qemu:dirty-bitmap:%s",
>>>     bitmap);
>>> +    /* See 

Re: [PULL 00/15] Block layer patches

2019-10-15 Thread Peter Maydell
On Mon, 14 Oct 2019 at 17:03, Kevin Wolf  wrote:
>
> The following changes since commit 22dbfdecc3c52228d3489da3fe81da92b21197bf:
>
>   Merge remote-tracking branch 'remotes/awilliam/tags/vfio-update-20191010.0' 
> into staging (2019-10-14 15:09:08 +0100)
>
> are available in the Git repository at:
>
>   git://repo.or.cz/qemu/kevin.git tags/for-upstream
>
> for you to fetch changes up to a1406a9262a087d9ec9627b88da13c4590b61dae:
>
>   iotests: Test large write request to qcow2 file (2019-10-14 17:12:48 +0200)
>
> 
> Block layer patches:
>
> - block: Fix crash with qcow2 partial cluster COW with small cluster
>   sizes (misaligned write requests with BDRV_REQ_NO_FALLBACK)
> - qcow2: Fix integer overflow potentially causing corruption with huge
>   requests
> - vhdx: Detect truncated image files
> - tools: Support help options for --object
> - Various block-related replay improvements
> - iotests/028: Fix for long $TEST_DIRs


Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/4.2
for any user-visible changes.

-- PMM



Re: [PATCH v2 00/20] nvme: support NVMe v1.3d, SGLs and multiple namespaces

2019-10-15 Thread no-reply
Patchew URL: https://patchew.org/QEMU/20191015103900.313928-1-...@irrelevant.dk/



Hi,

This series failed the docker-mingw@fedora build test. Please find the testing 
commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#! /bin/bash
export ARCH=x86_64
make docker-image-fedora V=1 NETWORK=1
time make docker-test-mingw@fedora J=14 NETWORK=1
=== TEST SCRIPT END ===

  CC  hw/misc/imx7_gpr.o
  CC  hw/misc/mst_fpga.o
/tmp/qemu-test/src/hw/block/nvme.c: In function 'nvme_map_prp':
/tmp/qemu-test/src/hw/block/nvme.c:232:42: error: cast to pointer from integer 
of different size [-Werror=int-to-pointer-cast]
 trace_nvme_err_addr_read((void *) prp2);
  ^
/tmp/qemu-test/src/hw/block/nvme.c:258:50: error: cast to pointer from integer 
of different size [-Werror=int-to-pointer-cast]
 trace_nvme_err_addr_read((void *) prp_ent);
  ^
/tmp/qemu-test/src/hw/block/nvme.c: In function 'nvme_map_sgl':
/tmp/qemu-test/src/hw/block/nvme.c:414:42: error: cast to pointer from integer 
of different size [-Werror=int-to-pointer-cast]
 trace_nvme_err_addr_read((void *) addr);
  ^
/tmp/qemu-test/src/hw/block/nvme.c:429:38: error: cast to pointer from integer 
of different size [-Werror=int-to-pointer-cast]
 trace_nvme_err_addr_read((void *) addr);
  ^
/tmp/qemu-test/src/hw/block/nvme.c:478:38: error: cast to pointer from integer 
of different size [-Werror=int-to-pointer-cast]
 trace_nvme_err_addr_read((void *) addr);
  ^
/tmp/qemu-test/src/hw/block/nvme.c:493:34: error: cast to pointer from integer 
of different size [-Werror=int-to-pointer-cast]
 trace_nvme_err_addr_read((void *) addr);
  ^
/tmp/qemu-test/src/hw/block/nvme.c: In function 'nvme_post_cqes':
/tmp/qemu-test/src/hw/block/nvme.c:847:39: error: cast to pointer from integer 
of different size [-Werror=int-to-pointer-cast]
 trace_nvme_err_addr_write((void *) addr);
   ^
/tmp/qemu-test/src/hw/block/nvme.c: In function 'nvme_process_sq':
/tmp/qemu-test/src/hw/block/nvme.c:1971:38: error: cast to pointer from integer 
of different size [-Werror=int-to-pointer-cast]
 trace_nvme_err_addr_read((void *) addr);
  ^
cc1: all warnings being treated as errors
make: *** [/tmp/qemu-test/src/rules.mak:69: hw/block/nvme.o] Error 1
make: *** Waiting for unfinished jobs
Traceback (most recent call last):
  File "./tests/docker/docker.py", line 662, in 
---
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', 
'--label', 'com.qemu.instance.uuid=8aa0a85fff1f457c9dc7c826d7b3189d', '-u', 
'1001', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', 
'-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 
'SHOW_ENV=', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', 
'/home/patchew/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', 
'/var/tmp/patchew-tester-tmp-2g1bl41s/src/docker-src.2019-10-15-13.13.48.993:/var/tmp/qemu:z,ro',
 'qemu:fedora', '/var/tmp/qemu/run', 'test-mingw']' returned non-zero exit 
status 2.
filter=--filter=label=com.qemu.instance.uuid=8aa0a85fff1f457c9dc7c826d7b3189d
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-2g1bl41s/src'
make: *** [docker-run-test-mingw@fedora] Error 2

real5m56.522s
user0m7.913s


The full log is available at
http://patchew.org/logs/20191015103900.313928-1-...@irrelevant.dk/testing.docker-mingw@fedora/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-de...@redhat.com

Re: [PATCH 2/2] core: replace getpagesize() with qemu_real_host_page_size

2019-10-15 Thread Wei Yang
On Tue, Oct 15, 2019 at 02:45:15PM +0300, Yuval Shaia wrote:
>On Sun, Oct 13, 2019 at 10:11:45AM +0800, Wei Yang wrote:
>> There are three page size in qemu:
>> 
>>   real host page size
>>   host page size
>>   target page size
>> 
>> All of them have dedicate variable to represent. For the last two, we
>> use the same form in the whole qemu project, while for the first one we
>> use two forms: qemu_real_host_page_size and getpagesize().
>> 
>> qemu_real_host_page_size is defined to be a replacement of
>> getpagesize(), so let it serve the role.
>> 
>> [Note] Not fully tested for some arch or device.
>> 
>> Signed-off-by: Wei Yang 
>> ---
>>  accel/kvm/kvm-all.c|  6 +++---
>>  backends/hostmem.c |  2 +-
>>  block.c|  4 ++--
>>  block/file-posix.c |  9 +
>>  block/io.c |  2 +-
>>  block/parallels.c  |  2 +-
>>  block/qcow2-cache.c|  2 +-
>>  contrib/vhost-user-gpu/vugbm.c |  2 +-
>>  exec.c |  6 +++---
>>  hw/intc/s390_flic_kvm.c|  2 +-
>>  hw/ppc/mac_newworld.c  |  2 +-
>>  hw/ppc/spapr_pci.c |  2 +-
>>  hw/rdma/vmw/pvrdma_main.c  |  2 +-
>
>for pvrdma stuff:
>
>Reviewed-by: Yuval Shaia 
>Tested-by: Yuval Shaia 

Thanks

>
>>  hw/vfio/spapr.c|  7 ---
>>  include/exec/ram_addr.h|  2 +-
>>  include/qemu/osdep.h   |  4 ++--
>>  migration/migration.c  |  2 +-
>>  migration/postcopy-ram.c   |  4 ++--
>>  monitor/misc.c |  2 +-
>>  target/ppc/kvm.c   |  2 +-
>>  tests/vhost-user-bridge.c  |  8 
>>  util/mmap-alloc.c  | 10 +-
>>  util/oslib-posix.c |  4 ++--
>>  util/oslib-win32.c |  2 +-
>>  util/vfio-helpers.c| 12 ++--
>>  25 files changed, 52 insertions(+), 50 deletions(-)
>> 
>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>> index d2d96d73e8..140b0bd8f6 100644
>> --- a/accel/kvm/kvm-all.c
>> +++ b/accel/kvm/kvm-all.c
>> @@ -52,7 +52,7 @@
>>  /* KVM uses PAGE_SIZE in its definition of KVM_COALESCED_MMIO_MAX. We
>>   * need to use the real host PAGE_SIZE, as that's what KVM will use.
>>   */
>> -#define PAGE_SIZE getpagesize()
>> +#define PAGE_SIZE qemu_real_host_page_size
>>  
>>  //#define DEBUG_KVM
>>  
>> @@ -507,7 +507,7 @@ static int 
>> kvm_get_dirty_pages_log_range(MemoryRegionSection *section,
>>  {
>>  ram_addr_t start = section->offset_within_region +
>> memory_region_get_ram_addr(section->mr);
>> -ram_addr_t pages = int128_get64(section->size) / getpagesize();
>> +ram_addr_t pages = int128_get64(section->size) / 
>> qemu_real_host_page_size;
>>  
>>  cpu_physical_memory_set_dirty_lebitmap(bitmap, start, pages);
>>  return 0;
>> @@ -1841,7 +1841,7 @@ static int kvm_init(MachineState *ms)
>>   * even with KVM.  TARGET_PAGE_SIZE is assumed to be the minimum
>>   * page size for the system though.
>>   */
>> -assert(TARGET_PAGE_SIZE <= getpagesize());
>> +assert(TARGET_PAGE_SIZE <= qemu_real_host_page_size);
>>  
>>  s->sigmask_len = 8;
>>  
>> diff --git a/backends/hostmem.c b/backends/hostmem.c
>> index 6d333dc23c..e773bdfa6e 100644
>> --- a/backends/hostmem.c
>> +++ b/backends/hostmem.c
>> @@ -304,7 +304,7 @@ size_t host_memory_backend_pagesize(HostMemoryBackend 
>> *memdev)
>>  #else
>>  size_t host_memory_backend_pagesize(HostMemoryBackend *memdev)
>>  {
>> -return getpagesize();
>> +return qemu_real_host_page_size;
>>  }
>>  #endif
>>  
>> diff --git a/block.c b/block.c
>> index 5944124845..98f47e2902 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -106,7 +106,7 @@ size_t bdrv_opt_mem_align(BlockDriverState *bs)
>>  {
>>  if (!bs || !bs->drv) {
>>  /* page size or 4k (hdd sector size) should be on the safe side */
>> -return MAX(4096, getpagesize());
>> +return MAX(4096, qemu_real_host_page_size);
>>  }
>>  
>>  return bs->bl.opt_mem_alignment;
>> @@ -116,7 +116,7 @@ size_t bdrv_min_mem_align(BlockDriverState *bs)
>>  {
>>  if (!bs || !bs->drv) {
>>  /* page size or 4k (hdd sector size) should be on the safe side */
>> -return MAX(4096, getpagesize());
>> +return MAX(4096, qemu_real_host_page_size);
>>  }
>>  
>>  return bs->bl.min_mem_alignment;
>> diff --git a/block/file-posix.c b/block/file-posix.c
>> index f12c06de2d..f60ac3f93f 100644
>> --- a/block/file-posix.c
>> +++ b/block/file-posix.c
>> @@ -322,7 +322,7 @@ static void raw_probe_alignment(BlockDriverState *bs, 
>> int fd, Error **errp)
>>  {
>>  BDRVRawState *s = bs->opaque;
>>  char *buf;
>> -size_t max_align = MAX(MAX_BLOCKSIZE, getpagesize());
>> +size_t max_align = MAX(MAX_BLOCKSIZE, qemu_real_host_page_size);
>>  size_t alignments[] = {1, 512, 1024, 2048, 4096};
>>  
>>  /* For SCSI generic devices the alignment is not really used.
>> @@ 

Re: [PATCH v3 0/5] qcow2: advanced compression options

2019-10-15 Thread no-reply
Patchew URL: 
https://patchew.org/QEMU/1571163625-642312-1-git-send-email-andrey.shinkev...@virtuozzo.com/



Hi,

This series failed the docker-mingw@fedora build test. Please find the testing 
commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#! /bin/bash
export ARCH=x86_64
make docker-image-fedora V=1 NETWORK=1
time make docker-test-mingw@fedora J=14 NETWORK=1
=== TEST SCRIPT END ===

  CC  block/blklogwrites.o
  CC  block/block-backend.o
/tmp/qemu-test/src/block/qcow2.c: In function 
'qcow2_co_pwritev_compressed_part':
/tmp/qemu-test/src/block/qcow2.c:4244:9: error: 'ret' may be used uninitialized 
in this function [-Werror=maybe-uninitialized]
 int ret;
 ^~~
cc1: all warnings being treated as errors
make: *** [/tmp/qemu-test/src/rules.mak:69: block/qcow2.o] Error 1
make: *** Waiting for unfinished jobs
Traceback (most recent call last):
  File "./tests/docker/docker.py", line 664, in 
---
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/subprocess.py", line 291, 
in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', 
'--label', 'com.qemu.instance.uuid=4299392cefd911e9addb68b59973b7d0', '-u', 
'1003', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', 
'-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 
'SHOW_ENV=', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', 
'/home/patchew2/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', 
'/var/tmp/patchew-tester-tmp-r2c14at8/src/docker-src.2019-10-16-01.53.08.3890:/var/tmp/qemu:z,ro',
 'qemu:fedora', '/var/tmp/qemu/run', 'test-mingw']' returned non-zero exit 
status 2.
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-r2c14at8/src'
make: *** [docker-run-test-mingw@fedora] Error 2

real2m50.343s
user0m8.261s


The full log is available at
http://patchew.org/logs/1571163625-642312-1-git-send-email-andrey.shinkev...@virtuozzo.com/testing.docker-mingw@fedora/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-de...@redhat.com

Re: [PATCH v2 00/21] iotests: Allow ./check -o data_file

2019-10-15 Thread no-reply
Patchew URL: https://patchew.org/QEMU/20191015142729.18123-1-mre...@redhat.com/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Subject: [PATCH v2 00/21] iotests: Allow ./check -o data_file
Type: series
Message-id: 20191015142729.18123-1-mre...@redhat.com

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
7e75916 iotests: Allow check -o data_file
a21918d iotests: Disable data_file where it cannot be used
1eb7209 iotests: Make 198 work with data_file
02453ff iotests: Make 137 work with data_file
cdb651c iotests: Make 110 work with data_file
1b30e90 iotests: Make 091 work with data_file
26ebffa iotests: Avoid cp/mv of test images
5d6ba79 iotests: Use _rm_test_img for deleting test images
4c20fa0 iotests: Avoid qemu-img create
944555b iotests: Drop IMGOPTS use in 267
9037b83 iotests: Replace IMGOPTS='' by --no-opts
e62282b iotests: Replace IMGOPTS= by -o
26d39b5 iotests: Inject space into -ocompat=0.10 in 051
99d129e iotests: Add -o and --no-opts to _make_test_img
301f2c3 iotests: Let _make_test_img parse its parameters
53a8dea iotests: Drop compat=1.1 in 050
85b18f8 iotests: Replace IMGOPTS by _unsupported_imgopts
476fb23 iotests: Filter refcount_order in 036
67b9119 iotests: Add _filter_json_filename
fbf9402 iotests/qcow2.py: Split feature fields into bits
afe3486 iotests/qcow2.py: Add dump-header-exts

=== OUTPUT BEGIN ===
1/21 Checking commit afe348661672 (iotests/qcow2.py: Add dump-header-exts)
ERROR: line over 90 characters
#32: FILE: tests/qemu-iotests/qcow2.py:237:
+[ 'dump-header-exts', cmd_dump_header_exts, 0, 'Dump image header 
extensions' ],

total: 1 errors, 0 warnings, 17 lines checked

Patch 1/21 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

2/21 Checking commit fbf940255d05 (iotests/qcow2.py: Split feature fields into 
bits)
3/21 Checking commit 67b9119032ad (iotests: Add _filter_json_filename)
4/21 Checking commit 476fb233c777 (iotests: Filter refcount_order in 036)
5/21 Checking commit 85b18f83a826 (iotests: Replace IMGOPTS by 
_unsupported_imgopts)
6/21 Checking commit 53a8dea8fb7b (iotests: Drop compat=1.1 in 050)
7/21 Checking commit 301f2c32204c (iotests: Let _make_test_img parse its 
parameters)
8/21 Checking commit 99d129e91dbe (iotests: Add -o and --no-opts to 
_make_test_img)
9/21 Checking commit 26d39b59dfe1 (iotests: Inject space into -ocompat=0.10 in 
051)
10/21 Checking commit e62282b2ad38 (iotests: Replace IMGOPTS= by -o)
11/21 Checking commit 9037b83425c4 (iotests: Replace IMGOPTS='' by --no-opts)
12/21 Checking commit 944555b5c283 (iotests: Drop IMGOPTS use in 267)
13/21 Checking commit 4c20fa09b6c5 (iotests: Avoid qemu-img create)
14/21 Checking commit 5d6ba791204b (iotests: Use _rm_test_img for deleting test 
images)
15/21 Checking commit 26ebffafbd87 (iotests: Avoid cp/mv of test images)
16/21 Checking commit 1b30e9035908 (iotests: Make 091 work with data_file)
17/21 Checking commit cdb651c3c22b (iotests: Make 110 work with data_file)
18/21 Checking commit 02453ff71311 (iotests: Make 137 work with data_file)
19/21 Checking commit 1eb720910a65 (iotests: Make 198 work with data_file)
20/21 Checking commit a21918dcdf92 (iotests: Disable data_file where it cannot 
be used)
21/21 Checking commit 7e7591696382 (iotests: Allow check -o data_file)
=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/20191015142729.18123-1-mre...@redhat.com/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-de...@redhat.com

Re: [PATCH 2/2] core: replace getpagesize() with qemu_real_host_page_size

2019-10-15 Thread Wei Yang
On Sun, Oct 13, 2019 at 08:28:41PM +1100, David Gibson wrote:
>On Sun, Oct 13, 2019 at 10:11:45AM +0800, Wei Yang wrote:
>> There are three page size in qemu:
>> 
>>   real host page size
>>   host page size
>>   target page size
>> 
>> All of them have dedicate variable to represent. For the last two, we
>> use the same form in the whole qemu project, while for the first one we
>> use two forms: qemu_real_host_page_size and getpagesize().
>> 
>> qemu_real_host_page_size is defined to be a replacement of
>> getpagesize(), so let it serve the role.
>> 
>> [Note] Not fully tested for some arch or device.
>> 
>> Signed-off-by: Wei Yang 
>
>Reviewed-by: David Gibson 
>
>Although the chances of someone messing this up again are almost 100%.
>

Hi, David

I found put a check in checkpatch.pl may be a good way to prevent it.

Just draft a patch, hope you would like it.

>-- 
>David Gibson   | I'll have my music baroque, and my code
>david AT gibson.dropbear.id.au | minimalist, thank you.  NOT _the_ _other_
>   | _way_ _around_!
>http://www.ozlabs.org/~dgibson



-- 
Wei Yang
Help you, Help me



Re: [PULL 1/1] test-bdrv-drain: fix iothread_join() hang

2019-10-15 Thread Stefan Hajnoczi
On Mon, Oct 14, 2019 at 01:11:41PM +0200, Paolo Bonzini wrote:
> On 14/10/19 10:52, Stefan Hajnoczi wrote:
> > tests/test-bdrv-drain can hang in tests/iothread.c:iothread_run():
> > 
> >   while (!atomic_read(>stopping)) {
> >   aio_poll(iothread->ctx, true);
> >   }
> > 
> > The iothread_join() function works as follows:
> > 
> >   void iothread_join(IOThread *iothread)
> >   {
> >   iothread->stopping = true;
> >   aio_notify(iothread->ctx);
> >   qemu_thread_join(>thread);
> > 
> > If iothread_run() checks iothread->stopping before the iothread_join()
> > thread sets stopping to true, then aio_notify() may be optimized away
> > and iothread_run() hangs forever in aio_poll().
> > 
> > The correct way to change iothread->stopping is from a BH that executes
> > within iothread_run().  This ensures that iothread->stopping is checked
> > after we set it to true.
> > 
> > This was already fixed for ./iothread.c (note this is a different source
> > file!) by commit 2362a28ea11c145e1a13ae79342d76dc118a72a6 ("iothread:
> > fix iothread_stop() race condition"), but not for tests/iothread.c.
> 
> Aha, I did have some kind of dejavu when sending the patch I have just
> sent; let's see if this also fixes the test-aio-multithread assertion
> failure.
> 
> Note that with this change the atomic read of iothread->stopping can go
> away; I can send a separate patch later.

Yes, I thought about the atomic_read() later as well.

Stefan


signature.asc
Description: PGP signature


Re: [PULL 01/19] util/hbitmap: strict hbitmap_reset

2019-10-15 Thread Kevin Wolf
Am 14.10.2019 um 20:10 hat John Snow geschrieben:
> 
> 
> On 10/11/19 7:18 PM, John Snow wrote:
> > 
> > 
> > On 10/11/19 5:48 PM, Eric Blake wrote:
> >> On 10/11/19 4:25 PM, John Snow wrote:
> >>> From: Vladimir Sementsov-Ogievskiy 
> >>>
> >>> hbitmap_reset has an unobvious property: it rounds requested region up.
> >>> It may provoke bugs, like in recently fixed write-blocking mode of
> >>> mirror: user calls reset on unaligned region, not keeping in mind that
> >>> there are possible unrelated dirty bytes, covered by rounded-up region
> >>> and information of this unrelated "dirtiness" will be lost.
> >>>
> >>> Make hbitmap_reset strict: assert that arguments are aligned, allowing
> >>> only one exception when @start + @count == hb->orig_size. It's needed
> >>> to comfort users of hbitmap_next_dirty_area, which cares about
> >>> hb->orig_size.
> >>>
> >>> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> >>> Reviewed-by: Max Reitz 
> >>> Message-Id: <20190806152611.280389-1-vsement...@virtuozzo.com>
> >>> [Maintainer edit: Max's suggestions from on-list. --js]
> >>> Signed-off-by: John Snow 
> >>> ---
> >>>   include/qemu/hbitmap.h | 5 +
> >>>   tests/test-hbitmap.c   | 2 +-
> >>>   util/hbitmap.c | 4 
> >>>   3 files changed, 10 insertions(+), 1 deletion(-)
> >>>
> >>
> >>> +++ b/util/hbitmap.c
> >>> @@ -476,6 +476,10 @@ void hbitmap_reset(HBitmap *hb, uint64_t start,
> >>> uint64_t count)
> >>>   /* Compute range in the last layer.  */
> >>>   uint64_t first;
> >>>   uint64_t last = start + count - 1;
> >>> +    uint64_t gran = 1ULL << hb->granularity;
> >>> +
> >>> +    assert(!(start & (gran - 1)));
> >>> +    assert(!(count & (gran - 1)) || (start + count == hb->orig_size));
> >>
> >> I know I'm replying a bit late (since this is now a pull request), but
> >> would it be worth using the dedicated macro:
> >>
> >> assert(QEMU_IS_ALIGNED(start, gran));
> >> assert(QEMU_IS_ALIGNED(count, gran) || start + count == hb->orig_size);
> >>
> >> instead of open-coding it?  (I would also drop the extra () around the
> >> right half of ||). If we want it, that would now be a followup patch.
> 
> I've noticed that seasoned C programmers hate extra parentheses a lot.
> I've noticed that I cannot remember operator precedence enough to ever
> feel like this is actually an improvement.
> 
> Something about a nice weighted tree of ((expr1) || (expr2)) feels
> soothing to my weary eyes. So, if it's not terribly important, I'd
> prefer to leave it as-is.

I don't mind the parentheses, but I do prefer QEMU_IS_ALIGNED() to the
open-coded version. Would that be a viable compromise?

Kevin



Re: [PULL 1/2] trace: add --group=all to tracing.txt

2019-10-15 Thread Stefan Hajnoczi
On Mon, Oct 14, 2019 at 11:08:25AM +0200, Philippe Mathieu-Daudé wrote:
> Hi Stefan,
> 
> On 10/14/19 10:57 AM, Stefan Hajnoczi wrote:
> > tracetool needs to know the group name ("all", "root", or a specific
> > subdirectory).  Also remove the stdin redirection because tracetool.py
> > needs the path to the trace-events file.  Update the documentation.
> > 
> > Fixes: 2098c56a9bc5901e145fa5d4759f075808811685
> > ("trace: move setting of group name into Makefiles")
> > Launchpad: https://bugs.launchpad.net/bugs/1844814
> 
> Sorry I didn't noticed that earlier, but on 
> https://wiki.qemu.org/Contribute/SubmitAPatch#Write_a_meaningful_commit_message
> we recommend using the 'Buglink' tag.
> Not sure it's worth resending another pull request...

Sure, it hasn't been merged yet so I can send a v2.

Stefan


signature.asc
Description: PGP signature


[PATCH v2 07/20] nvme: refactor device realization

2019-10-15 Thread Klaus Jensen
This patch splits up nvme_realize into multiple individual functions,
each initializing a different subset of the device.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.c | 176 +++-
 hw/block/nvme.h |  22 ++
 2 files changed, 135 insertions(+), 63 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 84e4f2ea7a15..1fdb3b8655ed 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -43,6 +43,8 @@
 #include "trace.h"
 #include "nvme.h"
 
+#define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
+
 #define NVME_GUEST_ERR(trace, fmt, ...) \
 do { \
 (trace_##trace)(__VA_ARGS__); \
@@ -1336,67 +1338,106 @@ static const MemoryRegionOps nvme_cmb_ops = {
 },
 };
 
-static void nvme_realize(PCIDevice *pci_dev, Error **errp)
+static int nvme_check_constraints(NvmeCtrl *n, Error **errp)
 {
-NvmeCtrl *n = NVME(pci_dev);
-NvmeIdCtrl *id = >id_ctrl;
-
-int i;
-int64_t bs_size;
-uint8_t *pci_conf;
-
-if (!n->params.num_queues) {
-error_setg(errp, "num_queues can't be zero");
-return;
-}
+NvmeParams *params = >params;
 
 if (!n->conf.blk) {
-error_setg(errp, "drive property not set");
-return;
+error_setg(errp, "nvme: block backend not configured");
+return 1;
 }
 
-bs_size = blk_getlength(n->conf.blk);
-if (bs_size < 0) {
-error_setg(errp, "could not get backing file size");
-return;
+if (!params->serial) {
+error_setg(errp, "nvme: serial not configured");
+return 1;
 }
 
-if (!n->params.serial) {
-error_setg(errp, "serial property not set");
-return;
+if ((params->num_queues < 1 || params->num_queues > NVME_MAX_QS)) {
+error_setg(errp, "nvme: invalid queue configuration");
+return 1;
 }
+
+return 0;
+}
+
+static int nvme_init_blk(NvmeCtrl *n, Error **errp)
+{
 blkconf_blocksizes(>conf);
 if (!blkconf_apply_backend_options(>conf, blk_is_read_only(n->conf.blk),
-   false, errp)) {
-return;
+false, errp)) {
+return 1;
 }
 
-pci_conf = pci_dev->config;
-pci_conf[PCI_INTERRUPT_PIN] = 1;
-pci_config_set_prog_interface(pci_dev->config, 0x2);
-pci_config_set_class(pci_dev->config, PCI_CLASS_STORAGE_EXPRESS);
-pcie_endpoint_cap_init(pci_dev, 0x80);
+return 0;
+}
 
+static void nvme_init_state(NvmeCtrl *n)
+{
 n->num_namespaces = 1;
 n->reg_size = pow2ceil(0x1004 + 2 * (n->params.num_queues + 1) * 4);
-n->ns_size = bs_size / (uint64_t)n->num_namespaces;
-
 n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
 n->sq = g_new0(NvmeSQueue *, n->params.num_queues);
 n->cq = g_new0(NvmeCQueue *, n->params.num_queues);
+}
+
+static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
+{
+NVME_CMBLOC_SET_BIR(n->bar.cmbloc, 2);
+NVME_CMBLOC_SET_OFST(n->bar.cmbloc, 0);
+
+NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
+NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 1);
+NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
+NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
+NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
+NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2);
+NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->params.cmb_size_mb);
+
+n->cmbloc = n->bar.cmbloc;
+n->cmbsz = n->bar.cmbsz;
+
+n->cmbuf = g_malloc0(NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
+memory_region_init_io(>ctrl_mem, OBJECT(n), _cmb_ops, n,
+"nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
+pci_register_bar(pci_dev, NVME_CMBLOC_BIR(n->bar.cmbloc),
+PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64 |
+PCI_BASE_ADDRESS_MEM_PREFETCH, >ctrl_mem);
+}
+
+static void nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev)
+{
+uint8_t *pci_conf = pci_dev->config;
 
-memory_region_init_io(>iomem, OBJECT(n), _mmio_ops, n,
-  "nvme", n->reg_size);
+pci_conf[PCI_INTERRUPT_PIN] = 1;
+pci_config_set_prog_interface(pci_conf, 0x2);
+pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_INTEL);
+pci_config_set_device_id(pci_conf, 0x5845);
+pci_config_set_class(pci_conf, PCI_CLASS_STORAGE_EXPRESS);
+pcie_endpoint_cap_init(pci_dev, 0x80);
+
+memory_region_init_io(>iomem, OBJECT(n), _mmio_ops, n, "nvme",
+n->reg_size);
 pci_register_bar(pci_dev, 0,
 PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64,
 >iomem);
 msix_init_exclusive_bar(pci_dev, n->params.num_queues, 4, NULL);
 
+if (n->params.cmb_size_mb) {
+nvme_init_cmb(n, pci_dev);
+}
+}
+
+static void nvme_init_ctrl(NvmeCtrl *n)
+{
+NvmeIdCtrl *id = >id_ctrl;
+NvmeParams *params = >params;
+uint8_t *pci_conf = n->parent_obj.config;
+
 id->vid = cpu_to_le16(pci_get_word(pci_conf + PCI_VENDOR_ID));
 id->ssvid = cpu_to_le16(pci_get_word(pci_conf + PCI_SUBSYSTEM_VENDOR_ID));
 strpadcpy((char *)id->mn, sizeof(id->mn), "QEMU NVMe 

[PATCH v2 00/20] nvme: support NVMe v1.3d, SGLs and multiple namespaces

2019-10-15 Thread Klaus Jensen
Hi,

(Quick note to Fam): most of this series is irrelevant to you as the
maintainer of the nvme block driver, but patch "nvme: add support for
scatter gather lists" touches block/nvme.c due to changes in the shared
NvmeCmd struct.

Anyway, v2 comes with a good bunch of changes. Compared to v1[1], I have
squashed some commits in the beginning of the series and heavily
refactored "nvme: support multiple block requests per request" into the
new commit "nvme: allow multiple aios per command".

I have also removed the original implementation of the Abort command
(commit "nvme: add support for the abort command") as it is currently
too tricky to test reliably. It has been replaced by a stub that,
besides a trivial sanity check, just fails to abort the given command.
*Some* implementation of the Abort command is mandatory, but given the
"best effort" nature of the command this is acceptable for now. When the
device gains support for arbitration it should be less tricky to test.

The support for multiple namespaces is now backwards compatible. The
nvme device still accepts a 'drive' parameter, but for multiple
namespaces the use of 'nvme-ns' devices are required. I also integrated
some feedback from Paul so the device supports non-consecutive namespace
ids.

I have also added some new commits at the end:

  - "nvme: bump controller pci device id" makes sure the Linux kernel
doesn't apply any quirks to the controller that it no longer has.
  - "nvme: handle dma errors" won't actually do anything before this[2]
fix to include/hw/pci/pci.h is merged. With these two patches added,
the device reliably passes some additional nasty tests from blktests
(block/011 "disable PCI device while doing I/O" and block/019 "break
PCI link device while doing I/O"). Before this patch, block/011
would pass from time to time if you were lucky, but would at least
mess up the controller pretty badly, causing a reset in the best
case.


  [1]: https://patchwork.kernel.org/project/qemu-devel/list/?series=142383
  [2]: https://patchwork.kernel.org/patch/11184911/


Klaus Jensen (20):
  nvme: remove superfluous breaks
  nvme: move device parameters to separate struct
  nvme: add missing fields in the identify controller data structure
  nvme: populate the mandatory subnqn and ver fields
  nvme: allow completion queues in the cmb
  nvme: add support for the abort command
  nvme: refactor device realization
  nvme: add support for the get log page command
  nvme: add support for the asynchronous event request command
  nvme: add logging to error information log page
  nvme: add missing mandatory features
  nvme: bump supported specification version to 1.3
  nvme: refactor prp mapping
  nvme: allow multiple aios per command
  nvme: add support for scatter gather lists
  nvme: support multiple namespaces
  nvme: bump controller pci device id
  nvme: remove redundant NvmeCmd pointer parameter
  nvme: make lba data size configurable
  nvme: handle dma errors

 block/nvme.c   |   18 +-
 hw/block/Makefile.objs |2 +-
 hw/block/nvme-ns.c |  139 +++
 hw/block/nvme-ns.h |   60 ++
 hw/block/nvme.c| 1863 +---
 hw/block/nvme.h|  219 -
 hw/block/trace-events  |   37 +-
 include/block/nvme.h   |  132 ++-
 8 files changed, 2094 insertions(+), 376 deletions(-)
 create mode 100644 hw/block/nvme-ns.c
 create mode 100644 hw/block/nvme-ns.h

-- 
2.23.0




[PATCH v2 06/20] nvme: add support for the abort command

2019-10-15 Thread Klaus Jensen
Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
Section 5.1 ("Abort command").

The Abort command is a best effort command; for now, the device always
fails to abort the given command.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index daa2367b0863..84e4f2ea7a15 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -741,6 +741,18 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
 }
 }
 
+static uint16_t nvme_abort(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+{
+uint16_t sqid = le32_to_cpu(cmd->cdw10) & 0x;
+
+req->cqe.result = 1;
+if (nvme_check_sqid(n, sqid)) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+return NVME_SUCCESS;
+}
+
 static inline void nvme_set_timestamp(NvmeCtrl *n, uint64_t ts)
 {
 trace_nvme_setfeat_timestamp(ts);
@@ -859,6 +871,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 trace_nvme_err_invalid_setfeat(dw10);
 return NVME_INVALID_FIELD | NVME_DNR;
 }
+
 return NVME_SUCCESS;
 }
 
@@ -875,6 +888,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 return nvme_create_cq(n, cmd);
 case NVME_ADM_CMD_IDENTIFY:
 return nvme_identify(n, cmd);
+case NVME_ADM_CMD_ABORT:
+return nvme_abort(n, cmd, req);
 case NVME_ADM_CMD_SET_FEATURES:
 return nvme_set_feature(n, cmd, req);
 case NVME_ADM_CMD_GET_FEATURES:
@@ -1388,6 +1403,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 id->ieee[2] = 0xb3;
 id->ver = cpu_to_le32(0x00010201);
 id->oacs = cpu_to_le16(0);
+id->acl = 3;
 id->frmw = 7 << 1;
 id->lpa = 1 << 0;
 id->sqes = (0x6 << 4) | 0x6;
-- 
2.23.0




[PATCH v2 02/20] nvme: move device parameters to separate struct

2019-10-15 Thread Klaus Jensen
Move device configuration parameters to separate struct to make it
explicit what is configurable and what is set internally.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.c | 44 ++--
 hw/block/nvme.h | 16 +---
 2 files changed, 35 insertions(+), 25 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index c06e3ca31905..277700fdcc58 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -64,12 +64,12 @@ static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void 
*buf, int size)
 
 static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
 {
-return sqid < n->num_queues && n->sq[sqid] != NULL ? 0 : -1;
+return sqid < n->params.num_queues && n->sq[sqid] != NULL ? 0 : -1;
 }
 
 static int nvme_check_cqid(NvmeCtrl *n, uint16_t cqid)
 {
-return cqid < n->num_queues && n->cq[cqid] != NULL ? 0 : -1;
+return cqid < n->params.num_queues && n->cq[cqid] != NULL ? 0 : -1;
 }
 
 static void nvme_inc_cq_tail(NvmeCQueue *cq)
@@ -631,7 +631,7 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
 trace_nvme_err_invalid_create_cq_addr(prp1);
 return NVME_INVALID_FIELD | NVME_DNR;
 }
-if (unlikely(vector > n->num_queues)) {
+if (unlikely(vector > n->params.num_queues)) {
 trace_nvme_err_invalid_create_cq_vector(vector);
 return NVME_INVALID_IRQ_VECTOR | NVME_DNR;
 }
@@ -783,7 +783,8 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 trace_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
 break;
 case NVME_NUMBER_OF_QUEUES:
-result = cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 
16));
+result = cpu_to_le32((n->params.num_queues - 2) |
+((n->params.num_queues - 2) << 16));
 trace_nvme_getfeat_numq(result);
 break;
 case NVME_TIMESTAMP:
@@ -827,9 +828,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 case NVME_NUMBER_OF_QUEUES:
 trace_nvme_setfeat_numq((dw11 & 0x) + 1,
 ((dw11 >> 16) & 0x) + 1,
-n->num_queues - 1, n->num_queues - 1);
-req->cqe.result =
-cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 16));
+n->params.num_queues - 1,
+n->params.num_queues - 1);
+req->cqe.result = cpu_to_le32((n->params.num_queues - 2) |
+((n->params.num_queues - 2) << 16));
 break;
 case NVME_TIMESTAMP:
 return nvme_set_feature_timestamp(n, cmd);
@@ -900,12 +902,12 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
 
 blk_drain(n->conf.blk);
 
-for (i = 0; i < n->num_queues; i++) {
+for (i = 0; i < n->params.num_queues; i++) {
 if (n->sq[i] != NULL) {
 nvme_free_sq(n->sq[i], n);
 }
 }
-for (i = 0; i < n->num_queues; i++) {
+for (i = 0; i < n->params.num_queues; i++) {
 if (n->cq[i] != NULL) {
 nvme_free_cq(n->cq[i], n);
 }
@@ -1308,7 +1310,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 int64_t bs_size;
 uint8_t *pci_conf;
 
-if (!n->num_queues) {
+if (!n->params.num_queues) {
 error_setg(errp, "num_queues can't be zero");
 return;
 }
@@ -1324,7 +1326,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 return;
 }
 
-if (!n->serial) {
+if (!n->params.serial) {
 error_setg(errp, "serial property not set");
 return;
 }
@@ -1341,25 +1343,25 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 pcie_endpoint_cap_init(pci_dev, 0x80);
 
 n->num_namespaces = 1;
-n->reg_size = pow2ceil(0x1004 + 2 * (n->num_queues + 1) * 4);
+n->reg_size = pow2ceil(0x1004 + 2 * (n->params.num_queues + 1) * 4);
 n->ns_size = bs_size / (uint64_t)n->num_namespaces;
 
 n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
-n->sq = g_new0(NvmeSQueue *, n->num_queues);
-n->cq = g_new0(NvmeCQueue *, n->num_queues);
+n->sq = g_new0(NvmeSQueue *, n->params.num_queues);
+n->cq = g_new0(NvmeCQueue *, n->params.num_queues);
 
 memory_region_init_io(>iomem, OBJECT(n), _mmio_ops, n,
   "nvme", n->reg_size);
 pci_register_bar(pci_dev, 0,
 PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64,
 >iomem);
-msix_init_exclusive_bar(pci_dev, n->num_queues, 4, NULL);
+msix_init_exclusive_bar(pci_dev, n->params.num_queues, 4, NULL);
 
 id->vid = cpu_to_le16(pci_get_word(pci_conf + PCI_VENDOR_ID));
 id->ssvid = cpu_to_le16(pci_get_word(pci_conf + PCI_SUBSYSTEM_VENDOR_ID));
 strpadcpy((char *)id->mn, sizeof(id->mn), "QEMU NVMe Ctrl", ' ');
 strpadcpy((char *)id->fr, sizeof(id->fr), "1.0", ' ');
-strpadcpy((char *)id->sn, sizeof(id->sn), n->serial, ' ');
+strpadcpy((char *)id->sn, 

[PATCH v2 03/20] nvme: add missing fields in the identify controller data structure

2019-10-15 Thread Klaus Jensen
Not used by the device model but added for completeness. See NVM Express
1.2.1, Section 5.11 ("Identify command"), Figure 90.

Signed-off-by: Klaus Jensen 
---
 include/block/nvme.h | 34 +-
 1 file changed, 29 insertions(+), 5 deletions(-)

diff --git a/include/block/nvme.h b/include/block/nvme.h
index 3ec8efcc435e..1b0accd4fe2b 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -543,7 +543,13 @@ typedef struct NvmeIdCtrl {
 uint8_t ieee[3];
 uint8_t cmic;
 uint8_t mdts;
-uint8_t rsvd255[178];
+uint16_tcntlid;
+uint32_tver;
+uint16_trtd3r;
+uint32_trtd3e;
+uint32_toaes;
+uint32_tctratt;
+uint8_t rsvd255[156];
 uint16_toacs;
 uint8_t acl;
 uint8_t aerl;
@@ -551,10 +557,22 @@ typedef struct NvmeIdCtrl {
 uint8_t lpa;
 uint8_t elpe;
 uint8_t npss;
-uint8_t rsvd511[248];
+uint8_t avscc;
+uint8_t apsta;
+uint16_twctemp;
+uint16_tcctemp;
+uint16_tmtfa;
+uint32_thmpre;
+uint32_thmmin;
+uint8_t tnvmcap[16];
+uint8_t unvmcap[16];
+uint32_trpmbs;
+uint8_t rsvd319[4];
+uint16_tkas;
+uint8_t rsvd511[190];
 uint8_t sqes;
 uint8_t cqes;
-uint16_trsvd515;
+uint16_tmaxcmd;
 uint32_tnn;
 uint16_toncs;
 uint16_tfuses;
@@ -562,8 +580,14 @@ typedef struct NvmeIdCtrl {
 uint8_t vwc;
 uint16_tawun;
 uint16_tawupf;
-uint8_t rsvd703[174];
-uint8_t rsvd2047[1344];
+uint8_t nvscc;
+uint8_t rsvd531;
+uint16_tacwu;
+uint16_trsvd535;
+uint32_tsgls;
+uint8_t rsvd767[228];
+uint8_t subnqn[256];
+uint8_t rsvd2047[1024];
 NvmePSD psd[32];
 uint8_t vs[1024];
 } NvmeIdCtrl;
-- 
2.23.0




[PATCH v2 04/20] nvme: populate the mandatory subnqn and ver fields

2019-10-15 Thread Klaus Jensen
Required for compliance with NVMe revision 1.2.1 or later. See NVM
Express 1.2.1, Section 5.11 ("Identify command"), Figure 90 and Section
7.9 ("NVMe Qualified Names").

This also bumps the supported version to 1.2.1.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 277700fdcc58..16f0fba10b08 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -9,9 +9,9 @@
  */
 
 /**
- * Reference Specs: http://www.nvmexpress.org, 1.2, 1.1, 1.0e
+ * Reference Specification: NVM Express 1.2.1
  *
- *  http://www.nvmexpress.org/resources/
+ *   https://nvmexpress.org/resources/specifications/
  */
 
 /**
@@ -1366,6 +1366,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 id->ieee[0] = 0x00;
 id->ieee[1] = 0x02;
 id->ieee[2] = 0xb3;
+id->ver = cpu_to_le32(0x00010201);
 id->oacs = cpu_to_le16(0);
 id->frmw = 7 << 1;
 id->lpa = 1 << 0;
@@ -1373,6 +1374,10 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 id->cqes = (0x4 << 4) | 0x4;
 id->nn = cpu_to_le32(n->num_namespaces);
 id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROS | NVME_ONCS_TIMESTAMP);
+
+strcpy((char *) id->subnqn, "nqn.2019-08.org.qemu:");
+pstrcat((char *) id->subnqn, sizeof(id->subnqn), n->params.serial);
+
 id->psd[0].mp = cpu_to_le16(0x9c4);
 id->psd[0].enlat = cpu_to_le32(0x10);
 id->psd[0].exlat = cpu_to_le32(0x4);
@@ -1387,7 +1392,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 NVME_CAP_SET_CSS(n->bar.cap, 1);
 NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
 
-n->bar.vs = 0x00010200;
+n->bar.vs = 0x00010201;
 n->bar.intmc = n->bar.intms = 0;
 
 if (n->params.cmb_size_mb) {
-- 
2.23.0




[PATCH v2 10/20] nvme: add logging to error information log page

2019-10-15 Thread Klaus Jensen
This adds the nvme_set_error_page function which allows errors to be
written to the error information log page. The functionality is largely
unused in the device, but with this in place we can at least try to push
new contributions to use it.

NOTE: In violation of the specification the Error Count field is *not*
retained across power off conditions because the device currently has no
place to store this kind of persistent state.

Cribbed from Keith's qemu-nvme tree.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.c | 22 --
 hw/block/nvme.h |  2 ++
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 5cdee37582f9..32381d7df655 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -161,6 +161,22 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue *cq)
 }
 }
 
+static void nvme_set_error_page(NvmeCtrl *n, uint16_t sqid, uint16_t cid,
+uint16_t status, uint16_t location, uint64_t lba, uint32_t nsid)
+{
+NvmeErrorLog *elp;
+
+elp = >elpes[n->elp_index];
+elp->error_count = n->error_count++;
+elp->sqid = sqid;
+elp->cid = cid;
+elp->status_field = status;
+elp->param_error_location = location;
+elp->lba = lba;
+elp->nsid = nsid;
+n->elp_index = (n->elp_index + 1) % n->params.elpe;
+}
+
 static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t prp1,
  uint64_t prp2, uint32_t len, NvmeCtrl *n)
 {
@@ -386,7 +402,9 @@ static void nvme_rw_cb(void *opaque, int ret)
 req->status = NVME_SUCCESS;
 } else {
 block_acct_failed(blk_get_stats(n->conf.blk), >acct);
-req->status = NVME_INTERNAL_DEV_ERROR;
+nvme_set_error_page(n, sq->sqid, cpu_to_le16(req->cid),
+NVME_INTERNAL_DEV_ERROR, 0, 0, 1);
+req->status = NVME_INTERNAL_DEV_ERROR | NVME_MORE;
 }
 if (req->has_sg) {
 qemu_sglist_destroy(>qsg);
@@ -678,7 +696,7 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, 
uint8_t rae,
 smart.host_read_commands[0] = cpu_to_le64(read_commands);
 smart.host_write_commands[0] = cpu_to_le64(write_commands);
 
-smart.number_of_error_log_entries[0] = cpu_to_le64(0);
+smart.number_of_error_log_entries[0] = cpu_to_le64(n->error_count);
 smart.temperature[0] = n->temperature & 0xff;
 smart.temperature[1] = (n->temperature >> 8) & 0xff;
 
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 3fc36f577b46..d74b0e0f9b2c 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -100,6 +100,8 @@ typedef struct NvmeCtrl {
 uint64_ttimestamp_set_qemu_clock_ms;/* QEMU clock time */
 uint64_tstarttime_ms;
 uint16_ttemperature;
+uint8_t elp_index;
+uint64_terror_count;
 
 QEMUTimer   *aer_timer;
 uint8_t aer_mask;
-- 
2.23.0




[PATCH v2 09/20] nvme: add support for the asynchronous event request command

2019-10-15 Thread Klaus Jensen
Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
Section 5.2 ("Asynchronous Event Request command").

Mostly imported from Keith's qemu-nvme tree. Modified to not enqueue
events if something of the same type is already queued (but not cleared
by the host).

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.c   | 180 --
 hw/block/nvme.h   |  13 ++-
 hw/block/trace-events |   8 ++
 include/block/nvme.h  |   4 +-
 4 files changed, 196 insertions(+), 9 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 4412a3bea3bc..5cdee37582f9 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -334,6 +334,46 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, 
NvmeRequest *req)
 timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
 }
 
+static void nvme_enqueue_event(NvmeCtrl *n, uint8_t event_type,
+uint8_t event_info, uint8_t log_page)
+{
+NvmeAsyncEvent *event;
+
+trace_nvme_enqueue_event(event_type, event_info, log_page);
+
+/*
+ * Do not enqueue the event if something of this type is already queued.
+ * This bounds the size of the event queue and makes sure it does not grow
+ * indefinitely when events are not processed by the host (i.e. does not
+ * issue any AERs).
+ */
+if (n->aer_mask_queued & (1 << event_type)) {
+trace_nvme_enqueue_event_masked(event_type);
+return;
+}
+n->aer_mask_queued |= (1 << event_type);
+
+event = g_new(NvmeAsyncEvent, 1);
+event->result = (NvmeAerResult) {
+.event_type = event_type,
+.event_info = event_info,
+.log_page   = log_page,
+};
+
+QTAILQ_INSERT_TAIL(>aer_queue, event, entry);
+
+timer_mod(n->aer_timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
+}
+
+static void nvme_clear_events(NvmeCtrl *n, uint8_t event_type)
+{
+n->aer_mask &= ~(1 << event_type);
+if (!QTAILQ_EMPTY(>aer_queue)) {
+timer_mod(n->aer_timer,
+qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
+}
+}
+
 static void nvme_rw_cb(void *opaque, int ret)
 {
 NvmeRequest *req = opaque;
@@ -578,7 +618,7 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeCmd *cmd)
 return NVME_SUCCESS;
 }
 
-static uint16_t nvme_error_info(NvmeCtrl *n, NvmeCmd *cmd,
+static uint16_t nvme_error_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
 uint32_t buf_len, uint64_t off, NvmeRequest *req)
 {
 uint32_t trans_len;
@@ -591,12 +631,16 @@ static uint16_t nvme_error_info(NvmeCtrl *n, NvmeCmd *cmd,
 
 trans_len = MIN(sizeof(*n->elpes) * (n->params.elpe + 1) - off, buf_len);
 
+if (!rae) {
+nvme_clear_events(n, NVME_AER_TYPE_ERROR);
+}
+
 return nvme_dma_read_prp(n, (uint8_t *) n->elpes + off, trans_len, prp1,
 prp2);
 }
 
-static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
-uint64_t off, NvmeRequest *req)
+static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
+uint32_t buf_len, uint64_t off, NvmeRequest *req)
 {
 uint64_t prp1 = le64_to_cpu(cmd->prp1);
 uint64_t prp2 = le64_to_cpu(cmd->prp2);
@@ -646,6 +690,10 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, 
uint32_t buf_len,
 smart.power_on_hours[0] = cpu_to_le64(
 (((current_ms - n->starttime_ms) / 1000) / 60) / 60);
 
+if (!rae) {
+nvme_clear_events(n, NVME_AER_TYPE_SMART);
+}
+
 return nvme_dma_read_prp(n, (uint8_t *)  + off, trans_len, prp1,
 prp2);
 }
@@ -698,9 +746,9 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 
 switch (lid) {
 case NVME_LOG_ERROR_INFO:
-return nvme_error_info(n, cmd, len, off, req);
+return nvme_error_info(n, cmd, rae, len, off, req);
 case NVME_LOG_SMART_INFO:
-return nvme_smart_info(n, cmd, len, off, req);
+return nvme_smart_info(n, cmd, rae, len, off, req);
 case NVME_LOG_FW_SLOT_INFO:
 return nvme_fw_log_info(n, cmd, len, off, req);
 default:
@@ -958,6 +1006,9 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 break;
 case NVME_TIMESTAMP:
 return nvme_get_feature_timestamp(n, cmd);
+case NVME_ASYNCHRONOUS_EVENT_CONF:
+result = cpu_to_le32(n->features.async_config);
+break;
 default:
 trace_nvme_err_invalid_getfeat(dw10);
 return NVME_INVALID_FIELD | NVME_DNR;
@@ -993,6 +1044,12 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 switch (dw10) {
 case NVME_TEMPERATURE_THRESHOLD:
 n->features.temp_thresh = dw11;
+
+if (n->features.temp_thresh <= n->temperature) {
+nvme_enqueue_event(n, NVME_AER_TYPE_SMART,
+NVME_AER_INFO_SMART_TEMP_THRESH, NVME_LOG_SMART_INFO);
+}
+
 break;
 
 case NVME_VOLATILE_WRITE_CACHE:
@@ -1008,6 +1065,9 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, 

[PATCH v2 08/20] nvme: add support for the get log page command

2019-10-15 Thread Klaus Jensen
Add support for the Get Log Page command and basic implementations
of the mandatory Error Information, SMART/Health Information and
Firmware Slot Information log pages.

In violation of the specification, the SMART/Health Information log page
does not persist information over the lifetime of the controller because
the device has no place to store such persistent state.

Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
Section 5.10 ("Get Log Page command").

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.c   | 150 +-
 hw/block/nvme.h   |   9 ++-
 hw/block/trace-events |   2 +
 include/block/nvme.h  |   2 +-
 4 files changed, 160 insertions(+), 3 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 1fdb3b8655ed..4412a3bea3bc 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -44,6 +44,7 @@
 #include "nvme.h"
 
 #define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
+#define NVME_TEMPERATURE 0x143
 
 #define NVME_GUEST_ERR(trace, fmt, ...) \
 do { \
@@ -577,6 +578,137 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeCmd *cmd)
 return NVME_SUCCESS;
 }
 
+static uint16_t nvme_error_info(NvmeCtrl *n, NvmeCmd *cmd,
+uint32_t buf_len, uint64_t off, NvmeRequest *req)
+{
+uint32_t trans_len;
+uint64_t prp1 = le64_to_cpu(cmd->prp1);
+uint64_t prp2 = le64_to_cpu(cmd->prp2);
+
+if (off > sizeof(*n->elpes) * (n->params.elpe + 1)) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+trans_len = MIN(sizeof(*n->elpes) * (n->params.elpe + 1) - off, buf_len);
+
+return nvme_dma_read_prp(n, (uint8_t *) n->elpes + off, trans_len, prp1,
+prp2);
+}
+
+static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
+uint64_t off, NvmeRequest *req)
+{
+uint64_t prp1 = le64_to_cpu(cmd->prp1);
+uint64_t prp2 = le64_to_cpu(cmd->prp2);
+uint32_t nsid = le32_to_cpu(cmd->nsid);
+
+uint32_t trans_len;
+time_t current_ms;
+uint64_t units_read = 0, units_written = 0, read_commands = 0,
+write_commands = 0;
+NvmeSmartLog smart;
+BlockAcctStats *s;
+
+if (!nsid || (nsid != 0x && nsid > n->num_namespaces)) {
+trace_nvme_err_invalid_ns(nsid, n->num_namespaces);
+return NVME_INVALID_NSID | NVME_DNR;
+}
+
+s = blk_get_stats(n->conf.blk);
+
+units_read = s->nr_bytes[BLOCK_ACCT_READ] >> BDRV_SECTOR_BITS;
+units_written = s->nr_bytes[BLOCK_ACCT_WRITE] >> BDRV_SECTOR_BITS;
+read_commands = s->nr_ops[BLOCK_ACCT_READ];
+write_commands = s->nr_ops[BLOCK_ACCT_WRITE];
+
+if (off > sizeof(smart)) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+trans_len = MIN(sizeof(smart) - off, buf_len);
+
+memset(, 0x0, sizeof(smart));
+
+smart.data_units_read[0] = cpu_to_le64(units_read / 1000);
+smart.data_units_written[0] = cpu_to_le64(units_written / 1000);
+smart.host_read_commands[0] = cpu_to_le64(read_commands);
+smart.host_write_commands[0] = cpu_to_le64(write_commands);
+
+smart.number_of_error_log_entries[0] = cpu_to_le64(0);
+smart.temperature[0] = n->temperature & 0xff;
+smart.temperature[1] = (n->temperature >> 8) & 0xff;
+
+if (n->features.temp_thresh <= n->temperature) {
+smart.critical_warning |= NVME_SMART_TEMPERATURE;
+}
+
+current_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
+smart.power_on_hours[0] = cpu_to_le64(
+(((current_ms - n->starttime_ms) / 1000) / 60) / 60);
+
+return nvme_dma_read_prp(n, (uint8_t *)  + off, trans_len, prp1,
+prp2);
+}
+
+static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
+uint64_t off, NvmeRequest *req)
+{
+uint32_t trans_len;
+uint64_t prp1 = le64_to_cpu(cmd->prp1);
+uint64_t prp2 = le64_to_cpu(cmd->prp2);
+NvmeFwSlotInfoLog fw_log;
+
+if (off > sizeof(fw_log)) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+memset(_log, 0, sizeof(NvmeFwSlotInfoLog));
+
+trans_len = MIN(sizeof(fw_log) - off, buf_len);
+
+return nvme_dma_read_prp(n, (uint8_t *) _log + off, trans_len, prp1,
+prp2);
+}
+
+static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+{
+uint32_t dw10 = le32_to_cpu(cmd->cdw10);
+uint32_t dw11 = le32_to_cpu(cmd->cdw11);
+uint32_t dw12 = le32_to_cpu(cmd->cdw12);
+uint32_t dw13 = le32_to_cpu(cmd->cdw13);
+uint16_t lid = dw10 & 0xff;
+uint8_t  rae = (dw10 >> 15) & 0x1;
+uint32_t numdl, numdu;
+uint64_t off, lpol, lpou;
+size_t   len;
+
+numdl = (dw10 >> 16);
+numdu = (dw11 & 0x);
+lpol = dw12;
+lpou = dw13;
+
+len = (((numdu << 16) | numdl) + 1) << 2;
+off = (lpou << 32ULL) | lpol;
+
+if (off & 0x3) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+trace_nvme_get_log(req->cid, lid, rae, len, off);
+
+switch (lid) {
+case NVME_LOG_ERROR_INFO:
+return nvme_error_info(n, cmd, len, off, req);
+case 

[PATCH v2 11/20] nvme: add missing mandatory features

2019-10-15 Thread Klaus Jensen
Add support for returning a resonable response to Get/Set Features of
mandatory features.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.c   | 51 ---
 hw/block/trace-events |  2 ++
 include/block/nvme.h  |  3 ++-
 3 files changed, 52 insertions(+), 4 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 32381d7df655..e7d46dcc6afe 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1007,12 +1007,24 @@ static uint16_t nvme_get_feature_timestamp(NvmeCtrl *n, 
NvmeCmd *cmd)
 static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 {
 uint32_t dw10 = le32_to_cpu(cmd->cdw10);
+uint32_t dw11 = le32_to_cpu(cmd->cdw11);
 uint32_t result;
 
+trace_nvme_getfeat(dw10);
+
 switch (dw10) {
+case NVME_ARBITRATION:
+result = cpu_to_le32(n->features.arbitration);
+break;
+case NVME_POWER_MANAGEMENT:
+result = cpu_to_le32(n->features.power_mgmt);
+break;
 case NVME_TEMPERATURE_THRESHOLD:
 result = cpu_to_le32(n->features.temp_thresh);
 break;
+case NVME_ERROR_RECOVERY:
+result = cpu_to_le32(n->features.err_rec);
+break;
 case NVME_VOLATILE_WRITE_CACHE:
 result = blk_enable_write_cache(n->conf.blk);
 trace_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
@@ -1024,6 +1036,19 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 break;
 case NVME_TIMESTAMP:
 return nvme_get_feature_timestamp(n, cmd);
+case NVME_INTERRUPT_COALESCING:
+result = cpu_to_le32(n->features.int_coalescing);
+break;
+case NVME_INTERRUPT_VECTOR_CONF:
+if ((dw11 & 0x) > n->params.num_queues) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+result = cpu_to_le32(n->features.int_vector_config[dw11 & 0x]);
+break;
+case NVME_WRITE_ATOMICITY:
+result = cpu_to_le32(n->features.write_atomicity);
+break;
 case NVME_ASYNCHRONOUS_EVENT_CONF:
 result = cpu_to_le32(n->features.async_config);
 break;
@@ -1059,6 +1084,8 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 uint32_t dw10 = le32_to_cpu(cmd->cdw10);
 uint32_t dw11 = le32_to_cpu(cmd->cdw11);
 
+trace_nvme_setfeat(dw10, dw11);
+
 switch (dw10) {
 case NVME_TEMPERATURE_THRESHOLD:
 n->features.temp_thresh = dw11;
@@ -1086,6 +1113,13 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 case NVME_ASYNCHRONOUS_EVENT_CONF:
 n->features.async_config = dw11;
 break;
+case NVME_ARBITRATION:
+case NVME_POWER_MANAGEMENT:
+case NVME_ERROR_RECOVERY:
+case NVME_INTERRUPT_COALESCING:
+case NVME_INTERRUPT_VECTOR_CONF:
+case NVME_WRITE_ATOMICITY:
+return NVME_FEAT_NOT_CHANGABLE | NVME_DNR;
 default:
 trace_nvme_err_invalid_setfeat(dw10);
 return NVME_INVALID_FIELD | NVME_DNR;
@@ -1709,6 +1743,14 @@ static void nvme_init_state(NvmeCtrl *n)
 n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
 n->temperature = NVME_TEMPERATURE;
 n->features.temp_thresh = 0x14d;
+n->features.int_vector_config = g_malloc0_n(n->params.num_queues,
+sizeof(*n->features.int_vector_config));
+
+/* disable coalescing (not supported) */
+for (int i = 0; i < n->params.num_queues; i++) {
+n->features.int_vector_config[i] = i | (1 << 16);
+}
+
 n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1);
 }
 
@@ -1786,15 +1828,17 @@ static void nvme_init_ctrl(NvmeCtrl *n)
 id->nn = cpu_to_le32(n->num_namespaces);
 id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROS | NVME_ONCS_TIMESTAMP);
 
+
+if (blk_enable_write_cache(n->conf.blk)) {
+id->vwc = 1;
+}
+
 strcpy((char *) id->subnqn, "nqn.2019-08.org.qemu:");
 pstrcat((char *) id->subnqn, sizeof(id->subnqn), n->params.serial);
 
 id->psd[0].mp = cpu_to_le16(0x9c4);
 id->psd[0].enlat = cpu_to_le32(0x10);
 id->psd[0].exlat = cpu_to_le32(0x4);
-if (blk_enable_write_cache(n->conf.blk)) {
-id->vwc = 1;
-}
 
 n->bar.cap = 0;
 NVME_CAP_SET_MQES(n->bar.cap, 0x7ff);
@@ -1866,6 +1910,7 @@ static void nvme_exit(PCIDevice *pci_dev)
 g_free(n->sq);
 g_free(n->elpes);
 g_free(n->aer_reqs);
+g_free(n->features.int_vector_config);
 
 if (n->params.cmb_size_mb) {
 g_free(n->cmbuf);
diff --git a/hw/block/trace-events b/hw/block/trace-events
index 6ddb13d34061..a20a68d85d5a 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -41,6 +41,8 @@ nvme_del_cq(uint16_t cqid) "deleted completion queue, 
sqid=%"PRIu16""
 nvme_identify_ctrl(void) "identify controller"
 nvme_identify_ns(uint16_t ns) "identify namespace, nsid=%"PRIu16""
 nvme_identify_nslist(uint16_t ns) "identify namespace list, nsid=%"PRIu16""
+nvme_getfeat(uint32_t fid) "fid 0x%"PRIx32""

[PATCH v2 15/20] nvme: add support for scatter gather lists

2019-10-15 Thread Klaus Jensen
For now, support the Data Block, Segment and Last Segment descriptor
types.

See NVM Express 1.3d, Section 4.4 ("Scatter Gather List (SGL)").

Signed-off-by: Klaus Jensen 
---
 block/nvme.c  |  18 +-
 hw/block/nvme.c   | 380 --
 hw/block/trace-events |   3 +
 include/block/nvme.h  |  62 ++-
 4 files changed, 398 insertions(+), 65 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 5be3a39b632e..8825c19c72c2 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -440,7 +440,7 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 error_setg(errp, "Cannot map buffer for DMA");
 goto out;
 }
-cmd.prp1 = cpu_to_le64(iova);
+cmd.dptr.prp.prp1 = cpu_to_le64(iova);
 
 if (nvme_cmd_sync(bs, s->queues[0], )) {
 error_setg(errp, "Failed to identify controller");
@@ -529,7 +529,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error 
**errp)
 }
 cmd = (NvmeCmd) {
 .opcode = NVME_ADM_CMD_CREATE_CQ,
-.prp1 = cpu_to_le64(q->cq.iova),
+.dptr.prp.prp1 = cpu_to_le64(q->cq.iova),
 .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0x)),
 .cdw11 = cpu_to_le32(0x3),
 };
@@ -540,7 +540,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error 
**errp)
 }
 cmd = (NvmeCmd) {
 .opcode = NVME_ADM_CMD_CREATE_SQ,
-.prp1 = cpu_to_le64(q->sq.iova),
+.dptr.prp.prp1 = cpu_to_le64(q->sq.iova),
 .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0x)),
 .cdw11 = cpu_to_le32(0x1 | (n << 16)),
 };
@@ -889,16 +889,16 @@ try_map:
 case 0:
 abort();
 case 1:
-cmd->prp1 = pagelist[0];
-cmd->prp2 = 0;
+cmd->dptr.prp.prp1 = pagelist[0];
+cmd->dptr.prp.prp2 = 0;
 break;
 case 2:
-cmd->prp1 = pagelist[0];
-cmd->prp2 = pagelist[1];
+cmd->dptr.prp.prp1 = pagelist[0];
+cmd->dptr.prp.prp2 = pagelist[1];
 break;
 default:
-cmd->prp1 = pagelist[0];
-cmd->prp2 = cpu_to_le64(req->prp_list_iova + sizeof(uint64_t));
+cmd->dptr.prp.prp1 = pagelist[0];
+cmd->dptr.prp.prp2 = cpu_to_le64(req->prp_list_iova + 
sizeof(uint64_t));
 break;
 }
 trace_nvme_cmd_map_qiov(s, cmd, req, qiov, entries);
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index f4b9bd36a04e..0a5cd079df9a 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -296,6 +296,198 @@ unmap:
 return status;
 }
 
+static uint16_t nvme_map_sgl_data(NvmeCtrl *n, QEMUSGList *qsg,
+NvmeSglDescriptor *segment, uint64_t nsgld, uint32_t *len,
+NvmeRequest *req)
+{
+dma_addr_t addr, trans_len;
+
+for (int i = 0; i < nsgld; i++) {
+if (NVME_SGL_TYPE(segment[i].type) != SGL_DESCR_TYPE_DATA_BLOCK) {
+trace_nvme_err_invalid_sgl_descriptor(req->cid,
+NVME_SGL_TYPE(segment[i].type));
+return NVME_SGL_DESCRIPTOR_TYPE_INVALID | NVME_DNR;
+}
+
+if (*len == 0) {
+if (!NVME_CTRL_SGLS_EXCESS_LENGTH(n->id_ctrl.sgls)) {
+trace_nvme_err_invalid_sgl_excess_length(req->cid);
+return NVME_DATA_SGL_LENGTH_INVALID | NVME_DNR;
+}
+
+break;
+}
+
+addr = le64_to_cpu(segment[i].addr);
+trans_len = MIN(*len, le64_to_cpu(segment[i].len));
+
+if (nvme_addr_is_cmb(n, addr)) {
+/*
+ * All data and metadata, if any, associated with a particular
+ * command shall be located in either the CMB or host memory. Thus,
+ * if an address if found to be in the CMB and we have already
+ * mapped data that is in host memory, the use is invalid.
+ */
+if (!nvme_req_is_cmb(req) && qsg->size) {
+return NVME_INVALID_USE_OF_CMB | NVME_DNR;
+}
+
+nvme_req_set_cmb(req);
+} else {
+/*
+ * Similarly, if the address does not reference the CMB, but we
+ * have already established that the request has data or metadata
+ * in the CMB, the use is invalid.
+ */
+if (nvme_req_is_cmb(req)) {
+return NVME_INVALID_USE_OF_CMB | NVME_DNR;
+}
+}
+
+qemu_sglist_add(qsg, addr, trans_len);
+
+*len -= trans_len;
+}
+
+return NVME_SUCCESS;
+}
+
+static uint16_t nvme_map_sgl(NvmeCtrl *n, QEMUSGList *qsg,
+NvmeSglDescriptor sgl, uint32_t len, NvmeRequest *req)
+{
+const int MAX_NSGLD = 256;
+
+NvmeSglDescriptor segment[MAX_NSGLD];
+uint64_t nsgld;
+uint16_t status;
+bool sgl_in_cmb = false;
+hwaddr addr = le64_to_cpu(sgl.addr);
+
+trace_nvme_map_sgl(req->cid, NVME_SGL_TYPE(sgl.type), req->nlb, len);
+
+pci_dma_sglist_init(qsg, >parent_obj, 1);
+
+/*
+ * If the entire transfer can be described with a 

Re: [PULL 0/1] Block patches

2019-10-15 Thread Peter Maydell
On Mon, 14 Oct 2019 at 09:52, Stefan Hajnoczi  wrote:
>
> The following changes since commit 98b2e3c9ab3abfe476a2b02f8f51813edb90e72d:
>
>   Merge remote-tracking branch 'remotes/stefanha/tags/block-pull-request' 
> into staging (2019-10-08 16:08:35 +0100)
>
> are available in the Git repository at:
>
>   https://github.com/stefanha/qemu.git tags/block-pull-request
>
> for you to fetch changes up to 69de48445a0d6169f1e2a6c5bfab994e1c810e33:
>
>   test-bdrv-drain: fix iothread_join() hang (2019-10-14 09:48:01 +0100)
>
> 
> Pull request
>
> 
>
> Stefan Hajnoczi (1):
>   test-bdrv-drain: fix iothread_join() hang
>

Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/4.2
for any user-visible changes.

-- PMM



[PATCH v2 01/20] nvme: remove superfluous breaks

2019-10-15 Thread Klaus Jensen
These break statements was left over when commit 3036a626e9ef ("nvme:
add Get/Set Feature Timestamp support") was merged.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 12d825425016..c06e3ca31905 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -788,7 +788,6 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 break;
 case NVME_TIMESTAMP:
 return nvme_get_feature_timestamp(n, cmd);
-break;
 default:
 trace_nvme_err_invalid_getfeat(dw10);
 return NVME_INVALID_FIELD | NVME_DNR;
@@ -832,11 +831,8 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 req->cqe.result =
 cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 16));
 break;
-
 case NVME_TIMESTAMP:
 return nvme_set_feature_timestamp(n, cmd);
-break;
-
 default:
 trace_nvme_err_invalid_setfeat(dw10);
 return NVME_INVALID_FIELD | NVME_DNR;
-- 
2.23.0




[PATCH v2 05/20] nvme: allow completion queues in the cmb

2019-10-15 Thread Klaus Jensen
Allow completion queues in the controller memory buffer.

This also inlines the nvme_addr_{read,write} functions and adds an
nvme_addr_is_cmb helper.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.c | 38 +-
 1 file changed, 29 insertions(+), 9 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 16f0fba10b08..daa2367b0863 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -52,14 +52,34 @@
 
 static void nvme_process_sq(void *opaque);
 
-static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
+static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
 {
-if (n->cmbsz && addr >= n->ctrl_mem.addr &&
-addr < (n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size))) {
-memcpy(buf, (void *)>cmbuf[addr - n->ctrl_mem.addr], size);
-} else {
-pci_dma_read(>parent_obj, addr, buf, size);
+hwaddr low = n->ctrl_mem.addr;
+hwaddr hi  = n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size);
+
+return addr >= low && addr < hi;
+}
+
+static inline void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf,
+int size)
+{
+if (n->cmbsz && nvme_addr_is_cmb(n, addr)) {
+memcpy(buf, (void *) >cmbuf[addr - n->ctrl_mem.addr], size);
+return;
 }
+
+pci_dma_read(>parent_obj, addr, buf, size);
+}
+
+static inline void nvme_addr_write(NvmeCtrl *n, hwaddr addr, void *buf,
+int size)
+{
+if (n->cmbsz && nvme_addr_is_cmb(n, addr)) {
+memcpy((void *) >cmbuf[addr - n->ctrl_mem.addr], buf, size);
+return;
+}
+
+pci_dma_write(>parent_obj, addr, buf, size);
 }
 
 static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
@@ -281,6 +301,7 @@ static void nvme_post_cqes(void *opaque)
 
 QTAILQ_FOREACH_SAFE(req, >req_list, entry, next) {
 NvmeSQueue *sq;
+NvmeCqe *cqe = >cqe;
 hwaddr addr;
 
 if (nvme_cq_full(cq)) {
@@ -294,8 +315,7 @@ static void nvme_post_cqes(void *opaque)
 req->cqe.sq_head = cpu_to_le16(sq->head);
 addr = cq->dma_addr + cq->tail * n->cqe_size;
 nvme_inc_cq_tail(cq);
-pci_dma_write(>parent_obj, addr, (void *)>cqe,
-sizeof(req->cqe));
+nvme_addr_write(n, addr, (void *) cqe, sizeof(*cqe));
 QTAILQ_INSERT_TAIL(>req_list, req, entry);
 }
 if (cq->tail != cq->head) {
@@ -1401,7 +1421,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 NVME_CMBLOC_SET_OFST(n->bar.cmbloc, 0);
 
 NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
-NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
+NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 1);
 NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
 NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
 NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
-- 
2.23.0




[PATCH v2 12/20] nvme: bump supported specification version to 1.3

2019-10-15 Thread Klaus Jensen
Add the new Namespace Identification Descriptor List (CNS 03h) and track
creation of queues to enable the controller to return Command Sequence
Error if Set Features is called for Number of Queues after any queues
have been created.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.c   | 82 +++
 hw/block/nvme.h   |  1 +
 hw/block/trace-events |  8 +++--
 include/block/nvme.h  | 30 +---
 4 files changed, 100 insertions(+), 21 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index e7d46dcc6afe..1e2320b38b14 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -9,20 +9,22 @@
  */
 
 /**
- * Reference Specification: NVM Express 1.2.1
+ * Reference Specification: NVM Express 1.3d
  *
  *   https://nvmexpress.org/resources/specifications/
  */
 
 /**
  * Usage: add options:
- *  -drive file=,if=none,id=
- *  -device nvme,drive=,serial=,id=, \
- *  cmb_size_mb=, \
- *  num_queues=
+ * -drive file=,if=none,id=
+ * -device nvme,drive=,serial=,id=
  *
- * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
- * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
+ * Advanced optional options:
+ *
+ *   num_queues=  : Maximum number of IO Queues.
+ *  Default: 64
+ *   cmb_size_mb= : Size of Controller Memory Buffer in MBs.
+ *  Default: 0 (disabled)
  */
 
 #include "qemu/osdep.h"
@@ -345,6 +347,8 @@ static void nvme_post_cqes(void *opaque)
 static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
 {
 assert(cq->cqid == req->sq->cqid);
+
+trace_nvme_enqueue_req_completion(req->cid, cq->cqid, req->status);
 QTAILQ_REMOVE(>sq->out_req_list, req, entry);
 QTAILQ_INSERT_TAIL(>req_list, req, entry);
 timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
@@ -530,6 +534,7 @@ static void nvme_free_sq(NvmeSQueue *sq, NvmeCtrl *n)
 if (sq->sqid) {
 g_free(sq);
 }
+n->qs_created--;
 }
 
 static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeCmd *cmd)
@@ -596,6 +601,7 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, 
uint64_t dma_addr,
 cq = n->cq[cqid];
 QTAILQ_INSERT_TAIL(&(cq->sq_list), sq, entry);
 n->sq[sqid] = sq;
+n->qs_created++;
 }
 
 static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeCmd *cmd)
@@ -742,7 +748,8 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 uint32_t dw11 = le32_to_cpu(cmd->cdw11);
 uint32_t dw12 = le32_to_cpu(cmd->cdw12);
 uint32_t dw13 = le32_to_cpu(cmd->cdw13);
-uint16_t lid = dw10 & 0xff;
+uint8_t  lid = dw10 & 0xff;
+uint8_t  lsp = (dw10 >> 8) & 0xf;
 uint8_t  rae = (dw10 >> 15) & 0x1;
 uint32_t numdl, numdu;
 uint64_t off, lpol, lpou;
@@ -760,7 +767,7 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 return NVME_INVALID_FIELD | NVME_DNR;
 }
 
-trace_nvme_get_log(req->cid, lid, rae, len, off);
+trace_nvme_get_log(req->cid, lid, lsp, rae, len, off);
 
 switch (lid) {
 case NVME_LOG_ERROR_INFO:
@@ -784,6 +791,7 @@ static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
 if (cq->cqid) {
 g_free(cq);
 }
+n->qs_created--;
 }
 
 static uint16_t nvme_del_cq(NvmeCtrl *n, NvmeCmd *cmd)
@@ -824,6 +832,7 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, 
uint64_t dma_addr,
 msix_vector_use(>parent_obj, cq->vector);
 n->cq[cqid] = cq;
 cq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_post_cqes, cq);
+n->qs_created++;
 }
 
 static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
@@ -897,7 +906,7 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify 
*c)
 prp1, prp2);
 }
 
-static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeIdentify *c)
+static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeIdentify *c)
 {
 static const int data_len = 4 * KiB;
 uint32_t min_nsid = le32_to_cpu(c->nsid);
@@ -907,7 +916,7 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, 
NvmeIdentify *c)
 uint16_t ret;
 int i, j = 0;
 
-trace_nvme_identify_nslist(min_nsid);
+trace_nvme_identify_ns_list(min_nsid);
 
 list = g_malloc0(data_len);
 for (i = 0; i < n->num_namespaces; i++) {
@@ -924,6 +933,41 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, 
NvmeIdentify *c)
 return ret;
 }
 
+static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *c)
+{
+static const int len = 4096;
+
+struct ns_descr {
+uint8_t nidt;
+uint8_t nidl;
+uint8_t rsvd2[2];
+uint8_t nid[16];
+};
+
+uint32_t nsid = le32_to_cpu(c->nsid);
+uint64_t prp1 = le64_to_cpu(c->prp1);
+uint64_t prp2 = le64_to_cpu(c->prp2);
+
+struct ns_descr *list;
+uint16_t ret;
+
+trace_nvme_identify_ns_descr_list(nsid);
+
+if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
+

[PATCH v2 18/20] nvme: remove redundant NvmeCmd pointer parameter

2019-10-15 Thread Klaus Jensen
The command struct is available in the NvmeRequest that we generally
pass around anyway.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.c | 219 +++-
 1 file changed, 106 insertions(+), 113 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index bcd801c345b6..67f92bf5a3ac 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -574,14 +574,14 @@ static uint16_t nvme_dma_write_sgl(NvmeCtrl *n, uint8_t 
*ptr, uint32_t len,
 }
 
 static uint16_t nvme_dma_write(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
-NvmeCmd *cmd, NvmeRequest *req)
+NvmeRequest *req)
 {
-if (NVME_CMD_FLAGS_PSDT(cmd->flags)) {
-return nvme_dma_write_sgl(n, ptr, len, cmd->dptr.sgl, req);
+if (NVME_CMD_FLAGS_PSDT(req->cmd.flags)) {
+return nvme_dma_write_sgl(n, ptr, len, req->cmd.dptr.sgl, req);
 }
 
-uint64_t prp1 = le64_to_cpu(cmd->dptr.prp.prp1);
-uint64_t prp2 = le64_to_cpu(cmd->dptr.prp.prp2);
+uint64_t prp1 = le64_to_cpu(req->cmd.dptr.prp.prp1);
+uint64_t prp2 = le64_to_cpu(req->cmd.dptr.prp.prp2);
 
 return nvme_dma_write_prp(n, ptr, len, prp1, prp2, req);
 }
@@ -624,7 +624,7 @@ out:
 }
 
 static uint16_t nvme_dma_read_sgl(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
-NvmeSglDescriptor sgl, NvmeCmd *cmd, NvmeRequest *req)
+NvmeSglDescriptor sgl, NvmeRequest *req)
 {
 QEMUSGList qsg;
 uint16_t err = NVME_SUCCESS;
@@ -662,29 +662,29 @@ out:
 }
 
 static uint16_t nvme_dma_read(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
-NvmeCmd *cmd, NvmeRequest *req)
+NvmeRequest *req)
 {
-if (NVME_CMD_FLAGS_PSDT(cmd->flags)) {
-return nvme_dma_read_sgl(n, ptr, len, cmd->dptr.sgl, cmd, req);
+if (NVME_CMD_FLAGS_PSDT(req->cmd.flags)) {
+return nvme_dma_read_sgl(n, ptr, len, req->cmd.dptr.sgl, req);
 }
 
-uint64_t prp1 = le64_to_cpu(cmd->dptr.prp.prp1);
-uint64_t prp2 = le64_to_cpu(cmd->dptr.prp.prp2);
+uint64_t prp1 = le64_to_cpu(req->cmd.dptr.prp.prp1);
+uint64_t prp2 = le64_to_cpu(req->cmd.dptr.prp.prp2);
 
 return nvme_dma_read_prp(n, ptr, len, prp1, prp2, req);
 }
 
-static uint16_t nvme_map(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+static uint16_t nvme_map(NvmeCtrl *n, NvmeRequest *req)
 {
 uint32_t len = req->nlb << nvme_ns_lbads(req->ns);
 uint64_t prp1, prp2;
 
-if (NVME_CMD_FLAGS_PSDT(cmd->flags)) {
-return nvme_map_sgl(n, >qsg, cmd->dptr.sgl, len, req);
+if (NVME_CMD_FLAGS_PSDT(req->cmd.flags)) {
+return nvme_map_sgl(n, >qsg, req->cmd.dptr.sgl, len, req);
 }
 
-prp1 = le64_to_cpu(cmd->dptr.prp.prp1);
-prp2 = le64_to_cpu(cmd->dptr.prp.prp2);
+prp1 = le64_to_cpu(req->cmd.dptr.prp.prp1);
+prp2 = le64_to_cpu(req->cmd.dptr.prp.prp2);
 
 return nvme_map_prp(n, >qsg, prp1, prp2, len, req);
 }
@@ -1045,7 +1045,7 @@ static uint16_t nvme_check_rw(NvmeCtrl *n, NvmeRequest 
*req)
 return NVME_SUCCESS;
 }
 
-static uint16_t nvme_flush(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+static uint16_t nvme_flush(NvmeCtrl *n, NvmeRequest *req)
 {
 NvmeNamespace *ns = req->ns;
 
@@ -1057,12 +1057,12 @@ static uint16_t nvme_flush(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 return NVME_NO_COMPLETE;
 }
 
-static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeRequest *req)
 {
 NvmeAIO *aio;
 
 NvmeNamespace *ns = req->ns;
-NvmeRwCmd *rw = (NvmeRwCmd *) cmd;
+NvmeRwCmd *rw = (NvmeRwCmd *) >cmd;
 
 int64_t offset;
 size_t count;
@@ -1092,9 +1092,9 @@ static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 return NVME_NO_COMPLETE;
 }
 
-static uint16_t nvme_rw(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req)
 {
-NvmeRwCmd *rw = (NvmeRwCmd *) cmd;
+NvmeRwCmd *rw = (NvmeRwCmd *) >cmd;
 NvmeNamespace *ns = req->ns;
 int status;
 
@@ -1114,7 +1114,7 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 return status;
 }
 
-status = nvme_map(n, cmd, req);
+status = nvme_map(n, req);
 if (status) {
 block_acct_invalid(blk_get_stats(ns->conf.blk), acct);
 return status;
@@ -1126,11 +1126,12 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 return NVME_NO_COMPLETE;
 }
 
-static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
 {
-uint32_t nsid = le32_to_cpu(cmd->nsid);
+uint32_t nsid = le32_to_cpu(req->cmd.nsid);
 
-trace_nvme_io_cmd(req->cid, nsid, le16_to_cpu(req->sq->sqid), cmd->opcode);
+trace_nvme_io_cmd(req->cid, nsid, le16_to_cpu(req->sq->sqid),
+req->cmd.opcode);
 
 req->ns = nvme_ns(n, nsid);
 
@@ -1139,16 +1140,16 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 return NVME_INVALID_NSID | NVME_DNR;
 }
 
- 

[PATCH v2 14/20] nvme: allow multiple aios per command

2019-10-15 Thread Klaus Jensen
This refactors how the device issues asynchronous block backend
requests. The NvmeRequest now holds a queue of NvmeAIOs that are
associated with the command. This allows multiple aios to be issued for
a command. Only when all requests have been completed will the device
post a completion queue entry.

Because the device is currently guaranteed to only issue a single aio
request per command, the benefit is not immediately obvious. But this
functionality is required to support metadata.

Signed-off-by: Klaus Jensen 
Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.c   | 455 +-
 hw/block/nvme.h   | 165 ---
 hw/block/trace-events |   8 +
 3 files changed, 511 insertions(+), 117 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index cbc0b6a660b6..f4b9bd36a04e 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -25,6 +25,8 @@
  *  Default: 64
  *   cmb_size_mb= : Size of Controller Memory Buffer in MBs.
  *  Default: 0 (disabled)
+ *   mdts= : Maximum Data Transfer Size (power of two)
+ *  Default: 7
  */
 
 #include "qemu/osdep.h"
@@ -56,6 +58,7 @@
 } while (0)
 
 static void nvme_process_sq(void *opaque);
+static void nvme_aio_cb(void *opaque, int ret);
 
 static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
 {
@@ -197,7 +200,7 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, 
uint64_t prp1,
 }
 
 if (nvme_addr_is_cmb(n, prp1)) {
-req->is_cmb = true;
+nvme_req_set_cmb(req);
 }
 
 pci_dma_sglist_init(qsg, >parent_obj, num_prps);
@@ -255,8 +258,8 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, 
uint64_t prp1,
 }
 
 addr_is_cmb = nvme_addr_is_cmb(n, prp_ent);
-if ((req->is_cmb && !addr_is_cmb) ||
-(!req->is_cmb && addr_is_cmb)) {
+if ((nvme_req_is_cmb(req) && !addr_is_cmb) ||
+(!nvme_req_is_cmb(req) && addr_is_cmb)) {
 status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
 goto unmap;
 }
@@ -269,8 +272,8 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, 
uint64_t prp1,
 }
 } else {
 bool addr_is_cmb = nvme_addr_is_cmb(n, prp2);
-if ((req->is_cmb && !addr_is_cmb) ||
-(!req->is_cmb && addr_is_cmb)) {
+if ((nvme_req_is_cmb(req) && !addr_is_cmb) ||
+(!nvme_req_is_cmb(req) && addr_is_cmb)) {
 status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
 goto unmap;
 }
@@ -312,7 +315,7 @@ static uint16_t nvme_dma_write_prp(NvmeCtrl *n, uint8_t 
*ptr, uint32_t len,
 return status;
 }
 
-if (req->is_cmb) {
+if (nvme_req_is_cmb(req)) {
 QEMUIOVector iov;
 
 qemu_iovec_init(, qsg.nsg);
@@ -341,19 +344,18 @@ static uint16_t nvme_dma_write_prp(NvmeCtrl *n, uint8_t 
*ptr, uint32_t len,
 static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
 uint64_t prp1, uint64_t prp2, NvmeRequest *req)
 {
-QEMUSGList qsg;
 uint16_t status = NVME_SUCCESS;
 
-status = nvme_map_prp(n, , prp1, prp2, len, req);
+status = nvme_map_prp(n, >qsg, prp1, prp2, len, req);
 if (status) {
 return status;
 }
 
-if (req->is_cmb) {
+if (nvme_req_is_cmb(req)) {
 QEMUIOVector iov;
 
-qemu_iovec_init(, qsg.nsg);
-dma_to_cmb(n, , );
+qemu_iovec_init(, req->qsg.nsg);
+dma_to_cmb(n, >qsg, );
 
 if (unlikely(qemu_iovec_from_buf(, 0, ptr, len) != len)) {
 trace_nvme_err_invalid_dma();
@@ -365,17 +367,137 @@ static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t 
*ptr, uint32_t len,
 goto out;
 }
 
-if (unlikely(dma_buf_read(ptr, len, ))) {
+if (unlikely(dma_buf_read(ptr, len, >qsg))) {
 trace_nvme_err_invalid_dma();
 status = NVME_INVALID_FIELD | NVME_DNR;
 }
 
 out:
-qemu_sglist_destroy();
+qemu_sglist_destroy(>qsg);
 
 return status;
 }
 
+static uint16_t nvme_map(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+{
+NvmeNamespace *ns = req->ns;
+
+uint32_t len = req->nlb << nvme_ns_lbads(ns);
+uint64_t prp1 = le64_to_cpu(cmd->prp1);
+uint64_t prp2 = le64_to_cpu(cmd->prp2);
+
+return nvme_map_prp(n, >qsg, prp1, prp2, len, req);
+}
+
+static void nvme_aio_destroy(NvmeAIO *aio)
+{
+if (aio->iov.nalloc) {
+qemu_iovec_destroy(>iov);
+}
+
+g_free(aio);
+}
+
+static NvmeAIO *nvme_aio_new(BlockBackend *blk, int64_t offset,
+QEMUSGList *qsg, NvmeRequest *req, NvmeAIOCompletionFunc *cb)
+{
+NvmeAIO *aio = g_malloc0(sizeof(*aio));
+
+*aio = (NvmeAIO) {
+.blk = blk,
+.offset = offset,
+.req = req,
+.qsg = qsg,
+.cb = cb,
+};
+
+if (qsg && 

[PATCH v2 20/20] nvme: handle dma errors

2019-10-15 Thread Klaus Jensen
Handling DMA errors gracefully is required for the device to pass the
block/011 test ("disable PCI device while doing I/O") in the blktests
suite.

With this patch the device passes the test by retrying "critical"
transfers (posting of completion entries and processing of submission
queue entries).

If DMA errors occur at any other point in the execution of the command
(say, while mapping the PRPs or SGLs), the command is aborted with a
Data Transfer Error status code.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.c   | 63 +--
 hw/block/trace-events |  2 ++
 include/block/nvme.h  |  2 +-
 3 files changed, 52 insertions(+), 15 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index d0103c16cfe9..00c5b843295b 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -71,26 +71,26 @@ static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr 
addr)
 return addr >= low && addr < hi;
 }
 
-static inline void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf,
+static inline int nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf,
 int size)
 {
 if (n->cmbsz && nvme_addr_is_cmb(n, addr)) {
 memcpy(buf, (void *) >cmbuf[addr - n->ctrl_mem.addr], size);
-return;
+return 0;
 }
 
-pci_dma_read(>parent_obj, addr, buf, size);
+return pci_dma_read(>parent_obj, addr, buf, size);
 }
 
-static inline void nvme_addr_write(NvmeCtrl *n, hwaddr addr, void *buf,
+static inline int nvme_addr_write(NvmeCtrl *n, hwaddr addr, void *buf,
 int size)
 {
 if (n->cmbsz && nvme_addr_is_cmb(n, addr)) {
 memcpy((void *) >cmbuf[addr - n->ctrl_mem.addr], buf, size);
-return;
+return 0;
 }
 
-pci_dma_write(>parent_obj, addr, buf, size);
+return pci_dma_write(>parent_obj, addr, buf, size);
 }
 
 static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
@@ -228,7 +228,11 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, 
uint64_t prp1,
 
 nents = (len + n->page_size - 1) >> n->page_bits;
 prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
-nvme_addr_read(n, prp2, (void *) prp_list, prp_trans);
+if (nvme_addr_read(n, prp2, (void *) prp_list, prp_trans)) {
+trace_nvme_err_addr_read((void *) prp2);
+status = NVME_DATA_TRANSFER_ERROR;
+goto unmap;
+}
 while (len != 0) {
 bool addr_is_cmb;
 uint64_t prp_ent = le64_to_cpu(prp_list[i]);
@@ -250,7 +254,11 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, 
uint64_t prp1,
 i = 0;
 nents = (len + n->page_size - 1) >> n->page_bits;
 prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
-nvme_addr_read(n, prp_ent, (void *) prp_list, prp_trans);
+if (nvme_addr_read(n, prp_ent, (void *) prp_list, 
prp_trans)) {
+trace_nvme_err_addr_read((void *) prp_ent);
+status = NVME_DATA_TRANSFER_ERROR;
+goto unmap;
+}
 prp_ent = le64_to_cpu(prp_list[i]);
 }
 
@@ -402,7 +410,11 @@ static uint16_t nvme_map_sgl(NvmeCtrl *n, QEMUSGList *qsg,
 
 /* read the segment in chunks of 256 descriptors (4k) */
 while (nsgld > MAX_NSGLD) {
-nvme_addr_read(n, addr, segment, sizeof(segment));
+if (nvme_addr_read(n, addr, segment, sizeof(segment))) {
+trace_nvme_err_addr_read((void *) addr);
+status = NVME_DATA_TRANSFER_ERROR;
+goto unmap;
+}
 
 status = nvme_map_sgl_data(n, qsg, segment, MAX_NSGLD, , req);
 if (status) {
@@ -413,7 +425,11 @@ static uint16_t nvme_map_sgl(NvmeCtrl *n, QEMUSGList *qsg,
 addr += MAX_NSGLD * sizeof(NvmeSglDescriptor);
 }
 
-nvme_addr_read(n, addr, segment, nsgld * sizeof(NvmeSglDescriptor));
+if (nvme_addr_read(n, addr, segment, nsgld * 
sizeof(NvmeSglDescriptor))) {
+trace_nvme_err_addr_read((void *) addr);
+status = NVME_DATA_TRANSFER_ERROR;
+goto unmap;
+}
 
 sgl = segment[nsgld - 1];
 addr = le64_to_cpu(sgl.addr);
@@ -458,7 +474,11 @@ static uint16_t nvme_map_sgl(NvmeCtrl *n, QEMUSGList *qsg,
 nsgld = le64_to_cpu(sgl.len) / sizeof(NvmeSglDescriptor);
 
 while (nsgld > MAX_NSGLD) {
-nvme_addr_read(n, addr, segment, sizeof(segment));
+if (nvme_addr_read(n, addr, segment, sizeof(segment))) {
+trace_nvme_err_addr_read((void *) addr);
+status = NVME_DATA_TRANSFER_ERROR;
+goto unmap;
+}
 
 status = nvme_map_sgl_data(n, qsg, segment, MAX_NSGLD, , req);
 if (status) {
@@ -469,7 +489,11 @@ static uint16_t nvme_map_sgl(NvmeCtrl *n, QEMUSGList *qsg,
 addr += 

[PATCH v2 16/20] nvme: support multiple namespaces

2019-10-15 Thread Klaus Jensen
This adds support for multiple namespaces by introducing a new 'nvme-ns'
device model. The nvme device creates a bus named from the device name
('id'). The nvme-ns devices then connect to this and registers
themselves with the nvme device.

This changes how an nvme device is created. Example with two namespaces:

  -drive file=nvme0n1.img,if=none,id=disk1
  -drive file=nvme0n2.img,if=none,id=disk2
  -device nvme,serial=deadbeef,id=nvme0
  -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
  -device nvme-ns,drive=disk2,bus=nvme0,nsid=2

The drive property is kept on the nvme device to keep the change
backward compatible, but the property is now optional. Specifying a
drive for the nvme device will always create the namespace with nsid 1.

Signed-off-by: Klaus Jensen 
Signed-off-by: Klaus Jensen 
---
 hw/block/Makefile.objs |   2 +-
 hw/block/nvme-ns.c | 139 +++
 hw/block/nvme-ns.h |  58 +++
 hw/block/nvme.c| 212 +
 hw/block/nvme.h|  51 +-
 hw/block/trace-events  |   5 +-
 6 files changed, 352 insertions(+), 115 deletions(-)
 create mode 100644 hw/block/nvme-ns.c
 create mode 100644 hw/block/nvme-ns.h

diff --git a/hw/block/Makefile.objs b/hw/block/Makefile.objs
index f5f643f0cc06..d44a2f4b780d 100644
--- a/hw/block/Makefile.objs
+++ b/hw/block/Makefile.objs
@@ -7,7 +7,7 @@ common-obj-$(CONFIG_PFLASH_CFI02) += pflash_cfi02.o
 common-obj-$(CONFIG_XEN) += xen-block.o
 common-obj-$(CONFIG_ECC) += ecc.o
 common-obj-$(CONFIG_ONENAND) += onenand.o
-common-obj-$(CONFIG_NVME_PCI) += nvme.o
+common-obj-$(CONFIG_NVME_PCI) += nvme.o nvme-ns.o
 
 obj-$(CONFIG_SH4) += tc58128.o
 
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
new file mode 100644
index ..aa76bb63ef45
--- /dev/null
+++ b/hw/block/nvme-ns.c
@@ -0,0 +1,139 @@
+#include "qemu/osdep.h"
+#include "qemu/units.h"
+#include "qemu/cutils.h"
+#include "qemu/log.h"
+#include "hw/block/block.h"
+#include "hw/pci/msix.h"
+#include "sysemu/sysemu.h"
+#include "sysemu/block-backend.h"
+#include "qapi/error.h"
+
+#include "hw/qdev-properties.h"
+#include "hw/qdev-core.h"
+
+#include "nvme.h"
+#include "nvme-ns.h"
+
+static int nvme_ns_init(NvmeNamespace *ns)
+{
+NvmeIdNs *id_ns = >id_ns;
+
+id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
+id_ns->nuse = id_ns->ncap = id_ns->nsze =
+cpu_to_le64(nvme_ns_nlbas(ns));
+
+return 0;
+}
+
+static int nvme_ns_init_blk(NvmeNamespace *ns, NvmeIdCtrl *id, Error **errp)
+{
+blkconf_blocksizes(>conf);
+
+if (!blkconf_apply_backend_options(>conf,
+blk_is_read_only(ns->conf.blk), false, errp)) {
+return 1;
+}
+
+ns->size = blk_getlength(ns->conf.blk);
+if (ns->size < 0) {
+error_setg_errno(errp, -ns->size, "blk_getlength");
+return 1;
+}
+
+if (!blk_enable_write_cache(ns->conf.blk)) {
+id->vwc = 0;
+}
+
+return 0;
+}
+
+static int nvme_ns_check_constraints(NvmeNamespace *ns, Error **errp)
+{
+if (!ns->conf.blk) {
+error_setg(errp, "block backend not configured");
+return 1;
+}
+
+return 0;
+}
+
+int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
+{
+Error *local_err = NULL;
+
+if (nvme_ns_check_constraints(ns, _err)) {
+error_propagate_prepend(errp, local_err,
+"nvme_ns_check_constraints: ");
+return 1;
+}
+
+if (nvme_ns_init_blk(ns, >id_ctrl, _err)) {
+error_propagate_prepend(errp, local_err, "nvme_ns_init_blk: ");
+return 1;
+}
+
+nvme_ns_init(ns);
+if (nvme_register_namespace(n, ns, _err)) {
+error_propagate_prepend(errp, local_err, "nvme_register_namespace: ");
+return 1;
+}
+
+return 0;
+}
+
+static void nvme_ns_realize(DeviceState *dev, Error **errp)
+{
+NvmeNamespace *ns = NVME_NS(dev);
+BusState *s = qdev_get_parent_bus(dev);
+NvmeCtrl *n = NVME(s->parent);
+Error *local_err = NULL;
+
+if (nvme_ns_setup(n, ns, _err)) {
+error_propagate_prepend(errp, local_err, "nvme_ns_setup: ");
+return;
+}
+}
+
+static Property nvme_ns_props[] = {
+DEFINE_BLOCK_PROPERTIES(NvmeNamespace, conf),
+DEFINE_NVME_NS_PROPERTIES(NvmeNamespace, params),
+DEFINE_PROP_END_OF_LIST(),
+};
+
+static void nvme_ns_class_init(ObjectClass *oc, void *data)
+{
+DeviceClass *dc = DEVICE_CLASS(oc);
+
+set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
+
+dc->bus_type = TYPE_NVME_BUS;
+dc->realize = nvme_ns_realize;
+dc->props = nvme_ns_props;
+dc->desc = "virtual nvme namespace";
+}
+
+static void nvme_ns_instance_init(Object *obj)
+{
+NvmeNamespace *ns = NVME_NS(obj);
+char *bootindex = g_strdup_printf("/namespace@%d,0", ns->params.nsid);
+
+device_add_bootindex_property(obj, >conf.bootindex, "bootindex",
+bootindex, DEVICE(obj), _abort);
+
+g_free(bootindex);
+}
+
+static const TypeInfo nvme_ns_info = {
+.name = TYPE_NVME_NS,

[PATCH v2 13/20] nvme: refactor prp mapping

2019-10-15 Thread Klaus Jensen
Instead of handling both QSGs and IOVs in multiple places, simply use
QSGs everywhere by assuming that the request does not involve the
controller memory buffer (CMB). If the request is found to involve the
CMB, convert the QSG to an IOV and issue the I/O. The QSG is converted
to an IOV by the dma helpers anyway, so the CMB path is not unfairly
affected by this simplifying change.

As a side-effect, this patch also allows PRPs to be located in the CMB.
The logic ensures that if some of the PRP is in the CMB, all of it must
be located there, as per the specification.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.c   | 255 --
 hw/block/nvme.h   |   4 +-
 hw/block/trace-events |   1 +
 include/block/nvme.h  |   1 +
 4 files changed, 174 insertions(+), 87 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 1e2320b38b14..cbc0b6a660b6 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -179,138 +179,200 @@ static void nvme_set_error_page(NvmeCtrl *n, uint16_t 
sqid, uint16_t cid,
 n->elp_index = (n->elp_index + 1) % n->params.elpe;
 }
 
-static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t prp1,
- uint64_t prp2, uint32_t len, NvmeCtrl *n)
+static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, uint64_t prp1,
+uint64_t prp2, uint32_t len, NvmeRequest *req)
 {
 hwaddr trans_len = n->page_size - (prp1 % n->page_size);
 trans_len = MIN(len, trans_len);
 int num_prps = (len >> n->page_bits) + 1;
+uint16_t status = NVME_SUCCESS;
+bool prp_list_in_cmb = false;
+
+trace_nvme_map_prp(req->cid, req->cmd.opcode, trans_len, len, prp1, prp2,
+num_prps);
 
 if (unlikely(!prp1)) {
 trace_nvme_err_invalid_prp();
 return NVME_INVALID_FIELD | NVME_DNR;
-} else if (n->cmbsz && prp1 >= n->ctrl_mem.addr &&
-   prp1 < n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size)) {
-qsg->nsg = 0;
-qemu_iovec_init(iov, num_prps);
-qemu_iovec_add(iov, (void *)>cmbuf[prp1 - n->ctrl_mem.addr], 
trans_len);
-} else {
-pci_dma_sglist_init(qsg, >parent_obj, num_prps);
-qemu_sglist_add(qsg, prp1, trans_len);
 }
+
+if (nvme_addr_is_cmb(n, prp1)) {
+req->is_cmb = true;
+}
+
+pci_dma_sglist_init(qsg, >parent_obj, num_prps);
+qemu_sglist_add(qsg, prp1, trans_len);
+
 len -= trans_len;
 if (len) {
 if (unlikely(!prp2)) {
 trace_nvme_err_invalid_prp2_missing();
+status = NVME_INVALID_FIELD | NVME_DNR;
 goto unmap;
 }
+
 if (len > n->page_size) {
 uint64_t prp_list[n->max_prp_ents];
 uint32_t nents, prp_trans;
 int i = 0;
 
+if (nvme_addr_is_cmb(n, prp2)) {
+prp_list_in_cmb = true;
+}
+
 nents = (len + n->page_size - 1) >> n->page_bits;
 prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
-nvme_addr_read(n, prp2, (void *)prp_list, prp_trans);
+nvme_addr_read(n, prp2, (void *) prp_list, prp_trans);
 while (len != 0) {
+bool addr_is_cmb;
 uint64_t prp_ent = le64_to_cpu(prp_list[i]);
 
 if (i == n->max_prp_ents - 1 && len > n->page_size) {
 if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
 trace_nvme_err_invalid_prplist_ent(prp_ent);
+status = NVME_INVALID_FIELD | NVME_DNR;
+goto unmap;
+}
+
+addr_is_cmb = nvme_addr_is_cmb(n, prp_ent);
+if ((prp_list_in_cmb && !addr_is_cmb) ||
+(!prp_list_in_cmb && addr_is_cmb)) {
+status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
 goto unmap;
 }
 
 i = 0;
 nents = (len + n->page_size - 1) >> n->page_bits;
 prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
-nvme_addr_read(n, prp_ent, (void *)prp_list,
-prp_trans);
+nvme_addr_read(n, prp_ent, (void *) prp_list, prp_trans);
 prp_ent = le64_to_cpu(prp_list[i]);
 }
 
 if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
 trace_nvme_err_invalid_prplist_ent(prp_ent);
+status = NVME_INVALID_FIELD | NVME_DNR;
 goto unmap;
 }
 
-trans_len = MIN(len, n->page_size);
-if (qsg->nsg){
-qemu_sglist_add(qsg, prp_ent, trans_len);
-} else {
-qemu_iovec_add(iov, (void *)>cmbuf[prp_ent - 
n->ctrl_mem.addr], trans_len);
+addr_is_cmb = nvme_addr_is_cmb(n, prp_ent);
+ 

[PATCH v2 19/20] nvme: make lba data size configurable

2019-10-15 Thread Klaus Jensen
Signed-off-by: Klaus Jensen 
---
 hw/block/nvme-ns.c | 2 +-
 hw/block/nvme-ns.h | 4 +++-
 hw/block/nvme.c| 1 +
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index aa76bb63ef45..70ff622a5729 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -18,7 +18,7 @@ static int nvme_ns_init(NvmeNamespace *ns)
 {
 NvmeIdNs *id_ns = >id_ns;
 
-id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
+id_ns->lbaf[0].ds = ns->params.lbads;
 id_ns->nuse = id_ns->ncap = id_ns->nsze =
 cpu_to_le64(nvme_ns_nlbas(ns));
 
diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index 64dd054cf6a9..aa1c81d85cde 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -6,10 +6,12 @@
 OBJECT_CHECK(NvmeNamespace, (obj), TYPE_NVME_NS)
 
 #define DEFINE_NVME_NS_PROPERTIES(_state, _props) \
-DEFINE_PROP_UINT32("nsid", _state, _props.nsid, 0)
+DEFINE_PROP_UINT32("nsid", _state, _props.nsid, 0), \
+DEFINE_PROP_UINT8("lbads", _state, _props.lbads, 9)
 
 typedef struct NvmeNamespaceParams {
 uint32_t nsid;
+uint8_t  lbads;
 } NvmeNamespaceParams;
 
 typedef struct NvmeNamespace {
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 67f92bf5a3ac..d0103c16cfe9 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -2602,6 +2602,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 if (n->namespace.conf.blk) {
 ns = >namespace;
 ns->params.nsid = 1;
+ns->params.lbads = 9;
 
 if (nvme_ns_setup(n, ns, _err)) {
 error_propagate_prepend(errp, local_err, "nvme_ns_setup: ");
-- 
2.23.0




  1   2   >