Re: [PATCH] iotests: Fix up python style in 300

2021-02-25 Thread Vladimir Sementsov-Ogievskiy

16.02.2021 02:21, John Snow wrote:

On 2/15/21 5:05 PM, Eric Blake wrote:

Break some long lines, and relax our type hints to be more generic to
any JSON, in order to more easily permit the additional JSON depth now
possible in migration parameters.  Detected by iotest 297.

Fixes: ca4bfec41d56
  (qemu-iotests: 300: Add test case for modifying persistence of bitmap)
Reported-by: Kevin Wolf 
Signed-off-by: Eric Blake 


Reviewed-by: John Snow 


---
  tests/qemu-iotests/300 | 10 ++
  1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/tests/qemu-iotests/300 b/tests/qemu-iotests/300
index 63036f6a6e13..adb927629747 100755
--- a/tests/qemu-iotests/300
+++ b/tests/qemu-iotests/300
@@ -22,7 +22,7 @@
  import os
  import random
  import re
-from typing import Dict, List, Optional, Union
+from typing import Dict, List, Optional

  import iotests

@@ -30,7 +30,7 @@ import iotests
  # pylint: disable=wrong-import-order
  import qemu

-BlockBitmapMapping = List[Dict[str, Union[str, List[Dict[str, str]
+BlockBitmapMapping = List[Dict[str, object]]



Assuming iotest 297 didn't yap about this, I think this has the necessary power 
for this file and we don't have to work any harder.

If in the future you try to treat e.g. bmap['persistent'] as a particular kind of value 
(string? bool? int?) mypy will likely complain about that a little, saying it has no 
insight into the type beyond "object".

If *that* becomes annoying, you can degrade this type to use 'Any' instead of 
'object' and even those checks will cease.


Probably at some future moment we'll have generated python types for QAPI 
structures ? :)




  mig_sock = os.path.join(iotests.sock_dir, 'mig_sock')

@@ -602,7 +602,8 @@ class TestCrossAliasMigration(TestDirtyBitmapMigration):

  class TestAliasTransformMigration(TestDirtyBitmapMigration):
  """
-    Tests the 'transform' option which modifies bitmap persistence on 
migration.
+    Tests the 'transform' option which modifies bitmap persistence on
+    migration.
  """

  src_node_name = 'node-a'
@@ -674,7 +675,8 @@ class TestAliasTransformMigration(TestDirtyBitmapMigration):
  bitmaps = self.vm_b.query_bitmaps()

  for node in bitmaps:
-    bitmaps[node] = sorted(((bmap['name'], bmap['persistent']) for 
bmap in bitmaps[node]))
+    bitmaps[node] = sorted(((bmap['name'], bmap['persistent'])
+    for bmap in bitmaps[node]))

  self.assertEqual(bitmaps,
   {'node-a': [('bmap-a', True), ('bmap-b', False)],







--
Best regards,
Vladimir



Re: [PATCH] iotests: Fix up python style in 300

2021-02-25 Thread Vladimir Sementsov-Ogievskiy

16.02.2021 01:05, Eric Blake wrote:

Break some long lines, and relax our type hints to be more generic to
any JSON, in order to more easily permit the additional JSON depth now
possible in migration parameters.  Detected by iotest 297.

Fixes: ca4bfec41d56
  (qemu-iotests: 300: Add test case for modifying persistence of bitmap)
Reported-by: Kevin Wolf
Signed-off-by: Eric Blake



Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir



[RFC PATCH 2/3] hw/block/pflash: Move code around

2021-02-25 Thread Philippe Mathieu-Daudé
First do the block checks, so we know if it is read-only or not.
Then create the MemoryRegion. This will allow optimization in
the next commit.

Signed-off-by: Philippe Mathieu-Daudé 
---
 hw/block/pflash_cfi01.c | 24 
 hw/block/pflash_cfi02.c | 18 +-
 2 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/hw/block/pflash_cfi01.c b/hw/block/pflash_cfi01.c
index 22287a1522e..a5fa8d8b74a 100644
--- a/hw/block/pflash_cfi01.c
+++ b/hw/block/pflash_cfi01.c
@@ -731,18 +731,6 @@ static void pflash_cfi01_realize(DeviceState *dev, Error 
**errp)
 }
 device_len = sector_len_per_device * blocks_per_device;
 
-memory_region_init_rom_device(
->mem, OBJECT(dev),
-_cfi01_ops,
-pfl,
-pfl->name, total_len, errp);
-if (*errp) {
-return;
-}
-
-pfl->storage = memory_region_get_ram_ptr(>mem);
-sysbus_init_mmio(SYS_BUS_DEVICE(dev), >mem);
-
 if (pfl->blk) {
 uint64_t perm;
 pfl->ro = !blk_supports_write_perm(pfl->blk);
@@ -755,6 +743,18 @@ static void pflash_cfi01_realize(DeviceState *dev, Error 
**errp)
 pfl->ro = 0;
 }
 
+memory_region_init_rom_device(
+>mem, OBJECT(dev),
+_cfi01_ops,
+pfl,
+pfl->name, total_len, errp);
+if (*errp) {
+return;
+}
+
+pfl->storage = memory_region_get_ram_ptr(>mem);
+sysbus_init_mmio(SYS_BUS_DEVICE(dev), >mem);
+
 if (pfl->blk) {
 if (!blk_check_size_and_read_all(pfl->blk, pfl->storage, total_len,
  errp)) {
diff --git a/hw/block/pflash_cfi02.c b/hw/block/pflash_cfi02.c
index 7962cff7455..4f62ce8917d 100644
--- a/hw/block/pflash_cfi02.c
+++ b/hw/block/pflash_cfi02.c
@@ -791,15 +791,6 @@ static void pflash_cfi02_realize(DeviceState *dev, Error 
**errp)
 return;
 }
 
-memory_region_init_rom_device(>orig_mem, OBJECT(pfl),
-  _cfi02_ops, pfl, pfl->name,
-  pfl->chip_len, errp);
-if (*errp) {
-return;
-}
-
-pfl->storage = memory_region_get_ram_ptr(>orig_mem);
-
 if (pfl->blk) {
 uint64_t perm;
 pfl->ro = !blk_supports_write_perm(pfl->blk);
@@ -812,6 +803,15 @@ static void pflash_cfi02_realize(DeviceState *dev, Error 
**errp)
 pfl->ro = 0;
 }
 
+memory_region_init_rom_device(>orig_mem, OBJECT(pfl),
+  _cfi02_ops, pfl, pfl->name,
+  pfl->chip_len, errp);
+if (*errp) {
+return;
+}
+
+pfl->storage = memory_region_get_ram_ptr(>orig_mem);
+
 if (pfl->blk) {
 if (!blk_check_size_and_read_all(pfl->blk, pfl->storage,
  pfl->chip_len, errp)) {
-- 
2.26.2




[RFC PATCH 1/3] exec/memory: Introduce memory_region_init_rom_device_from_file()

2021-02-25 Thread Philippe Mathieu-Daudé
Introduce memory_region_init_rom_device_from_file() which mmap
the backing file of ROM devices. This allows to reduce QEMU memory
footprint as the same file can be shared between multiple instances
of QEMU.

Signed-off-by: Philippe Mathieu-Daudé 
---
 include/exec/memory.h | 85 +
 softmmu/memory.c  | 98 +++
 2 files changed, 183 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index c6fb714e499..bacf7495003 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -487,6 +487,9 @@ struct MemoryRegion {
 const char *name;
 unsigned ioeventfd_nb;
 MemoryRegionIoeventfd *ioeventfds;
+#ifndef CONFIG_POSIX
+gchar *contents;
+#endif
 };
 
 struct IOMMUMemoryRegion {
@@ -1131,6 +1134,43 @@ void 
memory_region_init_rom_device_nomigrate(MemoryRegion *mr,
  uint64_t size,
  Error **errp);
 
+/**
+ * memory_region_init_rom_device_from_file_nomigrate:
+ * Initialize a ROM memory region from the specified backing file.
+ * Writes are handled via callbacks.
+ *
+ * Note that this function does not do anything to cause the data in the
+ * RAM side of the memory region to be migrated; that is the responsibility
+ * of the caller.
+ *
+ * @mr: the #MemoryRegion to be initialized.
+ * @owner: the object that tracks the region's reference count
+ * @ops: callbacks for write access handling (must not be NULL).
+ * @opaque: passed to the read and write callbacks of the @ops structure.
+ * @name: Region name, becomes part of RAMBlock name used in migration stream
+ *must be unique within any device
+ * @size: size of the region.
+ * @ram_flags: specify the properties of the ram block, which can be one
+ * or bit-or of following values
+ * - RAM_SHARED: mmap the backing file or device with MAP_SHARED
+ * - RAM_PMEM: the backend @mem_path is persistent memory
+ * Other bits are ignored.
+ * @path: specify the backing file
+ * @readonly: true to open @path for reading, false for read/write.
+ * @errp: pointer to Error*, to store an error if it happens.
+ */
+void memory_region_init_rom_device_from_file_nomigrate(MemoryRegion *mr,
+   Object *owner,
+   const MemoryRegionOps 
*ops,
+   void *opaque,
+   const char *name,
+   uint64_t size,
+   uint64_t align,
+   uint32_t ram_flags,
+   const char *path,
+   bool readonly,
+   Error **errp);
+
 /**
  * memory_region_init_iommu: Initialize a memory region of a custom type
  * that translates addresses
@@ -1249,6 +1289,51 @@ void memory_region_init_rom_device(MemoryRegion *mr,
Error **errp);
 
 
+/**
+ * memory_region_init_rom_device_from_file:
+ * Initialize a ROM memory region from the specified backing file.
+ * Writes are handled via callbacks.
+ *
+ * This function initializes a memory region backed by RAM for reads
+ * and callbacks for writes, and arranges for the RAM backing to
+ * be migrated (by calling vmstate_register_ram()
+ * if @owner is a DeviceState, or vmstate_register_ram_global() if
+ * @owner is NULL).
+ *
+ * TODO: Currently we restrict @owner to being either NULL (for
+ * global RAM regions with no owner) or devices, so that we can
+ * give the RAM block a unique name for migration purposes.
+ * We should lift this restriction and allow arbitrary Objects.
+ * If you pass a non-NULL non-device @owner then we will assert.
+ *
+ * @mr: the #MemoryRegion to be initialized.
+ * @owner: the object that tracks the region's reference count
+ * @ops: callbacks for write access handling (must not be NULL).
+ * @opaque: passed to the read and write callbacks of the @ops structure.
+ * @name: Region name, becomes part of RAMBlock name used in migration stream
+ *must be unique within any device
+ * @size: size of the region.
+ * @ram_flags: specify the properties of the ram block, which can be one
+ * or bit-or of following values
+ * - RAM_SHARED: mmap the backing file or device with MAP_SHARED
+ * - RAM_PMEM: the backend @mem_path is persistent memory
+ * Other bits are ignored.
+ * @path: specify the backing file
+ * @readonly: true to open @path for reading, false for read/write.
+ * @errp: pointer to Error*, to store an error if it happens.
+ */
+void 

[RFC PATCH 3/3] hw/block/pflash: use memory_region_init_rom_device_from_file()

2021-02-25 Thread Philippe Mathieu-Daudé
If the block drive is read-only we will model a "protected" flash
device. We can thus use memory_region_init_rom_device_from_file()
which mmap the backing file when creating the MemoryRegion.
If the same backing file is used by multiple QEMU instances, this
reduces the memory footprint (this is often the case with the
CODE flash image from OVMF and AAVMF).

Suggested-by: Stefan Hajnoczi 
Signed-off-by: Philippe Mathieu-Daudé 
---
 hw/block/pflash_cfi01.c | 20 ++--
 hw/block/pflash_cfi02.c | 18 ++
 2 files changed, 28 insertions(+), 10 deletions(-)

diff --git a/hw/block/pflash_cfi01.c b/hw/block/pflash_cfi01.c
index a5fa8d8b74a..5757391df1c 100644
--- a/hw/block/pflash_cfi01.c
+++ b/hw/block/pflash_cfi01.c
@@ -743,11 +743,19 @@ static void pflash_cfi01_realize(DeviceState *dev, Error 
**errp)
 pfl->ro = 0;
 }
 
-memory_region_init_rom_device(
->mem, OBJECT(dev),
-_cfi01_ops,
-pfl,
-pfl->name, total_len, errp);
+if (pfl->blk && pfl->ro) {
+memory_region_init_rom_device_from_file(>mem, OBJECT(dev),
+_cfi01_ops, pfl,
+pfl->name, total_len,
+qemu_real_host_page_size,
+RAM_SHARED,
+blk_bs(pfl->blk)->filename,
+true, errp);
+} else {
+memory_region_init_rom_device(>mem, OBJECT(dev),
+  _cfi01_ops, pfl,
+  pfl->name, total_len, errp);
+}
 if (*errp) {
 return;
 }
@@ -755,7 +763,7 @@ static void pflash_cfi01_realize(DeviceState *dev, Error 
**errp)
 pfl->storage = memory_region_get_ram_ptr(>mem);
 sysbus_init_mmio(SYS_BUS_DEVICE(dev), >mem);
 
-if (pfl->blk) {
+if (pfl->blk && !pfl->ro) {
 if (!blk_check_size_and_read_all(pfl->blk, pfl->storage, total_len,
  errp)) {
 vmstate_unregister_ram(>mem, DEVICE(pfl));
diff --git a/hw/block/pflash_cfi02.c b/hw/block/pflash_cfi02.c
index 4f62ce8917d..d57f64d7732 100644
--- a/hw/block/pflash_cfi02.c
+++ b/hw/block/pflash_cfi02.c
@@ -803,16 +803,26 @@ static void pflash_cfi02_realize(DeviceState *dev, Error 
**errp)
 pfl->ro = 0;
 }
 
-memory_region_init_rom_device(>orig_mem, OBJECT(pfl),
-  _cfi02_ops, pfl, pfl->name,
-  pfl->chip_len, errp);
+if (pfl->blk && pfl->ro) {
+memory_region_init_rom_device_from_file(>orig_mem, OBJECT(pfl),
+_cfi02_ops, pfl,
+pfl->name, pfl->chip_len,
+qemu_real_host_page_size,
+RAM_SHARED,
+blk_bs(pfl->blk)->filename,
+true, errp);
+} else {
+memory_region_init_rom_device(>orig_mem, OBJECT(pfl),
+  _cfi02_ops, pfl, pfl->name,
+  pfl->chip_len, errp);
+}
 if (*errp) {
 return;
 }
 
 pfl->storage = memory_region_get_ram_ptr(>orig_mem);
 
-if (pfl->blk) {
+if (pfl->blk && !pfl->ro) {
 if (!blk_check_size_and_read_all(pfl->blk, pfl->storage,
  pfl->chip_len, errp)) {
 vmstate_unregister_ram(>orig_mem, DEVICE(pfl));
-- 
2.26.2




[RFC PATCH 0/3] hw/block/pflash: Mmap read-only backend files with MAP_SHARED

2021-02-25 Thread Philippe Mathieu-Daudé
Hi,

This series aims to reduce the memory footprint of flash devices
when the backing file is read-only.

When a backing file is read-only, the model considers the flash
is in "protected" mode. No write are allowed, but the MMIO
state machine is still usable.

This series introduces a new memory region helper to mmap files
and use it with the pflash device (only with read-only backing
files).

The goal is to reduce QEMU's memory footprint when multiple VMs
are instantiated using the same read-only backing file, which is
the case with the CODE flash from OVMF and AAVMF.

Previous attempts:

- Huawei
https://www.mail-archive.com/qemu-devel@nongnu.org/msg607292.html
- Tencent
https://www.mail-archive.com/qemu-devel@nongnu.org/msg742066.html
- Oracle
https://www.mail-archive.com/qemu-devel@nongnu.org/msg760065.html

RFC because yet another approach to tackle this technical debt,
and very little tested.

Regards,

Phil.

Philippe Mathieu-Daudé (3):
  exec/memory: Introduce memory_region_init_rom_device_from_file()
  hw/block/pflash: Move code around
  hw/block/pflash: use memory_region_init_rom_device_from_file()

 include/exec/memory.h   | 85 +++
 hw/block/pflash_cfi01.c | 34 --
 hw/block/pflash_cfi02.c | 30 -
 softmmu/memory.c| 98 +
 4 files changed, 224 insertions(+), 23 deletions(-)

-- 
2.26.2





Re: [PATCH v2 03/31] qapi/qom: Add ObjectOptions for iothread

2021-02-25 Thread Eric Blake
On 2/24/21 7:52 AM, Kevin Wolf wrote:
> Add an ObjectOptions union that will eventually describe the options of
> all user creatable object types. As unions can't exist without any
> branches, also add the first object type.
> 
> This adds a QAPI schema for the properties of the iothread object.
> 
> Signed-off-by: Kevin Wolf 
> ---
>  qapi/qom.json | 53 +++
>  1 file changed, 53 insertions(+)
> 
> diff --git a/qapi/qom.json b/qapi/qom.json
> index 96c91c1faf..bf2ecb34be 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -202,6 +202,59 @@
>'returns': [ 'ObjectPropertyInfo' ],
>'allow-preconfig': true }
>  
> +##
> +# @IothreadProperties:
> +#
> +# Properties for iothread objects.
> +#
> +# @poll-max-ns: the maximum number of nanoseconds to busy wait for events.
> +#   0 means polling is disabled (default: 32768 on POSIX hosts,
> +#   0 otherwise)
> +#
> +# @poll-grow: the multiplier used to increase the polling time when the
> +# algorithm detects it is missing events due to not polling long
> +# enough. 0 selects a default behaviour (default: 0)
> +#
> +# @poll-shrink: the divisor used to decrease the polling time when the
> +#   algorithm detects it is spending too long polling without
> +#   encountering events. 0 selects a default behaviour (default: 
> 0)

Matches PollParamInfo declarations in iothread.c.

> +#
> +# Since: 2.0

How did you determine this value?  (I'm not questioning it being
something other than 6.0, because we have indeed supported QMP
configuration of these values via the untyped magic previously present
in add-object).

> +##
> +{ 'struct': 'IothreadProperties',
> +  'data': { '*poll-max-ns': 'int',
> +'*poll-grow': 'int',
> +'*poll-shrink': 'int' } }

These are correctly typed per the code in iothread.c, but it does raise
the question of whether a signed 64-bit value is the best choice, or if
we might later want to revisit things to pick more constrained types.  I
don't think such an audit should hold up this series, though.

> +
> +##
> +# @ObjectType:
> +#
> +# Since: 6.0
> +##
> +{ 'enum': 'ObjectType',
> +  'data': [
> +'iothread'
> +  ] }

Will be fun to watch this grow over the series.

> +
> +##
> +# @ObjectOptions:
> +#
> +# Describes the options of a user creatable QOM object.
> +#
> +# @qom-type: the class name for the object to be created
> +#
> +# @id: the name of the new object
> +#
> +# Since: 6.0
> +##
> +{ 'union': 'ObjectOptions',
> +  'base': { 'qom-type': 'ObjectType',
> +'id': 'str' },
> +  'discriminator': 'qom-type',
> +  'data': {
> +  'iothread':   'IothreadProperties'
> +  } }
> +
>  ##
>  # @object-add:
>  #
> 


Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v2 02/31] qapi/qom: Drop deprecated 'props' from object-add

2021-02-25 Thread Eric Blake
On 2/24/21 7:52 AM, Kevin Wolf wrote:
> The option has been deprecated in QEMU 5.0, remove it.
> 
> Signed-off-by: Kevin Wolf 
> ---
>  qapi/qom.json|  6 +-
>  docs/system/deprecated.rst   |  5 -
>  docs/system/removed-features.rst |  5 +
>  qom/qom-qmp-cmds.c   | 21 -
>  4 files changed, 6 insertions(+), 31 deletions(-)
> 

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v2 01/31] tests: Drop 'props' from object-add calls

2021-02-25 Thread Eric Blake
On 2/24/21 7:52 AM, Kevin Wolf wrote:
> The 'props' option has been deprecated in 5.0 in favour of a flattened
> object-add command. Time to change our test cases to drop the deprecated
> option.
> 
> Signed-off-by: Kevin Wolf 
> ---

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH] nbd: server: Report holes for raw images

2021-02-25 Thread Vladimir Sementsov-Ogievskiy

19.02.2021 19:58, Eric Blake wrote:

On 2/19/21 10:42 AM, Eric Blake wrote:


To me, data=false looks compatible with NBD_STATE_HOLE. From user point
of view, getting same results from qemu-nbd and qemu-img is more
important than being more correct about allocation status.


More to the point, here is our inconsistency:

In nbd/server.c, we turn !BDRV_BLOCK_ALLOCATED into NBD_STATE_HOLE

In block/nbd.c, we turn !NBD_STATE_HOLE into BDRV_BLOCK_DATA

The fact that we are not doing a round-trip conversion means that one of
the two places is wrong.  And your argument that the server side is
wrong makes sense to me.


In fact, when I went back and researched when this was introduced (see
commit e7b1948d51 in 2018), we may have been aware of the inconsistency
between client and server, but didn't make up our minds at the time:
https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg03465.html
"? Hm, don't remember, what we decided about DATA/HOLE flags mapping.."



I'll wait a few days for any other reviewer commentary before taking
this through my NBD tree.






I can add the following.

First, link to my research of block_status in Qemu: 
https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg05136.html

And about HOLE and ZERO..

As I've noted in the research above, SCSI may return HOLE & !ZERO:

from SCSI:
Logical Block Provisioning Read Zeros (LBPRZ) bit
1 If the logical block provisioning read zeros (LBPRZ) bit is set to one, 
then, for an unmapped LBA specified by a read operation, the deviceserver shall 
send user data with all bits set to zero to the data-in buffer.
0 If the TPRZ bit is set to zero, then, for an unmapped LBA specified by a 
read operation, the device server may send user data with all bitsset to any 
value to the data-in buffer.

So we can have an unmapped area that can be read as any random data. Same thing 
can be said about null-co driver with read-zeroes=false

Also, qcow2 support ALLOCATED ZERO clusters which reads as zero but data is 
allocated - they are reasonable to report as ZERO & !HOLE

And of-course UNALLOCATED ZERO clusters in qcow2 and lseek-holes are reasonable to report as 
ZERO & HOLE,  because they reads as zero and "future writes to that area may cause 
fragmentation or encounter an NBD_ENOSPC"..

So, all combination are reasonable, we just need to fix Qemu NBD server to 
report correct statuses in all these cases.

It seems that ZERO/HOLE specification is a lot more reasonable than what we 
have with ZERO/DATA/ALLOCATED in Qemu, and may be true way is move internal 
block_status to use NBD terms.


And thanks for CCing me. Hmm, maybe, I'll suggest myself as co-maintainer for 
NBD?

--
Best regards,
Vladimir



Re: [PATCH] nbd: server: Report holes for raw images

2021-02-25 Thread Vladimir Sementsov-Ogievskiy

19.02.2021 19:07, Nir Soffer wrote:

When querying image extents for raw image, qemu-nbd reports holes as
zero:

$ qemu-nbd -t -r -f raw empty-6g.raw

$ qemu-img map --output json nbd://localhost
[{ "start": 0, "length": 6442450944, "depth": 0, "zero": true, "data": true, 
"offset": 0}]

$ qemu-img map --output json empty-6g.raw
[{ "start": 0, "length": 6442450944, "depth": 0, "zero": true, "data": false, 
"offset": 0}]

Turns out that qemu-img map reports a hole based on BDRV_BLOCK_DATA, but
nbd server reports a hole based on BDRV_BLOCK_ALLOCATED.

The NBD protocol says:

 NBD_STATE_HOLE (bit 0): if set, the block represents a hole (and
 future writes to that area may cause fragmentation or encounter an
 NBD_ENOSPC error); if clear, the block is allocated or the server
 could not otherwise determine its status.


So, it makes sense to repot HOLE for lseek holes on raw files (as write may 
increase disk usage) and for qcow2 ZERO clusters: again, writing will lead to 
allocation of NORMAL cluster.

Reviewed-by: Vladimir Sementsov-Ogievskiy 



qemu-img manual says:

 whether the sectors contain actual data or not (boolean field data;
 if false, the sectors are either unallocated or stored as
 optimized all-zero clusters);

To me, data=false looks compatible with NBD_STATE_HOLE. From user point
of view, getting same results from qemu-nbd and qemu-img is more
important than being more correct about allocation status.

Changing nbd server to report holes using BDRV_BLOCK_DATA makes qemu-nbd
results compatible with qemu-img map:

$ qemu-img map --output json nbd://localhost
[{ "start": 0, "length": 6442450944, "depth": 0, "zero": true, "data": false, 
"offset": 0}]

Signed-off-by: Nir Soffer 
---
  nbd/server.c   | 4 ++--
  tests/qemu-iotests/241.out | 4 ++--
  2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 7229f487d2..86a44a9b41 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -2087,8 +2087,8 @@ static int blockstatus_to_extents(BlockDriverState *bs, 
uint64_t offset,
  return ret;
  }
  
-flags = (ret & BDRV_BLOCK_ALLOCATED ? 0 : NBD_STATE_HOLE) |

-(ret & BDRV_BLOCK_ZERO  ? NBD_STATE_ZERO : 0);
+flags = (ret & BDRV_BLOCK_DATA ? 0 : NBD_STATE_HOLE) |
+(ret & BDRV_BLOCK_ZERO ? NBD_STATE_ZERO : 0);
  
  if (nbd_extent_array_add(ea, num, flags) < 0) {

  return 0;
diff --git a/tests/qemu-iotests/241.out b/tests/qemu-iotests/241.out
index 75f9f465e5..3f8c173cc8 100644
--- a/tests/qemu-iotests/241.out
+++ b/tests/qemu-iotests/241.out
@@ -5,7 +5,7 @@ QA output created by 241
size:  1024
min block: 1
  [{ "start": 0, "length": 1000, "depth": 0, "zero": false, "data": true, 
"offset": OFFSET},
-{ "start": 1000, "length": 24, "depth": 0, "zero": true, "data": true, 
"offset": OFFSET}]
+{ "start": 1000, "length": 24, "depth": 0, "zero": true, "data": false, 
"offset": OFFSET}]
  1 KiB (0x400) bytes allocated at offset 0 bytes (0x0)
  
  === Exporting unaligned raw image, forced server sector alignment ===

@@ -23,6 +23,6 @@ WARNING: Image format was not specified for 'TEST_DIR/t.raw' 
and probing guessed
size:  1024
min block: 1
  [{ "start": 0, "length": 1000, "depth": 0, "zero": false, "data": true, 
"offset": OFFSET},
-{ "start": 1000, "length": 24, "depth": 0, "zero": true, "data": true, 
"offset": OFFSET}]
+{ "start": 1000, "length": 24, "depth": 0, "zero": true, "data": false, 
"offset": OFFSET}]
  1 KiB (0x400) bytes allocated at offset 0 bytes (0x0)
  *** done




--
Best regards,
Vladimir



Re: [PATCH 00/14] deprecations: remove many old deprecations

2021-02-25 Thread Jim Fehlig

Adding xen-devel and Ian to cc.

On 2/24/21 6:11 AM, Daniel P. Berrangé wrote:

The following features have been deprecated for well over the 2
release cycle we promise


This reminded me of a bug report we received late last year when updating to 
5.2.0. 'virsh setvcpus' suddenly stopped working for Xen HVM guests. Turns out 
libxl uses cpu-add under the covers.




   ``-usbdevice`` (since 2.10.0)
   ``-drive file=3Djson:{...{'driver':'file'}}`` (since 3.0)
   ``-vnc acl`` (since 4.0.0)
   ``-mon ...,control=3Dreadline,pretty=3Don|off`` (since 4.1)
   ``migrate_set_downtime`` and ``migrate_set_speed`` (since 2.8.0)
   ``query-named-block-nodes`` result ``encryption_key_missing`` (since 2.10.0)
   ``query-block`` result ``inserted.encryption_key_missing`` (since 2.10.0)
   ``migrate-set-cache-size`` and ``query-migrate-cache-size`` (since 2.11.0)
   ``query-named-block-nodes`` and ``query-block`` result dirty-bitmaps[i].sta=
tus (ince 4.0)
   ``query-cpus`` (since 2.12.0)
   ``query-cpus-fast`` ``arch`` output member (since 3.0.0)
   ``query-events`` (since 4.0)
   chardev client socket with ``wait`` option (since 4.0)
   ``acl_show``, ``acl_reset``, ``acl_policy``, ``acl_add``, ``acl_remove`` (s=
ince 4.0.0)
   ``ide-drive`` (since 4.2)
   ``scsi-disk`` (since 4.2)

AFAICT, libvirt has ceased to use all of these too.


A quick grep of the libxl code shows it uses -usbdevice, query-cpus, and 
scsi-disk.


There are many more similarly old deprecations not (yet) tackled.


The Xen tools maintainers will need to be more vigilant of the deprecations. I 
don't follow Xen development close enough to know if this topic has already been 
discussed.


Regards,
Jim




block/throttle and burst bucket

2021-02-25 Thread Peter Lieven
Hi,


I was wondering if there is a way to check from outside (qmp etc.) if a 
throttled block device has exceeded the iops_max_length seconds of time 
bursting up to iops_max and is now hard limited to the iops limit that is 
supplied?


Would it be also a good idea to exetend the accounting to account for requests 
that must have waited before being sent out to the backend device?


Thanks,

Peter





Re: [PATCH v5 0/9] block: Add retry for werror=/rerror= mechanism

2021-02-25 Thread Vladimir Sementsov-Ogievskiy

23.02.2021 16:41, Eric Blake wrote:

On 2/23/21 3:40 AM, Stefan Hajnoczi wrote:

On Fri, Feb 05, 2021 at 06:13:06PM +0800, Jiahui Cen wrote:

This patch series propose to extend the werror=/rerror= mechanism to add
a 'retry' feature. It can automatically retry failed I/O requests on error
without sending error back to guest, and guest can get back running smoothly
when I/O is recovred.


This patch series implements a retry followed by werror/rerror=report
after a timeout. This mechanism could be made more generic (and the code
could be simplified) by removing the new werror/rerror=retry action and
instead implementing the retry/timeout followed by *any* werror=/rerror=
policy chosen by the user.

In other words, if the retry interval is non-zero, retry the request and
check for timeouts. When the timeout is reached, obey the
werror=/rerror= action.

This is more flexible than hard-coding werror=retry to mean retry
timeout followed by werror=report.

For example:

   werror=stop,write-retry-interval=1000,write-retry-timeout=15000,
   rerror=report,read-retry-interval=1000,read-retry-timeout=15000

Failed write requests will be retried once a second for 15 seconds.
If the timeout is reached the guest is stopped.

Failed read requests will be retried once a second for 15 seconds. If
the timeout is reached the error is reported to the guest.


You may also want to look at what the NBD block device already
implements for retries, and see if making retry generic to the block
layer in general can do everything already possible in the NBD code, at
which point the NBD code can be simplified.  Vladimir (added in cc) is
the best point of contact there.



Hi!

There are some specific things for retrying in NBD client code, that may not 
make sense for generic code:

We try to handle not some generic error but tcp connection problems. So

1. We NBD retries we wanted to retry only requests failed due to connection 
loss.
   So, if NBD server successfully reply with -EIO to client request, we don't 
retry and just report -EIO.
   Such things can be distinguished only inside NBD driver.

2. Timeout is specific too: we start timer when we detected the connection lost 
for the first time.
   During the specified time all current requests are waiting for reconnect, 
they'll be retried if
   connection established.
   On timeout all current requests (waiting for reconnect) will report failure, 
as well as all further
   requests will report failure immediately (until successful reconnect).

Still, of course, NBD driver is over-complicated by reconnect feature 
(especially with reconnect thread:) and
it would be cool to split some parts to be separate...


--
Best regards,
Vladimir



Re: [RFC PATCH v2 3/4] block: Support multiple reopening with x-blockdev-reopen

2021-02-25 Thread Vladimir Sementsov-Ogievskiy

24.02.2021 15:33, Kevin Wolf wrote:

Am 09.02.2021 um 09:03 hat Vladimir Sementsov-Ogievskiy geschrieben:

08.02.2021 21:44, Alberto Garcia wrote:

Signed-off-by: Alberto Garcia 
---
   qapi/block-core.json   |  2 +-
   include/block/block.h  |  1 +
   block.c| 16 +--
   blockdev.c | 85 +-
   tests/qemu-iotests/155 |  9 ++--
   tests/qemu-iotests/165 |  4 +-
   tests/qemu-iotests/245 | 27 +++-
   tests/qemu-iotests/248 |  2 +-
   tests/qemu-iotests/248.out |  2 +-
   tests/qemu-iotests/298 |  4 +-
   10 files changed, 89 insertions(+), 63 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index c0e7c23331..b9fcf20a81 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -4177,7 +4177,7 @@
   # Since: 4.0
   ##
   { 'command': 'x-blockdev-reopen',
-  'data': 'BlockdevOptions', 'boxed': true }
+  'data': { 'options': ['BlockdevOptions'] } }


Do we also want to drop x- prefix?


libvirt really wants to have a stable blockdev-reopen interface in 6.0
because enabling the incremental backup code depends on this (they just
toggle the readonly flag if I understand correctly, so most of the work
we're currently doing isn't even relevant at this moment for libvirt).


Do you know what is the case exactly? If they do it to remove dirty bitmap
from backing image after snapshot operation, probably we'd better improve
block-dirty-bitmap-remove command to be able to reopen r-o image?

(I just recently faced such a task)



Given that the soft freeze is coming closer (March 16), I wonder if we
should just make this API change and declare the interface stable. We
can then make Vladimir's fixes and the file reopening on top of it - if
it's in time for 6.0, that would be good, but if not we could move it to
6.1 without impacting libvirt.

I think we're reasonable confident that the QAPI interfaces are right,
even if maybe not that all aspects of the implementation are right yet.

What do you think?



I think it's OK.. We have it since 4.0. What will we win keeping -x for years? 
Even latest change from updating one device to several could be easily done 
with help of 'alternate' if the command was already stable.


--
Best regards,
Vladimir



Re: [PATCH 2/5] block: Fix BDRV_BLOCK_RAW status to honor alignment

2021-02-25 Thread Vladimir Sementsov-Ogievskiy

25.02.2021 19:03, Eric Blake wrote:

On 2/25/21 8:55 AM, Vladimir Sementsov-Ogievskiy wrote:

18.02.2021 23:15, Eric Blake wrote:

Previous patches mentioned how the blkdebug filter driver demonstrates
a bug present in our NBD server (for example, commit b0245d64); the
bug is also present with the raw format driver when probing
occurs. Basically, if we specify a particular alignment > 1, but defer
the actual block status to the underlying file, and the underlying
file has a smaller alignment, then the use of BDRV_BLOCK_RAW to defer
to the underlying file can end up with status split at an alignment
unacceptable to our layer.  Many callers don't care, but NBD does - it
is a violation of the NBD protocol to send status or read results
split on an unaligned boundary (in 737d3f5244, we taught our 4.0
client to be tolerant of such violations because the problem was even
more pronounced with qemu 3.1 as server; but we still need to fix our
server for the sake of other stricter clients).




+ * - BDRV_BLOCK_ALLOCATED is set if the flag is set for at least one
subregion


Hmm about this..

We already have mess around ALLOCATED:

  [1] for format drivers it means that "read is handled by this layer,
not by backing", i.e. data (or zero) is placed exactly on that layer of
backing-chain


If we're reading at a given granularity, then the 4k read is satisfied
at this layer even if portions of the read came from lower layers.  So
the logic works here.


Hmm.. I can't agree. This way we can say that everything is satisfied at this 
layer. Even if no data in it, we read from this layer and it somehow takes data 
from lower layers.

It's all about terminology of course and we can use same terms for different 
things. For me ALLOCATED for format layer works as follows:

The whole layer is split into clusters. Each cluster is either ALLOCATED 
(format layer produces data on read somehow not touching backing child), or 
UNALLOCATED (format layer just calls read() on backing child with same offset.

And before your patch block_status request never combined clusters with 
different ALLOCATED status.

ALLOCATED status of blocks at some layer of backing chain is significant for 
block-jobs, and if we have several sequential chunks with different ALLOCATED 
status, we can't just consider all of them as ALLOCATED, because in some 
scenarios it will lead to data loss.





  [2] for protocol drivers it's up to the driver, which may always report
ALLOCATED (being more compatible with format drivers) or it may
sometimes return UNALLOCATED to show that data is not allocated in FS..


We've been moving away from this particular overload.  What's more, most
protocol drivers that set it at all set it for every single byte,
because protocol layers don't have a notion of a backing file; which
means that if it is set at all, it will be set for every byte anyway, so
the logic works here.



And this way, bdrv_co_block_status_aligned() is compatible with protocol
drivers but not with format drivers (as you can't combine
"go-to-backing" information of several flags, as for some scenarios it's
safer to consider the whole region ALLOCATED and for another it's safer
to consider it UNALLOCATED.

For example for stream target it's safer to consider target block
UNALLOCATED and do extra copy-on-read operation. And for stream base
it's safer to consider block ALLOCATED (and again do extra copying, not
missing something significant).


I think, to avoid increasing of the mess, we should first split [1] from
[2] somehow..
Assume we change it to BDRV_BLOCK_PROTOCOL_ALLOCATED and
BDRV_BLOCK_GO_TO_BACKING.


Maybe it is indeed worth splitting out two different flags to fully
distinguish between the two overloaded meanings, but that seems like an
independent patch to this series.



Then, for BDRV_BLOCK_PROTOCOL_ALLOCATED we probably can just report
BDRV_BLOCK_PROTOCOL_ALLOCATED if at least one of extents reports
BDRV_BLOCK_PROTOCOL_ALLOCATED. (probably we don't need
BDRV_BLOCK_PROTOCOL_ALLOCATED at all and can drop this logic)

But for BDRV_BLOCK_GO_TO_BACKING we'll have to also add
BDRV_BLOCK_GO_TO_BACKING_VALID and report

  * BDRV_BLOCK_GO_TO_BACKING | BDRV_BLOCK_GO_TO_BACKING_VALID if all
extents report BDRV_BLOCK_GO_TO_BACKING
  
  * BDRV_BLOCK_GO_TO_BACKING if all extents report no

BDRV_BLOCK_GO_TO_BACKING

  *  if some extents report BDRV_BLOCK_GO_TO_BACKING but others
not.


Hmm.. And, I think that there is a problem is in NBD protocol. Actually,
with allocation-depth context we started to report internal layers of
backing chain. And if we have layers with different request-alignment
it's not correct to report allocation-depth "aligned" to top image
request-alignment.. So, for allocation-depth to work correctly we should
extend NBD protocol to allow unaligned chunks in BLOCK_STATUS report.


The NBD protocol says that base:allocation must obey allocation rules.
If we want to declare that "because qemu:allocation-depth is an
extension, we choose 

Re: [PATCH 1/5] iotests: Update 241 to expose backing layer fragmentation

2021-02-25 Thread Vladimir Sementsov-Ogievskiy

25.02.2021 18:52, Eric Blake wrote:

On 2/25/21 8:57 AM, Vladimir Sementsov-Ogievskiy wrote:

25.02.2021 16:50, Vladimir Sementsov-Ogievskiy wrote:

18.02.2021 23:15, Eric Blake wrote:

Previous commits (such as 6e280648, 75d34eb9) have mentioned that our
NBD server still sends unaligned fragments when an active layer with
large advertised minimum block size is backed by another layer with a
smaller block size. Expand the test to actually cover these scenario,
by using two different approaches: qcow2 encryption (which forces
512-byte alignment) with an unaligned raw backing file, and blkdebug
with a 4k alignment.





Now I don't think that aligning qemu:allocation-depth information is a
correct thing to do.


Why not?  First, it's very rare that you'd have a qcow2 image with
mandated 4k minimum block size, backed by another qcow2 image with 512
block size (blkdebug made it possible to expose the bug, but I could not
find a way in common day-to-day usage), so we really aren't impacting
normal users.  Second, from the perspective of copying backing chains
over NBD, what difference does it make if we have the backing chain:

A (granularity 512) <- B (granularity 512) <- C (granularity 4k)

with the allocation pattern:

A: -A-A-A-A-A-A-A-A
B: --BB--BB--BB--BB
C: 

and report the allocation depth as:



instead of

03220322

The former may be imprecise, but it obeys our bounds, and in all
reality, if all we have access to is 4k chunks, any decisions we make
about how to handle that 4k block should be based on the fact that at
least some of the data was allocated in our backing file, and not
treating the entire 4k as unallocated merely because the first 512 bytes
are neither in A nor B.



I'm not sure about NBD client, but in qemu block-jobs the decision may be 
different for different tasks, as I mentioned in my answer on [2/5].
For example block-stream will skip chunks allocated in top, because nothing to 
do, data is already in top. But if we imagine that top may return ALLOCATED for 
something that is not ALLOCATED the stream logic is broken.. Probably that's a 
bad example.

I agree that this is a rare case anyway and we probably shouldn't care too 
much. But we should at least describe it in allocation-depth specification.


--
Best regards,
Vladimir



Re: [PATCH 2/5] block: Fix BDRV_BLOCK_RAW status to honor alignment

2021-02-25 Thread Eric Blake
On 2/25/21 8:55 AM, Vladimir Sementsov-Ogievskiy wrote:
> 18.02.2021 23:15, Eric Blake wrote:
>> Previous patches mentioned how the blkdebug filter driver demonstrates
>> a bug present in our NBD server (for example, commit b0245d64); the
>> bug is also present with the raw format driver when probing
>> occurs. Basically, if we specify a particular alignment > 1, but defer
>> the actual block status to the underlying file, and the underlying
>> file has a smaller alignment, then the use of BDRV_BLOCK_RAW to defer
>> to the underlying file can end up with status split at an alignment
>> unacceptable to our layer.  Many callers don't care, but NBD does - it
>> is a violation of the NBD protocol to send status or read results
>> split on an unaligned boundary (in 737d3f5244, we taught our 4.0
>> client to be tolerant of such violations because the problem was even
>> more pronounced with qemu 3.1 as server; but we still need to fix our
>> server for the sake of other stricter clients).
>>

>> + * - BDRV_BLOCK_ALLOCATED is set if the flag is set for at least one
>> subregion
> 
> Hmm about this..
> 
> We already have mess around ALLOCATED:
> 
>  [1] for format drivers it means that "read is handled by this layer,
> not by backing", i.e. data (or zero) is placed exactly on that layer of
> backing-chain

If we're reading at a given granularity, then the 4k read is satisfied
at this layer even if portions of the read came from lower layers.  So
the logic works here.

> 
>  [2] for protocol drivers it's up to the driver, which may always report
> ALLOCATED (being more compatible with format drivers) or it may
> sometimes return UNALLOCATED to show that data is not allocated in FS..

We've been moving away from this particular overload.  What's more, most
protocol drivers that set it at all set it for every single byte,
because protocol layers don't have a notion of a backing file; which
means that if it is set at all, it will be set for every byte anyway, so
the logic works here.

> 
> And this way, bdrv_co_block_status_aligned() is compatible with protocol
> drivers but not with format drivers (as you can't combine
> "go-to-backing" information of several flags, as for some scenarios it's
> safer to consider the whole region ALLOCATED and for another it's safer
> to consider it UNALLOCATED.
> 
> For example for stream target it's safer to consider target block
> UNALLOCATED and do extra copy-on-read operation. And for stream base
> it's safer to consider block ALLOCATED (and again do extra copying, not
> missing something significant).
> 
> 
> I think, to avoid increasing of the mess, we should first split [1] from
> [2] somehow..
> Assume we change it to BDRV_BLOCK_PROTOCOL_ALLOCATED and
> BDRV_BLOCK_GO_TO_BACKING.

Maybe it is indeed worth splitting out two different flags to fully
distinguish between the two overloaded meanings, but that seems like an
independent patch to this series.

> 
> Then, for BDRV_BLOCK_PROTOCOL_ALLOCATED we probably can just report
> BDRV_BLOCK_PROTOCOL_ALLOCATED if at least one of extents reports
> BDRV_BLOCK_PROTOCOL_ALLOCATED. (probably we don't need
> BDRV_BLOCK_PROTOCOL_ALLOCATED at all and can drop this logic)
> 
> But for BDRV_BLOCK_GO_TO_BACKING we'll have to also add
> BDRV_BLOCK_GO_TO_BACKING_VALID and report
> 
>  * BDRV_BLOCK_GO_TO_BACKING | BDRV_BLOCK_GO_TO_BACKING_VALID if all
> extents report BDRV_BLOCK_GO_TO_BACKING
>  
>  * BDRV_BLOCK_GO_TO_BACKING if all extents report no
> BDRV_BLOCK_GO_TO_BACKING
> 
>  *  if some extents report BDRV_BLOCK_GO_TO_BACKING but others
> not.
> 
> 
> Hmm.. And, I think that there is a problem is in NBD protocol. Actually,
> with allocation-depth context we started to report internal layers of
> backing chain. And if we have layers with different request-alignment
> it's not correct to report allocation-depth "aligned" to top image
> request-alignment.. So, for allocation-depth to work correctly we should
> extend NBD protocol to allow unaligned chunks in BLOCK_STATUS report.

The NBD protocol says that base:allocation must obey allocation rules.
If we want to declare that "because qemu:allocation-depth is an
extension, we choose to NOT obey allocation rules, and if your client
connects to our extension, it MUST be prepared for what would normally
be non-compliant responses to NBD_CMD_BLOCK_STATUS", then we are free to
do so (it is our extension, after all).  Particularly since the only way
I could actually trigger it was with blkdebug (most format layers
support byte-level access, even when layered on top of a protocol layer
with a 512 or 4k minimum byte access).

So if you think it is better for me to respin the patch to fix ONLY
base:allocation bugs, but not qemu:allocation-depth, and instead merely
document the issue there, I could be persuaded to do so.

> 
> So, finally:
> 
> 1. making bitmap export extents aligned is a separate simple thing -
> that's OK
> 
> 2. making base:allocation aligned is possible due to good NBD_STATE_HOLE

Re: [PATCH 13/14] block: remove 'dirty-bitmaps' field from 'BlockInfo' struct

2021-02-25 Thread Vladimir Sementsov-Ogievskiy

24.02.2021 16:11, Daniel P. Berrangé wrote:

The same data is available in the 'BlockDeviceInfo' struct.

Signed-off-by: Daniel P. Berrangé


Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir



Re: [PATCH 1/5] iotests: Update 241 to expose backing layer fragmentation

2021-02-25 Thread Eric Blake
On 2/25/21 8:57 AM, Vladimir Sementsov-Ogievskiy wrote:
> 25.02.2021 16:50, Vladimir Sementsov-Ogievskiy wrote:
>> 18.02.2021 23:15, Eric Blake wrote:
>>> Previous commits (such as 6e280648, 75d34eb9) have mentioned that our
>>> NBD server still sends unaligned fragments when an active layer with
>>> large advertised minimum block size is backed by another layer with a
>>> smaller block size. Expand the test to actually cover these scenario,
>>> by using two different approaches: qcow2 encryption (which forces
>>> 512-byte alignment) with an unaligned raw backing file, and blkdebug
>>> with a 4k alignment.
>>>

> 
> Now I don't think that aligning qemu:allocation-depth information is a
> correct thing to do.

Why not?  First, it's very rare that you'd have a qcow2 image with
mandated 4k minimum block size, backed by another qcow2 image with 512
block size (blkdebug made it possible to expose the bug, but I could not
find a way in common day-to-day usage), so we really aren't impacting
normal users.  Second, from the perspective of copying backing chains
over NBD, what difference does it make if we have the backing chain:

A (granularity 512) <- B (granularity 512) <- C (granularity 4k)

with the allocation pattern:

A: -A-A-A-A-A-A-A-A
B: --BB--BB--BB--BB
C: 

and report the allocation depth as:

   

instead of

   03220322

The former may be imprecise, but it obeys our bounds, and in all
reality, if all we have access to is 4k chunks, any decisions we make
about how to handle that 4k block should be based on the fact that at
least some of the data was allocated in our backing file, and not
treating the entire 4k as unallocated merely because the first 512 bytes
are neither in A nor B.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH 12/14] block: remove dirty bitmaps 'status' field

2021-02-25 Thread Vladimir Sementsov-Ogievskiy

24.02.2021 16:11, Daniel P. Berrangé wrote:

The same information is available via the 'recording' and 'busy' fields.

Signed-off-by: Daniel P. Berrangé


Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir



Re: [PATCH 1/5] iotests: Update 241 to expose backing layer fragmentation

2021-02-25 Thread Eric Blake
On 2/25/21 7:50 AM, Vladimir Sementsov-Ogievskiy wrote:
> 18.02.2021 23:15, Eric Blake wrote:
>> Previous commits (such as 6e280648, 75d34eb9) have mentioned that our
>> NBD server still sends unaligned fragments when an active layer with
>> large advertised minimum block size is backed by another layer with a
>> smaller block size. Expand the test to actually cover these scenario,
>> by using two different approaches: qcow2 encryption (which forces
>> 512-byte alignment) with an unaligned raw backing file, and blkdebug
>> with a 4k alignment.
>>
>> The encryption test passes with the desired results, but only because
>> the client side works around the server's non-compliance; if you
>> repeat the test manually with tracing turned on, you will see the
>> server sending a status for 1000 bytes of data then 1048 bytes of
>> hole, which is not aligned. But reverting commit 737d3f5244 shows that
>> it is indeed the client working around the bug in the server.
>>
>> Meanwhile, the blkdebug test gives incorrect results: remember, when
>> using x-dirty-bitmap with qemu-img map as a way to sniff alternative
>> metadata contexts, the meanings of "data" and "zero" are determined by
> 
> How I'm tired of this abuse:) It seems that total amount of comments
> about it in code and commit messages worth creating more intuitive
> interface.. Don't you have an idea in mind?

Yes: 'nbdinfo' as part of the libnbd project ;)

Sadly, libnbd is not available on all our common porting targets yet,
and nbdinfo is less than a year old (so even distros that have libnbd
1.0 are too old).

> 
>> that context.  Our client workaround is assuming that the fragmented
>> replies can be merged according to base:allocation rules, but those
>> rules do not work for other contexts (merging dirty and clean bitmap
>> should produce dirty; merging allocated and unallocated should produce
>> allocated; see the FIXME for more about the decoded values we expect).
> 
> You could instead keep the test output correct (without FIXME marks) but
> add the test to "disabled" group and drop it from the group when fixed.

Either way, it's fixed by the end of the series.

> 
>>
>> Signed-off-by: Eric Blake 
> 
> Reviewed-by: Vladimir Sementsov-Ogievskiy 
> 

Thanks!

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH 1/5] iotests: Update 241 to expose backing layer fragmentation

2021-02-25 Thread Vladimir Sementsov-Ogievskiy

25.02.2021 16:50, Vladimir Sementsov-Ogievskiy wrote:

18.02.2021 23:15, Eric Blake wrote:

Previous commits (such as 6e280648, 75d34eb9) have mentioned that our
NBD server still sends unaligned fragments when an active layer with
large advertised minimum block size is backed by another layer with a
smaller block size. Expand the test to actually cover these scenario,
by using two different approaches: qcow2 encryption (which forces
512-byte alignment) with an unaligned raw backing file, and blkdebug
with a 4k alignment.

The encryption test passes with the desired results, but only because
the client side works around the server's non-compliance; if you
repeat the test manually with tracing turned on, you will see the
server sending a status for 1000 bytes of data then 1048 bytes of
hole, which is not aligned. But reverting commit 737d3f5244 shows that
it is indeed the client working around the bug in the server.

Meanwhile, the blkdebug test gives incorrect results: remember, when
using x-dirty-bitmap with qemu-img map as a way to sniff alternative
metadata contexts, the meanings of "data" and "zero" are determined by


How I'm tired of this abuse:) It seems that total amount of comments
about it in code and commit messages worth creating more intuitive
interface.. Don't you have an idea in mind?


that context.  Our client workaround is assuming that the fragmented
replies can be merged according to base:allocation rules, but those
rules do not work for other contexts (merging dirty and clean bitmap
should produce dirty; merging allocated and unallocated should produce
allocated; see the FIXME for more about the decoded values we expect).


You could instead keep the test output correct (without FIXME marks) but
add the test to "disabled" group and drop it from the group when fixed.



Signed-off-by: Eric Blake 


Reviewed-by: Vladimir Sementsov-Ogievskiy 



Now I don't think that aligning qemu:allocation-depth information is a correct 
thing to do.


--
Best regards,
Vladimir



Re: [PATCH 2/5] block: Fix BDRV_BLOCK_RAW status to honor alignment

2021-02-25 Thread Vladimir Sementsov-Ogievskiy

18.02.2021 23:15, Eric Blake wrote:

Previous patches mentioned how the blkdebug filter driver demonstrates
a bug present in our NBD server (for example, commit b0245d64); the
bug is also present with the raw format driver when probing
occurs. Basically, if we specify a particular alignment > 1, but defer
the actual block status to the underlying file, and the underlying
file has a smaller alignment, then the use of BDRV_BLOCK_RAW to defer
to the underlying file can end up with status split at an alignment
unacceptable to our layer.  Many callers don't care, but NBD does - it
is a violation of the NBD protocol to send status or read results
split on an unaligned boundary (in 737d3f5244, we taught our 4.0
client to be tolerant of such violations because the problem was even
more pronounced with qemu 3.1 as server; but we still need to fix our
server for the sake of other stricter clients).

This patch lays the groundwork - it adjusts bdrv_block_status to
ensure that any time one layer defers to another via BDRV_BLOCK_RAW,
the deferral is either truncated down to an aligned boundary, or
multiple sub-aligned requests are coalesced into a single
representative answer (using an implicit hole beyond EOF as
needed). Iotest 241 exposes the effect (when format probing occurred,
we don't want block status to subdivide the initial sector, and thus
not any other sector either). Similarly, iotest 221 is a good
candidate to amend to specifically track the effects; a change to a
hole at EOF is not visible if an upper layer does not want alignment
smaller than 512. However, this patch alone is not a complete fix - it
is still possible to have an active layer with large alignment
constraints backed by another layer with smaller constraints; so the
next patch will complete the task.

In particular, the next patch will introduce some mutual recursion
(bdrv_co_common_block_status_above will call this new function rather
than directly into bdrv_co_block_status), so some conditions added
here (such as a NULL pointer for map or handling a length-0 request)
are not reachable until that point.

There is one interesting corner case: prior to this patch, ALLOCATED
was never returned without either DATA or ZERO also set. But now, if
we have two subregions where the first reports status 0 (not
allocated), and the second reports ZERO|ALLOCATED but not DATA
(preallocated, read sees zero but underlying file has indeterminate
contents), then we can end up with a result of just
ALLOCATED. However, looking at callers of bdrv_is_allocated does not
find any problem with this new return value. What's more, this
situation doesn't really happen until the next patch adds support for
getting aligned status from backing files (as the use of aligned
status in this patch tends to be limited to just the protocol child of
a format driver, yet protocol drivers tend to report fully allocated,
and only format drivers have unallocated clusters that defer to a
backing file in the first place).

Signed-off-by: Eric Blake 
---
  block/io.c | 142 +++--
  tests/qemu-iotests/221 |  13 
  tests/qemu-iotests/221.out |   9 +++
  tests/qemu-iotests/241.out |   3 +-
  4 files changed, 161 insertions(+), 6 deletions(-)

diff --git a/block/io.c b/block/io.c
index ca2dca30070e..4bca775c96b4 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2325,6 +2325,132 @@ int bdrv_flush_all(void)
  return result;
  }

+static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
+ bool want_zero,
+ int64_t offset, int64_t bytes,
+ int64_t *pnum, int64_t *map,
+ BlockDriverState **file);
+
+/*
+ * Returns an aligned allocation status of the specified disk region.
+ *
+ * Wrapper around bdrv_co_block_status() which requires the initial
+ * @offset and @count to be aligned to @align (must be power of 2),
+ * and guarantees the resulting @pnum will also be aligned; this may
+ * require piecing together multiple sub-aligned queries into an
+ * appropriate coalesced answer, as follows:
+ *
+ * - BDRV_BLOCK_DATA is set if the flag is set for at least one subregion
+ * - BDRV_BLOCK_ZERO is set only if the flag is set for all subregions
+ * - BDRV_BLOCK_OFFSET_VALID is set only if all subregions are contiguous
+ *   from the same file (@map and @file are then from the first subregion)
+ * - BDRV_BLOCK_ALLOCATED is set if the flag is set for at least one subregion


Hmm about this..

We already have mess around ALLOCATED:

 [1] for format drivers it means that "read is handled by this layer, not by 
backing", i.e. data (or zero) is placed exactly on that layer of backing-chain

 [2] for protocol drivers it's up to the driver, which may always report 
ALLOCATED (being more compatible with format drivers) or it may sometimes 
return UNALLOCATED to show that 

Re: [PATCH v2 01/22] block: add eMMC block device type

2021-02-25 Thread Cédric Le Goater
On 2/24/21 8:13 PM, Sai Pavan Boddu wrote:
> Hi Cedric,
> 
> 
>> -Original Message-
>> From: Cédric Le Goater 
>> Sent: Wednesday, February 24, 2021 7:25 PM
>> To: Stefan Hajnoczi ; Sai Pavan Boddu
>> 
>> Cc: Philippe Mathieu-Daudé ; Markus Armbruster
>> ; Kevin Wolf ; Max Reitz
>> ; Vladimir Sementsov-Ogievskiy
>> ; Eric Blake ; Joel Stanley
>> ; Vincent Palatin ; Dr. David Alan
>> Gilbert ; Thomas Huth ; Peter
>> Maydell ; Alistair Francis
>> ; Edgar Iglesias ; Luc Michel
>> ; Paolo Bonzini ; qemu-
>> de...@nongnu.org; qemu-block@nongnu.org
>> Subject: Re: [PATCH v2 01/22] block: add eMMC block device type
>>
>> On 2/24/21 12:40 PM, Stefan Hajnoczi wrote:
>>> On Tue, Feb 23, 2021 at 05:35:20PM +, Sai Pavan Boddu wrote:
 Hi Philippe,

> -Original Message-
> From: Philippe Mathieu-Daudé 
> Sent: Monday, February 22, 2021 5:34 PM
> To: Sai Pavan Boddu ; Markus Armbruster
> ; Kevin Wolf ; Max Reitz
> ; Vladimir Sementsov-Ogievskiy
> ; Eric Blake ; Joel
> Stanley ; Cédric Le Goater ; Vincent
> Palatin ; Dr. David Alan Gilbert
> ; Thomas Huth ; Stefan
> Hajnoczi ; Peter Maydell
> ; Alistair Francis
> ; Edgar Iglesias ; Luc
> Michel ; Paolo Bonzini
> 
> Cc: Sai Pavan Boddu ; qemu-de...@nongnu.org;
> qemu- bl...@nongnu.org
> Subject: Re: [PATCH v2 01/22] block: add eMMC block device type
>
> On 2/22/21 9:20 AM, Sai Pavan Boddu wrote:
>> From: Vincent Palatin 
>>
>> Add new block device type.
>>
>> Signed-off-by: Vincent Palatin 
>> [SPB: Rebased over 5.1 version]
>> Signed-off-by: Sai Pavan Boddu 
>> Signed-off-by: Joel Stanley 
>> Signed-off-by: Cédric Le Goater 
>> Reviewed-by: Alistair Francis 
>> ---
>>  include/sysemu/blockdev.h | 1 +
>>  blockdev.c| 1 +
>>  2 files changed, 2 insertions(+)
>>
>> diff --git a/include/sysemu/blockdev.h b/include/sysemu/blockdev.h
>> index 3b5fcda..eefae9f 100644
>> --- a/include/sysemu/blockdev.h
>> +++ b/include/sysemu/blockdev.h
>> @@ -24,6 +24,7 @@ typedef enum {
>>   */
>>  IF_NONE = 0,
>>  IF_IDE, IF_SCSI, IF_FLOPPY, IF_PFLASH, IF_MTD, IF_SD,
>> IF_VIRTIO, IF_XEN,
>> +IF_EMMC,
>>  IF_COUNT
>>  } BlockInterfaceType;
>>
>> diff --git a/blockdev.c b/blockdev.c index cd438e6..390d43c 100644
>> --- a/blockdev.c
>> +++ b/blockdev.c
>> @@ -83,6 +83,7 @@ static const char *const if_name[IF_COUNT] = {
>>  [IF_SD] = "sd",
>>  [IF_VIRTIO] = "virtio",
>>  [IF_XEN] = "xen",
>> +[IF_EMMC] = "emmc",
>>  };
>
> We don't need to introduce support for the legacy -drive magic.
>
> -device should be enough for this device, right?
 [Sai Pavan Boddu] I was seeing to use -device for emmc. But I see we
>> anyway need blockdev support for this, which would require us the use -drive.

 Can you give some pointers, how to approach this ?
>>>
>>> It is probably not necessary to add a new IF_ constant. Would this work:
>>>
>>>   -drive if=none,id=emmc0,file=test.img,format=raw
>>>   -device emmc,...,drive=emmc0
>>>
>>> Or the more modern:
>>>
>>>   -blockdev node-name=emmc0,driver=file,filename=test.img
>>>   -device emmc,...,drive=emmc0
>>>
>>> ?
>>>
>>> (The syntax might need small tweaks but is shows the general idea.)
>>
>> Yes. This is better.
>>
>> We could have an "emmc" device inheriting from "sd-card". The "emmc"
>> property would not be necessary anymore and may be, we could cleanup up
>> some parts doing :
>>
>> if (sd->emmc) { /* eMMC */
>> ...
>> } else {
>>
>> }
>>
>> with SDCardClass handlers. the SWITCH_FUNCTION command is a good
>> candidate, CMD8 also.
> [Sai Pavan Boddu] Nice, this approach looks clean.
> But we still may be depending on emmc property. Not sure!

It could be a SDCardClass attribute instead since it's a constant for
the emmc device.

C. 

> 
> I would get back with v3, your review on those patches would be great.
> 
> Thanks & Regards,
> Sai Pavan
>>
>> C.




Re: [PATCH 1/5] iotests: Update 241 to expose backing layer fragmentation

2021-02-25 Thread Vladimir Sementsov-Ogievskiy

18.02.2021 23:15, Eric Blake wrote:

Previous commits (such as 6e280648, 75d34eb9) have mentioned that our
NBD server still sends unaligned fragments when an active layer with
large advertised minimum block size is backed by another layer with a
smaller block size. Expand the test to actually cover these scenario,
by using two different approaches: qcow2 encryption (which forces
512-byte alignment) with an unaligned raw backing file, and blkdebug
with a 4k alignment.

The encryption test passes with the desired results, but only because
the client side works around the server's non-compliance; if you
repeat the test manually with tracing turned on, you will see the
server sending a status for 1000 bytes of data then 1048 bytes of
hole, which is not aligned. But reverting commit 737d3f5244 shows that
it is indeed the client working around the bug in the server.

Meanwhile, the blkdebug test gives incorrect results: remember, when
using x-dirty-bitmap with qemu-img map as a way to sniff alternative
metadata contexts, the meanings of "data" and "zero" are determined by


How I'm tired of this abuse:) It seems that total amount of comments
about it in code and commit messages worth creating more intuitive
interface.. Don't you have an idea in mind?


that context.  Our client workaround is assuming that the fragmented
replies can be merged according to base:allocation rules, but those
rules do not work for other contexts (merging dirty and clean bitmap
should produce dirty; merging allocated and unallocated should produce
allocated; see the FIXME for more about the decoded values we expect).


You could instead keep the test output correct (without FIXME marks) but
add the test to "disabled" group and drop it from the group when fixed.



Signed-off-by: Eric Blake 


Reviewed-by: Vladimir Sementsov-Ogievskiy 


---
  tests/qemu-iotests/241 | 60 ++
  tests/qemu-iotests/241.out | 21 +
  2 files changed, 76 insertions(+), 5 deletions(-)

diff --git a/tests/qemu-iotests/241 b/tests/qemu-iotests/241
index c962c8b6075d..5217af82dc65 100755
--- a/tests/qemu-iotests/241
+++ b/tests/qemu-iotests/241
@@ -3,7 +3,7 @@
  #
  # Test qemu-nbd vs. unaligned images
  #
-# Copyright (C) 2018-2019 Red Hat, Inc.
+# Copyright (C) 2018-2021 Red Hat, Inc.
  #
  # This program is free software; you can redistribute it and/or modify
  # it under the terms of the GNU General Public License as published by
@@ -27,7 +27,7 @@ status=1 # failure is the default!
  _cleanup()
  {
  _cleanup_test_img
-rm -f "$TEST_DIR/server.log"
+rm -f "$TEST_DIR/server.log" "$TEST_IMG_FILE.mid" "$TEST_IMG_FILE.qcow2"
  nbd_server_stop
  }
  trap "_cleanup; exit \$status" 0 1 2 3 15
@@ -37,7 +37,7 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
  . ./common.filter
  . ./common.nbd

-_supported_fmt raw
+_supported_fmt raw # although the test also requires use of qcow2
  _supported_proto nbd
  _supported_os Linux
  _require_command QEMU_NBD
@@ -89,11 +89,61 @@ $QEMU_IMG map --output=json "$TEST_IMG" | 
_filter_qemu_img_map
  $QEMU_IO -c map "$TEST_IMG"
  nbd_server_stop

-# Not tested yet: we also want to ensure that qemu as NBD client does
+echo
+echo "=== Encrypted qcow2 file backed by unaligned raw image ==="
+echo
+
+# Enabling encryption in qcow2 forces 512-alignment
+SECRET=secret,id=sec0,data=12345
+$QEMU_IMG create -f qcow2 -b "$TEST_IMG_FILE" -F raw --object "$SECRET" \
+  -o encrypt.format=luks,encrypt.key-secret=sec0,encrypt.iter-time=10 \
+  "$TEST_IMG_FILE.qcow2" 2k | _filter_img_create
+nbd_server_start_unix_socket --object "$SECRET" --image-opts \
+  driver=qcow2,file.filename="$TEST_IMG_FILE.qcow2",encrypt.key-secret=sec0
+
+$QEMU_NBD_PROG --list -k $nbd_unix_socket | grep '\(size\|min\)'
+$QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
+$QEMU_IO -c map "$TEST_IMG"
+nbd_server_stop
+
+echo
+echo "=== Use blkdebug for larger alignment than backing layer ==="
+echo
+
+$QEMU_IMG create -f qcow2 -o cluster_size=1024 \
+  "$TEST_IMG_FILE" 2k | _filter_img_create
+$QEMU_IMG create -f qcow2 -b "$TEST_IMG_FILE" -F qcow2 -o cluster_size=512 \
+  "$TEST_IMG_FILE.mid" | _filter_img_create
+$QEMU_IMG bitmap --add -g 512 --enable "$TEST_IMG_FILE.mid" b0
+$QEMU_IO -f qcow2 -c 'w 512 512' "$TEST_IMG_FILE.mid" | _filter_qemu_io
+$QEMU_IMG create -f qcow2 -b "$TEST_IMG_FILE.mid" -F qcow2 \
+  "$TEST_IMG_FILE.qcow2" 4k | _filter_img_create
+FILE=image.file.filename="$TEST_IMG_FILE.qcow2"
+nbd_server_start_unix_socket -B b0 -A --image-opts \
+  driver=blkdebug,align=4096,image.driver=qcow2,image.file.driver=file,"$FILE"
+
+TEST_IMG="driver=nbd,server.type=unix,server.path=$nbd_unix_socket"
+$QEMU_NBD_PROG --list -k $nbd_unix_socket | grep '\(size\|min\)'
+# FIXME: this should report a single 4k block of "data":false which translates
+# to the dirty bitmap being set for at least part of the region; "data":true
+# is wrong unless the entire 4k is clean.
+$QEMU_IMG map --output=json 

Re: [PATCH 06/14] machine: remove 'query-cpus' QMP command

2021-02-25 Thread Wainer dos Santos Moschetta

Hi,

On 2/24/21 10:11 AM, Daniel P. Berrangé wrote:

The newer 'query-cpus-fast' command avoids side effects on the guest
execution. Note that some of the field names are different in the
'query-cpus-fast' command.

Signed-off-by: Daniel P. Berrangé 
---
  docs/system/deprecated.rst |   5 -
  docs/system/removed-features.rst   |   5 +
  hw/core/machine-hmp-cmds.c |   8 +-
  hw/core/machine-qmp-cmds.c |  79 --
  qapi/machine.json  | 161 +
  tests/acceptance/pc_cpu_hotplug_props.py   |   2 +-
  tests/acceptance/x86_cpu_model_versions.py |   2 +-
  tests/migration/guestperf/engine.py|   2 +-
  tests/qtest/numa-test.c|   6 +-
  tests/qtest/qmp-test.c |   6 +-
  tests/qtest/test-x86-cpuid-compat.c|   4 +-
  11 files changed, 22 insertions(+), 258 deletions(-)

diff --git a/docs/system/deprecated.rst b/docs/system/deprecated.rst
index e214f0a9cf..484f017119 100644
--- a/docs/system/deprecated.rst
+++ b/docs/system/deprecated.rst
@@ -192,11 +192,6 @@ Since the ``dirty-bitmaps`` field is optionally present in 
both the old and
  new locations, clients must use introspection to learn where to anticipate
  the field if/when it does appear in command output.
  
-``query-cpus`` (since 2.12.0)

-'
-
-The ``query-cpus`` command is replaced by the ``query-cpus-fast`` command.
-
  ``query-cpus-fast`` ``arch`` output member (since 3.0.0)
  
  
diff --git a/docs/system/removed-features.rst b/docs/system/removed-features.rst

index 2c5513dcc7..ad146daf9b 100644
--- a/docs/system/removed-features.rst
+++ b/docs/system/removed-features.rst
@@ -95,6 +95,11 @@ Use ``migrate_set_parameter`` and ``info 
migrate_parameters`` instead.
  
  Use ``migrate_set_parameter`` instead.
  
+``query-cpus`` (removed in 6.0)

+'''
+
+The ``query-cpus`` command is replaced by the ``query-cpus-fast`` command.
+
  Human Monitor Protocol (HMP) commands
  -
  
diff --git a/hw/core/machine-hmp-cmds.c b/hw/core/machine-hmp-cmds.c

index 6357be9c6b..58248cffa3 100644
--- a/hw/core/machine-hmp-cmds.c
+++ b/hw/core/machine-hmp-cmds.c
@@ -130,7 +130,7 @@ void hmp_info_numa(Monitor *mon, const QDict *qdict)
  {
  int i, nb_numa_nodes;
  NumaNodeMem *node_mem;
-CpuInfoList *cpu_list, *cpu;
+CpuInfoFastList *cpu_list, *cpu;
  MachineState *ms = MACHINE(qdev_get_machine());
  
  nb_numa_nodes = ms->numa_state ? ms->numa_state->num_nodes : 0;

@@ -139,7 +139,7 @@ void hmp_info_numa(Monitor *mon, const QDict *qdict)
  return;
  }
  
-cpu_list = qmp_query_cpus(_abort);

+cpu_list = qmp_query_cpus_fast(_abort);
  node_mem = g_new0(NumaNodeMem, nb_numa_nodes);
  
  query_numa_node_mem(node_mem, ms);

@@ -148,7 +148,7 @@ void hmp_info_numa(Monitor *mon, const QDict *qdict)
  for (cpu = cpu_list; cpu; cpu = cpu->next) {
  if (cpu->value->has_props && cpu->value->props->has_node_id &&
  cpu->value->props->node_id == i) {
-monitor_printf(mon, " %" PRIi64, cpu->value->CPU);
+monitor_printf(mon, " %" PRIi64, cpu->value->cpu_index);
  }
  }
  monitor_printf(mon, "\n");
@@ -157,6 +157,6 @@ void hmp_info_numa(Monitor *mon, const QDict *qdict)
  monitor_printf(mon, "node %d plugged: %" PRId64 " MB\n", i,
 node_mem[i].node_plugged_mem >> 20);
  }
-qapi_free_CpuInfoList(cpu_list);
+qapi_free_CpuInfoFastList(cpu_list);
  g_free(node_mem);
  }
diff --git a/hw/core/machine-qmp-cmds.c b/hw/core/machine-qmp-cmds.c
index 44e979e503..af60cd969d 100644
--- a/hw/core/machine-qmp-cmds.c
+++ b/hw/core/machine-qmp-cmds.c
@@ -24,85 +24,6 @@
  #include "sysemu/runstate.h"
  #include "sysemu/sysemu.h"
  
-CpuInfoList *qmp_query_cpus(Error **errp)

-{
-MachineState *ms = MACHINE(qdev_get_machine());
-MachineClass *mc = MACHINE_GET_CLASS(ms);
-CpuInfoList *head = NULL, **tail = 
-CPUState *cpu;
-
-CPU_FOREACH(cpu) {
-CpuInfo *value;
-#if defined(TARGET_I386)
-X86CPU *x86_cpu = X86_CPU(cpu);
-CPUX86State *env = _cpu->env;
-#elif defined(TARGET_PPC)
-PowerPCCPU *ppc_cpu = POWERPC_CPU(cpu);
-CPUPPCState *env = _cpu->env;
-#elif defined(TARGET_SPARC)
-SPARCCPU *sparc_cpu = SPARC_CPU(cpu);
-CPUSPARCState *env = _cpu->env;
-#elif defined(TARGET_RISCV)
-RISCVCPU *riscv_cpu = RISCV_CPU(cpu);
-CPURISCVState *env = _cpu->env;
-#elif defined(TARGET_MIPS)
-MIPSCPU *mips_cpu = MIPS_CPU(cpu);
-CPUMIPSState *env = _cpu->env;
-#elif defined(TARGET_TRICORE)
-TriCoreCPU *tricore_cpu = TRICORE_CPU(cpu);
-CPUTriCoreState *env = _cpu->env;
-#elif defined(TARGET_S390X)
-S390CPU 

[PATCH v2 3/3] block/qcow2: introduce inflight writes counters: fix discard

2021-02-25 Thread Vladimir Sementsov-Ogievskiy
There is a bug in qcow2: host cluster can be discarded (refcount
becomes 0) and reused during data write. In this case data write may
pollute another cluster (recently allocated) or even metadata.

To fix the issue introduce rw-lock: take read-lock on data writing and
write-lock on discard.

Enable qcow2-discard-during-rewrite as it is fixed.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/qcow2.h  |  2 ++
 block/qcow2-cluster.c  |  4 
 block/qcow2.c  | 18 +-
 .../tests/qcow2-discard-during-rewrite |  2 +-
 4 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index 0678073b74..7ebb6e2677 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -420,6 +420,8 @@ typedef struct BDRVQcow2State {
  * is to convert the image with the desired compression type set.
  */
 Qcow2CompressionType compression_type;
+
+CoRwlock discard_rw_lock;
 } BDRVQcow2State;
 
 typedef struct Qcow2COWRegion {
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index bd0597842f..e16775dd59 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1938,6 +1938,8 @@ int qcow2_cluster_discard(BlockDriverState *bs, uint64_t 
offset,
 int64_t cleared;
 int ret;
 
+qemu_co_rwlock_wrlock(>discard_rw_lock);
+
 /* Caller must pass aligned values, except at image end */
 assert(QEMU_IS_ALIGNED(offset, s->cluster_size));
 assert(QEMU_IS_ALIGNED(end_offset, s->cluster_size) ||
@@ -1965,6 +1967,8 @@ fail:
 s->cache_discards = false;
 qcow2_process_discards(bs, ret);
 
+qemu_co_rwlock_unlock(>discard_rw_lock);
+
 return ret;
 }
 
diff --git a/block/qcow2.c b/block/qcow2.c
index d9f49a52e7..e1a5d89aa1 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1897,6 +1897,7 @@ static int qcow2_open(BlockDriverState *bs, QDict 
*options, int flags,
 
 /* Initialise locks */
 qemu_co_mutex_init(>lock);
+qemu_co_rwlock_init(>discard_rw_lock);
 
 if (qemu_in_coroutine()) {
 /* From bdrv_co_create.  */
@@ -2536,12 +2537,14 @@ static coroutine_fn int 
qcow2_co_pwritev_task(BlockDriverState *bs,
 }
 }
 
+qemu_co_rwlock_unlock(>discard_rw_lock);
 qemu_co_mutex_lock(>lock);
 
 ret = qcow2_handle_l2meta(bs, , true);
 goto out_locked;
 
 out_unlocked:
+qemu_co_rwlock_unlock(>discard_rw_lock);
 qemu_co_mutex_lock(>lock);
 
 out_locked:
@@ -2605,6 +2608,8 @@ static coroutine_fn int qcow2_co_pwritev_part(
 goto out_locked;
 }
 
+qemu_co_rwlock_rdlock(>discard_rw_lock);
+
 qemu_co_mutex_unlock(>lock);
 
 if (!aio && cur_bytes != bytes) {
@@ -4097,10 +4102,15 @@ qcow2_co_copy_range_to(BlockDriverState *bs,
 goto fail;
 }
 
+qemu_co_rwlock_rdlock(>discard_rw_lock);
 qemu_co_mutex_unlock(>lock);
+
 ret = bdrv_co_copy_range_to(src, src_offset, s->data_file, host_offset,
 cur_bytes, read_flags, write_flags);
+
+qemu_co_rwlock_unlock(>discard_rw_lock);
 qemu_co_mutex_lock(>lock);
+
 if (ret < 0) {
 goto fail;
 }
@@ -4536,13 +4546,19 @@ qcow2_co_pwritev_compressed_task(BlockDriverState *bs,
 }
 
 ret = qcow2_pre_write_overlap_check(bs, 0, cluster_offset, out_len, true);
-qemu_co_mutex_unlock(>lock);
 if (ret < 0) {
+qemu_co_mutex_unlock(>lock);
 goto fail;
 }
 
+qemu_co_rwlock_rdlock(>discard_rw_lock);
+qemu_co_mutex_unlock(>lock);
+
 BLKDBG_EVENT(s->data_file, BLKDBG_WRITE_COMPRESSED);
 ret = bdrv_co_pwrite(s->data_file, cluster_offset, out_len, out_buf, 0);
+
+qemu_co_rwlock_unlock(>discard_rw_lock);
+
 if (ret < 0) {
 goto fail;
 }
diff --git a/tests/qemu-iotests/tests/qcow2-discard-during-rewrite 
b/tests/qemu-iotests/tests/qcow2-discard-during-rewrite
index dd9964ef3f..5df0048757 100755
--- a/tests/qemu-iotests/tests/qcow2-discard-during-rewrite
+++ b/tests/qemu-iotests/tests/qcow2-discard-during-rewrite
@@ -1,5 +1,5 @@
 #!/usr/bin/env bash
-# group: quick disabled
+# group: quick
 #
 # Test discarding (and reusing) host cluster during writing data to it.
 #
-- 
2.29.2




[PATCH v2 1/3] qemu-io: add aio_discard

2021-02-25 Thread Vladimir Sementsov-Ogievskiy
Add aio_discard command like existing aio_write. It will be used in
further test.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 qemu-io-cmds.c | 117 +
 1 file changed, 117 insertions(+)

diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index 97611969cb..28b5c3c092 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -1332,6 +1332,7 @@ struct aio_ctx {
 BlockBackend *blk;
 QEMUIOVector qiov;
 int64_t offset;
+int64_t discard_bytes;
 char *buf;
 bool qflag;
 bool vflag;
@@ -1343,6 +1344,34 @@ struct aio_ctx {
 struct timespec t1;
 };
 
+static void aio_discard_done(void *opaque, int ret)
+{
+struct aio_ctx *ctx = opaque;
+struct timespec t2;
+
+clock_gettime(CLOCK_MONOTONIC, );
+
+
+if (ret < 0) {
+printf("aio_discard failed: %s\n", strerror(-ret));
+block_acct_failed(blk_get_stats(ctx->blk), >acct);
+goto out;
+}
+
+block_acct_done(blk_get_stats(ctx->blk), >acct);
+
+if (ctx->qflag) {
+goto out;
+}
+
+/* Finally, report back -- -C gives a parsable format */
+t2 = tsub(t2, ctx->t1);
+print_report("discarded", , ctx->offset, ctx->discard_bytes,
+ ctx->discard_bytes, 1, ctx->Cflag);
+out:
+g_free(ctx);
+}
+
 static void aio_write_done(void *opaque, int ret)
 {
 struct aio_ctx *ctx = opaque;
@@ -1671,6 +1700,93 @@ static int aio_write_f(BlockBackend *blk, int argc, char 
**argv)
 return 0;
 }
 
+static void aio_discard_help(void)
+{
+printf(
+"\n"
+" asynchronously discards a range of bytes from the given offset\n"
+"\n"
+" Example:\n"
+" 'aio_discard 0 64k' - discards 64K at start of a disk\n"
+"\n"
+" Note that due to its asynchronous nature, this command will be\n"
+" considered successful once the request is submitted, independently\n"
+" of potential I/O errors or pattern mismatches.\n"
+" -C, -- report statistics in a machine parsable format\n"
+" -i, -- treat request as invalid, for exercising stats\n"
+" -q, -- quiet mode, do not show I/O statistics\n"
+"\n");
+}
+
+static int aio_discard_f(BlockBackend *blk, int argc, char **argv);
+
+static const cmdinfo_t aio_discard_cmd = {
+.name   = "aio_discard",
+.cfunc  = aio_discard_f,
+.perm   = BLK_PERM_WRITE,
+.argmin = 2,
+.argmax = -1,
+.args   = "[-Ciq] off len",
+.oneline= "asynchronously discards a number of bytes",
+.help   = aio_discard_help,
+};
+
+static int aio_discard_f(BlockBackend *blk, int argc, char **argv)
+{
+int ret;
+int c;
+struct aio_ctx *ctx = g_new0(struct aio_ctx, 1);
+
+ctx->blk = blk;
+while ((c = getopt(argc, argv, "Ciq")) != -1) {
+switch (c) {
+case 'C':
+ctx->Cflag = true;
+break;
+case 'q':
+ctx->qflag = true;
+break;
+case 'i':
+printf("injecting invalid discard request\n");
+block_acct_invalid(blk_get_stats(blk), BLOCK_ACCT_UNMAP);
+g_free(ctx);
+return 0;
+default:
+g_free(ctx);
+qemuio_command_usage(_write_cmd);
+return -EINVAL;
+}
+}
+
+if (optind != argc - 2) {
+g_free(ctx);
+qemuio_command_usage(_write_cmd);
+return -EINVAL;
+}
+
+ctx->offset = cvtnum(argv[optind]);
+if (ctx->offset < 0) {
+ret = ctx->offset;
+print_cvtnum_err(ret, argv[optind]);
+g_free(ctx);
+return ret;
+}
+optind++;
+
+ctx->discard_bytes = cvtnum(argv[optind]);
+if (ctx->discard_bytes < 0) {
+ret = ctx->discard_bytes;
+print_cvtnum_err(ret, argv[optind]);
+g_free(ctx);
+return ret;
+}
+
+blk_aio_pdiscard(blk, ctx->offset, ctx->discard_bytes,
+ aio_discard_done, ctx);
+
+return 0;
+}
+
 static int aio_flush_f(BlockBackend *blk, int argc, char **argv)
 {
 BlockAcctCookie cookie;
@@ -2494,6 +2610,7 @@ static void __attribute((constructor)) 
init_qemuio_commands(void)
 qemuio_add_command(_cmd);
 qemuio_add_command(_cmd);
 qemuio_add_command(_cmd);
+qemuio_add_command(_discard_cmd);
 qemuio_add_command(_read_cmd);
 qemuio_add_command(_write_cmd);
 qemuio_add_command(_flush_cmd);
-- 
2.29.2




[PATCH v2 2/3] iotests: add qcow2-discard-during-rewrite

2021-02-25 Thread Vladimir Sementsov-Ogievskiy
Simple test:
 - start writing to allocated cluster A
 - discard this cluster
 - write to another unallocated cluster B (it's allocated in same place
   where A was allocated)
 - continue writing to A

For now last action pollutes cluster B which is a bug fixed by the
following commit.

For now, add test to "disabled" group, so that it doesn't run
automatically.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 .../tests/qcow2-discard-during-rewrite| 99 +++
 .../tests/qcow2-discard-during-rewrite.out| 17 
 2 files changed, 116 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/qcow2-discard-during-rewrite
 create mode 100644 tests/qemu-iotests/tests/qcow2-discard-during-rewrite.out

diff --git a/tests/qemu-iotests/tests/qcow2-discard-during-rewrite 
b/tests/qemu-iotests/tests/qcow2-discard-during-rewrite
new file mode 100755
index 00..dd9964ef3f
--- /dev/null
+++ b/tests/qemu-iotests/tests/qcow2-discard-during-rewrite
@@ -0,0 +1,99 @@
+#!/usr/bin/env bash
+# group: quick disabled
+#
+# Test discarding (and reusing) host cluster during writing data to it.
+#
+# Copyright (c) 2021 Virtuozzo International GmbH.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+# creator
+owner=vsement...@virtuozzo.com
+
+seq=`basename $0`
+echo "QA output created by $seq"
+
+status=1   # failure is the default!
+
+_cleanup()
+{
+_cleanup_test_img
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./../common.rc
+. ./../common.filter
+
+_supported_fmt qcow2
+_supported_proto file fuse
+_supported_os Linux
+
+size=1M
+_make_test_img $size
+
+(
+cat <

[PATCH v2(RFC) 0/3] qcow2: fix parallel rewrite and discard

2021-02-25 Thread Vladimir Sementsov-Ogievskiy
Hi all! It occurs that nothing prevents discarding and reallocating host
cluster during data writing. This way data writing will pollute another
newly allocated cluster of data or metadata.

OK, v2 is a try to solve the problem with CoRwlock.. And it is marked
RFC, because of a lot of iotest failures.. Some of problems with v2:

1. It's a more complicated to make a test, as everything is blocking
and I can't just break write and do discard.. I have to implement
aio_discard in qemu-io and rewrite test into several portions of io
commands splitted by "sleep 1".. OK, it's not a big problem, and I've
solved it.

2. iotest 7 fails with several leaked clusters. Seems, that it depend on
the fact that discard may be done in parallel with writes. Iotest 7 does
snapshots, so I think l1 table is updated to the moment when discard is
finally unlocked.. But I didn't dig into it, it's all my assumptions.

3. iotest 13 (and I think a lot more iotests) crashes on
assert(!to->locks_held); .. So with this assertion we can't keep rwlock
locked during data writing...

  #3  in __assert_fail () from /lib64/libc.so.6
  #4  in qemu_aio_coroutine_enter (ctx=0x55762120b700, co=0x55762121d700)
  at ../util/qemu-coroutine.c:158
  #5  in aio_co_enter (ctx=0x55762120b700, co=0x55762121d700) at 
../util/async.c:628
  #6  in aio_co_wake (co=0x55762121d700) at ../util/async.c:612
  #7  in thread_pool_co_cb (opaque=0x7f17950daab0, ret=0) at 
../util/thread-pool.c:279
  #8  in thread_pool_completion_bh (opaque=0x5576211e5070) at 
../util/thread-pool.c:188
  #9  in aio_bh_call (bh=0x557621205df0) at ../util/async.c:136
  #10 in aio_bh_poll (ctx=0x55762120b700) at ../util/async.c:164
  #11 in aio_poll (ctx=0x55762120b700, blocking=true) at ../util/aio-posix.c:659
  #12 in blk_prw (blk=0x557621205790, offset=4303351808, 
  buf=0x55762123e000 '\364' , ..., bytes=12288, 
  co_entry=0x557620d9dc97 , flags=0) at 
../block/block-backend.c:1335
  #13 in blk_pwrite (blk=0x557621205790, offset=4303351808, buf=0x55762123e000, 
  count=12288, flags=0) at ../block/block-backend.c:1501


So now I think that v1 is simpler.. It's more complicated (but not too
much) in code. But it keeps discards and data writes non-blocking each
other and avoids yields in critical sections.

Vladimir Sementsov-Ogievskiy (3):
  qemu-io: add aio_discard
  iotests: add qcow2-discard-during-rewrite
  block/qcow2: introduce inflight writes counters: fix discard

 block/qcow2.h |   2 +
 block/qcow2-cluster.c |   4 +
 block/qcow2.c |  18 ++-
 qemu-io-cmds.c| 117 ++
 .../tests/qcow2-discard-during-rewrite|  99 +++
 .../tests/qcow2-discard-during-rewrite.out|  17 +++
 6 files changed, 256 insertions(+), 1 deletion(-)
 create mode 100755 tests/qemu-iotests/tests/qcow2-discard-during-rewrite
 create mode 100644 tests/qemu-iotests/tests/qcow2-discard-during-rewrite.out

-- 
2.29.2




Re: [PATCH 0/3] hw/block/nvme: mdts/zasl cleanup

2021-02-25 Thread Klaus Jensen
On Feb 23 06:00, Keith Busch wrote:
> These look good.
> 
> Reviewed-by: Keith Busch 

Thanks, applied to nvme-next.


signature.asc
Description: PGP signature


[PATCH v2] blockjob: report a better error message

2021-02-25 Thread Stefano Garzarella
When a block job fails, we report strerror(-job->job.ret) error
message, also if the job set an error object.
Let's report a better error message using error_get_pretty(job->job.err).

If an error object was not set, strerror(-job->ret) is used as fallback,
as explained in include/qemu/job.h:

typedef struct Job {
...
/**
 * Error object for a failed job.
 * If job->ret is nonzero and an error object was not set, it will be set
 * to strerror(-job->ret) during job_completed.
 */
Error *err;
}

In block_job_query() there can be a transient where 'job.err' is not set
by a scheduled bottom half. In that case we use strerror(-job->ret) as it
was before.

Suggested-by: Kevin Wolf 
Signed-off-by: Stefano Garzarella 
---

Notes:
v2:
- fixed potential issue in block_job_query() [Kevin]
- updated commit message

 blockjob.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/blockjob.c b/blockjob.c
index f2feff051d..ef968017a2 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -318,8 +318,12 @@ BlockJobInfo *block_job_query(BlockJob *job, Error **errp)
 info->status= job->job.status;
 info->auto_finalize = job->job.auto_finalize;
 info->auto_dismiss  = job->job.auto_dismiss;
-info->has_error = job->job.ret != 0;
-info->error = job->job.ret ? g_strdup(strerror(-job->job.ret)) : NULL;
+if (job->job.ret) {
+info->has_error = true;
+info->error = job->job.err ?
+g_strdup(error_get_pretty(job->job.err)) :
+g_strdup(strerror(-job->job.ret));
+}
 return info;
 }
 
@@ -356,7 +360,7 @@ static void block_job_event_completed(Notifier *n, void 
*opaque)
 }
 
 if (job->job.ret < 0) {
-msg = strerror(-job->job.ret);
+msg = error_get_pretty(job->job.err);
 }
 
 qapi_event_send_block_job_completed(job_type(>job),
-- 
2.29.2




[PATCH v1 2/2] block/qcow2: introduce inflight writes counters: fix discard

2021-02-25 Thread Vladimir Sementsov-Ogievskiy
There is a bug in qcow2: host cluster can be discarded (refcount
becomes 0) and reused during data write. In this case data write may
pollute another cluster (recently allocated) or even metadata.

To fix the issue let's track inflight writes to host cluster in the
hash table and consider new counter when discarding and reusing host
clusters.

Enable qcow2-discard-during-rewrite as it is fixed.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/qcow2.h |   9 +
 block/qcow2-refcount.c| 154 +-
 block/qcow2.c |  26 ++-
 .../tests/qcow2-discard-during-rewrite|   2 +-
 4 files changed, 186 insertions(+), 5 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index 0678073b74..fea2525a76 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -420,6 +420,8 @@ typedef struct BDRVQcow2State {
  * is to convert the image with the desired compression type set.
  */
 Qcow2CompressionType compression_type;
+
+GHashTable *inflight_writes_counters;
 } BDRVQcow2State;
 
 typedef struct Qcow2COWRegion {
@@ -896,6 +898,13 @@ int qcow2_shrink_reftable(BlockDriverState *bs);
 int64_t qcow2_get_last_cluster(BlockDriverState *bs, int64_t size);
 int qcow2_detect_metadata_preallocation(BlockDriverState *bs);
 
+int qcow2_inflight_writes_inc_locked(BlockDriverState *bs, int64_t offset,
+ int64_t length);
+int qcow2_inflight_writes_dec(BlockDriverState *bs, int64_t offset,
+  int64_t length);
+int qcow2_inflight_writes_dec_locked(BlockDriverState *bs, int64_t offset,
+ int64_t length);
+
 /* qcow2-cluster.c functions */
 int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
 bool exact_size);
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 8e649b008e..0ecb1167a6 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -799,6 +799,145 @@ found:
 }
 }
 
+/*
+ * Qcow2InFlightRefcount is a type for values of s->inflight_writes_counters
+ * hasm map. And it's keys are cluster indices.
+ */
+typedef struct Qcow2InFlightRefcount {
+/*
+ * Number of in-flight writes to the cluster, always > 0, as when becomes
+ * 0 the entry is removed from s->inflight_writes_counters.
+ */
+uint64_t inflight_writes_cnt;
+
+/* Cluster refcount is known to be zero */
+bool refcount_zero;
+
+/* Cluster refcount was made zero with this discard-type */
+enum qcow2_discard_type type;
+} Qcow2InFlightRefcount;
+
+static Qcow2InFlightRefcount *find_infl_wr(BDRVQcow2State *s,
+   int64_t cluster_index)
+{
+Qcow2InFlightRefcount *infl;
+
+if (!s->inflight_writes_counters) {
+return NULL;
+}
+
+infl = g_hash_table_lookup(s->inflight_writes_counters, _index);
+
+if (infl) {
+assert(infl->inflight_writes_cnt > 0);
+}
+
+return infl;
+}
+
+/*
+ * Returns true if there are any in-flight writes to the cluster blocking
+ * its reallocation.
+ */
+static bool has_infl_wr(BDRVQcow2State *s, int64_t cluster_index)
+{
+return !!find_infl_wr(s, cluster_index);
+}
+
+static int update_inflight_write_cnt(BlockDriverState *bs,
+ int64_t offset, int64_t length,
+ bool decrease, bool locked)
+{
+BDRVQcow2State *s = bs->opaque;
+int64_t start, last, cluster_offset;
+
+if (locked) {
+qemu_co_mutex_assert_locked(>lock);
+}
+
+start = start_of_cluster(s, offset);
+last = start_of_cluster(s, offset + length - 1);
+for (cluster_offset = start; cluster_offset <= last;
+ cluster_offset += s->cluster_size)
+{
+int64_t cluster_index = cluster_offset >> s->cluster_bits;
+Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
+
+if (!infl) {
+infl = g_new0(Qcow2InFlightRefcount, 1);
+g_hash_table_insert(s->inflight_writes_counters,
+g_memdup(_index, 
sizeof(cluster_index)),
+infl);
+}
+
+if (decrease) {
+assert(infl->inflight_writes_cnt >= 1);
+infl->inflight_writes_cnt--;
+} else {
+infl->inflight_writes_cnt++;
+}
+
+if (infl->inflight_writes_cnt == 0) {
+bool refcount_zero = infl->refcount_zero;
+enum qcow2_discard_type type = infl->type;
+
+g_hash_table_remove(s->inflight_writes_counters, _index);
+
+if (refcount_zero) {
+/*
+ * Slow path. We must reset normal refcount to actually release
+ * the cluster.
+ */
+int ret;
+
+if (!locked) {
+qemu_co_mutex_lock(>lock);
+}
+

[PATCH v1 1/2] iotests: add qcow2-discard-during-rewrite

2021-02-25 Thread Vladimir Sementsov-Ogievskiy
Simple test:
 - start writing to allocated cluster A
 - discard this cluster
 - write to another unallocated cluster B (it's allocated in same place
   where A was allocated)
 - continue writing to A

For now last action pollutes cluster B which is a bug fixed by the
following commit.

For now, add test to "disabled" group, so that it doesn't run
automatically.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 .../tests/qcow2-discard-during-rewrite| 72 +++
 .../tests/qcow2-discard-during-rewrite.out| 21 ++
 2 files changed, 93 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/qcow2-discard-during-rewrite
 create mode 100644 tests/qemu-iotests/tests/qcow2-discard-during-rewrite.out

diff --git a/tests/qemu-iotests/tests/qcow2-discard-during-rewrite 
b/tests/qemu-iotests/tests/qcow2-discard-during-rewrite
new file mode 100755
index 00..7f0d8a107a
--- /dev/null
+++ b/tests/qemu-iotests/tests/qcow2-discard-during-rewrite
@@ -0,0 +1,72 @@
+#!/usr/bin/env bash
+# group: quick disabled
+#
+# Test discarding (and reusing) host cluster during writing data to it.
+#
+# Copyright (c) 2021 Virtuozzo International GmbH.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+# creator
+owner=vsement...@virtuozzo.com
+
+seq=`basename $0`
+echo "QA output created by $seq"
+
+status=1   # failure is the default!
+
+_cleanup()
+{
+_cleanup_test_img
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./../common.rc
+. ./../common.filter
+
+_supported_fmt qcow2
+_supported_proto file fuse
+_supported_os Linux
+
+size=1M
+_make_test_img $size
+
+(
+cat <

Re: [PATCH v4 0/8] hw/sh4: Kconfig cleanups

2021-02-25 Thread Paolo Bonzini

On 22/02/21 15:15, Philippe Mathieu-Daudé wrote:

Missing review: 1 (license)

Since v3:
- Include full MIT license text (Peter)

Since v2:
- Added missing TC58128/SH_PCI Kconfig entries (Peter)

Since v1:
- Addressed Peter Maydell review comments from
https://www.mail-archive.com/qemu-block@nongnu.org/msg80599.html

Philippe Mathieu-Daudé (8):
   hw/sh4: Add missing license
   hw/sh4: Add missing Kconfig dependency on SH7750 for the R2D board
   hw/intc: Introduce SH_INTC Kconfig entry
   hw/char: Introduce SH_SCI Kconfig entry
   hw/timer: Introduce SH_TIMER Kconfig entry
   hw/block: Introduce TC58128 eeprom Kconfig entry
   hw/pci-host: Introduce SH_PCI Kconfig entry
   hw/sh4: Remove now unused CONFIG_SH4 from Kconfig

  include/hw/sh4/sh.h   | 31 ---
  hw/block/tc58128.c| 26 ++
  hw/{sh4 => pci-host}/sh_pci.c |  0
  MAINTAINERS   |  6 ++
  hw/block/Kconfig  |  3 +++
  hw/block/meson.build  |  2 +-
  hw/char/Kconfig   |  3 +++
  hw/char/meson.build   |  2 +-
  hw/intc/Kconfig   |  3 +++
  hw/intc/meson.build   |  2 +-
  hw/pci-host/Kconfig   |  4 
  hw/pci-host/meson.build   |  1 +
  hw/sh4/Kconfig| 12 ++--
  hw/sh4/meson.build|  1 -
  hw/timer/Kconfig  |  4 
  hw/timer/meson.build  |  2 +-
  16 files changed, 88 insertions(+), 14 deletions(-)
  rename hw/{sh4 => pci-host}/sh_pci.c (100%)



Acked-by: Paolo Bonzini 

Paolo




[PATCH v1 0/2] qcow2: fix parallel rewrite and discard

2021-02-25 Thread Vladimir Sementsov-Ogievskiy
Hi all! It occurs that nothing prevents discarding and reallocating host
cluster during data writing. This way data writing will pollute another
flash allocated cluster of data or metadata.

Here is my suggestion to fix it basing on improved refcounts model. Look
at 02 for details.

I don't insist on this version, and will soon send a v2, based on
CoRwLock, as Kevin suggested, which should look simpler. Still, with v1
we keep possibility of relatively async discard.. Doesn't seem worth the
complexity.. But I'd like to share my idea of additional "runtime"
reference counters for clusters, as it may be needed later if we face
problems with more restrictive CoRwLock or may be for some other task.
So here is it.

Vladimir Sementsov-Ogievskiy (2):
  iotests: add qcow2-discard-during-rewrite
  block/qcow2: introduce inflight writes counters: fix discard

 block/qcow2.h |   9 +
 block/qcow2-refcount.c| 154 +-
 block/qcow2.c |  26 ++-
 .../tests/qcow2-discard-during-rewrite|  72 
 .../tests/qcow2-discard-during-rewrite.out|  21 +++
 5 files changed, 278 insertions(+), 4 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/qcow2-discard-during-rewrite
 create mode 100644 tests/qemu-iotests/tests/qcow2-discard-during-rewrite.out

-- 
2.29.2




Re: [PATCH v4] virtio-blk: Respect discard granularity

2021-02-25 Thread Stefano Garzarella

On Thu, Feb 25, 2021 at 09:12:39AM +0900, Akihiko Odaki wrote:

Report the configured granularity for discard operation to the
guest. If this is not set use the block size.

Since until now we have ignored the configured discard granularity
and always reported the block size, let's add
'report-discard-granularity' property and disable it for older
machine types to avoid migration issues.

Signed-off-by: Akihiko Odaki 
---
hw/block/virtio-blk.c  | 8 +++-
hw/core/machine.c  | 4 +++-
include/hw/virtio/virtio-blk.h | 1 +
3 files changed, 11 insertions(+), 2 deletions(-)


Reviewed-by: Stefano Garzarella