date:20201022

Re: [PATCH v5 2/2] hw/block/nvme: add the dataset management command

2020-10-22 Thread Klaus Jensen

On Oct 22 23:02, Philippe Mathieu-Daudé wrote:
> Hi Klaus,
> 

Hi Philippe,

Thanks for your comments!

> On 10/22/20 8:49 PM, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Add support for the Dataset Management command and the Deallocate
> > attribute. Deallocation results in discards being sent to the underlying
> > block device. Whether of not the blocks are actually deallocated is
> > affected by the same factors as Write Zeroes (see previous commit).
> > 
> >   format | discard | dsm (512b)  dsm (4kb)  dsm (64kb)
> 
> Please use B/KiB units which are unambiguous (kb is for kbits)
> (if you queue this yourself, you can fix when applying, no need
> to repost).
> 

Thanks, I'll change it.

> >  --
> >qcow2ignore   n   n  n
> >qcow2unmapn   n  y
> >raw  ignore   n   n  n
> >raw  unmapn   y  y
> > 
> > Again, a raw format and 4kb LBAs are preferable.
> > 
> > In order to set the Namespace Preferred Deallocate Granularity and
> > Alignment fields (NPDG and NPDA), choose a sane minimum discard
> > granularity of 4kb. If we are using a passthru device supporting discard
> > at a 512b granularity, user should set the discard_granularity property
> 
> Ditto.
> 
> > explicitly. NPDG and NPDA will also account for the cluster_size of the
> > block driver if required (i.e. for QCOW2).
> > 
> > See NVM Express 1.3d, Section 6.7 ("Dataset Management command").
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >   hw/block/nvme.h  |   2 +
> >   include/block/nvme.h |   7 ++-
> >   hw/block/nvme-ns.c   |  36 +--
> >   hw/block/nvme.c  | 101 ++-
> >   4 files changed, 140 insertions(+), 6 deletions(-)
> > 
> > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > index e080a2318a50..574333caa3f9 100644
> > --- a/hw/block/nvme.h
> > +++ b/hw/block/nvme.h
> > @@ -28,6 +28,7 @@ typedef struct NvmeRequest {
> >   struct NvmeNamespace*ns;
> >   BlockAIOCB  *aiocb;
> >   uint16_tstatus;
> > +void*opaque;
> >   NvmeCqe cqe;
> >   NvmeCmd cmd;
> >   BlockAcctCookie acct;
> > @@ -60,6 +61,7 @@ static inline const char *nvme_io_opc_str(uint8_t opc)
> >   case NVME_CMD_WRITE:return "NVME_NVM_CMD_WRITE";
> >   case NVME_CMD_READ: return "NVME_NVM_CMD_READ";
> >   case NVME_CMD_WRITE_ZEROES: return "NVME_NVM_CMD_WRITE_ZEROES";
> > +case NVME_CMD_DSM:  return "NVME_NVM_CMD_DSM";
> >   default:return "NVME_NVM_CMD_UNKNOWN";
> >   }
> >   }
> > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > index 966c3bb304bd..e95ff6ca9b37 100644
> > --- a/include/block/nvme.h
> > +++ b/include/block/nvme.h
> > @@ -990,7 +990,12 @@ typedef struct QEMU_PACKED NvmeIdNs {
> >   uint16_tnabspf;
> >   uint16_tnoiob;
> >   uint8_t nvmcap[16];
> > -uint8_t rsvd64[40];
> > +uint16_tnpwg;
> > +uint16_tnpwa;
> > +uint16_tnpdg;
> > +uint16_tnpda;
> > +uint16_tnows;
> > +uint8_t rsvd74[30];
> >   uint8_t nguid[16];
> >   uint64_teui64;
> >   NvmeLBAFlbaf[16];
> 
> If you consider "block/nvme.h" shared by 2 different subsystems,
> it is better to add the changes in a separate patch. That way
> the changes can be acked individually.
> 

Sure. Some other stuff here warrents a v6 I think, so I will split it.

> > diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> > index f1cc734c60f5..840651db7256 100644
> > --- a/hw/block/nvme-ns.c
> > +++ b/hw/block/nvme-ns.c
> > @@ -28,10 +28,14 @@
> >   #include "nvme.h"
> >   #include "nvme-ns.h"
> > -static void nvme_ns_init(NvmeNamespace *ns)
> > +#define MIN_DISCARD_GRANULARITY (4 * KiB)
> > +
> > +static int nvme_ns_init(NvmeNamespace *ns, Error **errp)
> 
> Hmm the Error* argument could be squashed in "hw/block/nvme:
> support multiple namespaces". Else better split patch in dumb
> units IMHO (maybe a reviewer's taste).
> 

Yeah, I guess I can squash that in.

> >   {
> > +BlockDriverInfo bdi;
> >   NvmeIdNs *id_ns = >id_ns;
> >   int lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
> > +int npdg, ret;
> >   ns->id_ns.dlfeat = 0x9;
> > @@ -43,8 +47,25 @@ static void nvme_ns_init(NvmeNamespace *ns)
> >   id_ns->ncap = id_ns->nsze;
> >   id_ns->nuse = id_ns->ncap;
> > -/* support DULBE */
> > -id_ns->nsfeat |= 0x4;
> > +/* support DULBE and I/O optimization fields */
> > +id_ns->nsfeat |= (0x4 | 0x10);
> 
> The comment helps, but isn't needed if you use explicit definitions
> for these flags. You already introduced the NVME_ID_NS_NSFEAT_DULBE
> and NVME_ID_NS_FLBAS_EXTENDED but they are restricted to extract bits.
>

Re: [PULL 22/23] hw/sd: Fix incorrect populated function switch status data structure

2020-10-22 Thread Bin Meng

Hi Niek,

On Thu, Oct 22, 2020 at 11:20 PM Niek Linnenbank
 wrote:
>
> Hi Bin, Philippe,
>
> If im correct the acceptance tests for orange pi need to be run with a flag 
> ARMBIAN_ARTIFACTS_CACHED set that explicitly allows them to be run using the 
> armbian mirror. So if you pass that flag on the same command that Philippe 
> gave, the rests should run.

Thank you for the hints. Actually I noticed the environment variable
ARMBIAN_ARTIFACTS_CACHED when looking at the test codes, but after I
turned on the flag it still could not download the test asset from the
apt.armbian.com website.

> I have a follow up question and Im interested to hear your opinion on that 
> Philippe. Should we perhaps update the orange pi tests (and maybe others) so 
> they use a reliable mirror that we can control, for example a github repo? I 
> would be happy to create a repo for that, at least for the orange pi tests. 
> But maybe there is already something planned as a more general solution for 
> artifacts of other machines as well?
>

Regards,
Bin

[RFC PATCH] hw/nvme: Move NVMe emulation out of hw/block/ directory

2020-10-22 Thread Philippe Mathieu-Daudé

As IDE used to be, NVMe emulation is becoming an active
subsystem. Move it into its own namespace.

Signed-off-by: Philippe Mathieu-Daudé 
---
sent as RFC, it case it helps NVMe developers. As nvme-next
has patches queues, if the idea of moving seems useful, this
patch can resend later.
---
 meson.build   |   1 +
 hw/{block/nvme.h => nvme/nvme-internal.h} |   4 +-
 hw/{block/nvme.c => nvme/core.c}  |   2 +-
 MAINTAINERS   |   2 +-
 hw/Kconfig|   1 +
 hw/block/Kconfig  |   5 -
 hw/block/meson.build  |   1 -
 hw/block/trace-events | 124 -
 hw/meson.build|   1 +
 hw/nvme/Kconfig   |   4 +
 hw/nvme/meson.build   |   1 +
 hw/nvme/trace-events  | 125 ++
 12 files changed, 137 insertions(+), 134 deletions(-)
 rename hw/{block/nvme.h => nvme/nvme-internal.h} (98%)
 rename hw/{block/nvme.c => nvme/core.c} (99%)
 create mode 100644 hw/nvme/Kconfig
 create mode 100644 hw/nvme/meson.build
 create mode 100644 hw/nvme/trace-events

diff --git a/meson.build b/meson.build
index 7627a0ae46e..24234ebd473 100644
--- a/meson.build
+++ b/meson.build
@@ -1356,6 +1356,7 @@
 'hw/misc',
 'hw/misc/macio',
 'hw/net',
+'hw/nvme',
 'hw/nvram',
 'hw/pci',
 'hw/pci-host',
diff --git a/hw/block/nvme.h b/hw/nvme/nvme-internal.h
similarity index 98%
rename from hw/block/nvme.h
rename to hw/nvme/nvme-internal.h
index 52ba794f2e9..824788d9c6e 100644
--- a/hw/block/nvme.h
+++ b/hw/nvme/nvme-internal.h
@@ -1,5 +1,5 @@
-#ifndef HW_NVME_H
-#define HW_NVME_H
+#ifndef HW_NVME_INTERNAL_H
+#define HW_NVME_INTERNAL_H
 
 #include "block/nvme.h"
 
diff --git a/hw/block/nvme.c b/hw/nvme/core.c
similarity index 99%
rename from hw/block/nvme.c
rename to hw/nvme/core.c
index 44fa5b90769..04391fbb083 100644
--- a/hw/block/nvme.c
+++ b/hw/nvme/core.c
@@ -67,8 +67,8 @@
 #include "qemu/log.h"
 #include "qemu/module.h"
 #include "qemu/cutils.h"
+#include "nvme-internal.h"
 #include "trace.h"
-#include "nvme.h"
 
 #define NVME_MAX_IOQPAIRS 0x
 #define NVME_DB_SIZE  4
diff --git a/MAINTAINERS b/MAINTAINERS
index 6a197bd358d..7132bbe3ff4 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1875,7 +1875,7 @@ M: Keith Busch 
 M: Klaus Jensen 
 L: qemu-block@nongnu.org
 S: Supported
-F: hw/block/nvme*
+F: hw/nvme/
 F: tests/qtest/nvme-test.c
 T: git git://git.infradead.org/qemu-nvme.git nvme-next
 
diff --git a/hw/Kconfig b/hw/Kconfig
index 4de1797ffda..4ef9ca40ab0 100644
--- a/hw/Kconfig
+++ b/hw/Kconfig
@@ -21,6 +21,7 @@ source mem/Kconfig
 source misc/Kconfig
 source net/Kconfig
 source nubus/Kconfig
+source nvme/Kconfig
 source nvram/Kconfig
 source pci-bridge/Kconfig
 source pci-host/Kconfig
diff --git a/hw/block/Kconfig b/hw/block/Kconfig
index 2d17f481adc..c2213173ffe 100644
--- a/hw/block/Kconfig
+++ b/hw/block/Kconfig
@@ -22,11 +22,6 @@ config ECC
 config ONENAND
 bool
 
-config NVME_PCI
-bool
-default y if PCI_DEVICES
-depends on PCI
-
 config VIRTIO_BLK
 bool
 default y
diff --git a/hw/block/meson.build b/hw/block/meson.build
index 78cad8f7cba..96697f739c0 100644
--- a/hw/block/meson.build
+++ b/hw/block/meson.build
@@ -13,7 +13,6 @@
 softmmu_ss.add(when: 'CONFIG_SWIM', if_true: files('swim.c'))
 softmmu_ss.add(when: 'CONFIG_XEN', if_true: files('xen-block.c'))
 softmmu_ss.add(when: 'CONFIG_SH4', if_true: files('tc58128.c'))
-softmmu_ss.add(when: 'CONFIG_NVME_PCI', if_true: files('nvme.c'))
 
 specific_ss.add(when: 'CONFIG_VIRTIO_BLK', if_true: files('virtio-blk.c'))
 specific_ss.add(when: 'CONFIG_VHOST_USER_BLK', if_true: 
files('vhost-user-blk.c'))
diff --git a/hw/block/trace-events b/hw/block/trace-events
index ec94c56a416..31be146 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -27,130 +27,6 @@ virtio_blk_submit_multireq(void *vdev, void *mrb, int 
start, int num_reqs, uint6
 hd_geometry_lchs_guess(void *blk, int cyls, int heads, int secs) "blk %p LCHS 
%d %d %d"
 hd_geometry_guess(void *blk, uint32_t cyls, uint32_t heads, uint32_t secs, int 
trans) "blk %p CHS %u %u %u trans %d"
 
-# nvme.c
-# nvme traces for successful events
-pci_nvme_irq_msix(uint32_t vector) "raising MSI-X IRQ vector %u"
-pci_nvme_irq_pin(void) "pulsing IRQ pin"
-pci_nvme_irq_masked(void) "IRQ is masked"
-pci_nvme_dma_read(uint64_t prp1, uint64_t prp2) "DMA read, prp1=0x%"PRIx64" 
prp2=0x%"PRIx64""
-pci_nvme_map_addr(uint64_t addr, uint64_t len) "addr 0x%"PRIx64" len %"PRIu64""
-pci_nvme_map_addr_cmb(uint64_t addr, uint64_t len) "addr 0x%"PRIx64" len 
%"PRIu64""
-pci_nvme_map_prp(uint64_t trans_len, uint32_t len, uint64_t prp1, uint64_t 
prp2, int num_prps) "trans_len %"PRIu64" len %"PRIu32" prp1 0x%"PRIx64" prp2 
0x%"PRIx64" num_prps %d"
-pci_nvme_io_cmd(uint16_t cid, uint32_t nsid, uint16_t sqid, uint8_t opcode) 
"cid

Re: [PATCH v4 2/7] nbd: Add new qemu:allocation-depth metadata context

2020-10-22 Thread Eric Blake

On 10/14/20 6:52 AM, Vladimir Sementsov-Ogievskiy wrote:

>>   docs/interop/nbd.txt | 27 ++---
> 
> [..]
> 
>> +In the allocation depth context, bits 0 and 1 form a tri-state value:
>> +
>> +    bits 0-1: 00: NBD_STATE_DEPTH_UNALLOC, the extent is unallocated
>> +  01: NBD_STATE_DEPTH_LOCAL, the extent is allocated in the
>> +  top level of the image
> 
> Hmm. I always thought that "image" == file, so backing chain is a chain
> of images,
> not a several levels of one image. If it is so, than it should be "the
> top level image".
> And "levels of the image" may designate internal qcow2 snapshots
> unrelated here..

It's fuzzy.  From the guest point of view, we are serving a single guest
image by use of multiple files in the host.  I will do s/level/layer/,
to match the wording I already had on the next line:

>   10: NBD_STATE_DEPTH_BACKING, the extent is inherited from a
>   backing layer

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v5 2/2] hw/block/nvme: add the dataset management command

2020-10-22 Thread Keith Busch

On Thu, Oct 22, 2020 at 11:02:04PM +0200, Philippe Mathieu-Daudé wrote:
> On 10/22/20 8:49 PM, Klaus Jensen wrote:
> > +static uint16_t nvme_dsm(NvmeCtrl *n, NvmeRequest *req)
> > +{
> > +NvmeNamespace *ns = req->ns;
> > +NvmeDsmCmd *dsm = (NvmeDsmCmd *) >cmd;
> > +NvmeDsmRange *range = NULL;
> 
> g_autofree?

Or just allocate the array on the stack. The upper limit is just 512
entries.

Re: [PATCH v2 15/20] iotests: 219: prepare for backup over block-copy

2020-10-22 Thread Vladimir Sementsov-Ogievskiy


23.07.2020 11:35, Max Reitz wrote:

On 01.06.20 20:11, Vladimir Sementsov-Ogievskiy wrote:

The further change of moving backup to be a on block-copy call will


-on?


make copying chunk-size and cluster-size a separate things. So, even


s/a/two/


with 64k cluster sized qcow2 image, default chunk would be 1M.
Test 219 depends on specified chunk-size. Update it for explicit
chunk-size for backup as for mirror.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/219 | 13 +++--
  1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/tests/qemu-iotests/219 b/tests/qemu-iotests/219
index db272c5249..2bbed28f39 100755
--- a/tests/qemu-iotests/219
+++ b/tests/qemu-iotests/219
@@ -203,13 +203,13 @@ with iotests.FilePath('disk.img') as disk_path, \
  # but related to this also automatic state transitions like job
  # completion), but still get pause points often enough to avoid making 
this
  # test very slow, it's important to have the right ratio between speed and
-# buf_size.
+# copy-chunk-size.
  #
-# For backup, buf_size is hard-coded to the source image cluster size 
(64k),
-# so we'll pick the same for mirror. The slice time, i.e. the granularity
-# of the rate limiting is 100ms. With a speed of 256k per second, we can
-# get four pause points per second. This gives us 250ms per iteration,
-# which should be enough to stay deterministic.
+# Chose 64k copy-chunk-size both for mirror (by buf_size) and backup (by
+# x-max-chunk). The slice time, i.e. the granularity of the rate limiting
+# is 100ms. With a speed of 256k per second, we can get four pause points
+# per second. This gives us 250ms per iteration, which should be enough to
+# stay deterministic.


Don’t we also have to limit the number of workers to 1 so we actually
keep 250 ms per iteration instead of just finishing four requests
immediately, then pausing for a second?


Block-copy rate limiter works good enough: it will not start too much requests.




  test_job_lifecycle(vm, 'drive-mirror', has_ready=True, job_args={
  'device': 'drive0-node',
@@ -226,6 +226,7 @@ with iotests.FilePath('disk.img') as disk_path, \
  'target': copy_path,
  'sync': 'full',
  'speed': 262144,
+'x-max-chunk': 65536,
  'auto-finalize': auto_finalize,
  'auto-dismiss': auto_dismiss,
  })







--
Best regards,
Vladimir

Re: [PATCH v2 14/20] iotests: 185: prepare for backup over block-copy

2020-10-22 Thread Vladimir Sementsov-Ogievskiy


23.07.2020 11:19, Max Reitz wrote:

On 01.06.20 20:11, Vladimir Sementsov-Ogievskiy wrote:

The further change of moving backup to be a on block-copy call will


-on?


one :)




make copying chunk-size and cluster-size a separate things. So, even


s/a/two/


with 64k cluster sized qcow2 image, default chunk would be 1M.
185 test however assumes, that with speed limited to 64K, one iteration
would result in offset=64K. It will change, as first iteration would
result in offset=1M independently of speed.

So, let's explicitly specify, what test wants: set max-chunk to 64K, so
that one iteration is 64K. Note, that we don't need to limit
max-workers, as block-copy rate limitator will handle the situation and


*limitator


wouldn't start new workers when speed limit is obviously reached.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/185 | 3 ++-
  tests/qemu-iotests/185.out | 2 +-
  2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/185 b/tests/qemu-iotests/185
index fd5e6ebe11..6afb3fc82f 100755
--- a/tests/qemu-iotests/185
+++ b/tests/qemu-iotests/185
@@ -182,7 +182,8 @@ _send_qemu_cmd $h \
'target': '$TEST_IMG.copy',
'format': '$IMGFMT',
'sync': 'full',
-  'speed': 65536 } }" \
+  'speed': 65536,
+  'x-max-chunk': 65536 } }" \


Out of curiosity, would it also suffice to disable copy offloading?


Note that x-max-chunk works even with copy offloading enabled, it sets maximum 
only for background copying, not for all operations.



But anyway:

Reviewed-by: Max Reitz 


  "return"
  
  # If we don't sleep here 'quit' command races with disk I/O

diff --git a/tests/qemu-iotests/185.out b/tests/qemu-iotests/185.out
index ac5ab16bc8..5232647972 100644
--- a/tests/qemu-iotests/185.out
+++ b/tests/qemu-iotests/185.out
@@ -61,7 +61,7 @@ Formatting 'TEST_DIR/t.qcow2.copy', fmt=qcow2 size=67108864 
cluster_size=65536 l
  
  { 'execute': 'qmp_capabilities' }

  {"return": {}}
-{ 'execute': 'drive-backup', 'arguments': { 'device': 'disk', 'target': 
'TEST_DIR/t.IMGFMT.copy', 'format': 'IMGFMT', 'sync': 'full', 'speed': 65536 } }
+{ 'execute': 'drive-backup', 'arguments': { 'device': 'disk', 'target': 
'TEST_DIR/t.IMGFMT.copy', 'format': 'IMGFMT', 'sync': 'full', 'speed': 65536, 
'x-max-chunk': 65536 } }
  Formatting 'TEST_DIR/t.qcow2.copy', fmt=qcow2 size=67108864 
cluster_size=65536 lazy_refcounts=off refcount_bits=16 compression_type=zlib
  {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": 
{"status": "created", "id": "disk"}}
  {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": 
{"status": "running", "id": "disk"}}







--
Best regards,
Vladimir

Re: [PATCH v2 13/20] iotests: 129: prepare for backup over block-copy

2020-10-22 Thread Vladimir Sementsov-Ogievskiy


23.07.2020 11:03, Max Reitz wrote:

On 01.06.20 20:11, Vladimir Sementsov-Ogievskiy wrote:

After introducing parallel async copy requests instead of plain
cluster-by-cluster copying loop, backup job may finish earlier than
final assertion in do_test_stop. Let's require slow backup explicitly
by specifying speed parameter.


Isn’t the problem really that block_set_io_throttle does absolutely
nothing?  (Which is a long-standing problem with 129.  I personally just
never run it, honestly.)


Hmm.. is it better to drop test_drive_backup() from here ?




Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/129 | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/129 b/tests/qemu-iotests/129
index 4db5eca441..bca56b589d 100755
--- a/tests/qemu-iotests/129
+++ b/tests/qemu-iotests/129
@@ -76,7 +76,7 @@ class TestStopWithBlockJob(iotests.QMPTestCase):
  def test_drive_backup(self):
  self.do_test_stop("drive-backup", device="drive0",
target=self.target_img,
-  sync="full")
+  sync="full", speed=1024)
  
  def test_block_commit(self):

  self.do_test_stop("block-commit", device="drive0")







--
Best regards,
Vladimir

Re: [PATCH v5 2/2] hw/block/nvme: add the dataset management command

2020-10-22 Thread Philippe Mathieu-Daudé


Hi Klaus,

On 10/22/20 8:49 PM, Klaus Jensen wrote:

From: Klaus Jensen 

Add support for the Dataset Management command and the Deallocate
attribute. Deallocation results in discards being sent to the underlying
block device. Whether of not the blocks are actually deallocated is
affected by the same factors as Write Zeroes (see previous commit).

  format | discard | dsm (512b)  dsm (4kb)  dsm (64kb)


Please use B/KiB units which are unambiguous (kb is for kbits)
(if you queue this yourself, you can fix when applying, no need
to repost).


 --
   qcow2ignore   n   n  n
   qcow2unmapn   n  y
   raw  ignore   n   n  n
   raw  unmapn   y  y

Again, a raw format and 4kb LBAs are preferable.

In order to set the Namespace Preferred Deallocate Granularity and
Alignment fields (NPDG and NPDA), choose a sane minimum discard
granularity of 4kb. If we are using a passthru device supporting discard
at a 512b granularity, user should set the discard_granularity property


Ditto.


explicitly. NPDG and NPDA will also account for the cluster_size of the
block driver if required (i.e. for QCOW2).

See NVM Express 1.3d, Section 6.7 ("Dataset Management command").

Signed-off-by: Klaus Jensen 
---
  hw/block/nvme.h  |   2 +
  include/block/nvme.h |   7 ++-
  hw/block/nvme-ns.c   |  36 +--
  hw/block/nvme.c  | 101 ++-
  4 files changed, 140 insertions(+), 6 deletions(-)

diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index e080a2318a50..574333caa3f9 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -28,6 +28,7 @@ typedef struct NvmeRequest {
  struct NvmeNamespace*ns;
  BlockAIOCB  *aiocb;
  uint16_tstatus;
+void*opaque;
  NvmeCqe cqe;
  NvmeCmd cmd;
  BlockAcctCookie acct;
@@ -60,6 +61,7 @@ static inline const char *nvme_io_opc_str(uint8_t opc)
  case NVME_CMD_WRITE:return "NVME_NVM_CMD_WRITE";
  case NVME_CMD_READ: return "NVME_NVM_CMD_READ";
  case NVME_CMD_WRITE_ZEROES: return "NVME_NVM_CMD_WRITE_ZEROES";
+case NVME_CMD_DSM:  return "NVME_NVM_CMD_DSM";
  default:return "NVME_NVM_CMD_UNKNOWN";
  }
  }
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 966c3bb304bd..e95ff6ca9b37 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -990,7 +990,12 @@ typedef struct QEMU_PACKED NvmeIdNs {
  uint16_tnabspf;
  uint16_tnoiob;
  uint8_t nvmcap[16];
-uint8_t rsvd64[40];
+uint16_tnpwg;
+uint16_tnpwa;
+uint16_tnpdg;
+uint16_tnpda;
+uint16_tnows;
+uint8_t rsvd74[30];
  uint8_t nguid[16];
  uint64_teui64;
  NvmeLBAFlbaf[16];


If you consider "block/nvme.h" shared by 2 different subsystems,
it is better to add the changes in a separate patch. That way
the changes can be acked individually.


diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index f1cc734c60f5..840651db7256 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -28,10 +28,14 @@
  #include "nvme.h"
  #include "nvme-ns.h"
  
-static void nvme_ns_init(NvmeNamespace *ns)

+#define MIN_DISCARD_GRANULARITY (4 * KiB)
+
+static int nvme_ns_init(NvmeNamespace *ns, Error **errp)


Hmm the Error* argument could be squashed in "hw/block/nvme:
support multiple namespaces". Else better split patch in dumb
units IMHO (maybe a reviewer's taste).


  {
+BlockDriverInfo bdi;
  NvmeIdNs *id_ns = >id_ns;
  int lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
+int npdg, ret;
  
  ns->id_ns.dlfeat = 0x9;
  
@@ -43,8 +47,25 @@ static void nvme_ns_init(NvmeNamespace *ns)

  id_ns->ncap = id_ns->nsze;
  id_ns->nuse = id_ns->ncap;
  
-/* support DULBE */

-id_ns->nsfeat |= 0x4;
+/* support DULBE and I/O optimization fields */
+id_ns->nsfeat |= (0x4 | 0x10);


The comment helps, but isn't needed if you use explicit definitions
for these flags. You already introduced the NVME_ID_NS_NSFEAT_DULBE
and NVME_ID_NS_FLBAS_EXTENDED but they are restricted to extract bits.
This is why I personally prefer the registerfields API (see
"hw/registerfields.h").


+
+npdg = ns->blkconf.discard_granularity / ns->blkconf.logical_block_size;
+
+ret = bdrv_get_info(blk_bs(ns->blkconf.blk), );
+if (ret < 0) {
+error_setg_errno(errp, -ret, "could not get block driver info");
+return ret;
+}
+
+if (bdi.cluster_size &&
+bdi.cluster_size > ns->blkconf.discard_granularity) {
+npdg = bdi.cluster_size / ns->blkconf.logical_block_size;
+}
+
+id_ns->npda = id_ns->npdg = npdg - 1;
+
+return 0;
  }
  
  static int nvme_ns_init_blk(NvmeCtrl *n,

Re: qcow2 overlay performance

2020-10-22 Thread Yoonho Park

I am still seeing the performance degradation, but I did find something
interesting (and promising) with qemu 5.1.50. Enabling the subcluster
allocation support in qemu 5.1.50 (extended_l2=on) eliminates the
performance degradation of adding an overlay. Without subcluster allocation
enabled, I still see the performance degradation in qemu 5.1.50 when adding
an overlay. For these experiments, I used 64K blocks and 2M qcow2 cluster
size.

On Mon, Oct 19, 2020 at 12:51 PM Alberto Garcia  wrote:

> On Thu 27 Aug 2020 06:29:15 PM CEST, Yoonho Park wrote:
> > Below is the data with the cache disabled ("virsh attach-disk ... --cache
> > none"). I added the previous data for reference. Overall, random read
> > performance was not affected significantly. This makes sense because a
> > cache is probably not going to help random read performance much. BTW how
> > big the cache is by default? Random write performance for 4K blocks seems
> > more "sane" now. Random write performance for 64K blocks is interesting
> > because base image (0 overlay) performance is 2X slower than 1-5
> overlays.
> > We believe this is because the random writes to an overlay actually turn
> > into sequential writes (appends to the overlay). Does this make sense?
> >
> >
> > NO CACHE
> >
> >   4K blocks64K blocks
> >
> > olays rd bw rd iops wr bw  wr iops rd bw rd iops wr bw  wr iops
> >
> > 0 4478  11194684   117157001 890 42050  657
> >
> > 1 4490  11222503   625 56656 885 93483  1460
>
> I haven't been able to reproduce this (I tried the scenarios with 0 and
> 1 overlays), did you figure out anything new or what's the situation?
>
> Berto
>

Re: [PATCH v2 08/20] block/block-copy: add block_copy_cancel

2020-10-22 Thread Vladimir Sementsov-Ogievskiy


22.07.2020 14:28, Max Reitz wrote:

On 01.06.20 20:11, Vladimir Sementsov-Ogievskiy wrote:

Add function to cancel running async block-copy call. It will be used
in backup.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  include/block/block-copy.h |  7 +++
  block/block-copy.c | 22 +++---
  2 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/include/block/block-copy.h b/include/block/block-copy.h
index d40e691123..370a194d3c 100644
--- a/include/block/block-copy.h
+++ b/include/block/block-copy.h
@@ -67,6 +67,13 @@ BlockCopyCallState *block_copy_async(BlockCopyState *s,
  void block_copy_set_speed(BlockCopyState *s, BlockCopyCallState *call_state,
uint64_t speed);
  
+/*

+ * Cancel running block-copy call.
+ * Cancel leaves block-copy state valid: dirty bits are correct and you may use
+ * cancel +  to emulate pause/resume.
+ */
+void block_copy_cancel(BlockCopyCallState *call_state);
+
  BdrvDirtyBitmap *block_copy_dirty_bitmap(BlockCopyState *s);
  void block_copy_set_skip_unallocated(BlockCopyState *s, bool skip);
  
diff --git a/block/block-copy.c b/block/block-copy.c

index 851d9c8aaf..b551feb6c2 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -44,6 +44,8 @@ typedef struct BlockCopyCallState {
  bool failed;
  bool finished;
  QemuCoSleepState *sleep_state;
+bool cancelled;
+Coroutine *canceller;
  
  /* OUT parameters */

  bool error_is_read;
@@ -582,7 +584,7 @@ block_copy_dirty_clusters(BlockCopyCallState *call_state)
  assert(QEMU_IS_ALIGNED(offset, s->cluster_size));
  assert(QEMU_IS_ALIGNED(bytes, s->cluster_size));
  
-while (bytes && aio_task_pool_status(aio) == 0) {

+while (bytes && aio_task_pool_status(aio) == 0 && !call_state->cancelled) {
  BlockCopyTask *task;
  int64_t status_bytes;
  
@@ -693,7 +695,7 @@ static int coroutine_fn block_copy_common(BlockCopyCallState *call_state)

  do {
  ret = block_copy_dirty_clusters(call_state);
  
-if (ret == 0) {

+if (ret == 0 && !call_state->cancelled) {
  ret = block_copy_wait_one(call_state->s, call_state->offset,
call_state->bytes);
  }
@@ -707,13 +709,18 @@ static int coroutine_fn 
block_copy_common(BlockCopyCallState *call_state)
   * 2. We have waited for some intersecting block-copy request
   *It may have failed and produced new dirty bits.
   */
-} while (ret > 0);
+} while (ret > 0 && !call_state->cancelled);


Would it be cleaner if block_copy_dirty_cluster() just returned
-ECANCELED?  Or would that pose a problem for its callers or the async
callback?



I'd prefer not to merge io ret with block-copy logic: who knows what underlying 
operations may return.. Can't it be _another_ ECANCELED?
And it would be just a sugar for block_copy_dirty_clusters() call, I'll have to 
check ->cancelled after block_copy_wait_one() anyway.
Also, for the next version I try to make it more obvious that finished 
block-copy call is in one of thee states:
 - success
 - failed
 - cancelled

Hmm. Also, cancelled should be OK for copy-on-write operations in filter, it 
just mean that we don't need to care anymore.


  if (call_state->cb) {
  call_state->cb(ret, call_state->error_is_read,
 call_state->s->progress_opaque);
  }
  
+if (call_state->canceller) {

+aio_co_wake(call_state->canceller);
+call_state->canceller = NULL;
+}
+
  call_state->finished = true;
  
  return ret;





--
Best regards,
Vladimir

Re: [PATCH v3 12/17] hw/block/nvme: add support for scatter gather lists

2020-10-22 Thread Philippe Mathieu-Daudé


On 9/22/20 10:45 AM, Klaus Jensen wrote:

From: Klaus Jensen 

For now, support the Data Block, Segment and Last Segment descriptor
types.

See NVM Express 1.3d, Section 4.4 ("Scatter Gather List (SGL)").

Signed-off-by: Klaus Jensen 
Reviewed-by: Keith Busch 
---
  include/block/nvme.h  |   6 +-
  hw/block/nvme.c   | 329 ++
  hw/block/trace-events |   4 +
  3 files changed, 279 insertions(+), 60 deletions(-)

diff --git a/include/block/nvme.h b/include/block/nvme.h
index 65e68a82c897..58647bcdad0b 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -412,9 +412,9 @@ typedef union NvmeCmdDptr {
  } NvmeCmdDptr;
  
  enum NvmePsdt {

-PSDT_PRP = 0x0,
-PSDT_SGL_MPTR_CONTIGUOUS = 0x1,
-PSDT_SGL_MPTR_SGL= 0x2,
+NVME_PSDT_PRP = 0x0,
+NVME_PSDT_SGL_MPTR_CONTIGUOUS = 0x1,
+NVME_PSDT_SGL_MPTR_SGL= 0x2,
  };
  
  typedef struct QEMU_PACKED NvmeCmd {

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 3b901efd1ec0..c5d09ff1edf5 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -413,13 +413,262 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, uint64_t prp1, 
uint64_t prp2,
  return NVME_SUCCESS;
  }
  
-static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,

- uint64_t prp1, uint64_t prp2, DMADirection dir,
+/*
+ * Map 'nsgld' data descriptors from 'segment'. The function will subtract the
+ * number of bytes mapped in len.
+ */
+static uint16_t nvme_map_sgl_data(NvmeCtrl *n, QEMUSGList *qsg,
+  QEMUIOVector *iov,
+  NvmeSglDescriptor *segment, uint64_t nsgld,
+  size_t *len, NvmeRequest *req)
+{
+dma_addr_t addr, trans_len;
+uint32_t dlen;
+uint16_t status;
+
+for (int i = 0; i < nsgld; i++) {
+uint8_t type = NVME_SGL_TYPE(segment[i].type);
+
+switch (type) {
+case NVME_SGL_DESCR_TYPE_DATA_BLOCK:
+break;
+case NVME_SGL_DESCR_TYPE_SEGMENT:
+case NVME_SGL_DESCR_TYPE_LAST_SEGMENT:
+return NVME_INVALID_NUM_SGL_DESCRS | NVME_DNR;
+default:
+return NVME_SGL_DESCR_TYPE_INVALID | NVME_DNR;
+}
+
+dlen = le32_to_cpu(segment[i].len);
+if (!dlen) {
+continue;
+}
+
+if (*len == 0) {
+/*
+ * All data has been mapped, but the SGL contains additional
+ * segments and/or descriptors. The controller might accept
+ * ignoring the rest of the SGL.
+ */
+uint16_t sgls = le16_to_cpu(n->id_ctrl.sgls);
+if (sgls & NVME_CTRL_SGLS_EXCESS_LENGTH) {
+break;
+}
+
+trace_pci_nvme_err_invalid_sgl_excess_length(nvme_cid(req));
+return NVME_DATA_SGL_LEN_INVALID | NVME_DNR;
+}
+
+trans_len = MIN(*len, dlen);
+addr = le64_to_cpu(segment[i].addr);
+
+if (UINT64_MAX - addr < dlen) {
+return NVME_DATA_SGL_LEN_INVALID | NVME_DNR;
+}
+
+status = nvme_map_addr(n, qsg, iov, addr, trans_len);
+if (status) {
+return status;
+}
+
+*len -= trans_len;
+}
+
+return NVME_SUCCESS;
+}
+
+static uint16_t nvme_map_sgl(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector *iov,
+ NvmeSglDescriptor sgl, size_t len,
   NvmeRequest *req)
+{
+/*
+ * Read the segment in chunks of 256 descriptors (one 4k page) to avoid
+ * dynamically allocating a potentially huge SGL. The spec allows the SGL
+ * to be larger (as in number of bytes required to describe the SGL
+ * descriptors and segment chain) than the command transfer size, so it is
+ * not bounded by MDTS.
+ */
+const int SEG_CHUNK_SIZE = 256;
+
+NvmeSglDescriptor segment[SEG_CHUNK_SIZE], *sgld, *last_sgld;
+uint64_t nsgld;
+uint32_t seg_len;
+uint16_t status;
+bool sgl_in_cmb = false;
+hwaddr addr;
+int ret;
+
+sgld = 
+addr = le64_to_cpu(sgl.addr);
+
+trace_pci_nvme_map_sgl(nvme_cid(req), NVME_SGL_TYPE(sgl.type), len);
+
+/*
+ * If the entire transfer can be described with a single data block it can
+ * be mapped directly.
+ */
+if (NVME_SGL_TYPE(sgl.type) == NVME_SGL_DESCR_TYPE_DATA_BLOCK) {
+status = nvme_map_sgl_data(n, qsg, iov, sgld, 1, , req);
+if (status) {
+goto unmap;
+}
+
+goto out;
+}
+
+/*
+ * If the segment is located in the CMB, the submission queue of the
+ * request must also reside there.
+ */
+if (nvme_addr_is_cmb(n, addr)) {
+if (!nvme_addr_is_cmb(n, req->sq->dma_addr)) {
+return NVME_INVALID_USE_OF_CMB | NVME_DNR;
+}
+
+sgl_in_cmb = true;
+}
+
+for (;;) {
+switch (NVME_SGL_TYPE(sgld->type)) {
+

Re: [PATCH v2 11/20] qapi: backup: add x-max-chunk and x-max-workers parameters

2020-10-22 Thread Vladimir Sementsov-Ogievskiy


22.07.2020 15:22, Max Reitz wrote:

On 01.06.20 20:11, Vladimir Sementsov-Ogievskiy wrote:

Add new parameters to configure future backup features. The patch
doesn't introduce aio backup requests (so we actually have only one
worker) neither requests larger than one cluster. Still, formally we
satisfy these maximums anyway, so add the parameters now, to facilitate
further patch which will really change backup job behavior.

Options are added with x- prefixes, as the only use for them are some
very conservative iotests which will be updated soon.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  qapi/block-core.json  |  9 -
  include/block/block_int.h |  7 +++
  block/backup.c| 21 +
  block/replication.c   |  2 +-
  blockdev.c|  5 +
  5 files changed, 42 insertions(+), 2 deletions(-)



[..]


@@ -422,6 +436,11 @@ BlockJob *backup_job_create(const char *job_id, 
BlockDriverState *bs,
  if (cluster_size < 0) {
  goto error;
  }
+if (max_chunk && max_chunk < cluster_size) {
+error_setg(errp, "Required max-chunk (%" PRIi64") is less than backup "


(missing a space after PRIi64)


+   "cluster size (%" PRIi64 ")", max_chunk, cluster_size);


Should this be noted in the QAPI documentation?


Hmm.. It makes sense, but I don't know what to write: should be >= job 
cluster_size? But then I'll have to describe the logic of 
backup_calculate_cluster_size()...


 (And perhaps the fact
that without copy offloading, we’ll never copy anything bigger than one
cluster at a time anyway?)


This is a parameter for background copying. Look at block_copy_task_create(), 
if call_state has own max_chunk, it's used instead of common copy_size (derived 
from cluster_size). But at a moment of this patch background process through 
async block-copy is not  yet implemented, and the parameter doesn't work, which 
is described in commit message.




+return NULL;
+}
  
  /*


[..]


--
Best regards,
Vladimir

[PATCH v5 07/12] iotests: define group in each iotest

2020-10-22 Thread Vladimir Sementsov-Ogievskiy

We are going to drop group file. Define group in tests as a preparatory
step.

The patch is generated by

cd tests/qemu-iotests

grep '^[0-9]\{3\} ' group | while read line; do
file=$(awk '{print $1}' <<< "$line");
groups=$(sed -e 's/^... //' <<< "$line");
awk "NR==2{print \"# group: $groups\"}1" $file > tmp;
cat tmp > $file;
done

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/001 | 1 +
 tests/qemu-iotests/002 | 1 +
 tests/qemu-iotests/003 | 1 +
 tests/qemu-iotests/004 | 1 +
 tests/qemu-iotests/005 | 1 +
 tests/qemu-iotests/007 | 1 +
 tests/qemu-iotests/008 | 1 +
 tests/qemu-iotests/009 | 1 +
 tests/qemu-iotests/010 | 1 +
 tests/qemu-iotests/011 | 1 +
 tests/qemu-iotests/012 | 1 +
 tests/qemu-iotests/013 | 1 +
 tests/qemu-iotests/014 | 1 +
 tests/qemu-iotests/015 | 1 +
 tests/qemu-iotests/017 | 1 +
 tests/qemu-iotests/018 | 1 +
 tests/qemu-iotests/019 | 1 +
 tests/qemu-iotests/020 | 1 +
 tests/qemu-iotests/021 | 1 +
 tests/qemu-iotests/022 | 1 +
 tests/qemu-iotests/023 | 1 +
 tests/qemu-iotests/024 | 1 +
 tests/qemu-iotests/025 | 1 +
 tests/qemu-iotests/026 | 1 +
 tests/qemu-iotests/027 | 1 +
 tests/qemu-iotests/028 | 1 +
 tests/qemu-iotests/029 | 1 +
 tests/qemu-iotests/030 | 1 +
 tests/qemu-iotests/031 | 1 +
 tests/qemu-iotests/032 | 1 +
 tests/qemu-iotests/033 | 1 +
 tests/qemu-iotests/034 | 1 +
 tests/qemu-iotests/035 | 1 +
 tests/qemu-iotests/036 | 1 +
 tests/qemu-iotests/037 | 1 +
 tests/qemu-iotests/038 | 1 +
 tests/qemu-iotests/039 | 1 +
 tests/qemu-iotests/040 | 1 +
 tests/qemu-iotests/041 | 1 +
 tests/qemu-iotests/042 | 1 +
 tests/qemu-iotests/043 | 1 +
 tests/qemu-iotests/044 | 1 +
 tests/qemu-iotests/045 | 1 +
 tests/qemu-iotests/046 | 1 +
 tests/qemu-iotests/047 | 1 +
 tests/qemu-iotests/048 | 1 +
 tests/qemu-iotests/049 | 1 +
 tests/qemu-iotests/050 | 1 +
 tests/qemu-iotests/051 | 1 +
 tests/qemu-iotests/052 | 1 +
 tests/qemu-iotests/053 | 1 +
 tests/qemu-iotests/054 | 1 +
 tests/qemu-iotests/055 | 1 +
 tests/qemu-iotests/056 | 1 +
 tests/qemu-iotests/057 | 1 +
 tests/qemu-iotests/058 | 1 +
 tests/qemu-iotests/059 | 1 +
 tests/qemu-iotests/060 | 1 +
 tests/qemu-iotests/061 | 1 +
 tests/qemu-iotests/062 | 1 +
 tests/qemu-iotests/063 | 1 +
 tests/qemu-iotests/064 | 1 +
 tests/qemu-iotests/065 | 1 +
 tests/qemu-iotests/066 | 1 +
 tests/qemu-iotests/068 | 1 +
 tests/qemu-iotests/069 | 1 +
 tests/qemu-iotests/070 | 1 +
 tests/qemu-iotests/071 | 1 +
 tests/qemu-iotests/072 | 1 +
 tests/qemu-iotests/073 | 1 +
 tests/qemu-iotests/074 | 1 +
 tests/qemu-iotests/075 | 1 +
 tests/qemu-iotests/076 | 1 +
 tests/qemu-iotests/077 | 1 +
 tests/qemu-iotests/078 | 1 +
 tests/qemu-iotests/079 | 1 +
 tests/qemu-iotests/080 | 1 +
 tests/qemu-iotests/081 | 1 +
 tests/qemu-iotests/082 | 1 +
 tests/qemu-iotests/083 | 1 +
 tests/qemu-iotests/084 | 1 +
 tests/qemu-iotests/085 | 1 +
 tests/qemu-iotests/086 | 1 +
 tests/qemu-iotests/087 | 1 +
 tests/qemu-iotests/088 | 1 +
 tests/qemu-iotests/089 | 1 +
 tests/qemu-iotests/090 | 1 +
 tests/qemu-iotests/091 | 1 +
 tests/qemu-iotests/092 | 1 +
 tests/qemu-iotests/093 | 1 +
 tests/qemu-iotests/094 | 1 +
 tests/qemu-iotests/095 | 1 +
 tests/qemu-iotests/096 | 1 +
 tests/qemu-iotests/097 | 1 +
 tests/qemu-iotests/098 | 1 +
 tests/qemu-iotests/099 | 1 +
 tests/qemu-iotests/101 | 1 +
 tests/qemu-iotests/102 | 1 +
 tests/qemu-iotests/103 | 1 +
 tests/qemu-iotests/104 | 1 +
 tests/qemu-iotests/105 | 1 +
 tests/qemu-iotests/106 | 1 +
 tests/qemu-iotests/107 | 1 +
 tests/qemu-iotests/108 | 1 +
 tests/qemu-iotests/109 | 1 +
 tests/qemu-iotests/110 | 1 +
 tests/qemu-iotests/111 | 1 +
 tests/qemu-iotests/112 | 1 +
 tests/qemu-iotests/113 | 1 +
 tests/qemu-iotests/114 | 1 +
 tests/qemu-iotests/115 | 1 +
 tests/qemu-iotests/116 | 1 +
 tests/qemu-iotests/117 | 1 +
 tests/qemu-iotests/118 | 1 +
 tests/qemu-iotests/119 | 1 +
 tests/qemu-iotests/120 | 1 +
 tests/qemu-iotests/121 | 1 +
 tests/qemu-iotests/122 | 1 +
 tests/qemu-iotests/123 | 1 +
 tests/qemu-iotests/124 | 1 +
 tests/qemu-iotests/125 | 1 +
 tests/qemu-iotests/126 | 1 +
 tests/qemu-iotests/127 | 1 +
 tests/qemu-iotests/128 | 1 +
 tests/qemu-iotests/129 | 1 +
 tests/qemu-iotests/130 | 1 +
 tests/qemu-iotests/131 | 1 +
 tests/qemu-iotests/132 | 1 +
 tests/qemu-iotests/133 | 1 +
 tests/qemu-iotests/134 | 1 +
 tests/qemu-iotests/135 | 1 +
 tests/qemu-iotests/136 | 1 +
 tests/qemu-iotests/137 | 1 +
 tests/qemu-iotests/138 | 1 +
 tests/qemu-iotests/139 | 1 +
 tests/qemu-iotests/140 | 1 +
 tests/qemu-iotests/141 | 1 +
 tests/qemu-iotests/143 | 1 +
 tests/qemu-iotests/144 | 1 +
 tests/qemu-iotests/145 | 1 +
 tests/qemu-iotests/146 | 1 +
 tests/qemu-iotests/147 | 1 +
 tests/qemu-iotests/148 | 1 +
 tests/qemu-iotests/149 | 1 +
 tests/qemu-iotests/150 | 1 +
 tests/qemu-iotests/151 | 1 +
 tests/qemu-iotests/152 | 1 +
 tests/qemu-iotests/153 | 1 +
 tests/qemu-iotests/154 | 1 +
 tests/qemu-iotests/155 | 1 +
 tests/qemu-iotests/156 | 1 +
 tests/qemu-iotests/157 | 1 +

[PATCH v5 11/12] iotests: rewrite check into python

2020-10-22 Thread Vladimir Sementsov-Ogievskiy

Just use classes introduced in previous three commits. Behavior
difference is described in these three commits.

Drop group file, as it becomes unused.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/check | 977 ++-
 tests/qemu-iotests/group | 317 -
 2 files changed, 28 insertions(+), 1266 deletions(-)
 delete mode 100644 tests/qemu-iotests/group

diff --git a/tests/qemu-iotests/check b/tests/qemu-iotests/check
index 678b6e4910..48bb3128c3 100755
--- a/tests/qemu-iotests/check
+++ b/tests/qemu-iotests/check
@@ -1,7 +1,8 @@
-#!/usr/bin/env bash
+#!/usr/bin/env python3
 #
-# Copyright (C) 2009 Red Hat, Inc.
-# Copyright (c) 2000-2002,2006 Silicon Graphics, Inc.  All Rights Reserved.
+# Configure environment and run group of tests in it.
+#
+# Copyright (c) 2020 Virtuozzo International GmbH
 #
 # This program is free software; you can redistribute it and/or
 # modify it under the terms of the GNU General Public License as
@@ -14,950 +15,28 @@
 #
 # You should have received a copy of the GNU General Public License
 # along with this program.  If not, see .
-#
-#
-# Control script for QA
-#
-
-status=0
-needwrap=true
-try=0
-n_bad=0
-bad=""
-notrun=""
-casenotrun=""
-interrupt=true
-makecheck=false
-
-_init_error()
-{
-echo "check: $1" >&2
-exit 1
-}
-
-if [ -L "$0" ]
-then
-# called from the build tree
-source_iotests=$(dirname "$(readlink "$0")")
-if [ -z "$source_iotests" ]
-then
-_init_error "failed to obtain source tree name from check symlink"
-fi
-source_iotests=$(cd "$source_iotests"; pwd) || _init_error "failed to 
enter source tree"
-build_iotests=$(cd "$(dirname "$0")"; pwd)
-else
-# called from the source tree
-source_iotests=$PWD
-# this may be an in-tree build (note that in the following code we may not
-# assume that it truly is and have to test whether the build results
-# actually exist)
-build_iotests=$PWD
-fi
-
-build_root="$build_iotests/../.."
-
-# we need common.env
-if ! . "$build_iotests/common.env"
-then
-_init_error "failed to source common.env (make sure the qemu-iotests are 
run from tests/qemu-iotests in the build tree)"
-fi
-
-# we need common.config
-if ! . "$source_iotests/common.config"
-then
-_init_error "failed to source common.config"
-fi
-
-_full_imgfmt_details()
-{
-if [ -n "$IMGOPTS" ]; then
-echo "$IMGFMT ($IMGOPTS)"
-else
-echo "$IMGFMT"
-fi
-}
-
-_full_platform_details()
-{
-os=$(uname -s)
-host=$(hostname -s)
-kernel=$(uname -r)
-platform=$(uname -m)
-echo "$os/$platform $host $kernel"
-}
-
-_full_env_details()
-{
-cat < /dev/null)
-if [ -n "$p" -a -x "$p" ]; then
-type -p "$p"
-else
-return 1
-fi
-}
-
-if [ -z "$TEST_DIR" ]; then
-TEST_DIR=$PWD/scratch
-fi
-mkdir -p "$TEST_DIR" || _init_error 'Failed to create TEST_DIR'
-
-tmp_sock_dir=false
-if [ -z "$SOCK_DIR" ]; then
-SOCK_DIR=$(mktemp -d)
-tmp_sock_dir=true
-fi
-mkdir -p "$SOCK_DIR" || _init_error 'Failed to create SOCK_DIR'
-
-diff="diff -u"
-verbose=false
-debug=false
-group=false
-xgroup=false
-imgopts=false
-showme=false
-sortme=false
-expunge=true
-have_test_arg=false
-cachemode=false
-aiomode=false
-
-tmp="${TEST_DIR}"/$$
-rm -f $tmp.list $tmp.tmp $tmp.sed
-
-export IMGFMT=raw
-export IMGFMT_GENERIC=true
-export IMGPROTO=file
-export IMGOPTS=""
-export CACHEMODE="writeback"
-export AIOMODE="threads"
-export QEMU_IO_OPTIONS=""
-export QEMU_IO_OPTIONS_NO_FMT=""
-export CACHEMODE_IS_DEFAULT=true
-export VALGRIND_QEMU=
-export IMGKEYSECRET=
-export IMGOPTSSYNTAX=false
-
-# Save current tty settings, since an aborting qemu call may leave things
-# screwed up
-STTY_RESTORE=
-if test -t 0; then
-STTY_RESTORE=$(stty -g)
-fi
-
-for r
-do
-
-if $group
-then
-# arg after -g
-group_list=$(sed -n <"$source_iotests/group" -e 's/$/ /' -e 
"/^[0-9][0-9][0-9].* $r /"'{
-s/ .*//p
-}')
-if [ -z "$group_list" ]
-then
-echo "Group \"$r\" is empty or not defined?"
-exit 1
-fi
-[ ! -s $tmp.list ] && touch $tmp.list
-for t in $group_list
-do
-if grep -s "^$t\$" $tmp.list >/dev/null
-then
-:
-else
-echo "$t" >>$tmp.list
-fi
-done
-group=false
-continue
-
-elif $xgroup
-then
-# arg after -x
-# Populate $tmp.list with all tests
-awk '/^[0-9]{3,}/ {print $1}' "${source_iotests}/group" > $tmp.list 
2>/dev/null
-group_list=$(sed -n <"$source_iotests/group" -e 's/$/ /' -e 
"/^[0-9][0-9][0-9].* $r /"'{
-s/ .*//p
-}')
-if [ -z "$group_list" ]
-then
-echo "Group \"$r\" is empty or not defined?"
-exit 1
-fi
-numsed=0
-rm -f $tmp.sed
-for t in $group_list
-

[PATCH v5 09/12] iotests: add testenv.py

2020-10-22 Thread Vladimir Sementsov-Ogievskiy

Add TestEnv class, which will handle test environment in a new python
iotests running framework.

Difference with current ./check interface:
- -v (verbose) option dropped, as it is unused

- -xdiff option is dropped, until somebody complains that it is needed
- same for -n option

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/testenv.py | 325 ++
 1 file changed, 325 insertions(+)
 create mode 100755 tests/qemu-iotests/testenv.py

diff --git a/tests/qemu-iotests/testenv.py b/tests/qemu-iotests/testenv.py
new file mode 100755
index 00..97c75f70df
--- /dev/null
+++ b/tests/qemu-iotests/testenv.py
@@ -0,0 +1,325 @@
+#!/usr/bin/env python3
+#
+# Parse command line options to manage test environment variables.
+#
+# Copyright (c) 2020 Virtuozzo International GmbH
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+import os
+import sys
+import tempfile
+from pathlib import Path
+import shutil
+import collections
+import subprocess
+import argparse
+from typing import List, Dict
+
+
+def get_default_machine(qemu_prog: str) -> str:
+outp = subprocess.run([qemu_prog, '-machine', 'help'], check=True,
+  text=True, stdout=subprocess.PIPE).stdout
+
+machines = outp.split('\n')
+default_machine = next(m for m in machines if m.endswith(' (default)'))
+default_machine = default_machine.split(' ', 1)[0]
+
+alias_suf = ' (alias of {})'.format(default_machine)
+alias = next((m for m in machines if m.endswith(alias_suf)), None)
+if alias is not None:
+default_machine = alias.split(' ', 1)[0]
+
+return default_machine
+
+
+class TestEnv:
+"""
+Manage system environment for running tests
+
+The following variables are supported/provided. They are represented by
+lower-cased TestEnv attributes.
+"""
+env_variables = ['PYTHONPATH', 'TEST_DIR', 'SOCK_DIR', 'SAMPLE_IMG_DIR',
+ 'OUTPUT_DIR', 'PYTHON', 'QEMU_PROG', 'QEMU_IMG_PROG',
+ 'QEMU_IO_PROG', 'QEMU_NBD_PROG',
+ 'SOCKET_SCM_HELPER', 'QEMU_OPTIONS', 'QEMU_IMG_OPTIONS',
+ 'QEMU_IO_OPTIONS', 'QEMU_NBD_OPTIONS', 'IMGOPTS',
+ 'IMGFMT', 'IMGPROTO', 'AIOMODE', 'CACHEMODE',
+ 'VALGRIND_QEMU', 'CACHEMODE_IS_DEFAULT', 'IMGFMT_GENERIC',
+ 'IMGOPTSSYNTAX', 'IMGKEYSECRET', 'QEMU_DEFAULT_MACHINE']
+
+def get_env(self) -> Dict[str, str]:
+env = {}
+for v in self.env_variables:
+val = getattr(self, v.lower(), None)
+if val is not None:
+env[v] = val
+
+return env
+
+_argparser = None
+@classmethod
+def get_argparser(cls) -> argparse.ArgumentParser:
+if cls._argparser is not None:
+return cls._argparser
+
+p = argparse.ArgumentParser(description="= test environment options =",
+add_help=False, usage=argparse.SUPPRESS)
+
+p.add_argument('-d', dest='debug', action='store_true', help='debug')
+p.add_argument('-misalign', action='store_true',
+   help='misalign memory allocations')
+
+p.set_defaults(imgfmt='raw', imgproto='file')
+
+format_list = ['raw', 'bochs', 'cloop', 'parallels', 'qcow', 'qcow2',
+   'qed', 'vdi', 'vpc', 'vhdx', 'vmdk', 'luks', 'dmg']
+g = p.add_argument_group(
+'image format options',
+'The following options sets IMGFMT environment variable. '
+'At most one chose is allowed, default is "raw"')
+g = g.add_mutually_exclusive_group()
+for fmt in format_list:
+g.add_argument('-' + fmt, dest='imgfmt', action='store_const',
+   const=fmt)
+
+protocol_list = ['file', 'rbd', 'sheepdoc', 'nbd', 'ssh', 'nfs']
+g = p.add_argument_group(
+'image protocol options',
+'The following options sets IMGPROTO environment variably. '
+'At most one chose is allowed, default is "file"')
+g = g.add_mutually_exclusive_group()
+for prt in protocol_list:
+g.add_argument('-' + prt, dest='imgproto', action='store_const',
+   const=prt)
+
+g = p.add_mutually_exclusive_group()
+# We don't set default for

[PATCH RFC v5 12/12] iotests: rename and move 169 and 199 tests

2020-10-22 Thread Vladimir Sementsov-Ogievskiy

Rename bitmaps migration tests and move them to tests subdirectory to
demonstrate new human-friendly test naming.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/{199 => tests/migrate-bitmaps-postcopy-test}   | 0
 .../{199.out => tests/migrate-bitmaps-postcopy-test.out}  | 0
 tests/qemu-iotests/{169 => tests/migrate-bitmaps-test}| 0
 tests/qemu-iotests/{169.out => tests/migrate-bitmaps-test.out}| 0
 4 files changed, 0 insertions(+), 0 deletions(-)
 rename tests/qemu-iotests/{199 => tests/migrate-bitmaps-postcopy-test} (100%)
 rename tests/qemu-iotests/{199.out => tests/migrate-bitmaps-postcopy-test.out} 
(100%)
 rename tests/qemu-iotests/{169 => tests/migrate-bitmaps-test} (100%)
 rename tests/qemu-iotests/{169.out => tests/migrate-bitmaps-test.out} (100%)

diff --git a/tests/qemu-iotests/199 
b/tests/qemu-iotests/tests/migrate-bitmaps-postcopy-test
similarity index 100%
rename from tests/qemu-iotests/199
rename to tests/qemu-iotests/tests/migrate-bitmaps-postcopy-test
diff --git a/tests/qemu-iotests/199.out 
b/tests/qemu-iotests/tests/migrate-bitmaps-postcopy-test.out
similarity index 100%
rename from tests/qemu-iotests/199.out
rename to tests/qemu-iotests/tests/migrate-bitmaps-postcopy-test.out
diff --git a/tests/qemu-iotests/169 
b/tests/qemu-iotests/tests/migrate-bitmaps-test
similarity index 100%
rename from tests/qemu-iotests/169
rename to tests/qemu-iotests/tests/migrate-bitmaps-test
diff --git a/tests/qemu-iotests/169.out 
b/tests/qemu-iotests/tests/migrate-bitmaps-test.out
similarity index 100%
rename from tests/qemu-iotests/169.out
rename to tests/qemu-iotests/tests/migrate-bitmaps-test.out
-- 
2.21.3

[PATCH v5 06/12] iotests/294: add shebang line

2020-10-22 Thread Vladimir Sementsov-Ogievskiy

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/294 | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tests/qemu-iotests/294 b/tests/qemu-iotests/294
index 9c95ed8c9a..4bdb7364af 100755
--- a/tests/qemu-iotests/294
+++ b/tests/qemu-iotests/294
@@ -1,3 +1,4 @@
+#!/usr/bin/env bash
 #
 # Copyright (C) 2019 Red Hat, Inc.
 #
-- 
2.21.3

[PATCH v5 08/12] iotests: add findtests.py

2020-10-22 Thread Vladimir Sementsov-Ogievskiy

Add python script with new logic of searching for tests:

Current ./check behavior:
 - tests are named [0-9][0-9][0-9]
 - tests must be registered in group file (even if test doesn't belong
   to any group, like 142)

Behavior of findtests.py:
 - group file is dropped
 - tests are all files in tests/ subdirectory (except for .out files),
   so it's not needed more to "register the test", just create it with
   appropriate name in tests/ subdirectory. Old names like
   [0-9][0-9][0-9] (in root iotests directory) are supported too, but
   not recommended for new tests
 - groups are parsed from '# group: ' line inside test files
 - optional file group.local may be used to define some additional
   groups for downstreams
 - 'disabled' group is used to temporary disable tests. So instead of
   commenting tests in old 'group' file you now can add them to
   disabled group with help of 'group.local' file
 - selecting test ranges like 5-15 are not supported more
   (to support restarting failed ./check command from the middle of the
process, new argument is added: --start-from)

Benefits:
 - no rebase conflicts in group file on patch porting from branch to
   branch
 - no conflicts in upstream, when different series want to occupy same
   test number
 - meaningful names for test files
   For example, with digital number, when some person wants to add some
   test about block-stream, he most probably will just create a new
   test. But if there would be test-block-stream test already, he will
   at first look at it and may be just add a test-case into it.
   And anyway meaningful names are better.

This commit don't update check behavior (which will be don in further
commit), still, the documentation changed like new behavior is already
here.  Let's live with this small inconsistency for the following few
commits, until final change.

The file findtests.py is self-executable and may be used for debugging
purposes.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 docs/devel/testing.rst  |  50 ++-
 tests/qemu-iotests/findtests.py | 229 
 2 files changed, 278 insertions(+), 1 deletion(-)
 create mode 100755 tests/qemu-iotests/findtests.py

diff --git a/docs/devel/testing.rst b/docs/devel/testing.rst
index 0c3e79d31c..b2a4f6ce42 100644
--- a/docs/devel/testing.rst
+++ b/docs/devel/testing.rst
@@ -111,7 +111,7 @@ check-block
 ---
 
 ``make check-block`` runs a subset of the block layer iotests (the tests that
-are in the "auto" group in ``tests/qemu-iotests/group``).
+are in the "auto" group).
 See the "QEMU iotests" section below for more information.
 
 GCC gcov support
@@ -224,6 +224,54 @@ another application on the host may have locked the file, 
possibly leading to a
 test failure.  If using such devices are explicitly desired, consider adding
 ``locking=off`` option to disable image locking.
 
+Test case groups
+
+
+Test may belong to some groups, you may define it in the comment inside the
+test. By convention, test groups are listed in the second line of the test
+file, after "#!/..." line, like this:
+
+.. code::
+
+  #!/usr/bin/env python3
+  # group: auto quick
+  #
+  ...
+
+Additional way of defining groups is creating tests/qemu-iotests/group.local
+file. This should be used only for downstream (this file should never appear
+in upstream). This file may be used for defining some downstream test groups
+or for temporary disable tests, like this:
+
+.. code::
+
+  # groups for some company downstream process
+  #
+  # ci - tests to run on build
+  # down - our downstream tests, not for upstream
+  #
+  # Format of each line is:
+  # TEST_NAME TEST_GROUP [TEST_GROUP ]...
+
+  013 ci
+  210 disabled
+  215 disabled
+  our-ugly-workaround-test down ci
+
+The (not exhaustive) list of groups:
+
+- quick : Tests in this group should finish within some few seconds.
+
+- auto : Tests in this group are used during "make check" and should be
+  runnable in any case. That means they should run with every QEMU binary
+  (also non-x86), with every QEMU configuration (i.e. must not fail if
+  an optional feature is not compiled in - but reporting a "skip" is ok),
+  work at least with the qcow2 file format, work with all kind of host
+  filesystems and users (e.g. "nobody" or "root") and must not take too
+  much memory and disk space (since CI pipelines tend to fail otherwise).
+
+- disabled : Tests in this group are disabled and ignored by check.
+
 .. _docker-ref:
 
 Docker based tests
diff --git a/tests/qemu-iotests/findtests.py b/tests/qemu-iotests/findtests.py
new file mode 100755
index 00..b053db48e8
--- /dev/null
+++ b/tests/qemu-iotests/findtests.py
@@ -0,0 +1,229 @@
+#!/usr/bin/env python3
+#
+# Parse command line options to define set of tests to run.
+#
+# Copyright (c) 2020 Virtuozzo International GmbH
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public

[PATCH v5 10/12] iotests: add testrunner.py

2020-10-22 Thread Vladimir Sementsov-Ogievskiy

Add TestRunner class, which will run tests in a new python iotests
running framework.

There are some differences with current ./check behavior, most
significant are:
- Consider all tests self-executable, just run them, don't run python
  by hand.
- Elapsed time is cached in json file
- Elapsed time precision increased a bit
- use python difflib instead of "diff -w", to ignore spaces at line
  ends strip lines by hand. Do not ignore other spaces.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/testrunner.py | 351 +++
 1 file changed, 351 insertions(+)
 create mode 100644 tests/qemu-iotests/testrunner.py

diff --git a/tests/qemu-iotests/testrunner.py b/tests/qemu-iotests/testrunner.py
new file mode 100644
index 00..e395877882
--- /dev/null
+++ b/tests/qemu-iotests/testrunner.py
@@ -0,0 +1,351 @@
+# Class for actual tests running.
+#
+# Copyright (c) 2020 Virtuozzo International GmbH
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+import os
+import random
+from pathlib import Path
+import datetime
+import time
+import difflib
+import subprocess
+import collections
+import contextlib
+import json
+import argparse
+import termios
+import sys
+from contextlib import contextmanager
+from typing import List, Optional, Iterator
+
+from testenv import TestEnv
+
+
+def silent_unlink(path: Path) -> None:
+try:
+path.unlink()
+except OSError:
+pass
+
+
+def file_diff(file1: str, file2: str) -> List[str]:
+with open(file1) as f1, open(file2) as f2:
+# We want to ignore spaces at line ends. There are a lot of mess about
+# it in iotests.
+# TODO: fix all tests to not produce extra spaces, fix all .out files
+# and use strict diff here!
+seq1 = [line.rstrip() for line in f1]
+seq2 = [line.rstrip() for line in f2]
+return list(difflib.unified_diff(seq1, seq2, file1, file2))
+
+
+# We want to save current tty settings during test run,
+# since an aborting qemu call may leave things screwed up.
+@contextmanager
+def savetty() -> Iterator:
+isterm = sys.stdin.isatty()
+if isterm:
+fd = sys.stdin.fileno()
+attr = termios.tcgetattr(0)
+
+try:
+yield
+finally:
+if isterm:
+termios.tcsetattr(fd, termios.TCSADRAIN, attr)
+
+
+class LastElapsedTime:
+""" Cache for elapsed time for tests, to show it during new test run
+
+Use get() in any time. But, if use update you should then call save(),
+or use update() inside with-block.
+"""
+def __init__(self, cache_file: str, env: TestEnv) -> None:
+self.env = env
+self.cache_file = cache_file
+
+try:
+with open(cache_file) as f:
+self.cache = json.load(f)
+except (OSError, ValueError):
+self.cache = {}
+
+def get(self, test: str,
+default: Optional[float] = None) -> Optional[float]:
+if test not in self.cache:
+return default
+
+if self.env.imgproto not in self.cache[test]:
+return default
+
+return self.cache[test][self.env.imgproto].get(self.env.imgfmt,
+   default)
+
+def update(self, test: str, elapsed: float) -> None:
+d = self.cache.setdefault(test, {})
+d = d.setdefault(self.env.imgproto, {})
+d[self.env.imgfmt] = elapsed
+
+def save(self) -> None:
+with open(self.cache_file, 'w') as f:
+json.dump(self.cache, f)
+
+def __enter__(self) -> 'LastElapsedTime':
+return self
+
+def __exit__(self, *args) -> None:
+self.save()
+
+
+TestResult = collections.namedtuple(
+'TestResult',
+['status', 'description', 'elapsed', 'diff', 'casenotrun'],
+defaults=('', '', '', ''))
+
+
+class TestRunner:
+_argparser = None
+@classmethod
+def get_argparser(cls) -> argparse.ArgumentParser:
+if cls._argparser is not None:
+return cls._argparser
+
+p = argparse.ArgumentParser(description="= test running options =",
+add_help=False, usage=argparse.SUPPRESS)
+
+p.add_argument('-makecheck', action='store_true',
+   help='pretty print output for make check')
+
+cls._argparser = p
+return p
+
+def

[PATCH v5 05/12] iotests/299: make executable

2020-10-22 Thread Vladimir Sementsov-Ogievskiy

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/299 | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 mode change 100644 => 100755 tests/qemu-iotests/299

diff --git a/tests/qemu-iotests/299 b/tests/qemu-iotests/299
old mode 100644
new mode 100755
-- 
2.21.3

[PATCH v5 01/12] iotests/277: use dot slash for nbd-fault-injector.py running

2020-10-22 Thread Vladimir Sementsov-Ogievskiy

If you run './check 277', check includes common.config which adjusts
$PATH to include '.' first, and therefore finds nbd-fault-injector.py
on PATH.  But if you run './277' directly, there is nothing to adjust
PATH, and if '.' is not already on your PATH by other means, the test
fails because the executable is not found.  Adjust how we invoke the
helper executable to avoid needing a PATH search in the first place.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Eric Blake 
---
 tests/qemu-iotests/277 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/277 b/tests/qemu-iotests/277
index d34f87021f..a39ce2d873 100755
--- a/tests/qemu-iotests/277
+++ b/tests/qemu-iotests/277
@@ -42,7 +42,7 @@ def make_conf_file(event):
 def start_server_NBD(event):
 make_conf_file(event)
 
-srv = subprocess.Popen(['nbd-fault-injector.py', '--classic-negotiation',
+srv = subprocess.Popen(['./nbd-fault-injector.py', '--classic-negotiation',
nbd_sock, conf_file], stdout=subprocess.PIPE,
stderr=subprocess.STDOUT, universal_newlines=True)
 line = srv.stdout.readline()
-- 
2.21.3

[PATCH v5 02/12] iotests/303: use dot slash for qcow2.py running

2020-10-22 Thread Vladimir Sementsov-Ogievskiy

If you run './check 303', check includes common.config which adjusts
$PATH to include '.' first, and therefore finds qcow2.py on PATH.  But
if you run './303' directly, there is nothing to adjust PATH, and if
'.' is not already on your PATH by other means, the test fails because
the executable is not found.  Adjust how we invoke the helper
executable to avoid needing a PATH search in the first place.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/303 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/303 b/tests/qemu-iotests/303
index 6c21774483..11cd9eeb26 100755
--- a/tests/qemu-iotests/303
+++ b/tests/qemu-iotests/303
@@ -56,7 +56,7 @@ qemu_img_create('-f', iotests.imgfmt, disk, '10M')
 
 add_bitmap(1, 0, 6, False)
 add_bitmap(2, 6, 8, True)
-dump = ['qcow2.py', disk, 'dump-header']
+dump = ['./qcow2.py', disk, 'dump-header']
 subprocess.run(dump)
 # Dump the metadata in JSON format
 dump.append('-j')
-- 
2.21.3

[PATCH v5 03/12] iotests: fix some whitespaces in test output files

2020-10-22 Thread Vladimir Sementsov-Ogievskiy

We are going to be stricter about comparing test result with .out
files. So, fix some whitespaces now.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/175.out |  2 +-
 tests/qemu-iotests/271.out | 12 ++--
 tests/qemu-iotests/287.out | 10 +-
 3 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/tests/qemu-iotests/175.out b/tests/qemu-iotests/175.out
index 39c2ee0f62..40a5bd1ce6 100644
--- a/tests/qemu-iotests/175.out
+++ b/tests/qemu-iotests/175.out
@@ -23,4 +23,4 @@ size=4096, min allocation
 == resize empty image with block_resize ==
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=0
 size=1048576, min allocation
- *** done
+*** done
diff --git a/tests/qemu-iotests/271.out b/tests/qemu-iotests/271.out
index 92deb7ebb0..81043ba4d7 100644
--- a/tests/qemu-iotests/271.out
+++ b/tests/qemu-iotests/271.out
@@ -500,7 +500,7 @@ L2 entry #0: 0x80050001 0001
 
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576
 L2 entry #1: 0x8006 0001
-qcow2: Marking image as corrupt: Invalid cluster entry found  (L2 offset: 
0x4, L2 index: 0x1); further corruption events will be suppressed
+qcow2: Marking image as corrupt: Invalid cluster entry found (L2 offset: 
0x4, L2 index: 0x1); further corruption events will be suppressed
 write failed: Input/output error
 
 ### Corrupted L2 entries - write test (unallocated) ###
@@ -515,14 +515,14 @@ L2 entry #0: 0x8006 0001
 
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576
 L2 entry #0: 0x 0001
-qcow2: Marking image as corrupt: Invalid cluster entry found  (L2 offset: 
0x4, L2 index: 0); further corruption events will be suppressed
+qcow2: Marking image as corrupt: Invalid cluster entry found (L2 offset: 
0x4, L2 index: 0); further corruption events will be suppressed
 write failed: Input/output error
 
 # Both 'subcluster is zero' and 'subcluster is allocated' bits set
 
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576
 L2 entry #1: 0x 00010001
-qcow2: Marking image as corrupt: Invalid cluster entry found  (L2 offset: 
0x4, L2 index: 0x1); further corruption events will be suppressed
+qcow2: Marking image as corrupt: Invalid cluster entry found (L2 offset: 
0x4, L2 index: 0x1); further corruption events will be suppressed
 write failed: Input/output error
 
 ### Compressed cluster with subcluster bitmap != 0 - write test ###
@@ -583,7 +583,7 @@ read 524288/524288 bytes at offset 0
 read 524288/524288 bytes at offset 524288
 512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 Offset  Length  Mapped to   File
-0   0x80   TEST_DIR/t.qcow2.base
+0   0x8 0   TEST_DIR/t.qcow2.base
 # backing file and preallocation=falloc
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=524288 
backing_file=TEST_DIR/t.IMGFMT.base backing_fmt=raw preallocation=falloc
 Image resized.
@@ -592,7 +592,7 @@ read 524288/524288 bytes at offset 0
 read 524288/524288 bytes at offset 524288
 512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 Offset  Length  Mapped to   File
-0   0x80   TEST_DIR/t.qcow2.base
+0   0x8 0   TEST_DIR/t.qcow2.base
 # backing file and preallocation=full
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=524288 
backing_file=TEST_DIR/t.IMGFMT.base backing_fmt=raw preallocation=full
 Image resized.
@@ -601,7 +601,7 @@ read 524288/524288 bytes at offset 0
 read 524288/524288 bytes at offset 524288
 512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 Offset  Length  Mapped to   File
-0   0x80   TEST_DIR/t.qcow2.base
+0   0x8 0   TEST_DIR/t.qcow2.base
 
 ### Image resizing with preallocation and backing files ###
 
diff --git a/tests/qemu-iotests/287.out b/tests/qemu-iotests/287.out
index 6b9dfb4af0..49ab6a27d5 100644
--- a/tests/qemu-iotests/287.out
+++ b/tests/qemu-iotests/287.out
@@ -10,22 +10,22 @@ incompatible_features []
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
 incompatible_features [3]
 
-=== Testing zlib with incompatible bit set  ===
+=== Testing zlib with incompatible bit set ===
 
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
 incompatible_features [3]
 
-=== Testing zstd with incompatible bit unset  ===
+=== Testing zstd with incompatible bit unset ===
 
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
 incompatible_features []
 
-=== Testing compression type values  ===
+=== Testing compression type values ===
 
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
-   0
+0
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
-   1
+1
 
 === Testing simple reading and writing with zstd ===
 
-- 
2.21.3

[PATCH v5 04/12] iotests/283: make executable

2020-10-22 Thread Vladimir Sementsov-Ogievskiy

All other test files are executable, except for this one. Fix that.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Eric Blake 
Reviewed-by: Philippe Mathieu-DaudÃ© 
---
 tests/qemu-iotests/283 | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 mode change 100644 => 100755 tests/qemu-iotests/283

diff --git a/tests/qemu-iotests/283 b/tests/qemu-iotests/283
old mode 100644
new mode 100755
-- 
2.21.3

[PATCH v5 00/12] Rework iotests/check

2020-10-22 Thread Vladimir Sementsov-Ogievskiy

Hi all!

These series has 3 goals:

 - get rid of group file
 - introduce human-readable names for tests
 - rewrite check into python

v5: [rebase on master]
02: new
03: updated
05,06: new
07: updated
09: rebased on master:
  vxhs removed
  qemu_prog path changed
  build_iotests path calculation changed
  handle mega2560 and gdbsim-r5f562n8 machines

Mark last patch as RFC: it's OK to postpone it for a while.
Also, patches 01-06 are simple cleanups and may be merged in separate.

Vladimir Sementsov-Ogievskiy (12):
  iotests/277: use dot slash for nbd-fault-injector.py running
  iotests/303: use dot slash for qcow2.py running
  iotests: fix some whitespaces in test output files
  iotests/283: make executable
  iotests/299: make executable
  iotests/294: add shebang line
  iotests: define group in each iotest
  iotests: add findtests.py
  iotests: add testenv.py
  iotests: add testrunner.py
  iotests: rewrite check into python
  iotests: rename and move 169 and 199 tests

 docs/devel/testing.rst|  50 +-
 tests/qemu-iotests/001|   1 +
 tests/qemu-iotests/002|   1 +
 tests/qemu-iotests/003|   1 +
 tests/qemu-iotests/004|   1 +
 tests/qemu-iotests/005|   1 +
 tests/qemu-iotests/007|   1 +
 tests/qemu-iotests/008|   1 +
 tests/qemu-iotests/009|   1 +
 tests/qemu-iotests/010|   1 +
 tests/qemu-iotests/011|   1 +
 tests/qemu-iotests/012|   1 +
 tests/qemu-iotests/013|   1 +
 tests/qemu-iotests/014|   1 +
 tests/qemu-iotests/015|   1 +
 tests/qemu-iotests/017|   1 +
 tests/qemu-iotests/018|   1 +
 tests/qemu-iotests/019|   1 +
 tests/qemu-iotests/020|   1 +
 tests/qemu-iotests/021|   1 +
 tests/qemu-iotests/022|   1 +
 tests/qemu-iotests/023|   1 +
 tests/qemu-iotests/024|   1 +
 tests/qemu-iotests/025|   1 +
 tests/qemu-iotests/026|   1 +
 tests/qemu-iotests/027|   1 +
 tests/qemu-iotests/028|   1 +
 tests/qemu-iotests/029|   1 +
 tests/qemu-iotests/030|   1 +
 tests/qemu-iotests/031|   1 +
 tests/qemu-iotests/032|   1 +
 tests/qemu-iotests/033|   1 +
 tests/qemu-iotests/034|   1 +
 tests/qemu-iotests/035|   1 +
 tests/qemu-iotests/036|   1 +
 tests/qemu-iotests/037|   1 +
 tests/qemu-iotests/038|   1 +
 tests/qemu-iotests/039|   1 +
 tests/qemu-iotests/040|   1 +
 tests/qemu-iotests/041|   1 +
 tests/qemu-iotests/042|   1 +
 tests/qemu-iotests/043|   1 +
 tests/qemu-iotests/044|   1 +
 tests/qemu-iotests/045|   1 +
 tests/qemu-iotests/046|   1 +
 tests/qemu-iotests/047|   1 +
 tests/qemu-iotests/048|   1 +
 tests/qemu-iotests/049|   1 +
 tests/qemu-iotests/050|   1 +
 tests/qemu-iotests/051|   1 +
 tests/qemu-iotests/052|   1 +
 tests/qemu-iotests/053|   1 +
 tests/qemu-iotests/054|   1 +
 tests/qemu-iotests/055|   1 +
 tests/qemu-iotests/056|   1 +
 tests/qemu-iotests/057|   1 +
 tests/qemu-iotests/058|   1 +
 tests/qemu-iotests/059|   1 +
 tests/qemu-iotests/060|   1 +
 tests/qemu-iotests/061|   1 +
 tests/qemu-iotests/062|   1 +
 tests/qemu-iotests/063|   1 +
 tests/qemu-iotests/064|   1 +
 tests/qemu-iotests/065|   1 +
 tests/qemu-iotests/066|   1 +
 tests/qemu-iotests/068|   1 +
 tests/qemu-iotests/069|   1 +
 tests/qemu-iotests/070|   1 +
 tests/qemu-iotests/071|   1 +
 tests/qemu-iotests/072|   1 +
 tests/qemu-iotests/073|   1 +
 tests/qemu-iotests/074|   1 +
 tests/qemu-iotests/075|   1

[PATCH v5 1/2] hw/block/nvme: add dulbe support

2020-10-22 Thread Klaus Jensen

From: Klaus Jensen 

Add support for reporting the Deallocated or Unwritten Logical Block
Error (DULBE).

Rely on the block status flags reported by the block layer and consider
any block with the BDRV_BLOCK_ZERO flag to be deallocated.

Multiple factors affect when a Write Zeroes command result in
deallocation of blocks.

  * the underlying file system block size
  * the blockdev format
  * the 'discard' and 'logical_block_size' parameters

 format | discard | wz (512b)  wz (4kb)  wz (64kb)
---
  qcow2ignore   n  n y
  qcow2unmapn  n y
  raw  ignore   n  y y
  raw  unmapn  y y

So, this works best with an image in raw format and 4kb LBAs, since
holes can then be punched on a per-block basis (this assumes a file
system with a 4kb block size, YMMV). A qcow2 image, uses a cluster size
of 64kb by default and blocks will only be marked deallocated if a full
cluster is zeroed or discarded. However, this *is* consistent with the
spec since Write Zeroes "should" deallocate the block if the Deallocate
attribute is set and "may" deallocate if the Deallocate attribute is not
set. Thus, we always try to deallocate (the BDRV_REQ_MAY_UNMAP flag is
always set).

Signed-off-by: Klaus Jensen 
Reviewed-by: Keith Busch 
---
 hw/block/nvme-ns.h|  4 +++
 include/block/nvme.h  |  5 +++
 hw/block/nvme-ns.c|  8 +++--
 hw/block/nvme.c   | 83 +--
 hw/block/trace-events |  4 +++
 5 files changed, 99 insertions(+), 5 deletions(-)

diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index 83734f4606e1..44bf6271b744 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -31,6 +31,10 @@ typedef struct NvmeNamespace {
 NvmeIdNs id_ns;
 
 NvmeNamespaceParams params;
+
+struct {
+uint32_t err_rec;
+} features;
 } NvmeNamespace;
 
 static inline uint32_t nvme_nsid(NvmeNamespace *ns)
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 8a46d9cf015f..966c3bb304bd 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -687,6 +687,7 @@ enum NvmeStatusCodes {
 NVME_E2E_REF_ERROR  = 0x0284,
 NVME_CMP_FAILURE= 0x0285,
 NVME_ACCESS_DENIED  = 0x0286,
+NVME_DULB   = 0x0287,
 NVME_MORE   = 0x2000,
 NVME_DNR= 0x4000,
 NVME_NO_COMPLETE= 0x,
@@ -903,6 +904,9 @@ enum NvmeIdCtrlLpa {
 #define NVME_AEC_NS_ATTR(aec)   ((aec >> 8) & 0x1)
 #define NVME_AEC_FW_ACTIVATION(aec) ((aec >> 9) & 0x1)
 
+#define NVME_ERR_REC_TLER(err_rec)  (err_rec & 0x)
+#define NVME_ERR_REC_DULBE(err_rec) (err_rec & 0x1)
+
 enum NvmeFeatureIds {
 NVME_ARBITRATION= 0x1,
 NVME_POWER_MANAGEMENT   = 0x2,
@@ -1023,6 +1027,7 @@ enum NvmeNsIdentifierType {
 
 
 #define NVME_ID_NS_NSFEAT_THIN(nsfeat)  ((nsfeat & 0x1))
+#define NVME_ID_NS_NSFEAT_DULBE(nsfeat) ((nsfeat >> 2) & 0x1)
 #define NVME_ID_NS_FLBAS_EXTENDED(flbas)((flbas >> 4) & 0x1)
 #define NVME_ID_NS_FLBAS_INDEX(flbas)   ((flbas & 0xf))
 #define NVME_ID_NS_MC_SEPARATE(mc)  ((mc >> 1) & 0x1)
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 31c80cdf5b5f..f1cc734c60f5 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -33,9 +33,7 @@ static void nvme_ns_init(NvmeNamespace *ns)
 NvmeIdNs *id_ns = >id_ns;
 int lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
 
-if (blk_get_flags(ns->blkconf.blk) & BDRV_O_UNMAP) {
-ns->id_ns.dlfeat = 0x9;
-}
+ns->id_ns.dlfeat = 0x9;
 
 id_ns->lbaf[lba_index].ds = 31 - clz32(ns->blkconf.logical_block_size);
 
@@ -44,6 +42,9 @@ static void nvme_ns_init(NvmeNamespace *ns)
 /* no thin provisioning */
 id_ns->ncap = id_ns->nsze;
 id_ns->nuse = id_ns->ncap;
+
+/* support DULBE */
+id_ns->nsfeat |= 0x4;
 }
 
 static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
@@ -92,6 +93,7 @@ int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error 
**errp)
 }
 
 nvme_ns_init(ns);
+
 if (nvme_register_namespace(n, ns, errp)) {
 return -1;
 }
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index fa2cba744b57..4ab0705f5a92 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -105,6 +105,7 @@ static const bool nvme_feature_support[NVME_FID_MAX] = {
 
 static const uint32_t nvme_feature_cap[NVME_FID_MAX] = {
 [NVME_TEMPERATURE_THRESHOLD]= NVME_FEAT_CAP_CHANGE,
+[NVME_ERROR_RECOVERY]   = NVME_FEAT_CAP_CHANGE | NVME_FEAT_CAP_NS,
 [NVME_VOLATILE_WRITE_CACHE] = NVME_FEAT_CAP_CHANGE,
 [NVME_NUMBER_OF_QUEUES] = NVME_FEAT_CAP_CHANGE,
 [NVME_ASYNCHRONOUS_EVENT_CONF]  = NVME_FEAT_CAP_CHANGE,
@@ -878,6 +879,41 @@ static inline uint16_t nvme_check_bounds(NvmeCtrl *n, 
NvmeNamespace *ns,
 return NVME_SUCCESS;
 }
 
+static

[PATCH v5 0/2] hw/block/nvme: dulbe and dsm support

2020-10-22 Thread Klaus Jensen

From: Klaus Jensen 

This adds support for the Deallocated or Unwritten Logical Block error
recovery feature as well as the Dataset Management command.

I wanted to add support for the NPDG and NPDA fields such that the host
could get a hint on how many blocks to request deallocation of for the
deallocation to actually happen, but I cannot find a realiable way to
get the actual block size of the underlying device. If it is an image on
a file system we could typically use the host page size, but if it is a
raw device, we might have 512 byte sectors that we can issue discards
on. And QEMU doesn't seem to provide this without root privileges at
least.

See the two patches for some gotchas.

I also integrated this into my zoned proposal. I'll spare you the v4, nobody
cares anyway. But I put it in my repo[1] for posterity.

  [1]: https://irrelevant.dk/g/pci-nvme.git/tag/?h=zoned-v4.

v5:
  - Restore status code from callback (Keith)

v4:
  - Removed mixed declaration and code (Keith)
  - Set NPDG and NPDA and account for the blockdev cluster size.

Klaus Jensen (2):
  hw/block/nvme: add dulbe support
  hw/block/nvme: add the dataset management command

 hw/block/nvme-ns.h|   4 +
 hw/block/nvme.h   |   2 +
 include/block/nvme.h  |  12 ++-
 hw/block/nvme-ns.c|  40 +++--
 hw/block/nvme.c   | 184 +-
 hw/block/trace-events |   4 +
 6 files changed, 237 insertions(+), 9 deletions(-)

-- 
2.28.0

[PATCH v5 2/2] hw/block/nvme: add the dataset management command

2020-10-22 Thread Klaus Jensen

From: Klaus Jensen 

Add support for the Dataset Management command and the Deallocate
attribute. Deallocation results in discards being sent to the underlying
block device. Whether of not the blocks are actually deallocated is
affected by the same factors as Write Zeroes (see previous commit).

 format | discard | dsm (512b)  dsm (4kb)  dsm (64kb)
--
  qcow2ignore   n   n  n
  qcow2unmapn   n  y
  raw  ignore   n   n  n
  raw  unmapn   y  y

Again, a raw format and 4kb LBAs are preferable.

In order to set the Namespace Preferred Deallocate Granularity and
Alignment fields (NPDG and NPDA), choose a sane minimum discard
granularity of 4kb. If we are using a passthru device supporting discard
at a 512b granularity, user should set the discard_granularity property
explicitly. NPDG and NPDA will also account for the cluster_size of the
block driver if required (i.e. for QCOW2).

See NVM Express 1.3d, Section 6.7 ("Dataset Management command").

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.h  |   2 +
 include/block/nvme.h |   7 ++-
 hw/block/nvme-ns.c   |  36 +--
 hw/block/nvme.c  | 101 ++-
 4 files changed, 140 insertions(+), 6 deletions(-)

diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index e080a2318a50..574333caa3f9 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -28,6 +28,7 @@ typedef struct NvmeRequest {
 struct NvmeNamespace*ns;
 BlockAIOCB  *aiocb;
 uint16_tstatus;
+void*opaque;
 NvmeCqe cqe;
 NvmeCmd cmd;
 BlockAcctCookie acct;
@@ -60,6 +61,7 @@ static inline const char *nvme_io_opc_str(uint8_t opc)
 case NVME_CMD_WRITE:return "NVME_NVM_CMD_WRITE";
 case NVME_CMD_READ: return "NVME_NVM_CMD_READ";
 case NVME_CMD_WRITE_ZEROES: return "NVME_NVM_CMD_WRITE_ZEROES";
+case NVME_CMD_DSM:  return "NVME_NVM_CMD_DSM";
 default:return "NVME_NVM_CMD_UNKNOWN";
 }
 }
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 966c3bb304bd..e95ff6ca9b37 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -990,7 +990,12 @@ typedef struct QEMU_PACKED NvmeIdNs {
 uint16_tnabspf;
 uint16_tnoiob;
 uint8_t nvmcap[16];
-uint8_t rsvd64[40];
+uint16_tnpwg;
+uint16_tnpwa;
+uint16_tnpdg;
+uint16_tnpda;
+uint16_tnows;
+uint8_t rsvd74[30];
 uint8_t nguid[16];
 uint64_teui64;
 NvmeLBAFlbaf[16];
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index f1cc734c60f5..840651db7256 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -28,10 +28,14 @@
 #include "nvme.h"
 #include "nvme-ns.h"
 
-static void nvme_ns_init(NvmeNamespace *ns)
+#define MIN_DISCARD_GRANULARITY (4 * KiB)
+
+static int nvme_ns_init(NvmeNamespace *ns, Error **errp)
 {
+BlockDriverInfo bdi;
 NvmeIdNs *id_ns = >id_ns;
 int lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
+int npdg, ret;
 
 ns->id_ns.dlfeat = 0x9;
 
@@ -43,8 +47,25 @@ static void nvme_ns_init(NvmeNamespace *ns)
 id_ns->ncap = id_ns->nsze;
 id_ns->nuse = id_ns->ncap;
 
-/* support DULBE */
-id_ns->nsfeat |= 0x4;
+/* support DULBE and I/O optimization fields */
+id_ns->nsfeat |= (0x4 | 0x10);
+
+npdg = ns->blkconf.discard_granularity / ns->blkconf.logical_block_size;
+
+ret = bdrv_get_info(blk_bs(ns->blkconf.blk), );
+if (ret < 0) {
+error_setg_errno(errp, -ret, "could not get block driver info");
+return ret;
+}
+
+if (bdi.cluster_size &&
+bdi.cluster_size > ns->blkconf.discard_granularity) {
+npdg = bdi.cluster_size / ns->blkconf.logical_block_size;
+}
+
+id_ns->npda = id_ns->npdg = npdg - 1;
+
+return 0;
 }
 
 static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
@@ -59,6 +80,11 @@ static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, 
Error **errp)
 return -1;
 }
 
+if (ns->blkconf.discard_granularity == -1) {
+ns->blkconf.discard_granularity =
+MAX(ns->blkconf.logical_block_size, MIN_DISCARD_GRANULARITY);
+}
+
 ns->size = blk_getlength(ns->blkconf.blk);
 if (ns->size < 0) {
 error_setg_errno(errp, -ns->size, "could not get blockdev size");
@@ -92,7 +118,9 @@ int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error 
**errp)
 return -1;
 }
 
-nvme_ns_init(ns);
+if (nvme_ns_init(ns, errp)) {
+return -1;
+}
 
 if (nvme_register_namespace(n, ns, errp)) {
 return -1;
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 4ab0705f5a92..7acb9e9dc38a 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -959,6

Re: [PATCH v4 2/2] hw/block/nvme: add the dataset management command

2020-10-22 Thread Klaus Jensen

On Oct 22 10:50, Keith Busch wrote:
> On Thu, Oct 22, 2020 at 07:43:33PM +0200, Klaus Jensen wrote:
> > On Oct 22 08:01, Keith Busch wrote:
> > > On Thu, Oct 22, 2020 at 09:33:13AM +0200, Klaus Jensen wrote:
> > > > +if (--(*discards)) {
> > > > +status = NVME_NO_COMPLETE;
> > > > +} else {
> > > > +g_free(discards);
> > > > +req->opaque = NULL;
> > > 
> > > This case needs a
> > > 
> > > status = req->status;
> > > 
> > > So that we get the error set in the callback.
> > > 
> > 
> > There are no cases that result in a non-zero status code here.
> 
> Your callback has a case that sets NVME_INTERNAL_DEV_ERROR status. That
> would get ignored if the final discard reference is dropped from the
> submission side.
> 

Oh. Crap. You are right. Nice catch!

> +static void nvme_aio_discard_cb(void *opaque, int ret)
> +{
> +NvmeRequest *req = opaque;
> +int *discards = req->opaque;
> +
> +trace_pci_nvme_aio_discard_cb(nvme_cid(req));
> +
> +if (ret) {
> +req->status = NVME_INTERNAL_DEV_ERROR;
> +trace_pci_nvme_err_aio(nvme_cid(req), strerror(ret),
> +   req->status);
> +}


signature.asc
Description: PGP signature

[PATCH v12 12/14] copy-on-read: skip non-guest reads if no copy needed

2020-10-22 Thread Andrey Shinkevich via

If the flag BDRV_REQ_PREFETCH was set, skip idling read/write
operations in COR-driver. It can be taken into account for the
COR-algorithms optimization. That check is being made during the
block stream job by the moment.

Signed-off-by: Andrey Shinkevich 
---
 block/copy-on-read.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/block/copy-on-read.c b/block/copy-on-read.c
index a2b180a..081e661 100644
--- a/block/copy-on-read.c
+++ b/block/copy-on-read.c
@@ -153,10 +153,14 @@ static int coroutine_fn 
cor_co_preadv_part(BlockDriverState *bs,
 }
 }
 
-ret = bdrv_co_preadv_part(bs->file, offset, n, qiov, qiov_offset,
-  local_flags);
-if (ret < 0) {
-return ret;
+/* Skip if neither read nor write are needed */
+if ((local_flags & (BDRV_REQ_PREFETCH | BDRV_REQ_COPY_ON_READ)) !=
+BDRV_REQ_PREFETCH) {
+ret = bdrv_co_preadv_part(bs->file, offset, n, qiov, qiov_offset,
+  local_flags);
+if (ret < 0) {
+return ret;
+}
 }
 
 offset += n;
-- 
1.8.3.1

[PATCH v12 06/14] copy-on-read: pass bottom node name to COR driver

2020-10-22 Thread Andrey Shinkevich via

We are going to use the COR-filter for a block-stream job.
To limit COR operations by the base node in the backing chain during
stream job, pass the bottom node name, that is the first non-filter
overlay of the base, to the copy-on-read driver as the base node itself
may change due to possible concurrent jobs.
The rest of the functionality will be implemented in the patch that
follows.

Signed-off-by: Andrey Shinkevich 
---
 block/copy-on-read.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/block/copy-on-read.c b/block/copy-on-read.c
index 618c4c4..3d8e4db 100644
--- a/block/copy-on-read.c
+++ b/block/copy-on-read.c
@@ -24,18 +24,24 @@
 #include "block/block_int.h"
 #include "qemu/module.h"
 #include "qapi/error.h"
+#include "qapi/qmp/qerror.h"
+#include "qapi/qmp/qdict.h"
 #include "block/copy-on-read.h"
 
 
 typedef struct BDRVStateCOR {
 bool active;
+BlockDriverState *bottom_bs;
 } BDRVStateCOR;
 
 
 static int cor_open(BlockDriverState *bs, QDict *options, int flags,
 Error **errp)
 {
+BlockDriverState *bottom_bs = NULL;
 BDRVStateCOR *state = bs->opaque;
+/* Find a bottom node name, if any */
+const char *bottom_node = qdict_get_try_str(options, "bottom");
 
 bs->file = bdrv_open_child(NULL, options, "file", bs, _of_bds,
BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
@@ -51,7 +57,17 @@ static int cor_open(BlockDriverState *bs, QDict *options, 
int flags,
 ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK) &
 bs->file->bs->supported_zero_flags);
 
+if (bottom_node) {
+bottom_bs = bdrv_lookup_bs(NULL, bottom_node, errp);
+if (!bottom_bs) {
+error_setg(errp, QERR_BASE_NOT_FOUND, bottom_node);
+qdict_del(options, "bottom");
+return -EINVAL;
+}
+qdict_del(options, "bottom");
+}
 state->active = true;
+state->bottom_bs = bottom_bs;
 
 /*
  * We don't need to call bdrv_child_refresh_perms() now as the permissions
-- 
1.8.3.1

[PATCH v12 09/14] block: modify the comment for BDRV_REQ_PREFETCH flag

2020-10-22 Thread Andrey Shinkevich via

Modify the comment for the flag BDRV_REQ_PREFETCH as we are going to
use it alone and pass it to the COR-filter driver for further
processing.

Signed-off-by: Andrey Shinkevich 
---
 include/block/block.h | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index ae7612f..1b6742f 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -81,9 +81,11 @@ typedef enum {
 BDRV_REQ_NO_FALLBACK= 0x100,
 
 /*
- * BDRV_REQ_PREFETCH may be used only together with BDRV_REQ_COPY_ON_READ
- * on read request and means that caller doesn't really need data to be
- * written to qiov parameter which may be NULL.
+ * BDRV_REQ_PREFETCH makes sense only in the context of copy-on-read
+ * (i.e., together with the BDRV_REQ_COPY_ON_READ flag or when a COR
+ * filter is involved), in which case it signals that the COR operation
+ * need not read the data into memory (qiov) but only ensure they are
+ * copied to the top layer (i.e., that COR operation is done).
  */
 BDRV_REQ_PREFETCH  = 0x200,
 /* Mask of valid flags */
-- 
1.8.3.1

[PATCH v12 13/14] stream: skip filters when writing backing file name to QCOW2 header

2020-10-22 Thread Andrey Shinkevich via

Avoid writing a filter JSON file name and a filter format name to QCOW2
image when the backing file is changed after the block stream job.
A user is still able to assign the 'backing-file' parameter for a
block-stream job keeping in mind the possible issue mentioned above.
If the user does not specify the 'backing-file' parameter, QEMU will
assign it automatically.

Signed-off-by: Andrey Shinkevich 
---
 block/stream.c | 15 +--
 blockdev.c |  9 ++---
 2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/block/stream.c b/block/stream.c
index e0540ee..1ba74ab 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -65,6 +65,7 @@ static int stream_prepare(Job *job)
 BlockDriverState *bs = blk_bs(bjob->blk);
 BlockDriverState *unfiltered_bs = bdrv_skip_filters(bs);
 BlockDriverState *base = bdrv_filter_or_cow_bs(s->above_base);
+BlockDriverState *base_unfiltered = NULL;
 Error *local_err = NULL;
 int ret = 0;
 
@@ -75,8 +76,18 @@ static int stream_prepare(Job *job)
 const char *base_id = NULL, *base_fmt = NULL;
 if (base) {
 base_id = s->backing_file_str;
-if (base->drv) {
-base_fmt = base->drv->format_name;
+if (base_id) {
+if (base->drv) {
+base_fmt = base->drv->format_name;
+}
+} else {
+base_unfiltered = bdrv_skip_filters(base);
+if (base_unfiltered) {
+base_id = base_unfiltered->filename;
+if (base_unfiltered->drv) {
+base_fmt = base_unfiltered->drv->format_name;
+}
+}
 }
 }
 bdrv_set_backing_hd(unfiltered_bs, base, _err);
diff --git a/blockdev.c b/blockdev.c
index c917625..0e9c783 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2508,7 +2508,6 @@ void qmp_block_stream(bool has_job_id, const char 
*job_id, const char *device,
 BlockDriverState *base_bs = NULL;
 AioContext *aio_context;
 Error *local_err = NULL;
-const char *base_name = NULL;
 int job_flags = JOB_DEFAULT;
 
 if (!has_on_error) {
@@ -2536,7 +2535,6 @@ void qmp_block_stream(bool has_job_id, const char 
*job_id, const char *device,
 goto out;
 }
 assert(bdrv_get_aio_context(base_bs) == aio_context);
-base_name = base;
 }
 
 if (has_base_node) {
@@ -2551,7 +2549,6 @@ void qmp_block_stream(bool has_job_id, const char 
*job_id, const char *device,
 }
 assert(bdrv_get_aio_context(base_bs) == aio_context);
 bdrv_refresh_filename(base_bs);
-base_name = base_bs->filename;
 }
 
 /* Check for op blockers in the whole chain between bs and base */
@@ -2571,9 +2568,6 @@ void qmp_block_stream(bool has_job_id, const char 
*job_id, const char *device,
 goto out;
 }
 
-/* backing_file string overrides base bs filename */
-base_name = has_backing_file ? backing_file : base_name;
-
 if (has_auto_finalize && !auto_finalize) {
 job_flags |= JOB_MANUAL_FINALIZE;
 }
@@ -2581,7 +2575,8 @@ void qmp_block_stream(bool has_job_id, const char 
*job_id, const char *device,
 job_flags |= JOB_MANUAL_DISMISS;
 }
 
-stream_start(has_job_id ? job_id : NULL, bs, base_bs, base_name,
+stream_start(has_job_id ? job_id : NULL, bs, base_bs,
+ has_backing_file ? backing_file : NULL,
  job_flags, has_speed ? speed : 0, on_error,
  filter_node_name, _err);
 if (local_err) {
-- 
1.8.3.1

[PATCH v12 00/14] Apply COR-filter to the block-stream permanently

2020-10-22 Thread Andrey Shinkevich via

The node insert/remove functions were added at the block generic layer.
COR-filter options structure was added to the QAPI.
The test case #310 was added to check the 'bottom' node limit for COR.
The 'supported_read_flags' member was added to the BDS structure
(with the flags check at the block generic layer for drivers).

v12:
  02: New.
  03: Only the temporary drop filter function left.
  05: New (suggested by Max)
  06: 'base' -> 'bottom' option.
  07: Fixes based on the review of the v11.
  08: New.
  09: The comment ext was modified.
  10: The read flags check at the block generic layer.
  11: COR flag was added.
  12: The condition was fixed.
  13: The 'backing-file' parameter returned. No deprecation.
  14: The COR-filter 'add' function replaced with the 'insert node' generic
  function. Fixes based on the review of the v11.

Andrey Shinkevich (14):
  copy-on-read: support preadv/pwritev_part functions
  block: add insert/remove node functions
  copy-on-read: add filter drop function
  qapi: add filter-node-name to block-stream
  qapi: create BlockdevOptionsCor structure for COR driver
  copy-on-read: pass bottom node name to COR driver
  copy-on-read: limit COR operations to bottom node
  iotests: add #310 to test bottom node in COR driver
  block: modify the comment for BDRV_REQ_PREFETCH flag
  block: include supported_read_flags into BDS structure
  copy-on-read: add support for read flags to COR-filter
  copy-on-read: skip non-guest reads if no copy needed
  stream: skip filters when writing backing file name to QCOW2 header
  block: apply COR-filter to block-stream jobs

 block.c|  49 ++
 block/copy-on-read.c   | 144 +
 block/copy-on-read.h   |  32 +
 block/io.c |  12 +++-
 block/monitor/block-hmp-cmds.c |   4 +-
 block/stream.c | 117 ++---
 blockdev.c |  13 ++--
 include/block/block.h  |  11 +++-
 include/block/block_int.h  |  11 +++-
 qapi/block-core.json   |  27 +++-
 tests/qemu-iotests/030 |  51 ++-
 tests/qemu-iotests/030.out |   4 +-
 tests/qemu-iotests/141.out |   2 +-
 tests/qemu-iotests/245 |  22 +--
 tests/qemu-iotests/310 | 109 +++
 tests/qemu-iotests/310.out |  15 +
 tests/qemu-iotests/group   |   3 +-
 17 files changed, 503 insertions(+), 123 deletions(-)
 create mode 100644 block/copy-on-read.h
 create mode 100755 tests/qemu-iotests/310
 create mode 100644 tests/qemu-iotests/310.out

-- 
1.8.3.1

[PATCH v12 11/14] copy-on-read: add support for read flags to COR-filter

2020-10-22 Thread Andrey Shinkevich via

Add the BDRV_REQ_COPY_ON_READ and BDRV_REQ_PREFETCH flags to the
supported_read_flags of the COR-filter.

Signed-off-by: Andrey Shinkevich 
---
 block/copy-on-read.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/copy-on-read.c b/block/copy-on-read.c
index 8178a91..a2b180a 100644
--- a/block/copy-on-read.c
+++ b/block/copy-on-read.c
@@ -50,6 +50,8 @@ static int cor_open(BlockDriverState *bs, QDict *options, int 
flags,
 return -EINVAL;
 }
 
+bs->supported_read_flags = BDRV_REQ_COPY_ON_READ | BDRV_REQ_PREFETCH;
+
 bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
 (BDRV_REQ_FUA & bs->file->bs->supported_write_flags);
 
-- 
1.8.3.1

[PATCH v12 14/14] block: apply COR-filter to block-stream jobs

2020-10-22 Thread Andrey Shinkevich via

This patch completes the series with the COR-filter insertion for
block-stream operations. Adding the filter makes it possible for copied
regions to be discarded in backing files during the block-stream job,
what will reduce the disk overuse.
The COR-filter insertion incurs changes in the iotests case
245:test_block_stream_4 that reopens the backing chain during a
block-stream job. There are changes in the iotests #030 as well.
The iotests case 030:test_stream_parallel was deleted due to multiple
conflicts between the concurrent job operations over the same backing
chain. The base backing node for one job is the top node for another
job. It may change due to the filter node inserted into the backing
chain while both jobs are running. Another issue is that the parts of
the backing chain are being frozen by the running job and may not be
changed by the concurrent job when needed. The concept of the parallel
jobs with common nodes is considered vital no more.

Signed-off-by: Andrey Shinkevich 
---
 block/stream.c | 98 ++
 tests/qemu-iotests/030 | 51 +++-
 tests/qemu-iotests/030.out |  4 +-
 tests/qemu-iotests/141.out |  2 +-
 tests/qemu-iotests/245 | 22 +++
 5 files changed, 87 insertions(+), 90 deletions(-)

diff --git a/block/stream.c b/block/stream.c
index 1ba74ab..f6ed315 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -17,8 +17,10 @@
 #include "block/blockjob_int.h"
 #include "qapi/error.h"
 #include "qapi/qmp/qerror.h"
+#include "qapi/qmp/qdict.h"
 #include "qemu/ratelimit.h"
 #include "sysemu/block-backend.h"
+#include "block/copy-on-read.h"
 
 enum {
 /*
@@ -33,6 +35,8 @@ typedef struct StreamBlockJob {
 BlockJob common;
 BlockDriverState *base_overlay; /* COW overlay (stream from this) */
 BlockDriverState *above_base;   /* Node directly above the base */
+BlockDriverState *cor_filter_bs;
+BlockDriverState *target_bs;
 BlockdevOnError on_error;
 char *backing_file_str;
 bool bs_read_only;
@@ -44,8 +48,7 @@ static int coroutine_fn stream_populate(BlockBackend *blk,
 {
 assert(bytes < SIZE_MAX);
 
-return blk_co_preadv(blk, offset, bytes, NULL,
- BDRV_REQ_COPY_ON_READ | BDRV_REQ_PREFETCH);
+return blk_co_preadv(blk, offset, bytes, NULL, BDRV_REQ_PREFETCH);
 }
 
 static void stream_abort(Job *job)
@@ -53,23 +56,20 @@ static void stream_abort(Job *job)
 StreamBlockJob *s = container_of(job, StreamBlockJob, common.job);
 
 if (s->chain_frozen) {
-BlockJob *bjob = >common;
-bdrv_unfreeze_backing_chain(blk_bs(bjob->blk), s->above_base);
+bdrv_unfreeze_backing_chain(s->cor_filter_bs, s->above_base);
 }
 }
 
 static int stream_prepare(Job *job)
 {
 StreamBlockJob *s = container_of(job, StreamBlockJob, common.job);
-BlockJob *bjob = >common;
-BlockDriverState *bs = blk_bs(bjob->blk);
-BlockDriverState *unfiltered_bs = bdrv_skip_filters(bs);
+BlockDriverState *unfiltered_bs = bdrv_skip_filters(s->target_bs);
 BlockDriverState *base = bdrv_filter_or_cow_bs(s->above_base);
 BlockDriverState *base_unfiltered = NULL;
 Error *local_err = NULL;
 int ret = 0;
 
-bdrv_unfreeze_backing_chain(bs, s->above_base);
+bdrv_unfreeze_backing_chain(s->cor_filter_bs, s->above_base);
 s->chain_frozen = false;
 
 if (bdrv_cow_child(unfiltered_bs)) {
@@ -105,15 +105,16 @@ static void stream_clean(Job *job)
 {
 StreamBlockJob *s = container_of(job, StreamBlockJob, common.job);
 BlockJob *bjob = >common;
-BlockDriverState *bs = blk_bs(bjob->blk);
 
 /* Reopen the image back in read-only mode if necessary */
 if (s->bs_read_only) {
 /* Give up write permissions before making it read-only */
 blk_set_perm(bjob->blk, 0, BLK_PERM_ALL, _abort);
-bdrv_reopen_set_read_only(bs, true, NULL);
+bdrv_reopen_set_read_only(s->target_bs, true, NULL);
 }
 
+bdrv_cor_filter_drop(s->cor_filter_bs);
+
 g_free(s->backing_file_str);
 }
 
@@ -121,9 +122,7 @@ static int coroutine_fn stream_run(Job *job, Error **errp)
 {
 StreamBlockJob *s = container_of(job, StreamBlockJob, common.job);
 BlockBackend *blk = s->common.blk;
-BlockDriverState *bs = blk_bs(blk);
-BlockDriverState *unfiltered_bs = bdrv_skip_filters(bs);
-bool enable_cor = !bdrv_cow_child(s->base_overlay);
+BlockDriverState *unfiltered_bs = bdrv_skip_filters(s->target_bs);
 int64_t len;
 int64_t offset = 0;
 uint64_t delay_ns = 0;
@@ -135,21 +134,12 @@ static int coroutine_fn stream_run(Job *job, Error **errp)
 return 0;
 }
 
-len = bdrv_getlength(bs);
+len = bdrv_getlength(s->target_bs);
 if (len < 0) {
 return len;
 }
 job_progress_set_remaining(>common.job, len);
 
-/* Turn on copy-on-read for the whole block device so that guest read
- * requests help us make progress.  Only do this when copying

[PATCH v12 01/14] copy-on-read: support preadv/pwritev_part functions

2020-10-22 Thread Andrey Shinkevich via

Add support for the recently introduced functions
bdrv_co_preadv_part()
and
bdrv_co_pwritev_part()
to the COR-filter driver.

Signed-off-by: Andrey Shinkevich 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/copy-on-read.c | 28 
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/block/copy-on-read.c b/block/copy-on-read.c
index 2816e61..cb03e0f 100644
--- a/block/copy-on-read.c
+++ b/block/copy-on-read.c
@@ -74,21 +74,25 @@ static int64_t cor_getlength(BlockDriverState *bs)
 }
 
 
-static int coroutine_fn cor_co_preadv(BlockDriverState *bs,
-  uint64_t offset, uint64_t bytes,
-  QEMUIOVector *qiov, int flags)
+static int coroutine_fn cor_co_preadv_part(BlockDriverState *bs,
+   uint64_t offset, uint64_t bytes,
+   QEMUIOVector *qiov,
+   size_t qiov_offset,
+   int flags)
 {
-return bdrv_co_preadv(bs->file, offset, bytes, qiov,
-  flags | BDRV_REQ_COPY_ON_READ);
+return bdrv_co_preadv_part(bs->file, offset, bytes, qiov, qiov_offset,
+   flags | BDRV_REQ_COPY_ON_READ);
 }
 
 
-static int coroutine_fn cor_co_pwritev(BlockDriverState *bs,
-   uint64_t offset, uint64_t bytes,
-   QEMUIOVector *qiov, int flags)
+static int coroutine_fn cor_co_pwritev_part(BlockDriverState *bs,
+uint64_t offset,
+uint64_t bytes,
+QEMUIOVector *qiov,
+size_t qiov_offset, int flags)
 {
-
-return bdrv_co_pwritev(bs->file, offset, bytes, qiov, flags);
+return bdrv_co_pwritev_part(bs->file, offset, bytes, qiov, qiov_offset,
+flags);
 }
 
 
@@ -137,8 +141,8 @@ static BlockDriver bdrv_copy_on_read = {
 
 .bdrv_getlength = cor_getlength,
 
-.bdrv_co_preadv = cor_co_preadv,
-.bdrv_co_pwritev= cor_co_pwritev,
+.bdrv_co_preadv_part= cor_co_preadv_part,
+.bdrv_co_pwritev_part   = cor_co_pwritev_part,
 .bdrv_co_pwrite_zeroes  = cor_co_pwrite_zeroes,
 .bdrv_co_pdiscard   = cor_co_pdiscard,
 .bdrv_co_pwritev_compressed = cor_co_pwritev_compressed,
-- 
1.8.3.1

[PATCH v12 05/14] qapi: create BlockdevOptionsCor structure for COR driver

2020-10-22 Thread Andrey Shinkevich via

Create the BlockdevOptionsCor structure for COR driver specific options
splitting it off form the BlockdevOptionsGenericFormat. The only option
'bottom' node in the structure denotes an image file that limits the
COR operations in the backing chain.

Suggested-by: Max Reitz 
Signed-off-by: Andrey Shinkevich 
---
 qapi/block-core.json | 21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 0a64306..bf465f6 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -3938,6 +3938,25 @@
   'data': { 'throttle-group': 'str',
 'file' : 'BlockdevRef'
  } }
+
+##
+# @BlockdevOptionsCor:
+#
+# Driver specific block device options for the copy-on-read driver.
+#
+# @bottom: the name of a non-filter node (allocation-bearing layer) that limits
+#  the COR operations in the backing chain (inclusive).
+#  For the block-stream job, it will be the first non-filter overlay of
+#  the base node. We do not involve the base node into the COR
+#  operations because the base may change due to a concurrent
+#  block-commit job on the same backing chain.
+#
+# Since: 5.2
+##
+{ 'struct': 'BlockdevOptionsCor',
+  'base': 'BlockdevOptionsGenericFormat',
+  'data': { '*bottom': 'str' } }
+
 ##
 # @BlockdevOptions:
 #
@@ -3990,7 +4009,7 @@
   'bochs':  'BlockdevOptionsGenericFormat',
   'cloop':  'BlockdevOptionsGenericFormat',
   'compress':   'BlockdevOptionsGenericFormat',
-  'copy-on-read':'BlockdevOptionsGenericFormat',
+  'copy-on-read':'BlockdevOptionsCor',
   'dmg':'BlockdevOptionsGenericFormat',
   'file':   'BlockdevOptionsFile',
   'ftp':'BlockdevOptionsCurlFtp',
-- 
1.8.3.1

[PATCH v12 10/14] block: include supported_read_flags into BDS structure

2020-10-22 Thread Andrey Shinkevich via

Add the new member supported_read_flags to the BlockDriverState
structure. It will control the flags set for copy-on-read operations.
Make the block generic layer evaluate supported read flags before they
go to a block driver.

Suggested-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Andrey Shinkevich 
---
 block/io.c| 12 ++--
 include/block/block_int.h |  4 
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/block/io.c b/block/io.c
index 54f0968..78ddf13 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1392,6 +1392,9 @@ static int coroutine_fn bdrv_aligned_preadv(BdrvChild 
*child,
 if (flags & BDRV_REQ_COPY_ON_READ) {
 int64_t pnum;
 
+/* The flag BDRV_REQ_COPY_ON_READ has reached its addressee */
+flags &= ~BDRV_REQ_COPY_ON_READ;
+
 ret = bdrv_is_allocated(bs, offset, bytes, );
 if (ret < 0) {
 goto out;
@@ -1413,9 +1416,13 @@ static int coroutine_fn bdrv_aligned_preadv(BdrvChild 
*child,
 goto out;
 }
 
+if (flags & ~bs->supported_read_flags) {
+abort();
+}
+
 max_bytes = ROUND_UP(MAX(0, total_bytes - offset), align);
 if (bytes <= max_bytes && bytes <= max_transfer) {
-ret = bdrv_driver_preadv(bs, offset, bytes, qiov, qiov_offset, 0);
+ret = bdrv_driver_preadv(bs, offset, bytes, qiov, qiov_offset, flags);
 goto out;
 }
 
@@ -1428,7 +1435,8 @@ static int coroutine_fn bdrv_aligned_preadv(BdrvChild 
*child,
 
 ret = bdrv_driver_preadv(bs, offset + bytes - bytes_remaining,
  num, qiov,
- qiov_offset + bytes - bytes_remaining, 0);
+ qiov_offset + bytes - bytes_remaining,
+ flags);
 max_bytes -= num;
 } else {
 num = bytes_remaining;
diff --git a/include/block/block_int.h b/include/block/block_int.h
index f782737..474174c 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -873,6 +873,10 @@ struct BlockDriverState {
 /* I/O Limits */
 BlockLimits bl;
 
+/*
+ * Flags honored during pread
+ */
+unsigned int supported_read_flags;
 /* Flags honored during pwrite (so far: BDRV_REQ_FUA,
  * BDRV_REQ_WRITE_UNCHANGED).
  * If a driver does not support BDRV_REQ_WRITE_UNCHANGED, those
-- 
1.8.3.1

[PATCH v12 02/14] block: add insert/remove node functions

2020-10-22 Thread Andrey Shinkevich via

Provide API for a node insertion to and removal from a backing chain.

Suggested-by: Max Reitz 
Signed-off-by: Andrey Shinkevich 
---
 block.c   | 49 +
 include/block/block.h |  3 +++
 2 files changed, 52 insertions(+)

diff --git a/block.c b/block.c
index 430edf7..502b483 100644
--- a/block.c
+++ b/block.c
@@ -4670,6 +4670,55 @@ static void bdrv_delete(BlockDriverState *bs)
 g_free(bs);
 }
 
+BlockDriverState *bdrv_insert_node(BlockDriverState *bs, QDict *node_options,
+   int flags, Error **errp)
+{
+BlockDriverState *new_node_bs;
+Error *local_err = NULL;
+
+new_node_bs =  bdrv_open(NULL, NULL, node_options, flags, errp);
+if (new_node_bs == NULL) {
+error_prepend(errp, "Could not create node: ");
+return NULL;
+}
+
+bdrv_drained_begin(bs);
+bdrv_replace_node(bs, new_node_bs, _err);
+bdrv_drained_end(bs);
+
+if (local_err) {
+bdrv_unref(new_node_bs);
+error_propagate(errp, local_err);
+return NULL;
+}
+
+return new_node_bs;
+}
+
+void bdrv_remove_node(BlockDriverState *bs)
+{
+BdrvChild *child;
+BlockDriverState *inferior_bs;
+
+child = bdrv_filter_or_cow_child(bs);
+if (!child) {
+return;
+}
+inferior_bs = child->bs;
+
+/* Retain the BDS until we complete the graph change. */
+bdrv_ref(inferior_bs);
+/* Hold a guest back from writing while permissions are being reset. */
+bdrv_drained_begin(inferior_bs);
+/* Refresh permissions before the graph change. */
+bdrv_child_refresh_perms(bs, child, _abort);
+bdrv_replace_node(bs, inferior_bs, _abort);
+
+bdrv_drained_end(inferior_bs);
+bdrv_unref(inferior_bs);
+bdrv_unref(bs);
+}
+
 /*
  * Run consistency checks on an image
  *
diff --git a/include/block/block.h b/include/block/block.h
index d16c401..ae7612f 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -350,6 +350,9 @@ void bdrv_append(BlockDriverState *bs_new, BlockDriverState 
*bs_top,
  Error **errp);
 void bdrv_replace_node(BlockDriverState *from, BlockDriverState *to,
Error **errp);
+BlockDriverState *bdrv_insert_node(BlockDriverState *bs, QDict *node_options,
+   int flags, Error **errp);
+void bdrv_remove_node(BlockDriverState *bs);
 
 int bdrv_parse_aio(const char *mode, int *flags);
 int bdrv_parse_cache_mode(const char *mode, int *flags, bool *writethrough);
-- 
1.8.3.1

[PATCH v12 08/14] iotests: add #310 to test bottom node in COR driver

2020-10-22 Thread Andrey Shinkevich via

The test case #310 is similar to #216 by Max Reitz. The difference is
that the test #310 involves a bottom node to the COR filter driver.

Signed-off-by: Andrey Shinkevich 
---
 tests/qemu-iotests/310 | 109 +
 tests/qemu-iotests/310.out |  15 +++
 tests/qemu-iotests/group   |   3 +-
 3 files changed, 126 insertions(+), 1 deletion(-)
 create mode 100755 tests/qemu-iotests/310
 create mode 100644 tests/qemu-iotests/310.out

diff --git a/tests/qemu-iotests/310 b/tests/qemu-iotests/310
new file mode 100755
index 000..5ad7ad2
--- /dev/null
+++ b/tests/qemu-iotests/310
@@ -0,0 +1,109 @@
+#!/usr/bin/env python3
+#
+# Copy-on-read tests using a COR filter with a bottom node
+#
+# Copyright (c) 2020 Virtuozzo International GmbH
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+import iotests
+from iotests import log, qemu_img, qemu_io_silent
+
+# Need backing file support
+iotests.script_initialize(supported_fmts=['qcow2', 'qcow', 'qed', 'vmdk'],
+  supported_platforms=['linux'])
+
+log('')
+log('=== Copy-on-read across nodes ===')
+log('')
+
+# This test is similar to the 216 one by Max Reitz 
+# The difference is that this test case involves a bottom node to the
+# COR filter driver.
+
+with iotests.FilePath('base.img') as base_img_path, \
+ iotests.FilePath('mid.img') as mid_img_path, \
+ iotests.FilePath('top.img') as top_img_path, \
+ iotests.VM() as vm:
+
+log('--- Setting up images ---')
+log('')
+
+assert qemu_img('create', '-f', iotests.imgfmt, base_img_path, '64M') == 0
+assert qemu_io_silent(base_img_path, '-c', 'write -P 1 0M 1M') == 0
+assert qemu_io_silent(base_img_path, '-c', 'write -P 1 3M 1M') == 0
+assert qemu_img('create', '-f', iotests.imgfmt, '-b', base_img_path,
+'-F', iotests.imgfmt, mid_img_path) == 0
+assert qemu_io_silent(mid_img_path,  '-c', 'write -P 3 2M 1M') == 0
+assert qemu_io_silent(mid_img_path,  '-c', 'write -P 3 4M 1M') == 0
+assert qemu_img('create', '-f', iotests.imgfmt, '-b', mid_img_path,
+'-F', iotests.imgfmt, top_img_path) == 0
+assert qemu_io_silent(top_img_path,  '-c', 'write -P 2 1M 1M') == 0
+
+log('Done')
+
+log('')
+log('--- Doing COR ---')
+log('')
+
+vm.launch()
+
+log(vm.qmp('blockdev-add',
+node_name='node0',
+driver='copy-on-read',
+bottom='node2',
+file={
+'driver': iotests.imgfmt,
+'file': {
+'driver': 'file',
+'filename': top_img_path
+},
+'backing': {
+'node-name': 'node2',
+'driver': iotests.imgfmt,
+'file': {
+'driver': 'file',
+'filename': mid_img_path
+},
+'backing': {
+#'node-name': 'node2',
+'driver': iotests.imgfmt,
+'file': {
+'driver': 'file',
+'filename': base_img_path
+}
+},
+}
+}))
+
+# Trigger COR
+log(vm.qmp('human-monitor-command',
+   command_line='qemu-io node0 "read 0 5M"'))
+
+vm.shutdown()
+
+log('')
+log('--- Checking COR result ---')
+log('')
+
+assert qemu_io_silent(base_img_path, '-c', 'discard 0 4M') == 0
+assert qemu_io_silent(mid_img_path, '-c', 'discard 0M 5M') == 0
+assert qemu_io_silent(top_img_path,  '-c', 'read -P 1 0M 1M') != 0
+assert qemu_io_silent(top_img_path,  '-c', 'read -P 2 1M 1M') == 0
+assert qemu_io_silent(top_img_path,  '-c', 'read -P 3 2M 1M') == 0
+assert qemu_io_silent(top_img_path,  '-c', 'read -P 1 3M 1M') != 0
+assert qemu_io_silent(top_img_path,  '-c', 'read -P 3 4M 1M') == 0
+
+log('Done')
diff --git a/tests/qemu-iotests/310.out b/tests/qemu-iotests/310.out
new file mode 100644

[PATCH v12 07/14] copy-on-read: limit COR operations to bottom node

2020-10-22 Thread Andrey Shinkevich via

Limit COR operations to the bottom node (inclusively) in the backing
chain when the bottom node name is given. It will be useful for a block
stream job when the COR-filter is applied. The bottom node is passed as
the base itself may change due to concurrent commit jobs on the same
backing chain.

Signed-off-by: Andrey Shinkevich 
---
 block/copy-on-read.c | 42 --
 1 file changed, 40 insertions(+), 2 deletions(-)

diff --git a/block/copy-on-read.c b/block/copy-on-read.c
index 3d8e4db..8178a91 100644
--- a/block/copy-on-read.c
+++ b/block/copy-on-read.c
@@ -123,8 +123,46 @@ static int coroutine_fn 
cor_co_preadv_part(BlockDriverState *bs,
size_t qiov_offset,
int flags)
 {
-return bdrv_co_preadv_part(bs->file, offset, bytes, qiov, qiov_offset,
-   flags | BDRV_REQ_COPY_ON_READ);
+int64_t n = 0;
+int local_flags;
+int ret;
+BDRVStateCOR *state = bs->opaque;
+
+if (!state->bottom_bs) {
+return bdrv_co_preadv_part(bs->file, offset, bytes, qiov, qiov_offset,
+   flags | BDRV_REQ_COPY_ON_READ);
+}
+
+while (bytes) {
+local_flags = flags;
+
+/* In case of failure, try to copy-on-read anyway */
+ret = bdrv_is_allocated(bs->file->bs, offset, bytes, );
+if (!ret || ret < 0) {
+ret = 
bdrv_is_allocated_above(bdrv_backing_chain_next(bs->file->bs),
+  state->bottom_bs, true, offset,
+  n, );
+if (ret == 1 || ret < 0) {
+local_flags |= BDRV_REQ_COPY_ON_READ;
+}
+/* Finish earlier if the end of a backing file has been reached */
+if (ret == 0 && n == 0) {
+break;
+}
+}
+
+ret = bdrv_co_preadv_part(bs->file, offset, n, qiov, qiov_offset,
+  local_flags);
+if (ret < 0) {
+return ret;
+}
+
+offset += n;
+qiov_offset += n;
+bytes -= n;
+}
+
+return 0;
 }
 
 
-- 
1.8.3.1

[PATCH v12 03/14] copy-on-read: add filter drop function

2020-10-22 Thread Andrey Shinkevich via

Provide API for the COR-filter removal. Also, drop the filter child
permissions for an inactive state when the filter node is being
removed. This function may be considered as an intermediate solution
before we are able to use bdrv_remove_node(). It will be possible once
the QEMU permission update system has overhauled.
To insert the filter, the block generic layer function
bdrv_insert_node() can be used.

Signed-off-by: Andrey Shinkevich 
---
 block/copy-on-read.c | 56 
 block/copy-on-read.h | 32 ++
 2 files changed, 88 insertions(+)
 create mode 100644 block/copy-on-read.h

diff --git a/block/copy-on-read.c b/block/copy-on-read.c
index cb03e0f..618c4c4 100644
--- a/block/copy-on-read.c
+++ b/block/copy-on-read.c
@@ -23,11 +23,20 @@
 #include "qemu/osdep.h"
 #include "block/block_int.h"
 #include "qemu/module.h"
+#include "qapi/error.h"
+#include "block/copy-on-read.h"
+
+
+typedef struct BDRVStateCOR {
+bool active;
+} BDRVStateCOR;
 
 
 static int cor_open(BlockDriverState *bs, QDict *options, int flags,
 Error **errp)
 {
+BDRVStateCOR *state = bs->opaque;
+
 bs->file = bdrv_open_child(NULL, options, "file", bs, _of_bds,
BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
false, errp);
@@ -42,6 +51,13 @@ static int cor_open(BlockDriverState *bs, QDict *options, 
int flags,
 ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK) &
 bs->file->bs->supported_zero_flags);
 
+state->active = true;
+
+/*
+ * We don't need to call bdrv_child_refresh_perms() now as the permissions
+ * will be updated later when the filter node gets its parent.
+ */
+
 return 0;
 }
 
@@ -57,6 +73,17 @@ static void cor_child_perm(BlockDriverState *bs, BdrvChild 
*c,
uint64_t perm, uint64_t shared,
uint64_t *nperm, uint64_t *nshared)
 {
+BDRVStateCOR *s = bs->opaque;
+
+if (!s->active) {
+/*
+ * While the filter is being removed
+ */
+*nperm = 0;
+*nshared = BLK_PERM_ALL;
+return;
+}
+
 *nperm = perm & PERM_PASSTHROUGH;
 *nshared = (shared & PERM_PASSTHROUGH) | PERM_UNCHANGED;
 
@@ -135,6 +162,7 @@ static void cor_lock_medium(BlockDriverState *bs, bool 
locked)
 
 static BlockDriver bdrv_copy_on_read = {
 .format_name= "copy-on-read",
+.instance_size  = sizeof(BDRVStateCOR),
 
 .bdrv_open  = cor_open,
 .bdrv_child_perm= cor_child_perm,
@@ -154,6 +182,34 @@ static BlockDriver bdrv_copy_on_read = {
 .is_filter  = true,
 };
 
+
+void bdrv_cor_filter_drop(BlockDriverState *cor_filter_bs)
+{
+BdrvChild *child;
+BlockDriverState *bs;
+BDRVStateCOR *s = cor_filter_bs->opaque;
+
+child = bdrv_filter_child(cor_filter_bs);
+if (!child) {
+return;
+}
+bs = child->bs;
+
+/* Retain the BDS until we complete the graph change. */
+bdrv_ref(bs);
+/* Hold a guest back from writing while permissions are being reset. */
+bdrv_drained_begin(bs);
+/* Drop permissions before the graph change. */
+s->active = false;
+bdrv_child_refresh_perms(cor_filter_bs, child, _abort);
+bdrv_replace_node(cor_filter_bs, bs, _abort);
+
+bdrv_drained_end(bs);
+bdrv_unref(bs);
+bdrv_unref(cor_filter_bs);
+}
+
+
 static void bdrv_copy_on_read_init(void)
 {
 bdrv_register(_copy_on_read);
diff --git a/block/copy-on-read.h b/block/copy-on-read.h
new file mode 100644
index 000..7bf405d
--- /dev/null
+++ b/block/copy-on-read.h
@@ -0,0 +1,32 @@
+/*
+ * Copy-on-read filter block driver
+ *
+ * The filter driver performs Copy-On-Read (COR) operations
+ *
+ * Copyright (c) 2018-2020 Virtuozzo International GmbH.
+ *
+ * Author:
+ *   Andrey Shinkevich 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see .
+ */
+
+#ifndef BLOCK_COPY_ON_READ
+#define BLOCK_COPY_ON_READ
+
+#include "block/block_int.h"
+
+void bdrv_cor_filter_drop(BlockDriverState *cor_filter_bs);
+
+#endif /* BLOCK_COPY_ON_READ */
-- 
1.8.3.1

[PATCH v12 04/14] qapi: add filter-node-name to block-stream

2020-10-22 Thread Andrey Shinkevich via

Provide the possibility to pass the 'filter-node-name' parameter to the
block-stream job as it is done for the commit block job.

Signed-off-by: Andrey Shinkevich 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/monitor/block-hmp-cmds.c | 4 ++--
 block/stream.c | 4 +++-
 blockdev.c | 4 +++-
 include/block/block_int.h  | 7 ++-
 qapi/block-core.json   | 6 ++
 5 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
index d15a2be..e8a58f3 100644
--- a/block/monitor/block-hmp-cmds.c
+++ b/block/monitor/block-hmp-cmds.c
@@ -508,8 +508,8 @@ void hmp_block_stream(Monitor *mon, const QDict *qdict)
 
 qmp_block_stream(true, device, device, base != NULL, base, false, NULL,
  false, NULL, qdict_haskey(qdict, "speed"), speed, true,
- BLOCKDEV_ON_ERROR_REPORT, false, false, false, false,
- );
+ BLOCKDEV_ON_ERROR_REPORT, false, NULL, false, false, 
false,
+ false, );
 
 hmp_handle_error(mon, error);
 }
diff --git a/block/stream.c b/block/stream.c
index 8ce6729..e0540ee 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -221,7 +221,9 @@ static const BlockJobDriver stream_job_driver = {
 void stream_start(const char *job_id, BlockDriverState *bs,
   BlockDriverState *base, const char *backing_file_str,
   int creation_flags, int64_t speed,
-  BlockdevOnError on_error, Error **errp)
+  BlockdevOnError on_error,
+  const char *filter_node_name,
+  Error **errp)
 {
 StreamBlockJob *s;
 BlockDriverState *iter;
diff --git a/blockdev.c b/blockdev.c
index fe6fb5d..c917625 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2499,6 +2499,7 @@ void qmp_block_stream(bool has_job_id, const char 
*job_id, const char *device,
   bool has_backing_file, const char *backing_file,
   bool has_speed, int64_t speed,
   bool has_on_error, BlockdevOnError on_error,
+  bool has_filter_node_name, const char *filter_node_name,
   bool has_auto_finalize, bool auto_finalize,
   bool has_auto_dismiss, bool auto_dismiss,
   Error **errp)
@@ -2581,7 +2582,8 @@ void qmp_block_stream(bool has_job_id, const char 
*job_id, const char *device,
 }
 
 stream_start(has_job_id ? job_id : NULL, bs, base_bs, base_name,
- job_flags, has_speed ? speed : 0, on_error, _err);
+ job_flags, has_speed ? speed : 0, on_error,
+ filter_node_name, _err);
 if (local_err) {
 error_propagate(errp, local_err);
 goto out;
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 38cad9d..f782737 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -1134,6 +1134,9 @@ int is_windows_drive(const char *filename);
  *  See @BlockJobCreateFlags
  * @speed: The maximum speed, in bytes per second, or 0 for unlimited.
  * @on_error: The action to take upon error.
+ * @filter_node_name: The node name that should be assigned to the filter
+ * driver that the commit job inserts into the graph above @bs. NULL means
+ * that a node name should be autogenerated.
  * @errp: Error object.
  *
  * Start a streaming operation on @bs.  Clusters that are unallocated
@@ -1146,7 +1149,9 @@ int is_windows_drive(const char *filename);
 void stream_start(const char *job_id, BlockDriverState *bs,
   BlockDriverState *base, const char *backing_file_str,
   int creation_flags, int64_t speed,
-  BlockdevOnError on_error, Error **errp);
+  BlockdevOnError on_error,
+  const char *filter_node_name,
+  Error **errp);
 
 /**
  * commit_start:
diff --git a/qapi/block-core.json b/qapi/block-core.json
index ee5ebef..0a64306 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2542,6 +2542,11 @@
 #'stop' and 'enospc' can only be used if the block device
 #supports io-status (see BlockInfo).  Since 1.3.
 #
+# @filter-node-name: the node name that should be assigned to the
+#filter driver that the stream job inserts into the graph
+#above @device. If this option is not given, a node name is
+#autogenerated. (Since: 5.2)
+#
 # @auto-finalize: When false, this job will wait in a PENDING state after it 
has
 # finished its work, waiting for @block-job-finalize before
 # making any block graph changes.
@@ -2572,6 +2577,7 @@
   'data': { '*job-id': 'str', 'device': 'str', '*base': 'str',
 '*base-node': 'str', '*backing-file': 'str', '*speed': 'int',
 '*on-error': 'BlockdevOnError',
+

Re: [PATCH v4 2/2] hw/block/nvme: add the dataset management command

2020-10-22 Thread Keith Busch

On Thu, Oct 22, 2020 at 07:43:33PM +0200, Klaus Jensen wrote:
> On Oct 22 08:01, Keith Busch wrote:
> > On Thu, Oct 22, 2020 at 09:33:13AM +0200, Klaus Jensen wrote:
> > > +if (--(*discards)) {
> > > +status = NVME_NO_COMPLETE;
> > > +} else {
> > > +g_free(discards);
> > > +req->opaque = NULL;
> > 
> > This case needs a
> > 
> > status = req->status;
> > 
> > So that we get the error set in the callback.
> > 
> 
> There are no cases that result in a non-zero status code here.

Your callback has a case that sets NVME_INTERNAL_DEV_ERROR status. That
would get ignored if the final discard reference is dropped from the
submission side.

+static void nvme_aio_discard_cb(void *opaque, int ret)
+{
+NvmeRequest *req = opaque;
+int *discards = req->opaque;
+
+trace_pci_nvme_aio_discard_cb(nvme_cid(req));
+
+if (ret) {
+req->status = NVME_INTERNAL_DEV_ERROR;
+trace_pci_nvme_err_aio(nvme_cid(req), strerror(ret),
+   req->status);
+}

Re: [PATCH 0/2] hw/block/nvme: two fixes for create sq/cq

2020-10-22 Thread Klaus Jensen

On Oct 22 08:20, Keith Busch wrote:
> On Thu, Oct 22, 2020 at 03:24:02PM +0200, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > The first patch is a follow up to "hw/block/nvme: fix prp mapping status
> > codes" and fixes some status codes in the nvme_create_{sq,cq} functions.
> > 
> > The second patch fixes a faulty check on the given queue identifier.
> 
> Looks good.
> 
> Reviewed-by: Keith Busch 

Thanks! Applied to nvme-next.


signature.asc
Description: PGP signature

Re: [PATCH v4 2/2] hw/block/nvme: add the dataset management command

2020-10-22 Thread Klaus Jensen

On Oct 22 08:01, Keith Busch wrote:
> On Thu, Oct 22, 2020 at 09:33:13AM +0200, Klaus Jensen wrote:
> > +if (--(*discards)) {
> > +status = NVME_NO_COMPLETE;
> > +} else {
> > +g_free(discards);
> > +req->opaque = NULL;
> 
> This case needs a
> 
> status = req->status;
> 
> So that we get the error set in the callback.
> 

There are no cases that result in a non-zero status code here. If an LBA
range is invalid we simply continue with the next. In case the DMA
transfer fails, we return the error directly and the normal path takes
care of it. The else block is for when there are no pending aios for
some reason (all invalid ranges or they completed immediately) - in that
case we can just return NVME_SUCCESS directly.

> Otherwise, this looks fine. I am assuming everything still runs single
> threaded since this isn't using atomics.

Yeah, all device code (including callbacks) run on the main thread.

signature.asc
Description: PGP signature

Re: [PATCH v1 0/2] Add timeout mechanism to qmp actions

2020-10-22 Thread Fam Zheng

On Tue, 2020-10-20 at 09:34 +0800, Zhenyu Ye wrote:
> On 2020/10/19 21:25, Paolo Bonzini wrote:
> > On 19/10/20 14:40, Zhenyu Ye wrote:
> > > The kernel backtrace for io_submit in GUEST is:
> > > 
> > >   guest# ./offcputime -K -p `pgrep -nx fio`
> > >   b'finish_task_switch'
> > >   b'__schedule'
> > >   b'schedule'
> > >   b'io_schedule'
> > >   b'blk_mq_get_tag'
> > >   b'blk_mq_get_request'
> > >   b'blk_mq_make_request'
> > >   b'generic_make_request'
> > >   b'submit_bio'
> > >   b'blkdev_direct_IO'
> > >   b'generic_file_read_iter'
> > >   b'aio_read'
> > >   b'io_submit_one'
> > >   b'__x64_sys_io_submit'
> > >   b'do_syscall_64'
> > >   b'entry_SYSCALL_64_after_hwframe'
> > >   -fio (1464)
> > >   40031912
> > > 
> > > And Linux io_uring can avoid the latency problem.

Thanks for the info. What this tells us is basically the inflight
requests are high. It's sad that the linux-aio is in practice
implemented as a blocking API.

Host side backtrace will be of more help. Can you get that too?

Fam

> > 
> > What filesystem are you using?
> > 
> 
> On host, the VM image and disk images are based on ext4 filesystem.
> In guest, the '/' uses xfs filesystem, and the disks are raw devices.
> 
> guest# df -hT
> Filesystem  Type  Size  Used Avail Use% Mounted on
> devtmpfsdevtmpfs   16G 0   16G   0% /dev
> tmpfs   tmpfs  16G 0   16G   0% /dev/shm
> tmpfs   tmpfs  16G  976K   16G   1% /run
> /dev/mapper/fedora-root xfs   8.0G  3.2G  4.9G  40% /
> tmpfs   tmpfs  16G 0   16G   0% /tmp
> /dev/sda1   xfs  1014M  181M  834M  18% /boot
> tmpfs   tmpfs 3.2G 0  3.2G   0% /run/user/0
> 
> guest# lsblk
> NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT
> sda   8:00  10G  0 disk
> ├─sda18:10   1G  0 part /boot
> └─sda28:20   9G  0 part
>   ├─fedora-root 253:00   8G  0 lvm  /
>   └─fedora-swap 253:10   1G  0 lvm  [SWAP]
> vda 252:00  10G  0 disk
> vdb 252:16   0  10G  0 disk
> vdc 252:32   0  10G  0 disk
> vdd 252:48   0  10G  0 disk
> 
> Thanks,
> Zhenyu
>

Re: RFC: tracking valid backing chain issue

2020-10-22 Thread Nikolay Shirokovskiy




On 21.10.2020 13:56, Kevin Wolf wrote:
> Am 20.10.2020 um 12:29 hat Nikolay Shirokovskiy geschrieben:
>>
>>
>> On 20.10.2020 13:23, Nikolay Shirokovskiy wrote:
>>>
>>>
>>> On 20.10.2020 11:50, Kevin Wolf wrote:
 Am 20.10.2020 um 10:21 hat Nikolay Shirokovskiy geschrieben:
> Hi, all.
>
> I recently found a corner case when it is impossible AFAIK to find out 
> valid
> backing chain after block commit operation. Imagine committing top image. 
> After
> commit ready state pivot is sent and then mgmt crashed. So far so good. 
> Upon
> next start mgmt can either check block job status for non-autodissmised 
> job or
> inspect backing chain to infer was pivot was successful or not in case of 
> older
> qemu.
>
> But imagine after mgmt crash qemu process was destroyed too. In this case 
> there
> is no option to know now what is valid backing chain. Yeah libvirt starts 
> qemu
> process with -no-shutdown flags so process is not destroyed in case of 
> shutdown
> but still process can crash.

 I don't think this is a problem.

 Between completion of the job and finalising it, both the base node and
 the top node are equivalent. You can access either and you'll always get
 the same data.

 So if libvirt didn't save that the job was already completed, it will
 use the old image file, and it's fine. And if libvirt already sent the
 job-finalize command, it will first have saved that the job was
 completed and therefore use the new image, and it's fine, too.
>>>
>>> So finalizing can't fail? Otherwise libvirt can save that job is completed 
>>> and
>>> graph is changed while is was really wasn't
>>
>> Hmm, it is even not the matter of qemu. Libvirt can save that job is 
>> completed
>> and then crash before sending command to finalize to qemu. So after qemu 
>> crash
>> and libvirt start libvirt would think that valid backing chain is without
>> top image which is not true.
> 
> Why not? During this time the top and base image are equally valid to be
> used as the active image.
> 
> If QEMU hadn't switched from top to base yet when it crashed, it's still
> no problem if libvirt does the switch when restarting QEMU.
> 

Now it clear. Thanx for explanation.

Nikolay

Re: [PATCH] MAINTAINERS: Cover "block/nvme.h" file

2020-10-22 Thread Stefan Hajnoczi

On Wed, Jul 01, 2020 at 04:06:34PM +0200, Philippe Mathieu-Daudé wrote:
> The "block/nvme.h" header is shared by both the NVMe block
> driver and the NVMe emulated device. Add the 'F:' entry on
> both sections, so all maintainers/reviewers are notified
> when it is changed.
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
> Cc: Fam Zheng 
> Cc: Keith Busch 
> Cc: Kevin Wolf 
> Cc: Max Reitz 
> Cc: Niklas Cassel 
> Cc: Klaus Jensen 
> Cc: Klaus Jensen 
> Cc: Maxim Levitsky 
> Cc: Dmitry Fomichev 
> Cc: Andrzej Jakowski 
> ---
>  MAINTAINERS | 2 ++
>  1 file changed, 2 insertions(+)

Thanks, applied to my block-next tree:
https://gitlab.com/stefanha/qemu/commits/block-next

Stefan


signature.asc
Description: PGP signature

Re: [PATCH 04/16] util/vfio-helpers: Report error when IOMMU page size is not supported

2020-10-22 Thread Stefan Hajnoczi

On Tue, Oct 20, 2020 at 07:24:16PM +0200, Philippe Mathieu-Daudé wrote:
> This driver uses the host page size to align its memory regions,
> but this size is not always compatible with the IOMMU. Add a
> check if the size matches, and bails out providing a hint what
> is the minimum page size the driver should use.
> 
> Suggested-by: Alex Williamson 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  util/vfio-helpers.c | 28 ++--
>  util/trace-events   |  1 +
>  2 files changed, 27 insertions(+), 2 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH 0/2] hw/block/nvme: two fixes for create sq/cq

2020-10-22 Thread Keith Busch

On Thu, Oct 22, 2020 at 03:24:02PM +0200, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> The first patch is a follow up to "hw/block/nvme: fix prp mapping status
> codes" and fixes some status codes in the nvme_create_{sq,cq} functions.
> 
> The second patch fixes a faulty check on the given queue identifier.

Looks good.

Reviewed-by: Keith Busch

Re: [PULL 22/23] hw/sd: Fix incorrect populated function switch status data structure

2020-10-22 Thread Niek Linnenbank

Hi Bin, Philippe,

If im correct the acceptance tests for orange pi need to be run with a flag
ARMBIAN_ARTIFACTS_CACHED set that explicitly allows them to be run using
the armbian mirror. So if you pass that flag on the same command that
Philippe gave, the rests should run.

I have a follow up question and Im interested to hear your opinion on that
Philippe. Should we perhaps update the orange pi tests (and maybe others)
so they use a reliable mirror that we can control, for example a github
repo? I would be happy to create a repo for that, at least for the orange
pi tests. But maybe there is already something planned as a more general
solution for artifacts of other machines as well?

regards,
Niek

Op do 22 okt. 2020 16:47 schreef Bin Meng :

> Hi Philippe,
>
> On Wed, Oct 21, 2020 at 6:07 PM Philippe Mathieu-Daudé 
> wrote:
> >
> > On 10/21/20 11:57 AM, Bin Meng wrote:
> > > Hi Philippe,
> > >
> > > On Tue, Oct 20, 2020 at 11:18 PM Philippe Mathieu-Daudé <
> f4...@amsat.org> wrote:
> > >>
> > >> Hi Bin,
> > >>
> > >> On 8/21/20 7:29 PM, Philippe Mathieu-Daudé wrote:
> > >>> From: Bin Meng 
> > >>>
> > >>> At present the function switch status data structure bit [399:376]
> > >>> are wrongly pupulated. These 3 bytes encode function switch status
> > >>> for the 6 function groups, with 4 bits per group, starting from
> > >>> function group 6 at bit 399, then followed by function group 5 at
> > >>> bit 395, and so on.
> > >>>
> > >>> However the codes mistakenly fills in the function group 1 status
> > >>> at bit 399. This fixes the code logic.
> > >>>
> > >>> Fixes: a1bb27b1e9 ("SD card emulation (initial implementation)")
> > >>> Signed-off-by: Bin Meng 
> > >>> Reviewed-by: Philippe Mathieu-Daudé 
> > >>> Tested-by: Sai Pavan Boddu 
> > >>> Message-Id: <1598021136-49525-1-git-send-email-bmeng...@gmail.com>
> > >>> Signed-off-by: Philippe Mathieu-Daudé 
> > >>> ---
> > >>>hw/sd/sd.c | 3 ++-
> > >>>1 file changed, 2 insertions(+), 1 deletion(-)
> > >>>
> > >>> diff --git a/hw/sd/sd.c b/hw/sd/sd.c
> > >>> index 7c9d956f113..805e21fc883 100644
> > >>> --- a/hw/sd/sd.c
> > >>> +++ b/hw/sd/sd.c
> > >>> @@ -807,11 +807,12 @@ static void sd_function_switch(SDState *sd,
> uint32_t arg)
> > >>>sd->data[11] = 0x43;
> > >>>sd->data[12] = 0x80;/* Supported group 1 functions */
> > >>>sd->data[13] = 0x03;
> > >>> +
> > >>>for (i = 0; i < 6; i ++) {
> > >>>new_func = (arg >> (i * 4)) & 0x0f;
> > >>>if (mode && new_func != 0x0f)
> > >>>sd->function_group[i] = new_func;
> > >>> -sd->data[14 + (i >> 1)] = new_func << ((i * 4) & 4);
> > >>> +sd->data[16 - (i >> 1)] |= new_func << ((i % 2) * 4);
> > >>
> > >> This patch broke the orangepi machine, reproducible running
> > >> test_arm_orangepi_bionic:
> > >> https://www.mail-archive.com/qemu-devel@nongnu.org/msg739449.html
> > >>
> > >> Can you have a look?
> > >
> > > Yes, I can take a look. Could you please send more details on how to
> > > run "test_arm_orangepi_bionic"?
> >
> > Looking at the previous link, I think this should work:
> >
> > $ make check-venv qemu-system-arm
> > $ AVOCADO_ALLOW_LARGE_STORAGE=1 \
> >tests/venv/bin/python -m \
> >  avocado --show=app,console run \
> >run -t machine:orangepi-pc tests/acceptance
> >
>
> I tried the above command in my build tree, and got:
>
> avocado run: error: unrecognized arguments: tests/acceptance
> Perhaps a plugin is missing; run 'avocado plugins' to list the installed
> ones
>
> I switched to `make check-acceptance` which seems to work, however not
> for orangepi-pc related tests?
>
>  (09/32)
> tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_arm_orangepi:
> SKIP: Test artifacts fetched from unreliable apt.armbian.com
>  (10/32)
> tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_arm_orangepi_initrd:
> SKIP: Test artifacts fetched from unreliable apt.armbian.com
>  (11/32)
> tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_arm_orangepi_sd:
> SKIP: Test artifacts fetched from unreliable apt.armbian.com
>  (12/32)
> tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_arm_orangepi_bionic:
> SKIP: storage limited
>  (13/32)
> tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_arm_orangepi_uboot_netbsd9:
> SKIP: storage limited
>
> Any ideas?
>
> Regards,
> Bin
>

Re: [PATCH v4 2/2] hw/block/nvme: add the dataset management command

2020-10-22 Thread Keith Busch

On Thu, Oct 22, 2020 at 09:33:13AM +0200, Klaus Jensen wrote:
> +if (--(*discards)) {
> +status = NVME_NO_COMPLETE;
> +} else {
> +g_free(discards);
> +req->opaque = NULL;

This case needs a

status = req->status;

So that we get the error set in the callback.

Otherwise, this looks fine. I am assuming everything still runs single
threaded since this isn't using atomics.

Re: [PATCH 15/16] block/nvme: Switch to using the MSIX API

2020-10-22 Thread Stefan Hajnoczi

On Tue, Oct 20, 2020 at 07:24:27PM +0200, Philippe Mathieu-Daudé wrote:
> In preparation of using multiple IRQs, switch to using the recently
> introduced MSIX API. Instead of allocating and assigning IRQ in
> a single step, we now have to use two distinct calls.
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  block/nvme.c | 14 --
>  1 file changed, 12 insertions(+), 2 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH 16/16] util/vfio-helpers: Remove now unused qemu_vfio_pci_init_irq()

2020-10-22 Thread Stefan Hajnoczi

On Tue, Oct 20, 2020 at 07:24:28PM +0200, Philippe Mathieu-Daudé wrote:
> Our only user, the NVMe block driver, switched to the MSIX API.
> As this function is now unused, remove it.
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  include/qemu/vfio-helpers.h |  2 --
>  util/vfio-helpers.c | 43 -
>  2 files changed, 45 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH 14/16] util/vfio-helpers: Introduce qemu_vfio_pci_msix_set_irq()

2020-10-22 Thread Stefan Hajnoczi

On Tue, Oct 20, 2020 at 07:24:26PM +0200, Philippe Mathieu-Daudé wrote:
> Introduce qemu_vfio_pci_msix_set_irq() to set the event
> notifier of a specific MSIX IRQ. All other registered IRQs
> are left unmodified.
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  include/qemu/vfio-helpers.h |  2 ++
>  util/vfio-helpers.c | 35 +++
>  util/trace-events   |  1 +
>  3 files changed, 38 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PULL 22/23] hw/sd: Fix incorrect populated function switch status data structure

2020-10-22 Thread Bin Meng

Hi Philippe,

On Wed, Oct 21, 2020 at 6:07 PM Philippe Mathieu-Daudé  wrote:
>
> On 10/21/20 11:57 AM, Bin Meng wrote:
> > Hi Philippe,
> >
> > On Tue, Oct 20, 2020 at 11:18 PM Philippe Mathieu-Daudé  
> > wrote:
> >>
> >> Hi Bin,
> >>
> >> On 8/21/20 7:29 PM, Philippe Mathieu-Daudé wrote:
> >>> From: Bin Meng 
> >>>
> >>> At present the function switch status data structure bit [399:376]
> >>> are wrongly pupulated. These 3 bytes encode function switch status
> >>> for the 6 function groups, with 4 bits per group, starting from
> >>> function group 6 at bit 399, then followed by function group 5 at
> >>> bit 395, and so on.
> >>>
> >>> However the codes mistakenly fills in the function group 1 status
> >>> at bit 399. This fixes the code logic.
> >>>
> >>> Fixes: a1bb27b1e9 ("SD card emulation (initial implementation)")
> >>> Signed-off-by: Bin Meng 
> >>> Reviewed-by: Philippe Mathieu-Daudé 
> >>> Tested-by: Sai Pavan Boddu 
> >>> Message-Id: <1598021136-49525-1-git-send-email-bmeng...@gmail.com>
> >>> Signed-off-by: Philippe Mathieu-Daudé 
> >>> ---
> >>>hw/sd/sd.c | 3 ++-
> >>>1 file changed, 2 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/hw/sd/sd.c b/hw/sd/sd.c
> >>> index 7c9d956f113..805e21fc883 100644
> >>> --- a/hw/sd/sd.c
> >>> +++ b/hw/sd/sd.c
> >>> @@ -807,11 +807,12 @@ static void sd_function_switch(SDState *sd, 
> >>> uint32_t arg)
> >>>sd->data[11] = 0x43;
> >>>sd->data[12] = 0x80;/* Supported group 1 functions */
> >>>sd->data[13] = 0x03;
> >>> +
> >>>for (i = 0; i < 6; i ++) {
> >>>new_func = (arg >> (i * 4)) & 0x0f;
> >>>if (mode && new_func != 0x0f)
> >>>sd->function_group[i] = new_func;
> >>> -sd->data[14 + (i >> 1)] = new_func << ((i * 4) & 4);
> >>> +sd->data[16 - (i >> 1)] |= new_func << ((i % 2) * 4);
> >>
> >> This patch broke the orangepi machine, reproducible running
> >> test_arm_orangepi_bionic:
> >> https://www.mail-archive.com/qemu-devel@nongnu.org/msg739449.html
> >>
> >> Can you have a look?
> >
> > Yes, I can take a look. Could you please send more details on how to
> > run "test_arm_orangepi_bionic"?
>
> Looking at the previous link, I think this should work:
>
> $ make check-venv qemu-system-arm
> $ AVOCADO_ALLOW_LARGE_STORAGE=1 \
>tests/venv/bin/python -m \
>  avocado --show=app,console run \
>run -t machine:orangepi-pc tests/acceptance
>

I tried the above command in my build tree, and got:

avocado run: error: unrecognized arguments: tests/acceptance
Perhaps a plugin is missing; run 'avocado plugins' to list the installed ones

I switched to `make check-acceptance` which seems to work, however not
for orangepi-pc related tests?

 (09/32) 
tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_arm_orangepi:
SKIP: Test artifacts fetched from unreliable apt.armbian.com
 (10/32) 
tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_arm_orangepi_initrd:
SKIP: Test artifacts fetched from unreliable apt.armbian.com
 (11/32) 
tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_arm_orangepi_sd:
SKIP: Test artifacts fetched from unreliable apt.armbian.com
 (12/32) 
tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_arm_orangepi_bionic:
SKIP: storage limited
 (13/32) 
tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_arm_orangepi_uboot_netbsd9:
SKIP: storage limited

Any ideas?

Regards,
Bin

Re: [PATCH] MAINTAINERS: Cover "block/nvme.h" file

2020-10-22 Thread Philippe Mathieu-Daudé


On 10/22/20 4:20 PM, Philippe Mathieu-Daudé wrote:

Cc'ing qemu-trivial@


Bah it doesn't apply anymore, I'll resend.



On 7/1/20 4:06 PM, Philippe Mathieu-Daudé wrote:

The "block/nvme.h" header is shared by both the NVMe block
driver and the NVMe emulated device. Add the 'F:' entry on
both sections, so all maintainers/reviewers are notified
when it is changed.

Signed-off-by: Philippe Mathieu-Daudé

Re: [PATCH 13/16] util/vfio-helpers: Introduce qemu_vfio_pci_msix_init_irqs()

2020-10-22 Thread Stefan Hajnoczi

On Tue, Oct 20, 2020 at 07:24:25PM +0200, Philippe Mathieu-Daudé wrote:
> qemu_vfio_pci_init_irq() allows us to initialize any type of IRQ,
> but only one. Introduce qemu_vfio_pci_msix_init_irqs() which is
> specific to MSIX IRQ type, and allow us to use multiple IRQs
> (thus passing multiple eventfd notifiers).
> All eventfd notifiers are initialized with the special '-1' value
> meaning "un-assigned".
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  include/qemu/vfio-helpers.h |  6 +++-
>  util/vfio-helpers.c | 65 -
>  util/trace-events   |  1 +
>  3 files changed, 70 insertions(+), 2 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH 09/16] util/vfio-helpers: Convert vfio_dump_mapping to trace events

2020-10-22 Thread Stefan Hajnoczi

On Tue, Oct 20, 2020 at 07:24:21PM +0200, Philippe Mathieu-Daudé wrote:
> The QEMU_VFIO_DEBUG definition is only modifiable at build-time.
> Trace events can be enabled at run-time. As we prefer the latter,
> convert qemu_vfio_dump_mappings() to use trace events instead
> of fprintf().
> 
> Reviewed-by: Fam Zheng 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  util/vfio-helpers.c | 19 ---
>  util/trace-events   |  1 +
>  2 files changed, 5 insertions(+), 15 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH 12/16] util/vfio-helpers: Let qemu_vfio_verify_mappings() use error_report()

2020-10-22 Thread Stefan Hajnoczi

On Tue, Oct 20, 2020 at 07:24:24PM +0200, Philippe Mathieu-Daudé wrote:
> Instead of displaying the error on stderr, use error_report()
> which also report to the monitor.
> 
> Reviewed-by: Fam Zheng 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  util/vfio-helpers.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH 11/16] util/vfio-helpers: Let qemu_vfio_do_mapping() propagate Error

2020-10-22 Thread Stefan Hajnoczi

On Tue, Oct 20, 2020 at 07:24:23PM +0200, Philippe Mathieu-Daudé wrote:
> Pass qemu_vfio_do_mapping() an Error* argument so it can propagate
> any error to callers. Replace error_report() which only report
> to the monitor by the more generic error_setg_errno().
> 
> Reviewed-by: Fam Zheng 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  util/vfio-helpers.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH 10/16] util/vfio-helpers: Let qemu_vfio_dma_map() propagate Error

2020-10-22 Thread Stefan Hajnoczi

On Tue, Oct 20, 2020 at 07:24:22PM +0200, Philippe Mathieu-Daudé wrote:
> Currently qemu_vfio_dma_map() displays errors on stderr.
> When using management interface, this information is simply
> lost. Pass qemu_vfio_dma_map() an Error* argument so it can
> propagate the error to callers.
> 
> Reviewed-by: Fam Zheng 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  include/qemu/vfio-helpers.h |  2 +-
>  block/nvme.c| 14 +++---
>  util/vfio-helpers.c | 12 +++-
>  3 files changed, 15 insertions(+), 13 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH 08/16] util/vfio-helpers: Improve DMA trace events

2020-10-22 Thread Stefan Hajnoczi

On Tue, Oct 20, 2020 at 07:24:20PM +0200, Philippe Mathieu-Daudé wrote:
> For debugging purpose, trace where DMA regions are mapped.
> 
> Reviewed-by: Fam Zheng 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  util/vfio-helpers.c | 3 ++-
>  util/trace-events   | 5 +++--
>  2 files changed, 5 insertions(+), 3 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH] MAINTAINERS: Cover "block/nvme.h" file

2020-10-22 Thread Philippe Mathieu-Daudé


Cc'ing qemu-trivial@

On 7/1/20 4:06 PM, Philippe Mathieu-Daudé wrote:

The "block/nvme.h" header is shared by both the NVMe block
driver and the NVMe emulated device. Add the 'F:' entry on
both sections, so all maintainers/reviewers are notified
when it is changed.

Signed-off-by: Philippe Mathieu-Daudé 
---
Cc: Fam Zheng 
Cc: Keith Busch 
Cc: Kevin Wolf 
Cc: Max Reitz 
Cc: Niklas Cassel 
Cc: Klaus Jensen 
Cc: Klaus Jensen 
Cc: Maxim Levitsky 
Cc: Dmitry Fomichev 
Cc: Andrzej Jakowski 
---
  MAINTAINERS | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index dec252f38b..9995cdc805 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1795,6 +1795,7 @@ M: Keith Busch 
  L: qemu-block@nongnu.org
  S: Supported
  F: hw/block/nvme*
+F: include/block/nvme.h
  F: tests/qtest/nvme-test.c
  
  megasas

@@ -2804,6 +2805,7 @@ M: Fam Zheng 
  L: qemu-block@nongnu.org
  S: Supported
  F: block/nvme*
+F: include/block/nvme.h
  
  Bootdevice

  M: Gonglei

Re: [PATCH 05/16] util/vfio-helpers: Trace PCI I/O config accesses

2020-10-22 Thread Stefan Hajnoczi

On Tue, Oct 20, 2020 at 07:24:17PM +0200, Philippe Mathieu-Daudé wrote:
> We sometime get kernel panic with some devices on Aarch64
> hosts. Alex Williamson suggests it might be broken PCIe
> root complex. Add trace event to record the latest I/O
> access before crashing. In case, assert our accesses are
> aligned.
> 
> Reviewed-by: Fam Zheng 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  util/vfio-helpers.c | 8 
>  util/trace-events   | 2 ++
>  2 files changed, 10 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH 06/16] util/vfio-helpers: Trace PCI BAR region info

2020-10-22 Thread Stefan Hajnoczi

On Tue, Oct 20, 2020 at 07:24:18PM +0200, Philippe Mathieu-Daudé wrote:
> For debug purpose, trace BAR regions info.
> 
> Reviewed-by: Fam Zheng 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  util/vfio-helpers.c | 8 
>  util/trace-events   | 1 +
>  2 files changed, 9 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH 07/16] util/vfio-helpers: Trace where BARs are mapped

2020-10-22 Thread Stefan Hajnoczi

On Tue, Oct 20, 2020 at 07:24:19PM +0200, Philippe Mathieu-Daudé wrote:
> For debugging purpose, trace where a BAR is mapped.
> 
> Reviewed-by: Fam Zheng 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  util/vfio-helpers.c | 2 ++
>  util/trace-events   | 1 +
>  2 files changed, 3 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PULL v2 00/28] Block patches

2020-10-22 Thread Peter Maydell

On Thu, 22 Oct 2020 at 12:27, Stefan Hajnoczi  wrote:
>
> The following changes since commit ac793156f650ae2d77834932d72224175ee69086:
>
>   Merge remote-tracking branch 
> 'remotes/pmaydell/tags/pull-target-arm-20201020-1' into staging (2020-10-20 
> 21:11:35 +0100)
>
> are available in the Git repository at:
>
>   https://gitlab.com/stefanha/qemu.git tags/block-pull-request
>
> for you to fetch changes up to 32a3fd65e7e3551337fd26bfc0e2f899d70c028c:
>
>   iotests: add commit top->base cases to 274 (2020-10-22 09:55:39 +0100)
>
> 
> Pull request
>
> v2:
>  * Fix format string issues on 32-bit hosts [Peter]
>  * Fix qemu-nbd.c CONFIG_POSIX ifdef issue [Eric]
>  * Fix missing eventfd.h header on macOS [Peter]
>  * Drop unreliable vhost-user-blk test (will send a new patch when ready) 
> [Peter]
>
> This pull request contains the vhost-user-blk server by Coiby Xu along with my
> additions, block/nvme.c alignment and hardware error statistics by Philippe
> Mathieu-Daudé, and bdrv_co_block_status_above() fixes by Vladimir
> Sementsov-Ogievskiy.

Fails to link on FreeBSD, NetBSD, OpenBSD, OSX:

ld: error: undefined symbol: blk_exp_vhost_user_blk
>>> referenced by export.c:58 (../src/block/export/export.c:58)
>>>   block_export_export.c.o:(blk_exp_add) in archive 
>>> libblockdev.fa
c++: error: linker command failed with exit code 1 (use -v to see invocation)

thanks
-- PMM

Re: [PATCH 03/16] util/vfio-helpers: Pass minimum page size to qemu_vfio_open_pci()

2020-10-22 Thread Stefan Hajnoczi

On Tue, Oct 20, 2020 at 07:24:15PM +0200, Philippe Mathieu-Daudé wrote:
> @@ -724,7 +725,7 @@ static int nvme_init(BlockDriverState *bs, const char 
> *device, int namespace,
>  goto out;
>  }
>  
> -s->page_size = MAX(4096, 1u << (12 + NVME_CAP_MPSMIN(cap)));
> +s->page_size = MAX(min_page_size, 1u << (12 + NVME_CAP_MPSMIN(cap)));

Is there a guarantee that the NVMe drive supports our min_page_size?

Stefan


signature.asc
Description: PGP signature

[PATCH v3 3/9] block-backend: add I/O hang timeout

2020-10-22 Thread Jiahui Cen

Not all errors would be fixed, so it is better to add a rehandle timeout
for I/O hang.

Signed-off-by: Jiahui Cen 
Signed-off-by: Ying Fang 
---
 block/block-backend.c  | 99 +-
 include/sysemu/block-backend.h |  2 +
 2 files changed, 100 insertions(+), 1 deletion(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 8050669d23..90fcc678b5 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -38,6 +38,11 @@ static AioContext *blk_aiocb_get_aio_context(BlockAIOCB 
*acb);
 /* block backend rehandle timer interval 5s */
 #define BLOCK_BACKEND_REHANDLE_TIMER_INTERVAL   5000
 
+enum BlockIOHangStatus {
+BLOCK_IO_HANG_STATUS_NORMAL = 0,
+BLOCK_IO_HANG_STATUS_HANG,
+};
+
 typedef struct BlockBackendRehandleInfo {
 bool enable;
 QEMUTimer *ts;
@@ -109,6 +114,11 @@ struct BlockBackend {
 unsigned int in_flight;
 
 BlockBackendRehandleInfo reinfo;
+
+int64_t iohang_timeout; /* The I/O hang timeout value in sec. */
+int64_t iohang_time;/* The I/O hang start time */
+bool is_iohang_timeout;
+int iohang_status;
 };
 
 typedef struct BlockBackendAIOCB {
@@ -2481,20 +2491,107 @@ static void blk_rehandle_timer_cb(void *opaque)
 aio_context_release(blk_get_aio_context(blk));
 }
 
+static bool blk_iohang_handle(BlockBackend *blk, int new_status)
+{
+int64_t now;
+int old_status = blk->iohang_status;
+bool need_rehandle = false;
+
+switch (new_status) {
+case BLOCK_IO_HANG_STATUS_NORMAL:
+if (old_status == BLOCK_IO_HANG_STATUS_HANG) {
+/* Case when I/O Hang is recovered */
+blk->is_iohang_timeout = false;
+blk->iohang_time = 0;
+}
+break;
+case BLOCK_IO_HANG_STATUS_HANG:
+if (old_status != BLOCK_IO_HANG_STATUS_HANG) {
+/* Case when I/O hang is first triggered */
+blk->iohang_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME) / 1000;
+need_rehandle = true;
+} else {
+if (!blk->is_iohang_timeout) {
+now = qemu_clock_get_ms(QEMU_CLOCK_REALTIME) / 1000;
+if (now >= (blk->iohang_time + blk->iohang_timeout)) {
+/* Case when I/O hang is timeout */
+blk->is_iohang_timeout = true;
+} else {
+/* Case when I/O hang is continued */
+need_rehandle = true;
+}
+}
+}
+break;
+default:
+break;
+}
+
+blk->iohang_status = new_status;
+return need_rehandle;
+}
+
+static bool blk_rehandle_aio(BlkAioEmAIOCB *acb, bool *has_timeout)
+{
+bool need_rehandle = false;
+
+/* Rehandle aio which returns EIO before hang timeout */
+if (acb->rwco.ret == -EIO) {
+if (acb->rwco.blk->is_iohang_timeout) {
+/* I/O hang has timeout and not recovered */
+*has_timeout = true;
+} else {
+need_rehandle = blk_iohang_handle(acb->rwco.blk,
+  BLOCK_IO_HANG_STATUS_HANG);
+/* I/O hang timeout first trigger */
+if (acb->rwco.blk->is_iohang_timeout) {
+*has_timeout = true;
+}
+}
+}
+
+return need_rehandle;
+}
+
 static void blk_rehandle_aio_complete(BlkAioEmAIOCB *acb)
 {
+bool has_timeout = false;
+bool need_rehandle = false;
+
 if (acb->has_returned) {
 blk_dec_in_flight(acb->rwco.blk);
-if (acb->rwco.ret == -EIO) {
+need_rehandle = blk_rehandle_aio(acb, _timeout);
+if (need_rehandle) {
 blk_rehandle_insert_aiocb(acb->rwco.blk, acb);
 return;
 }
 
 acb->common.cb(acb->common.opaque, acb->rwco.ret);
+
+/* I/O hang return to normal status */
+if (!has_timeout) {
+blk_iohang_handle(acb->rwco.blk, BLOCK_IO_HANG_STATUS_NORMAL);
+}
+
 qemu_aio_unref(acb);
 }
 }
 
+void blk_iohang_init(BlockBackend *blk, int64_t iohang_timeout)
+{
+if (!blk) {
+return;
+}
+
+blk->is_iohang_timeout = false;
+blk->iohang_time = 0;
+blk->iohang_timeout = 0;
+blk->iohang_status = BLOCK_IO_HANG_STATUS_NORMAL;
+if (iohang_timeout > 0) {
+blk->iohang_timeout = iohang_timeout;
+}
+}
+
 void blk_register_buf(BlockBackend *blk, void *host, size_t size)
 {
 bdrv_register_buf(blk_bs(blk), host, size);
diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index 8203d7f6f9..bfebe3a960 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -268,4 +268,6 @@ const BdrvChild *blk_root(BlockBackend *blk);
 
 int blk_make_empty(BlockBackend *blk, Error **errp);
 
+void blk_iohang_init(BlockBackend *blk, int64_t iohang_timeout);
+
 #endif
-- 
2.19.1

Re: [PATCH 02/16] util/vfio-helpers: Improve reporting unsupported IOMMU type

2020-10-22 Thread Stefan Hajnoczi

On Tue, Oct 20, 2020 at 07:24:14PM +0200, Philippe Mathieu-Daudé wrote:
> Change the confuse "VFIO IOMMU check failed" error message by
> the explicit "VFIO IOMMU Type1 is not supported" once.
> 
> Example on POWER:
> 
>  $ qemu-system-ppc64 -drive 
> if=none,id=nvme0,file=nvme://0001:01:00.0/1,format=raw
>  qemu-system-ppc64: -drive 
> if=none,id=nvme0,file=nvme://0001:01:00.0/1,format=raw: VFIO IOMMU Type1 is 
> not supported
> 
> Suggested-by: Alex Williamson 
> Reviewed-by: Fam Zheng 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  util/vfio-helpers.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH 01/16] block/nvme: Correct minimum device page size

2020-10-22 Thread Stefan Hajnoczi

On Tue, Oct 20, 2020 at 07:24:13PM +0200, Philippe Mathieu-Daudé wrote:
> While trying to simplify the code using a macro, we forgot
> the 12-bit shift... Correct that.
> 
> Fixes: fad1eb68862 ("block/nvme: Use register definitions from 
> 'block/nvme.h'")
> Reported-by: Eric Auger 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  block/nvme.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

[PATCH 2/2] nvme: fix queue identifer validation

2020-10-22 Thread Klaus Jensen

From: Gollu Appalanaidu 

The nvme_check_{sq,cq} functions check if the given queue identifer is
valid *and* that the queue exists. Thus, the function return value
cannot simply be inverted to check if the identifer is valid and that
the queue does *not* exist.

Replace the call with an OR'ed version of the checks.

Signed-off-by: Gollu Appalanaidu 
Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 5dfef0204c2c..fa2cba744b57 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1143,7 +1143,8 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeRequest 
*req)
 trace_pci_nvme_err_invalid_create_sq_cqid(cqid);
 return NVME_INVALID_CQID | NVME_DNR;
 }
-if (unlikely(!sqid || !nvme_check_sqid(n, sqid))) {
+if (unlikely(!sqid || sqid > n->params.max_ioqpairs ||
+n->sq[sqid] != NULL)) {
 trace_pci_nvme_err_invalid_create_sq_sqid(sqid);
 return NVME_INVALID_QID | NVME_DNR;
 }
@@ -1398,7 +1399,8 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest 
*req)
 trace_pci_nvme_create_cq(prp1, cqid, vector, qsize, qflags,
  NVME_CQ_FLAGS_IEN(qflags) != 0);
 
-if (unlikely(!cqid || !nvme_check_cqid(n, cqid))) {
+if (unlikely(!cqid || cqid > n->params.max_ioqpairs ||
+n->cq[cqid] != NULL)) {
 trace_pci_nvme_err_invalid_create_cq_cqid(cqid);
 return NVME_INVALID_QID | NVME_DNR;
 }
-- 
2.28.0

[PATCH v3 4/9] block-backend: add I/O rehandle pause/unpause

2020-10-22 Thread Jiahui Cen

Sometimes there is no need to rehandle AIOs although I/O hang is enabled. For
example, when deleting a block backend, we have to wait AIO completed by calling
blk_drain(), but not care about the results. So a pause interface of I/O hang
is helpful to bypass the rehandle mechanism.

Signed-off-by: Jiahui Cen 
Signed-off-by: Ying Fang 
---
 block/block-backend.c  | 60 +++---
 include/sysemu/block-backend.h |  2 ++
 2 files changed, 58 insertions(+), 4 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 90fcc678b5..c16d95a2c9 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -37,6 +37,9 @@ static AioContext *blk_aiocb_get_aio_context(BlockAIOCB *acb);
 
 /* block backend rehandle timer interval 5s */
 #define BLOCK_BACKEND_REHANDLE_TIMER_INTERVAL   5000
+#define BLOCK_BACKEND_REHANDLE_NORMAL   1
+#define BLOCK_BACKEND_REHANDLE_DRAIN_REQUESTED  2
+#define BLOCK_BACKEND_REHANDLE_DRAINED  3
 
 enum BlockIOHangStatus {
 BLOCK_IO_HANG_STATUS_NORMAL = 0,
@@ -50,6 +53,8 @@ typedef struct BlockBackendRehandleInfo {
 
 unsigned int in_flight;
 QTAILQ_HEAD(, BlkAioEmAIOCB) re_aios;
+
+int status;
 } BlockBackendRehandleInfo;
 
 typedef struct BlockBackendAioNotifier {
@@ -2461,6 +2466,51 @@ static void blk_rehandle_remove_aiocb(BlockBackend *blk, 
BlkAioEmAIOCB *acb)
 qatomic_dec(>reinfo.in_flight);
 }
 
+static void blk_rehandle_drain(BlockBackend *blk)
+{
+if (blk_bs(blk)) {
+bdrv_drained_begin(blk_bs(blk));
+BDRV_POLL_WHILE(blk_bs(blk), qatomic_read(>reinfo.in_flight) > 0);
+bdrv_drained_end(blk_bs(blk));
+}
+}
+
+static bool blk_rehandle_is_paused(BlockBackend *blk)
+{
+return blk->reinfo.status == BLOCK_BACKEND_REHANDLE_DRAIN_REQUESTED ||
+   blk->reinfo.status == BLOCK_BACKEND_REHANDLE_DRAINED;
+}
+
+void blk_rehandle_pause(BlockBackend *blk)
+{
+BlockBackendRehandleInfo *reinfo = >reinfo;
+
+aio_context_acquire(blk_get_aio_context(blk));
+if (!reinfo->enable || reinfo->status == BLOCK_BACKEND_REHANDLE_DRAINED) {
+aio_context_release(blk_get_aio_context(blk));
+return;
+}
+
+reinfo->status = BLOCK_BACKEND_REHANDLE_DRAIN_REQUESTED;
+blk_rehandle_drain(blk);
+reinfo->status = BLOCK_BACKEND_REHANDLE_DRAINED;
+aio_context_release(blk_get_aio_context(blk));
+}
+
+void blk_rehandle_unpause(BlockBackend *blk)
+{
+BlockBackendRehandleInfo *reinfo = >reinfo;
+
+aio_context_acquire(blk_get_aio_context(blk));
+if (!reinfo->enable || reinfo->status == BLOCK_BACKEND_REHANDLE_NORMAL) {
+aio_context_release(blk_get_aio_context(blk));
+return;
+}
+
+reinfo->status = BLOCK_BACKEND_REHANDLE_NORMAL;
+aio_context_release(blk_get_aio_context(blk));
+}
+
 static void blk_rehandle_timer_cb(void *opaque)
 {
 BlockBackend *blk = opaque;
@@ -2560,10 +2610,12 @@ static void blk_rehandle_aio_complete(BlkAioEmAIOCB 
*acb)
 
 if (acb->has_returned) {
 blk_dec_in_flight(acb->rwco.blk);
-need_rehandle = blk_rehandle_aio(acb, _timeout);
-if (need_rehandle) {
-blk_rehandle_insert_aiocb(acb->rwco.blk, acb);
-return;
+if (!blk_rehandle_is_paused(acb->rwco.blk)) {
+need_rehandle = blk_rehandle_aio(acb, _timeout);
+if (need_rehandle) {
+blk_rehandle_insert_aiocb(acb->rwco.blk, acb);
+return;
+}
 }
 
 acb->common.cb(acb->common.opaque, acb->rwco.ret);
diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index bfebe3a960..391a047444 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -268,6 +268,8 @@ const BdrvChild *blk_root(BlockBackend *blk);
 
 int blk_make_empty(BlockBackend *blk, Error **errp);
 
+void blk_rehandle_pause(BlockBackend *blk);
+void blk_rehandle_unpause(BlockBackend *blk);
 void blk_iohang_init(BlockBackend *blk, int64_t iohang_timeout);
 
 #endif
-- 
2.19.1

[PATCH v3 0/9] block-backend: Introduce I/O hang

2020-10-22 Thread Jiahui Cen

A VM in the cloud environment may use a virutal disk as the backend storage,
and there are usually filesystems on the virtual block device. When backend
storage is temporarily down, any I/O issued to the virtual block device will
cause an error. For example, an error occurred in ext4 filesystem would make
the filesystem readonly. However a cloud backend storage can be soon recovered.
For example, an IP-SAN may be down due to network failure and will be online
soon after network is recovered. The error in the filesystem may not be
recovered unless a device reattach or system restart. So an I/O rehandle is
in need to implement a self-healing mechanism.

This patch series propose a feature called I/O hang. It can rehandle AIOs
with EIO error without sending error back to guest. From guest's perspective
of view it is just like an IO is hanging and not returned. Guest can get
back running smoothly when I/O is recovred with this feature enabled.

v2->v3:
* Add a doc to describe I/O hang.

v1->v2:
* Rebase to fix compile problems.
* Fix incorrect remove of rehandle list.
* Provide rehandle pause interface.

Jiahui Cen (9):
  block-backend: introduce I/O rehandle info
  block-backend: rehandle block aios when EIO
  block-backend: add I/O hang timeout
  block-backend: add I/O rehandle pause/unpause
  block-backend: enable I/O hang when timeout is set
  virtio-blk: pause I/O hang when resetting
  qemu-option: add I/O hang timeout option
  qapi: add I/O hang and I/O hang timeout qapi event
  docs: add a doc about I/O hang

 block/block-backend.c  | 300 +
 blockdev.c |  11 ++
 docs/io-hang.rst   |  45 +
 hw/block/virtio-blk.c  |   8 +
 include/sysemu/block-backend.h |   5 +
 qapi/block-core.json   |  26 +++
 6 files changed, 395 insertions(+)
 create mode 100644 docs/io-hang.rst

-- 
2.19.1

[PATCH v3 5/9] block-backend: enable I/O hang when timeout is set

2020-10-22 Thread Jiahui Cen

Setting a non-zero timeout of I/O hang indicates I/O hang is enabled for the
block backend. And when the block backend is going to be deleted, we should
disable I/O hang.

Signed-off-by: Jiahui Cen 
Signed-off-by: Ying Fang 
---
 block/block-backend.c  | 40 ++
 include/sysemu/block-backend.h |  1 +
 2 files changed, 41 insertions(+)

diff --git a/block/block-backend.c b/block/block-backend.c
index c16d95a2c9..c812b3a9c7 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -34,6 +34,7 @@
 #define NOT_DONE 0x7fff /* used while emulated sync operation in progress 
*/
 
 static AioContext *blk_aiocb_get_aio_context(BlockAIOCB *acb);
+static void blk_rehandle_disable(BlockBackend *blk);
 
 /* block backend rehandle timer interval 5s */
 #define BLOCK_BACKEND_REHANDLE_TIMER_INTERVAL   5000
@@ -476,6 +477,8 @@ static void blk_delete(BlockBackend *blk)
 assert(!blk->refcnt);
 assert(!blk->name);
 assert(!blk->dev);
+assert(qatomic_read(>reinfo.in_flight) == 0);
+blk_rehandle_disable(blk);
 if (blk->public.throttle_group_member.throttle_state) {
 blk_io_limits_disable(blk);
 }
@@ -2629,6 +2632,42 @@ static void blk_rehandle_aio_complete(BlkAioEmAIOCB *acb)
 }
 }
 
+static void blk_rehandle_enable(BlockBackend *blk)
+{
+BlockBackendRehandleInfo *reinfo = >reinfo;
+
+aio_context_acquire(blk_get_aio_context(blk));
+if (reinfo->enable) {
+aio_context_release(blk_get_aio_context(blk));
+return;
+}
+
+reinfo->ts = aio_timer_new(blk_get_aio_context(blk), QEMU_CLOCK_REALTIME,
+   SCALE_MS, blk_rehandle_timer_cb, blk);
+reinfo->timer_interval_ms = BLOCK_BACKEND_REHANDLE_TIMER_INTERVAL;
+reinfo->status = BLOCK_BACKEND_REHANDLE_NORMAL;
+reinfo->enable = true;
+aio_context_release(blk_get_aio_context(blk));
+}
+
+static void blk_rehandle_disable(BlockBackend *blk)
+{
+if (!blk->reinfo.enable) {
+return;
+}
+
+blk_rehandle_pause(blk);
+timer_del(blk->reinfo.ts);
+timer_free(blk->reinfo.ts);
+blk->reinfo.ts = NULL;
+blk->reinfo.enable = false;
+}
+
+bool blk_iohang_is_enabled(BlockBackend *blk)
+{
+return blk->iohang_timeout != 0;
+}
+
 void blk_iohang_init(BlockBackend *blk, int64_t iohang_timeout)
 {
 if (!blk) {
@@ -2641,6 +2680,7 @@ void blk_iohang_init(BlockBackend *blk, int64_t 
iohang_timeout)
 blk->iohang_status = BLOCK_IO_HANG_STATUS_NORMAL;
 if (iohang_timeout > 0) {
 blk->iohang_timeout = iohang_timeout;
+blk_rehandle_enable(blk);
 }
 }
 
diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index 391a047444..851b90b99b 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -270,6 +270,7 @@ int blk_make_empty(BlockBackend *blk, Error **errp);
 
 void blk_rehandle_pause(BlockBackend *blk);
 void blk_rehandle_unpause(BlockBackend *blk);
+bool blk_iohang_is_enabled(BlockBackend *blk);
 void blk_iohang_init(BlockBackend *blk, int64_t iohang_timeout);
 
 #endif
-- 
2.19.1

[PATCH v3 1/9] block-backend: introduce I/O rehandle info

2020-10-22 Thread Jiahui Cen

The I/O hang feature is realized based on a rehandle mechanism.
Each block backend will have a list to store hanging block AIOs,
and a timer to regularly resend these aios. In order to issue
the AIOs again, each block AIOs also need to store its coroutine entry.

Signed-off-by: Jiahui Cen 
Signed-off-by: Ying Fang 
---
 block/block-backend.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/block/block-backend.c b/block/block-backend.c
index ce78d30794..b8367d82cc 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -35,6 +35,18 @@
 
 static AioContext *blk_aiocb_get_aio_context(BlockAIOCB *acb);
 
+/* block backend rehandle timer interval 5s */
+#define BLOCK_BACKEND_REHANDLE_TIMER_INTERVAL   5000
+
+typedef struct BlockBackendRehandleInfo {
+bool enable;
+QEMUTimer *ts;
+unsigned timer_interval_ms;
+
+unsigned int in_flight;
+QTAILQ_HEAD(, BlkAioEmAIOCB) re_aios;
+} BlockBackendRehandleInfo;
+
 typedef struct BlockBackendAioNotifier {
 void (*attached_aio_context)(AioContext *new_context, void *opaque);
 void (*detach_aio_context)(void *opaque);
@@ -95,6 +107,8 @@ struct BlockBackend {
  * Accessed with atomic ops.
  */
 unsigned int in_flight;
+
+BlockBackendRehandleInfo reinfo;
 };
 
 typedef struct BlockBackendAIOCB {
@@ -350,6 +364,7 @@ BlockBackend *blk_new(AioContext *ctx, uint64_t perm, 
uint64_t shared_perm)
 qemu_co_queue_init(>queued_requests);
 notifier_list_init(>remove_bs_notifiers);
 notifier_list_init(>insert_bs_notifiers);
+
 QLIST_INIT(>aio_notifiers);
 
 QTAILQ_INSERT_TAIL(_backends, blk, link);
@@ -1392,6 +1407,10 @@ typedef struct BlkAioEmAIOCB {
 BlkRwCo rwco;
 int bytes;
 bool has_returned;
+
+/* for rehandle */
+CoroutineEntry *co_entry;
+QTAILQ_ENTRY(BlkAioEmAIOCB) list;
 } BlkAioEmAIOCB;
 
 static AioContext *blk_aio_em_aiocb_get_aio_context(BlockAIOCB *acb_)
-- 
2.19.1

[PATCH 1/2] nvme: fix create IO SQ/CQ status codes

2020-10-22 Thread Klaus Jensen

From: Gollu Appalanaidu 

Replace the Invalid Field in Command with the Invalid PRP Offset status
code in the nvme_create_{cq,sq} functions. Also, allow PRP1 to be
address 0x0.

Also replace the Completion Queue Invalid status code returned in
nvme_create_cq when the the queue identifier is invalid with the Invalid
Queue Identifier. The Completion Queue Invalid status code is
exclusively for indicating that the completion queue identifer given
when creating a submission queue is invalid.

See NVM Express v1.3d, Section 5.3 ("Create I/O Completion Queue
command") and 5.4("Create I/O Submission Queue command").

Signed-off-by: Gollu Appalanaidu 
Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 2896bb49b9c0..5dfef0204c2c 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1151,9 +1151,9 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeRequest 
*req)
 trace_pci_nvme_err_invalid_create_sq_size(qsize);
 return NVME_MAX_QSIZE_EXCEEDED | NVME_DNR;
 }
-if (unlikely(!prp1 || prp1 & (n->page_size - 1))) {
+if (unlikely(prp1 & (n->page_size - 1))) {
 trace_pci_nvme_err_invalid_create_sq_addr(prp1);
-return NVME_INVALID_FIELD | NVME_DNR;
+return NVME_INVALID_PRP_OFFSET | NVME_DNR;
 }
 if (unlikely(!(NVME_SQ_FLAGS_PC(qflags {
 trace_pci_nvme_err_invalid_create_sq_qflags(NVME_SQ_FLAGS_PC(qflags));
@@ -1400,15 +1400,15 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest 
*req)
 
 if (unlikely(!cqid || !nvme_check_cqid(n, cqid))) {
 trace_pci_nvme_err_invalid_create_cq_cqid(cqid);
-return NVME_INVALID_CQID | NVME_DNR;
+return NVME_INVALID_QID | NVME_DNR;
 }
 if (unlikely(!qsize || qsize > NVME_CAP_MQES(n->bar.cap))) {
 trace_pci_nvme_err_invalid_create_cq_size(qsize);
 return NVME_MAX_QSIZE_EXCEEDED | NVME_DNR;
 }
-if (unlikely(!prp1)) {
+if (unlikely(prp1 & (n->page_size - 1))) {
 trace_pci_nvme_err_invalid_create_cq_addr(prp1);
-return NVME_INVALID_FIELD | NVME_DNR;
+return NVME_INVALID_PRP_OFFSET | NVME_DNR;
 }
 if (unlikely(!msix_enabled(>parent_obj) && vector)) {
 trace_pci_nvme_err_invalid_create_cq_vector(vector);
-- 
2.28.0

[PATCH v3 8/9] qapi: add I/O hang and I/O hang timeout qapi event

2020-10-22 Thread Jiahui Cen

Sometimes hypervisor management tools like libvirt may need to monitor
I/O hang events. Let's report I/O hang and I/O hang timeout event via qapi.

Signed-off-by: Jiahui Cen 
Signed-off-by: Ying Fang 
---
 block/block-backend.c |  3 +++
 qapi/block-core.json  | 26 ++
 2 files changed, 29 insertions(+)

diff --git a/block/block-backend.c b/block/block-backend.c
index c812b3a9c7..42337ceb04 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2556,6 +2556,7 @@ static bool blk_iohang_handle(BlockBackend *blk, int 
new_status)
 /* Case when I/O Hang is recovered */
 blk->is_iohang_timeout = false;
 blk->iohang_time = 0;
+qapi_event_send_block_io_hang(false);
 }
 break;
 case BLOCK_IO_HANG_STATUS_HANG:
@@ -2563,12 +2564,14 @@ static bool blk_iohang_handle(BlockBackend *blk, int 
new_status)
 /* Case when I/O hang is first triggered */
 blk->iohang_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME) / 1000;
 need_rehandle = true;
+qapi_event_send_block_io_hang(true);
 } else {
 if (!blk->is_iohang_timeout) {
 now = qemu_clock_get_ms(QEMU_CLOCK_REALTIME) / 1000;
 if (now >= (blk->iohang_time + blk->iohang_timeout)) {
 /* Case when I/O hang is timeout */
 blk->is_iohang_timeout = true;
+qapi_event_send_block_io_hang_timeout(true);
 } else {
 /* Case when I/O hang is continued */
 need_rehandle = true;
diff --git a/qapi/block-core.json b/qapi/block-core.json
index ee5ebef7f2..709514498c 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -5379,3 +5379,29 @@
 { 'command': 'blockdev-snapshot-delete-internal-sync',
   'data': { 'device': 'str', '*id': 'str', '*name': 'str'},
   'returns': 'SnapshotInfo' }
+
+##
+# @BLOCK_IO_HANG:
+#
+# Emitted when device I/O hang trigger event begin or end
+#
+# @set: true if I/O hang begin; false if I/O hang end.
+#
+# Since: 5.2
+#
+##
+{ 'event': 'BLOCK_IO_HANG',
+  'data': { 'set': 'bool' }}
+
+##
+# @BLOCK_IO_HANG_TIMEOUT:
+#
+# Emitted when device I/O hang timeout event set or clear
+#
+# @set: true if set; false if clear.
+#
+# Since: 5.2
+#
+##
+{ 'event': 'BLOCK_IO_HANG_TIMEOUT',
+  'data': { 'set': 'bool' }}
-- 
2.19.1

[PATCH v3 6/9] virtio-blk: pause I/O hang when resetting

2020-10-22 Thread Jiahui Cen

When resetting virtio-blk, we have to drain all AIOs but do not care about the
results. So it is necessary to disable I/O hang before resetting virtio-blk,
and enable it after resetting.

Signed-off-by: Jiahui Cen 
Signed-off-by: Ying Fang 
---
 hw/block/virtio-blk.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index bac2d6fa2b..f96e4ac274 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -899,6 +899,10 @@ static void virtio_blk_reset(VirtIODevice *vdev)
 AioContext *ctx;
 VirtIOBlockReq *req;
 
+if (blk_iohang_is_enabled(s->blk)) {
+blk_rehandle_pause(s->blk);
+}
+
 ctx = blk_get_aio_context(s->blk);
 aio_context_acquire(ctx);
 blk_drain(s->blk);
@@ -916,6 +920,10 @@ static void virtio_blk_reset(VirtIODevice *vdev)
 
 assert(!s->dataplane_started);
 blk_set_enable_write_cache(s->blk, s->original_wce);
+
+if (blk_iohang_is_enabled(s->blk)) {
+blk_rehandle_unpause(s->blk);
+}
 }
 
 /* coalesce internal state, copy to pci i/o region 0
-- 
2.19.1

[PATCH v3 9/9] docs: add a doc about I/O hang

2020-10-22 Thread Jiahui Cen

Give some details about the I/O hang and how to use it.

Signed-off-by: Jiahui Cen 
Signed-off-by: Ying Fang 
---
 docs/io-hang.rst | 45 +
 1 file changed, 45 insertions(+)
 create mode 100644 docs/io-hang.rst

diff --git a/docs/io-hang.rst b/docs/io-hang.rst
new file mode 100644
index 00..53356ceed2
--- /dev/null
+++ b/docs/io-hang.rst
@@ -0,0 +1,45 @@
+
+I/O hang
+
+
+Introduction
+
+
+I/O hang is a mechanism to automatically rehandle AIOs when an error occurs on
+the backend block device, which is unperceivable for Guest. If the error on the
+backend device is temporary, like NFS returns EIO due to network fluctuations,
+once the device is recovered, the AIOs will be resent automatically and done
+successfully. If the error on the device is unrecoverable, the failed AIOs will
+be returned to Guest after rehandle timeout.
+
+This mechanism can provide self-recovery and high availability to VM. From the
+view of Guest, AIOs sent to the device are hung for a time and the finished, 
and
+any other unrelated service in Guest can work as usual.
+
+Implementation
+==
+
+Each BlockBackend will have a list to store AIOs which need be rehandled and a
+timer to monitor the timeout.
+
+If I/O hang is enabled, all returned AIOs will be checked and the failed ones
+will be inserted into BlockBackend's list. The timer will be started and the
+AIOs in the list will be resent to the backend device. Once the result of AIOs
+relevant to this backend device is success, the AIOs will be returned back to
+Guest. Otherwise, those AIOs will be rehandled periodically until timeout.
+
+I/O hang provides interfaces to drain all the AIOs in BlockBackend's list. 
There
+are some situations to drain the rehandling AIOs, eg. resetting virtio-blk.
+
+I/O hang also provides qapi events, which can be used for manager tools like
+libvirt to monitor.
+
+Examples
+
+
+Enable I/O hang when launching QEMU::
+
+  $ qemu-system-x86_64  \
+  -machine pc,accel=kvm -smp 1 -m 2048  \
+  -drive 
file=shared.img,format=raw,cache=none,aio=native,iohang-timeout=60
+
-- 
2.19.1

[PATCH v3 2/9] block-backend: rehandle block aios when EIO

2020-10-22 Thread Jiahui Cen

When a backend device temporarily does not response, like a network disk down
due to some network faults, any IO to the coresponding virtual block device
in VM would return I/O error. If the hypervisor returns the error to VM, the
filesystem on this block device may not work as usual. And in many situations,
the returned error is often an EIO.

To avoid this unavailablity, we can store the failed AIOs, and resend them
later. If the error is temporary, the retries can succeed and the AIOs can
be successfully completed.

Signed-off-by: Jiahui Cen 
Signed-off-by: Ying Fang 
---
 block/block-backend.c | 89 +++
 1 file changed, 89 insertions(+)

diff --git a/block/block-backend.c b/block/block-backend.c
index b8367d82cc..8050669d23 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -365,6 +365,12 @@ BlockBackend *blk_new(AioContext *ctx, uint64_t perm, 
uint64_t shared_perm)
 notifier_list_init(>remove_bs_notifiers);
 notifier_list_init(>insert_bs_notifiers);
 
+/* for rehandle */
+blk->reinfo.enable = false;
+blk->reinfo.ts = NULL;
+qatomic_set(>reinfo.in_flight, 0);
+QTAILQ_INIT(>reinfo.re_aios);
+
 QLIST_INIT(>aio_notifiers);
 
 QTAILQ_INSERT_TAIL(_backends, blk, link);
@@ -1425,8 +1431,16 @@ static const AIOCBInfo blk_aio_em_aiocb_info = {
 .get_aio_context= blk_aio_em_aiocb_get_aio_context,
 };
 
+static void blk_rehandle_timer_cb(void *opaque);
+static void blk_rehandle_aio_complete(BlkAioEmAIOCB *acb);
+
 static void blk_aio_complete(BlkAioEmAIOCB *acb)
 {
+if (acb->rwco.blk->reinfo.enable) {
+blk_rehandle_aio_complete(acb);
+return;
+}
+
 if (acb->has_returned) {
 acb->common.cb(acb->common.opaque, acb->rwco.ret);
 blk_dec_in_flight(acb->rwco.blk);
@@ -1459,6 +1473,7 @@ static BlockAIOCB *blk_aio_prwv(BlockBackend *blk, 
int64_t offset, int bytes,
 .ret= NOT_DONE,
 };
 acb->bytes = bytes;
+acb->co_entry = co_entry;
 acb->has_returned = false;
 
 co = qemu_coroutine_create(co_entry, acb);
@@ -2054,6 +2069,20 @@ static int blk_do_set_aio_context(BlockBackend *blk, 
AioContext *new_context,
 throttle_group_attach_aio_context(tgm, new_context);
 bdrv_drained_end(bs);
 }
+
+if (blk->reinfo.enable) {
+if (blk->reinfo.ts) {
+timer_del(blk->reinfo.ts);
+timer_free(blk->reinfo.ts);
+}
+blk->reinfo.ts = aio_timer_new(new_context, QEMU_CLOCK_REALTIME,
+   SCALE_MS, blk_rehandle_timer_cb,
+   blk);
+if (qatomic_read(>reinfo.in_flight)) {
+timer_mod(blk->reinfo.ts,
+  qemu_clock_get_ms(QEMU_CLOCK_REALTIME));
+}
+}
 }
 
 blk->ctx = new_context;
@@ -2406,6 +2435,66 @@ static void blk_root_drained_end(BdrvChild *child, int 
*drained_end_counter)
 }
 }
 
+static void blk_rehandle_insert_aiocb(BlockBackend *blk, BlkAioEmAIOCB *acb)
+{
+assert(blk->reinfo.enable);
+
+qatomic_inc(>reinfo.in_flight);
+QTAILQ_INSERT_TAIL(>reinfo.re_aios, acb, list);
+timer_mod(blk->reinfo.ts, qemu_clock_get_ms(QEMU_CLOCK_REALTIME) +
+  blk->reinfo.timer_interval_ms);
+}
+
+static void blk_rehandle_remove_aiocb(BlockBackend *blk, BlkAioEmAIOCB *acb)
+{
+QTAILQ_REMOVE(>reinfo.re_aios, acb, list);
+qatomic_dec(>reinfo.in_flight);
+}
+
+static void blk_rehandle_timer_cb(void *opaque)
+{
+BlockBackend *blk = opaque;
+BlockBackendRehandleInfo *reinfo = >reinfo;
+BlkAioEmAIOCB *acb, *tmp;
+Coroutine *co;
+
+aio_context_acquire(blk_get_aio_context(blk));
+QTAILQ_FOREACH_SAFE(acb, >re_aios, list, tmp) {
+if (acb->rwco.ret == NOT_DONE) {
+continue;
+}
+
+blk_inc_in_flight(acb->rwco.blk);
+acb->rwco.ret = NOT_DONE;
+acb->has_returned = false;
+blk_rehandle_remove_aiocb(acb->rwco.blk, acb);
+
+co = qemu_coroutine_create(acb->co_entry, acb);
+qemu_coroutine_enter(co);
+
+acb->has_returned = true;
+if (acb->rwco.ret != NOT_DONE) {
+replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
+ blk_aio_complete_bh, acb);
+}
+}
+aio_context_release(blk_get_aio_context(blk));
+}
+
+static void blk_rehandle_aio_complete(BlkAioEmAIOCB *acb)
+{
+if (acb->has_returned) {
+blk_dec_in_flight(acb->rwco.blk);
+if (acb->rwco.ret == -EIO) {
+blk_rehandle_insert_aiocb(acb->rwco.blk, acb);
+return;
+}
+
+acb->common.cb(acb->common.opaque, acb->rwco.ret);
+qemu_aio_unref(acb);
+}
+}
+
 void blk_register_buf(BlockBackend *blk, void *host, size_t size)
 {
 bdrv_register_buf(blk_bs(blk), host, size);
-- 
2.19.1

[PATCH v3 7/9] qemu-option: add I/O hang timeout option

2020-10-22 Thread Jiahui Cen

I/O hang timeout should be different under different situations. So it is
better to provide an option for user to determine I/O hang timeout for
each block device.

Signed-off-by: Jiahui Cen 
Signed-off-by: Ying Fang 
---
 blockdev.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/blockdev.c b/blockdev.c
index fe6fb5dc1d..0a77eedfc4 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -501,6 +501,7 @@ static BlockBackend *blockdev_init(const char *file, QDict 
*bs_opts,
 BlockdevDetectZeroesOptions detect_zeroes =
 BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF;
 const char *throttling_group = NULL;
+int64_t iohang_timeout = 0;
 
 /* Check common options by copying from bs_opts to opts, all other options
  * stay in bs_opts for processing by bdrv_open(). */
@@ -623,6 +624,12 @@ static BlockBackend *blockdev_init(const char *file, QDict 
*bs_opts,
 
 bs->detect_zeroes = detect_zeroes;
 
+/* init timeout value for I/O Hang */
+iohang_timeout = qemu_opt_get_number(opts, "iohang-timeout", 0);
+if (iohang_timeout > 0) {
+blk_iohang_init(blk, iohang_timeout);
+}
+
 block_acct_setup(blk_get_stats(blk), account_invalid, account_failed);
 
 if (!parse_stats_intervals(blk_get_stats(blk), interval_list, errp)) {
@@ -3796,6 +3803,10 @@ QemuOptsList qemu_common_drive_opts = {
 .type = QEMU_OPT_BOOL,
 .help = "whether to account for failed I/O operations "
 "in the statistics",
+},{
+.name = "iohang-timeout",
+.type = QEMU_OPT_NUMBER,
+.help = "timeout value for I/O Hang",
 },
 { /* end of list */ }
 },
-- 
2.19.1

[PATCH 0/2] hw/block/nvme: two fixes for create sq/cq

2020-10-22 Thread Klaus Jensen

From: Klaus Jensen 

The first patch is a follow up to "hw/block/nvme: fix prp mapping status
codes" and fixes some status codes in the nvme_create_{sq,cq} functions.

The second patch fixes a faulty check on the given queue identifier.

Gollu Appalanaidu (2):
  nvme: fix create IO SQ/CQ status codes
  nvme: fix queue identifer validation

 hw/block/nvme.c | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

-- 
2.28.0

Re: [PATCH v2] hw/block/nvme: fix prp mapping status codes

2020-10-22 Thread Klaus Jensen

On Oct 19 11:20, Keith Busch wrote:
> On Mon, Oct 19, 2020 at 07:35:38PM +0200, Klaus Jensen wrote:
> > From: Gollu Appalanaidu 
> > 
> > Address 0 is not an invalid address. Remove those invalikd checks.
> > 
> > Unaligned PRP2 and PRP list entries should result in Invalid PRP Offset
> > status code and not Invalid Field. Fix that.
> > 
> > See NVMe Express v1.3d, Section 4.3 ("Physical Region Page Entry and
> > List").
> 
> Looks good to me.
> 
> Reviewed-by: Keith Busch 
> 

Thanks, added to nvme-next.


signature.asc
Description: PGP signature

Re: [PULL v2 00/28] Block patches

2020-10-22 Thread no-reply

Patchew URL: 
https://patchew.org/QEMU/20201022112726.736757-1-stefa...@redhat.com/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 20201022112726.736757-1-stefa...@redhat.com
Subject: [PULL v2 00/28] Block patches

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 * [new tag] patchew/20201022112726.736757-1-stefa...@redhat.com -> 
patchew/20201022112726.736757-1-stefa...@redhat.com
Switched to a new branch 'test'
070cb87 iotests: add commit top->base cases to 274
2381a9d block/io: fix bdrv_is_allocated_above
5314155 block/io: bdrv_common_block_status_above: support bs == base
824d6f6 block/io: bdrv_common_block_status_above: support include_base
8f13830 block/io: fix bdrv_co_block_status_above
384ba22 block/export: add vhost-user-blk multi-queue support
e10972d block/export: add iothread and fixed-iothread options
531d73b block: move block exports to libblockdev
13e8546 qemu-storage-daemon: avoid compiling blockdev_ss twice
9144c2b util/vhost-user-server: use static library in meson.build
9afaa76 util/vhost-user-server: move header to include/
1ded925 block/export: convert vhost-user-blk server to block export API
5bfa726 block/export: report flush errors
4565966 util/vhost-user-server: rework vu_client_trip() coroutine lifecycle
7d3e499 util/vhost-user-server: check EOF when reading payload
719ab6d util/vhost-user-server: fix memory leak in vu_message_read()
0afbb2c util/vhost-user-server: drop unused DevicePanicNotifier
9c3a1b30 block/export: consolidate request structs into VuBlockReq
d7f9068 util/vhost-user-server: drop unnecessary watch deletion
eb038288 util/vhost-user-server: drop unnecessary QOM cast
b6a8c41 util/vhost-user-server: s/fileds/fields/ typo fix
ef06bf4 MAINTAINERS: Add vhost-user block device backend server maintainer
3fed4d9 block/export: vhost-user block device backend server
5570b61 block: move logical block size check function to a common utility 
function
b03a5d2 util/vhost-user-server: generic vhost user server
a940b88 libvhost-user: remove watch for kick_fd when de-initialize vu-dev
b367b3a libvhost-user: Allow vu_message_read to be replaced
d61af88 block/nvme: Add driver statistics for access alignment and hw errors

=== OUTPUT BEGIN ===
1/28 Checking commit d61af8856e1d (block/nvme: Add driver statistics for access 
alignment and hw errors)
2/28 Checking commit b367b3a0c954 (libvhost-user: Allow vu_message_read to be 
replaced)
WARNING: Block comments use a leading /* on a separate line
#130: FILE: contrib/libvhost-user/libvhost-user.h:395:
+/* @read_msg: custom method to read vhost-user message

total: 0 errors, 1 warnings, 139 lines checked

Patch 2/28 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
3/28 Checking commit a940b885ebfc (libvhost-user: remove watch for kick_fd when 
de-initialize vu-dev)
4/28 Checking commit b03a5d2b5417 (util/vhost-user-server: generic vhost user 
server)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#35: 
new file mode 100644

WARNING: line over 80 characters
#87: FILE: util/vhost-user-server.c:48:
+/* When this is set vu_client_trip will stop new processing vhost-user 
message */

total: 0 errors, 2 warnings, 500 lines checked

Patch 4/28 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
5/28 Checking commit 5570b616e366 (block: move logical block size check 
function to a common utility function)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#85: 
new file mode 100644

total: 0 errors, 1 warnings, 129 lines checked

Patch 5/28 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
6/28 Checking commit 3fed4d90a1a6 (block/export: vhost-user block device 
backend server)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#30: 
new file mode 100644

WARNING: line over 80 characters
#476: FILE: block/export/vhost-user-blk-server.c:442:
+blk_remove_aio_context_notifier(vu_block_device->backend, 
blk_aio_attached,

ERROR: g_free(NULL) is safe this check is probably not required
#536: FILE: block/export/vhost-user-blk-server.c:502:
+if (vus->node_name) {
+g_free(vus->node_name);

total: 1 errors, 2 warnings, 714 lines checked

Patch 6/28 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

7/28

[PULL v2 22/28] block/export: add iothread and fixed-iothread options

2020-10-22 Thread Stefan Hajnoczi

Make it possible to specify the iothread where the export will run. By
default the block node can be moved to other AioContexts later and the
export will follow. The fixed-iothread option forces strict behavior
that prevents changing AioContext while the export is active. See the
QAPI docs for details.

Signed-off-by: Stefan Hajnoczi 
Message-id: 20200929125516.186715-5-stefa...@redhat.com
[Fix stray '#' character in block-export.json and add missing "(since:
5.2)" as suggested by Eric Blake.
--Stefan]
Signed-off-by: Stefan Hajnoczi 
---
 qapi/block-export.json   | 11 ++
 block/export/export.c| 31 +++-
 block/export/vhost-user-blk-server.c |  5 -
 nbd/server.c |  2 --
 4 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/qapi/block-export.json b/qapi/block-export.json
index a793e34af9..8a4ced817f 100644
--- a/qapi/block-export.json
+++ b/qapi/block-export.json
@@ -219,11 +219,22 @@
 #export before completion is signalled. (since: 5.2;
 #default: false)
 #
+# @iothread: The name of the iothread object where the export will run. The
+#default is to use the thread currently associated with the
+#block node. (since: 5.2)
+#
+# @fixed-iothread: True prevents the block node from being moved to another
+#  thread while the export is active. If true and @iothread is
+#  given, export creation fails if the block node cannot be
+#  moved to the iothread. The default is false. (since: 5.2)
+#
 # Since: 4.2
 ##
 { 'union': 'BlockExportOptions',
   'base': { 'type': 'BlockExportType',
 'id': 'str',
+   '*fixed-iothread': 'bool',
+   '*iothread': 'str',
 'node-name': 'str',
 '*writable': 'bool',
 '*writethrough': 'bool' },
diff --git a/block/export/export.c b/block/export/export.c
index 550897e236..a5b6b02703 100644
--- a/block/export/export.c
+++ b/block/export/export.c
@@ -15,6 +15,7 @@
 
 #include "block/block.h"
 #include "sysemu/block-backend.h"
+#include "sysemu/iothread.h"
 #include "block/export.h"
 #include "block/nbd.h"
 #include "qapi/error.h"
@@ -63,10 +64,11 @@ static const BlockExportDriver 
*blk_exp_find_driver(BlockExportType type)
 
 BlockExport *blk_exp_add(BlockExportOptions *export, Error **errp)
 {
+bool fixed_iothread = export->has_fixed_iothread && export->fixed_iothread;
 const BlockExportDriver *drv;
 BlockExport *exp = NULL;
 BlockDriverState *bs;
-BlockBackend *blk;
+BlockBackend *blk = NULL;
 AioContext *ctx;
 uint64_t perm;
 int ret;
@@ -102,6 +104,28 @@ BlockExport *blk_exp_add(BlockExportOptions *export, Error 
**errp)
 ctx = bdrv_get_aio_context(bs);
 aio_context_acquire(ctx);
 
+if (export->has_iothread) {
+IOThread *iothread;
+AioContext *new_ctx;
+
+iothread = iothread_by_id(export->iothread);
+if (!iothread) {
+error_setg(errp, "iothread \"%s\" not found", export->iothread);
+goto fail;
+}
+
+new_ctx = iothread_get_aio_context(iothread);
+
+ret = bdrv_try_set_aio_context(bs, new_ctx, errp);
+if (ret == 0) {
+aio_context_release(ctx);
+aio_context_acquire(new_ctx);
+ctx = new_ctx;
+} else if (fixed_iothread) {
+goto fail;
+}
+}
+
 /*
  * Block exports are used for non-shared storage migration. Make sure
  * that BDRV_O_INACTIVE is cleared and the image is ready for write
@@ -116,6 +140,11 @@ BlockExport *blk_exp_add(BlockExportOptions *export, Error 
**errp)
 }
 
 blk = blk_new(ctx, perm, BLK_PERM_ALL);
+
+if (!fixed_iothread) {
+blk_set_allow_aio_context_change(blk, true);
+}
+
 ret = blk_insert_bs(blk, bs, errp);
 if (ret < 0) {
 goto fail;
diff --git a/block/export/vhost-user-blk-server.c 
b/block/export/vhost-user-blk-server.c
index f7021cbd7b..286eb5fb9a 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -323,13 +323,17 @@ static const VuDevIface vu_blk_iface = {
 static void blk_aio_attached(AioContext *ctx, void *opaque)
 {
 VuBlkExport *vexp = opaque;
+
+vexp->export.ctx = ctx;
 vhost_user_server_attach_aio_context(>vu_server, ctx);
 }
 
 static void blk_aio_detach(void *opaque)
 {
 VuBlkExport *vexp = opaque;
+
 vhost_user_server_detach_aio_context(>vu_server);
+vexp->export.ctx = NULL;
 }
 
 static void
@@ -384,7 +388,6 @@ static int vu_blk_exp_create(BlockExport *exp, 
BlockExportOptions *opts,
 vu_blk_initialize_config(blk_bs(exp->blk), >blkcfg,
logical_block_size);
 
-blk_set_allow_aio_context_change(exp->blk, true);
 blk_add_aio_context_notifier(exp->blk, blk_aio_attached, blk_aio_detach,
  vexp);
 
diff --git

[PULL v2 21/28] block: move block exports to libblockdev

2020-10-22 Thread Stefan Hajnoczi

Block exports are used by softmmu, qemu-storage-daemon, and qemu-nbd.
They are not used by other programs and are not otherwise needed in
libblock.

Undo the recent move of blockdev-nbd.c from blockdev_ss into block_ss.
Since bdrv_close_all() (libblock) calls blk_exp_close_all()
(libblockdev) a stub function is required..

Make qemu-nbd.c use signal handling utility functions instead of
duplicating the code. This helps because os-posix.c is in libblockdev
and it depends on a qemu_system_killed() symbol that qemu-nbd.c lacks.
Once we use the signal handling utility functions we also end up
providing the necessary symbol.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Eric Blake 
Message-id: 20200929125516.186715-4-stefa...@redhat.com
[Fixed s/ndb/nbd/ typo in commit description as suggested by Eric Blake
--Stefan]
Signed-off-by: Stefan Hajnoczi 
---
 qemu-nbd.c| 21 -
 stubs/blk-exp-close-all.c |  7 +++
 block/export/meson.build  |  4 ++--
 meson.build   |  4 ++--
 nbd/meson.build   |  2 ++
 stubs/meson.build |  1 +
 6 files changed, 22 insertions(+), 17 deletions(-)
 create mode 100644 stubs/blk-exp-close-all.c

diff --git a/qemu-nbd.c b/qemu-nbd.c
index bc644a0670..a0701cdf36 100644
--- a/qemu-nbd.c
+++ b/qemu-nbd.c
@@ -25,6 +25,7 @@
 #include "qapi/error.h"
 #include "qemu/cutils.h"
 #include "sysemu/block-backend.h"
+#include "sysemu/runstate.h" /* for qemu_system_killed() prototype */
 #include "block/block_int.h"
 #include "block/nbd.h"
 #include "qemu/main-loop.h"
@@ -155,7 +156,11 @@ QEMU_COPYRIGHT "\n"
 }
 
 #ifdef CONFIG_POSIX
-static void termsig_handler(int signum)
+/*
+ * The client thread uses SIGTERM to interrupt the server.  A signal
+ * handler ensures that "qemu-nbd -v -c" exits with a nice status code.
+ */
+void qemu_system_killed(int signum, pid_t pid)
 {
 qatomic_cmpxchg(, RUNNING, TERMINATE);
 qemu_notify_event();
@@ -582,18 +587,8 @@ int main(int argc, char **argv)
 BlockExportOptions *export_opts;
 
 #ifdef CONFIG_POSIX
-/*
- * Exit gracefully on various signals, which includes SIGTERM used
- * by 'qemu-nbd -v -c'.
- */
-struct sigaction sa_sigterm;
-memset(_sigterm, 0, sizeof(sa_sigterm));
-sa_sigterm.sa_handler = termsig_handler;
-sigaction(SIGTERM, _sigterm, NULL);
-sigaction(SIGINT, _sigterm, NULL);
-sigaction(SIGHUP, _sigterm, NULL);
-
-signal(SIGPIPE, SIG_IGN);
+os_setup_early_signal_handling();
+os_setup_signal_handling();
 #endif
 
 socket_init();
diff --git a/stubs/blk-exp-close-all.c b/stubs/blk-exp-close-all.c
new file mode 100644
index 00..1c71316763
--- /dev/null
+++ b/stubs/blk-exp-close-all.c
@@ -0,0 +1,7 @@
+#include "qemu/osdep.h"
+#include "block/export.h"
+
+/* Only used in programs that support block exports (libblockdev.fa) */
+void blk_exp_close_all(void)
+{
+}
diff --git a/block/export/meson.build b/block/export/meson.build
index fffe6b09e8..9fb4fbf81d 100644
--- a/block/export/meson.build
+++ b/block/export/meson.build
@@ -1,2 +1,2 @@
-block_ss.add(files('export.c'))
-block_ss.add(when: ['CONFIG_LINUX', 'CONFIG_VHOST_USER'], if_true: 
files('vhost-user-blk-server.c'))
+blockdev_ss.add(files('export.c'))
+blockdev_ss.add(when: ['CONFIG_LINUX', 'CONFIG_VHOST_USER'], if_true: 
files('vhost-user-blk-server.c'))
diff --git a/meson.build b/meson.build
index 880683407f..b349c9bda8 100644
--- a/meson.build
+++ b/meson.build
@@ -1443,7 +1443,6 @@ subdir('dump')
 
 block_ss.add(files(
   'block.c',
-  'blockdev-nbd.c',
   'blockjob.c',
   'job.c',
   'qemu-io-cmds.c',
@@ -1456,6 +1455,7 @@ subdir('block')
 
 blockdev_ss.add(files(
   'blockdev.c',
+  'blockdev-nbd.c',
   'iothread.c',
   'job-qmp.c',
 ))
@@ -1832,7 +1832,7 @@ if have_tools
   qemu_io = executable('qemu-io', files('qemu-io.c'),
  dependencies: [block, qemuutil], install: true)
   qemu_nbd = executable('qemu-nbd', files('qemu-nbd.c'),
-   dependencies: [block, qemuutil], install: true)
+   dependencies: [blockdev, qemuutil], install: true)
 
   subdir('storage-daemon')
   subdir('contrib/rdmacm-mux')
diff --git a/nbd/meson.build b/nbd/meson.build
index 0c00a776d3..2baaa36948 100644
--- a/nbd/meson.build
+++ b/nbd/meson.build
@@ -1,5 +1,7 @@
 block_ss.add(files(
   'client.c',
   'common.c',
+))
+blockdev_ss.add(files(
   'server.c',
 ))
diff --git a/stubs/meson.build b/stubs/meson.build
index 67f2a8c069..7b733fadb7 100644
--- a/stubs/meson.build
+++ b/stubs/meson.build
@@ -1,6 +1,7 @@
 stub_ss.add(files('arch_type.c'))
 stub_ss.add(files('bdrv-next-monitor-owned.c'))
 stub_ss.add(files('blk-commit-all.c'))
+stub_ss.add(files('blk-exp-close-all.c'))
 stub_ss.add(files('blockdev-close-all-bdrv-states.c'))
 stub_ss.add(files('change-state-handler.c'))
 stub_ss.add(files('cmos.c'))
-- 
2.26.2

[PULL v2 25/28] block/io: bdrv_common_block_status_above: support include_base

2020-10-22 Thread Stefan Hajnoczi

From: Vladimir Sementsov-Ogievskiy 

In order to reuse bdrv_common_block_status_above in
bdrv_is_allocated_above, let's support include_base parameter.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Message-id: 20200924194003.22080-3-vsement...@virtuozzo.com
Signed-off-by: Stefan Hajnoczi 
---
 block/coroutines.h |  2 ++
 block/io.c | 21 ++---
 2 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/block/coroutines.h b/block/coroutines.h
index f69179f5ef..1cb3128b94 100644
--- a/block/coroutines.h
+++ b/block/coroutines.h
@@ -41,6 +41,7 @@ bdrv_pwritev(BdrvChild *child, int64_t offset, unsigned int 
bytes,
 int coroutine_fn
 bdrv_co_common_block_status_above(BlockDriverState *bs,
   BlockDriverState *base,
+  bool include_base,
   bool want_zero,
   int64_t offset,
   int64_t bytes,
@@ -50,6 +51,7 @@ bdrv_co_common_block_status_above(BlockDriverState *bs,
 int generated_co_wrapper
 bdrv_common_block_status_above(BlockDriverState *bs,
BlockDriverState *base,
+   bool include_base,
bool want_zero,
int64_t offset,
int64_t bytes,
diff --git a/block/io.c b/block/io.c
index a718d50ca2..86f76d04bf 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2343,6 +2343,7 @@ early_out:
 int coroutine_fn
 bdrv_co_common_block_status_above(BlockDriverState *bs,
   BlockDriverState *base,
+  bool include_base,
   bool want_zero,
   int64_t offset,
   int64_t bytes,
@@ -2354,10 +2355,11 @@ bdrv_co_common_block_status_above(BlockDriverState *bs,
 BlockDriverState *p;
 int64_t eof = 0;
 
-assert(bs != base);
+assert(include_base || bs != base);
+assert(!include_base || base); /* Can't include NULL base */
 
 ret = bdrv_co_block_status(bs, want_zero, offset, bytes, pnum, map, file);
-if (ret < 0 || *pnum == 0 || ret & BDRV_BLOCK_ALLOCATED) {
+if (ret < 0 || *pnum == 0 || ret & BDRV_BLOCK_ALLOCATED || bs == base) {
 return ret;
 }
 
@@ -2368,7 +2370,7 @@ bdrv_co_common_block_status_above(BlockDriverState *bs,
 assert(*pnum <= bytes);
 bytes = *pnum;
 
-for (p = bdrv_filter_or_cow_bs(bs); p != base;
+for (p = bdrv_filter_or_cow_bs(bs); include_base || p != base;
  p = bdrv_filter_or_cow_bs(p))
 {
 ret = bdrv_co_block_status(p, want_zero, offset, bytes, pnum, map,
@@ -2406,6 +2408,11 @@ bdrv_co_common_block_status_above(BlockDriverState *bs,
 break;
 }
 
+if (p == base) {
+assert(include_base);
+break;
+}
+
 /*
  * OK, [offset, offset + *pnum) region is unallocated on this layer,
  * let's continue the diving.
@@ -2425,7 +2432,7 @@ int bdrv_block_status_above(BlockDriverState *bs, 
BlockDriverState *base,
 int64_t offset, int64_t bytes, int64_t *pnum,
 int64_t *map, BlockDriverState **file)
 {
-return bdrv_common_block_status_above(bs, base, true, offset, bytes,
+return bdrv_common_block_status_above(bs, base, false, true, offset, bytes,
   pnum, map, file);
 }
 
@@ -2442,9 +2449,9 @@ int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, 
int64_t offset,
 int ret;
 int64_t dummy;
 
-ret = bdrv_common_block_status_above(bs, bdrv_filter_or_cow_bs(bs), false,
- offset, bytes, pnum ? pnum : ,
- NULL, NULL);
+ret = bdrv_common_block_status_above(bs, bs, true, false, offset,
+ bytes, pnum ? pnum : , NULL,
+ NULL);
 if (ret < 0) {
 return ret;
 }
-- 
2.26.2

[PULL v2 19/28] util/vhost-user-server: use static library in meson.build

2020-10-22 Thread Stefan Hajnoczi

Don't compile contrib/libvhost-user/libvhost-user.c again. Instead build
the static library once and then reuse it throughout QEMU.

Also switch from CONFIG_LINUX to CONFIG_VHOST_USER, which is what the
vhost-user tools (vhost-user-gpu, etc) do.

Signed-off-by: Stefan Hajnoczi 
Message-id: 20200924151549.913737-14-stefa...@redhat.com
[Added CONFIG_LINUX again because libvhost-user doesn't build on macOS.
--Stefan]
Signed-off-by: Stefan Hajnoczi 
---
 block/export/export.c | 8 
 block/export/meson.build  | 2 +-
 contrib/libvhost-user/meson.build | 1 +
 meson.build   | 6 +-
 util/meson.build  | 4 +++-
 5 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/block/export/export.c b/block/export/export.c
index bd7cac241f..550897e236 100644
--- a/block/export/export.c
+++ b/block/export/export.c
@@ -17,17 +17,17 @@
 #include "sysemu/block-backend.h"
 #include "block/export.h"
 #include "block/nbd.h"
-#if CONFIG_LINUX
-#include "block/export/vhost-user-blk-server.h"
-#endif
 #include "qapi/error.h"
 #include "qapi/qapi-commands-block-export.h"
 #include "qapi/qapi-events-block-export.h"
 #include "qemu/id.h"
+#ifdef CONFIG_VHOST_USER
+#include "vhost-user-blk-server.h"
+#endif
 
 static const BlockExportDriver *blk_exp_drivers[] = {
 _exp_nbd,
-#if CONFIG_LINUX
+#ifdef CONFIG_VHOST_USER
 _exp_vhost_user_blk,
 #endif
 };
diff --git a/block/export/meson.build b/block/export/meson.build
index ef3a9576f7..fffe6b09e8 100644
--- a/block/export/meson.build
+++ b/block/export/meson.build
@@ -1,2 +1,2 @@
 block_ss.add(files('export.c'))
-block_ss.add(when: 'CONFIG_LINUX', if_true: files('vhost-user-blk-server.c', 
'../../contrib/libvhost-user/libvhost-user.c'))
+block_ss.add(when: ['CONFIG_LINUX', 'CONFIG_VHOST_USER'], if_true: 
files('vhost-user-blk-server.c'))
diff --git a/contrib/libvhost-user/meson.build 
b/contrib/libvhost-user/meson.build
index e68dd1a581..a261e7665f 100644
--- a/contrib/libvhost-user/meson.build
+++ b/contrib/libvhost-user/meson.build
@@ -1,3 +1,4 @@
 libvhost_user = static_library('vhost-user',
files('libvhost-user.c', 
'libvhost-user-glib.c'),
build_by_default: false)
+vhost_user = declare_dependency(link_with: libvhost_user)
diff --git a/meson.build b/meson.build
index 7627a0ae46..2ec4f114ce 100644
--- a/meson.build
+++ b/meson.build
@@ -1398,6 +1398,11 @@ trace_events_subdirs += [
   'util',
 ]
 
+vhost_user = not_found
+if 'CONFIG_VHOST_USER' in config_host
+  subdir('contrib/libvhost-user')
+endif
+
 subdir('qapi')
 subdir('qobject')
 subdir('stubs')
@@ -1830,7 +1835,6 @@ if have_tools
  install: true)
 
   if 'CONFIG_VHOST_USER' in config_host
-subdir('contrib/libvhost-user')
 subdir('contrib/vhost-user-blk')
 subdir('contrib/vhost-user-gpu')
 subdir('contrib/vhost-user-input')
diff --git a/util/meson.build b/util/meson.build
index 2296e81b34..c5159ad79d 100644
--- a/util/meson.build
+++ b/util/meson.build
@@ -66,7 +66,9 @@ if have_block
   util_ss.add(files('main-loop.c'))
   util_ss.add(files('nvdimm-utils.c'))
   util_ss.add(files('qemu-coroutine.c', 'qemu-coroutine-lock.c', 
'qemu-coroutine-io.c'))
-  util_ss.add(when: 'CONFIG_LINUX', if_true: files('vhost-user-server.c'))
+  util_ss.add(when: ['CONFIG_LINUX', 'CONFIG_VHOST_USER'], if_true: [
+files('vhost-user-server.c'), vhost_user
+  ])
   util_ss.add(files('block-helpers.c'))
   util_ss.add(files('qemu-coroutine-sleep.c'))
   util_ss.add(files('qemu-co-shared-resource.c'))
-- 
2.26.2

[PULL v2 15/28] util/vhost-user-server: rework vu_client_trip() coroutine lifecycle

2020-10-22 Thread Stefan Hajnoczi

The vu_client_trip() coroutine is leaked during AioContext switching. It
is also unsafe to destroy the vu_dev in panic_cb() since its callers
still access it in some cases.

Rework the lifecycle to solve these safety issues.

Signed-off-by: Stefan Hajnoczi 
Message-id: 20200924151549.913737-10-stefa...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 util/vhost-user-server.h |  29 ++--
 block/export/vhost-user-blk-server.c |   9 +-
 util/vhost-user-server.c | 245 +++
 3 files changed, 155 insertions(+), 128 deletions(-)

diff --git a/util/vhost-user-server.h b/util/vhost-user-server.h
index 92177fc911..0da4c2cc4c 100644
--- a/util/vhost-user-server.h
+++ b/util/vhost-user-server.h
@@ -19,34 +19,36 @@
 #include "qapi/error.h"
 #include "standard-headers/linux/virtio_blk.h"
 
+/* A kick fd that we monitor on behalf of libvhost-user */
 typedef struct VuFdWatch {
 VuDev *vu_dev;
 int fd; /*kick fd*/
 void *pvt;
 vu_watch_cb cb;
-bool processing;
 QTAILQ_ENTRY(VuFdWatch) next;
 } VuFdWatch;
 
-typedef struct VuServer VuServer;
-
-struct VuServer {
+/**
+ * VuServer:
+ * A vhost-user server instance with user-defined VuDevIface callbacks.
+ * Vhost-user device backends can be implemented using VuServer. VuDevIface
+ * callbacks and virtqueue kicks run in the given AioContext.
+ */
+typedef struct {
 QIONetListener *listener;
+QEMUBH *restart_listener_bh;
 AioContext *ctx;
 int max_queues;
 const VuDevIface *vu_iface;
+
+/* Protected by ctx lock */
 VuDev vu_dev;
 QIOChannel *ioc; /* The I/O channel with the client */
 QIOChannelSocket *sioc; /* The underlying data channel with the client */
-/* IOChannel for fd provided via VHOST_USER_SET_SLAVE_REQ_FD */
-QIOChannel *ioc_slave;
-QIOChannelSocket *sioc_slave;
-Coroutine *co_trip; /* coroutine for processing VhostUserMsg */
 QTAILQ_HEAD(, VuFdWatch) vu_fd_watches;
-/* restart coroutine co_trip if AIOContext is changed */
-bool aio_context_changed;
-bool processing_msg;
-};
+
+Coroutine *co_trip; /* coroutine for processing VhostUserMsg */
+} VuServer;
 
 bool vhost_user_server_start(VuServer *server,
  SocketAddress *unix_socket,
@@ -57,6 +59,7 @@ bool vhost_user_server_start(VuServer *server,
 
 void vhost_user_server_stop(VuServer *server);
 
-void vhost_user_server_set_aio_context(VuServer *server, AioContext *ctx);
+void vhost_user_server_attach_aio_context(VuServer *server, AioContext *ctx);
+void vhost_user_server_detach_aio_context(VuServer *server);
 
 #endif /* VHOST_USER_SERVER_H */
diff --git a/block/export/vhost-user-blk-server.c 
b/block/export/vhost-user-blk-server.c
index c8fa4ecba9..4d35232bf3 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -313,18 +313,13 @@ static const VuDevIface vu_block_iface = {
 static void blk_aio_attached(AioContext *ctx, void *opaque)
 {
 VuBlockDev *vub_dev = opaque;
-aio_context_acquire(ctx);
-vhost_user_server_set_aio_context(_dev->vu_server, ctx);
-aio_context_release(ctx);
+vhost_user_server_attach_aio_context(_dev->vu_server, ctx);
 }
 
 static void blk_aio_detach(void *opaque)
 {
 VuBlockDev *vub_dev = opaque;
-AioContext *ctx = vub_dev->vu_server.ctx;
-aio_context_acquire(ctx);
-vhost_user_server_set_aio_context(_dev->vu_server, NULL);
-aio_context_release(ctx);
+vhost_user_server_detach_aio_context(_dev->vu_server);
 }
 
 static void
diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
index 981908fef0..c448800e58 100644
--- a/util/vhost-user-server.c
+++ b/util/vhost-user-server.c
@@ -9,8 +9,50 @@
  */
 #include "qemu/osdep.h"
 #include "qemu/main-loop.h"
+#include "block/aio-wait.h"
 #include "vhost-user-server.h"
 
+/*
+ * Theory of operation:
+ *
+ * VuServer is started and stopped by vhost_user_server_start() and
+ * vhost_user_server_stop() from the main loop thread. Starting the server
+ * opens a vhost-user UNIX domain socket and listens for incoming connections.
+ * Only one connection is allowed at a time.
+ *
+ * The connection is handled by the vu_client_trip() coroutine in the
+ * VuServer->ctx AioContext. The coroutine consists of a vu_dispatch() loop
+ * where libvhost-user calls vu_message_read() to receive the next vhost-user
+ * protocol messages over the UNIX domain socket.
+ *
+ * When virtqueues are set up libvhost-user calls set_watch() to monitor kick
+ * fds. These fds are also handled in the VuServer->ctx AioContext.
+ *
+ * Both vu_client_trip() and kick fd monitoring can be stopped by shutting down
+ * the socket connection. Shutting down the socket connection causes
+ * vu_message_read() to fail since no more data can be received from the 
socket.
+ * After vu_dispatch() fails, vu_client_trip() calls vu_deinit() to stop
+ * libvhost-user before terminating the coroutine. vu_deinit() calls
+ * remove_watch() to

[PULL v2 13/28] util/vhost-user-server: fix memory leak in vu_message_read()

2020-10-22 Thread Stefan Hajnoczi

fds[] is leaked when qio_channel_readv_full() fails.

Use vmsg->fds[] instead of keeping a local fds[] array. Then we can
reuse goto fail to clean up fds. vmsg->fd_num must be zeroed before the
loop to make this safe.

Signed-off-by: Stefan Hajnoczi 
Message-id: 20200924151549.913737-8-stefa...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 util/vhost-user-server.c | 50 ++--
 1 file changed, 23 insertions(+), 27 deletions(-)

diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
index 73a1667b54..a7b7a9897f 100644
--- a/util/vhost-user-server.c
+++ b/util/vhost-user-server.c
@@ -100,21 +100,11 @@ vu_message_read(VuDev *vu_dev, int conn_fd, VhostUserMsg 
*vmsg)
 };
 int rc, read_bytes = 0;
 Error *local_err = NULL;
-/*
- * Store fds/nfds returned from qio_channel_readv_full into
- * temporary variables.
- *
- * VhostUserMsg is a packed structure, gcc will complain about passing
- * pointer to a packed structure member if we pass _num
- * and  directly when calling qio_channel_readv_full,
- * thus two temporary variables nfds and fds are used here.
- */
-size_t nfds = 0, nfds_t = 0;
 const size_t max_fds = G_N_ELEMENTS(vmsg->fds);
-int *fds_t = NULL;
 VuServer *server = container_of(vu_dev, VuServer, vu_dev);
 QIOChannel *ioc = server->ioc;
 
+vmsg->fd_num = 0;
 if (!ioc) {
 error_report_err(local_err);
 goto fail;
@@ -122,41 +112,47 @@ vu_message_read(VuDev *vu_dev, int conn_fd, VhostUserMsg 
*vmsg)
 
 assert(qemu_in_coroutine());
 do {
+size_t nfds = 0;
+int *fds = NULL;
+
 /*
  * qio_channel_readv_full may have short reads, keeping calling it
  * until getting VHOST_USER_HDR_SIZE or 0 bytes in total
  */
-rc = qio_channel_readv_full(ioc, , 1, _t, _t, _err);
+rc = qio_channel_readv_full(ioc, , 1, , , _err);
 if (rc < 0) {
 if (rc == QIO_CHANNEL_ERR_BLOCK) {
+assert(local_err == NULL);
 qio_channel_yield(ioc, G_IO_IN);
 continue;
 } else {
 error_report_err(local_err);
-return false;
+goto fail;
 }
 }
-read_bytes += rc;
-if (nfds_t > 0) {
-if (nfds + nfds_t > max_fds) {
+
+if (nfds > 0) {
+if (vmsg->fd_num + nfds > max_fds) {
 error_report("A maximum of %zu fds are allowed, "
  "however got %zu fds now",
- max_fds, nfds + nfds_t);
+ max_fds, vmsg->fd_num + nfds);
+g_free(fds);
 goto fail;
 }
-memcpy(vmsg->fds + nfds, fds_t,
-   nfds_t *sizeof(vmsg->fds[0]));
-nfds += nfds_t;
-g_free(fds_t);
+memcpy(vmsg->fds + vmsg->fd_num, fds, nfds * sizeof(vmsg->fds[0]));
+vmsg->fd_num += nfds;
+g_free(fds);
 }
-if (read_bytes == VHOST_USER_HDR_SIZE || rc == 0) {
-break;
+
+if (rc == 0) { /* socket closed */
+goto fail;
 }
-iov.iov_base = (char *)vmsg + read_bytes;
-iov.iov_len = VHOST_USER_HDR_SIZE - read_bytes;
-} while (true);
 
-vmsg->fd_num = nfds;
+iov.iov_base += rc;
+iov.iov_len -= rc;
+read_bytes += rc;
+} while (read_bytes != VHOST_USER_HDR_SIZE);
+
 /* qio_channel_readv_full will make socket fds blocking, unblock them */
 vmsg_unblock_fds(vmsg);
 if (vmsg->size > sizeof(vmsg->payload)) {
-- 
2.26.2

[PULL v2 20/28] qemu-storage-daemon: avoid compiling blockdev_ss twice

2020-10-22 Thread Stefan Hajnoczi

Introduce libblkdev.fa to avoid recompiling blockdev_ss twice.

Suggested-by: Paolo Bonzini 
Reviewed-by: Paolo Bonzini 
Signed-off-by: Stefan Hajnoczi 
Message-id: 20200929125516.186715-3-stefa...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 meson.build| 12 ++--
 storage-daemon/meson.build |  3 +--
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/meson.build b/meson.build
index 2ec4f114ce..880683407f 100644
--- a/meson.build
+++ b/meson.build
@@ -1464,7 +1464,6 @@ blockdev_ss.add(files(
 # os-win32.c does not
 blockdev_ss.add(when: 'CONFIG_POSIX', if_true: files('os-posix.c'))
 softmmu_ss.add(when: 'CONFIG_WIN32', if_true: [files('os-win32.c')])
-softmmu_ss.add_all(blockdev_ss)
 
 common_ss.add(files('cpus-common.c'))
 
@@ -1596,6 +1595,15 @@ block = declare_dependency(link_whole: [libblock],
link_args: '@block.syms',
dependencies: [crypto, io])
 
+blockdev_ss = blockdev_ss.apply(config_host, strict: false)
+libblockdev = static_library('blockdev', blockdev_ss.sources() + genh,
+ dependencies: blockdev_ss.dependencies(),
+ name_suffix: 'fa',
+ build_by_default: false)
+
+blockdev = declare_dependency(link_whole: [libblockdev],
+  dependencies: [block])
+
 qmp_ss = qmp_ss.apply(config_host, strict: false)
 libqmp = static_library('qmp', qmp_ss.sources() + genh,
 dependencies: qmp_ss.dependencies(),
@@ -1628,7 +1636,7 @@ foreach m : block_mods + softmmu_mods
 install_dir: config_host['qemu_moddir'])
 endforeach
 
-softmmu_ss.add(authz, block, chardev, crypto, io, qmp)
+softmmu_ss.add(authz, blockdev, chardev, crypto, io, qmp)
 common_ss.add(qom, qemuutil)
 
 common_ss.add_all(when: 'CONFIG_SOFTMMU', if_true: [softmmu_ss])
diff --git a/storage-daemon/meson.build b/storage-daemon/meson.build
index 0409acc3f5..c5adce81c3 100644
--- a/storage-daemon/meson.build
+++ b/storage-daemon/meson.build
@@ -1,7 +1,6 @@
 qsd_ss = ss.source_set()
 qsd_ss.add(files('qemu-storage-daemon.c'))
-qsd_ss.add(block, chardev, qmp, qom, qemuutil)
-qsd_ss.add_all(blockdev_ss)
+qsd_ss.add(blockdev, chardev, qmp, qom, qemuutil)
 
 subdir('qapi')
 
-- 
2.26.2

[PULL v2 11/28] block/export: consolidate request structs into VuBlockReq

2020-10-22 Thread Stefan Hajnoczi

Only one struct is needed per request. Drop req_data and the separate
VuBlockReq instance. Instead let vu_queue_pop() allocate everything at
once.

This fixes the req_data memory leak in vu_block_virtio_process_req().

Signed-off-by: Stefan Hajnoczi 
Message-id: 20200924151549.913737-6-stefa...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 block/export/vhost-user-blk-server.c | 68 +---
 1 file changed, 21 insertions(+), 47 deletions(-)

diff --git a/block/export/vhost-user-blk-server.c 
b/block/export/vhost-user-blk-server.c
index 2ba1613cc9..d227b468d8 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -25,7 +25,7 @@ struct virtio_blk_inhdr {
 };
 
 typedef struct VuBlockReq {
-VuVirtqElement *elem;
+VuVirtqElement elem;
 int64_t sector_num;
 size_t size;
 struct virtio_blk_inhdr *in;
@@ -39,14 +39,10 @@ static void vu_block_req_complete(VuBlockReq *req)
 VuDev *vu_dev = >server->vu_dev;
 
 /* IO size with 1 extra status byte */
-vu_queue_push(vu_dev, req->vq, req->elem, req->size + 1);
+vu_queue_push(vu_dev, req->vq, >elem, req->size + 1);
 vu_queue_notify(vu_dev, req->vq);
 
-if (req->elem) {
-free(req->elem);
-}
-
-g_free(req);
+free(req);
 }
 
 static VuBlockDev *get_vu_block_device_by_server(VuServer *server)
@@ -89,20 +85,12 @@ static void coroutine_fn vu_block_flush(VuBlockReq *req)
 blk_co_flush(backend);
 }
 
-struct req_data {
-VuServer *server;
-VuVirtq *vq;
-VuVirtqElement *elem;
-};
-
 static void coroutine_fn vu_block_virtio_process_req(void *opaque)
 {
-struct req_data *data = opaque;
-VuServer *server = data->server;
-VuVirtq *vq = data->vq;
-VuVirtqElement *elem = data->elem;
+VuBlockReq *req = opaque;
+VuServer *server = req->server;
+VuVirtqElement *elem = >elem;
 uint32_t type;
-VuBlockReq *req;
 
 VuBlockDev *vdev_blk = get_vu_block_device_by_server(server);
 BlockBackend *backend = vdev_blk->backend;
@@ -111,18 +99,13 @@ static void coroutine_fn vu_block_virtio_process_req(void 
*opaque)
 struct iovec *out_iov = elem->out_sg;
 unsigned in_num = elem->in_num;
 unsigned out_num = elem->out_num;
+
 /* refer to hw/block/virtio_blk.c */
 if (elem->out_num < 1 || elem->in_num < 1) {
 error_report("virtio-blk request missing headers");
-free(elem);
-return;
+goto err;
 }
 
-req = g_new0(VuBlockReq, 1);
-req->server = server;
-req->vq = vq;
-req->elem = elem;
-
 if (unlikely(iov_to_buf(out_iov, out_num, 0, >out,
 sizeof(req->out)) != sizeof(req->out))) {
 error_report("virtio-blk request outhdr too short");
@@ -202,36 +185,27 @@ static void coroutine_fn vu_block_virtio_process_req(void 
*opaque)
 
 err:
 free(elem);
-g_free(req);
-return;
 }
 
 static void vu_block_process_vq(VuDev *vu_dev, int idx)
 {
-VuServer *server;
-VuVirtq *vq;
-struct req_data *req_data;
+VuServer *server = container_of(vu_dev, VuServer, vu_dev);
+VuVirtq *vq = vu_get_queue(vu_dev, idx);
 
-server = container_of(vu_dev, VuServer, vu_dev);
-assert(server);
-
-vq = vu_get_queue(vu_dev, idx);
-assert(vq);
-VuVirtqElement *elem;
 while (1) {
-elem = vu_queue_pop(vu_dev, vq, sizeof(VuVirtqElement) +
-sizeof(VuBlockReq));
-if (elem) {
-req_data = g_new0(struct req_data, 1);
-req_data->server = server;
-req_data->vq = vq;
-req_data->elem = elem;
-Coroutine *co = qemu_coroutine_create(vu_block_virtio_process_req,
-  req_data);
-aio_co_enter(server->ioc->ctx, co);
-} else {
+VuBlockReq *req;
+
+req = vu_queue_pop(vu_dev, vq, sizeof(VuBlockReq));
+if (!req) {
 break;
 }
+
+req->server = server;
+req->vq = vq;
+
+Coroutine *co =
+qemu_coroutine_create(vu_block_virtio_process_req, req);
+qemu_coroutine_enter(co);
 }
 }
 
-- 
2.26.2

[PULL v2 10/28] util/vhost-user-server: drop unnecessary watch deletion

2020-10-22 Thread Stefan Hajnoczi

Explicitly deleting watches is not necessary since libvhost-user calls
remove_watch() during vu_deinit(). Add an assertion to check this
though.

Signed-off-by: Stefan Hajnoczi 
Message-id: 20200924151549.913737-5-stefa...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 util/vhost-user-server.c | 19 ---
 1 file changed, 4 insertions(+), 15 deletions(-)

diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
index 443ab7448c..6efe2279fd 100644
--- a/util/vhost-user-server.c
+++ b/util/vhost-user-server.c
@@ -48,21 +48,6 @@ static void close_client(VuServer *server)
 /* When this is set vu_client_trip will stop new processing vhost-user 
message */
 server->sioc = NULL;
 
-VuFdWatch *vu_fd_watch, *next;
-QTAILQ_FOREACH_SAFE(vu_fd_watch, >vu_fd_watches, next, next) {
-aio_set_fd_handler(server->ioc->ctx, vu_fd_watch->fd, true, NULL,
-   NULL, NULL, NULL);
-}
-
-while (!QTAILQ_EMPTY(>vu_fd_watches)) {
-QTAILQ_FOREACH_SAFE(vu_fd_watch, >vu_fd_watches, next, next) {
-if (!vu_fd_watch->processing) {
-QTAILQ_REMOVE(>vu_fd_watches, vu_fd_watch, next);
-g_free(vu_fd_watch);
-}
-}
-}
-
 while (server->processing_msg) {
 if (server->ioc->read_coroutine) {
 server->ioc->read_coroutine = NULL;
@@ -73,6 +58,10 @@ static void close_client(VuServer *server)
 }
 
 vu_deinit(>vu_dev);
+
+/* vu_deinit() should have called remove_watch() */
+assert(QTAILQ_EMPTY(>vu_fd_watches));
+
 object_unref(OBJECT(sioc));
 object_unref(OBJECT(server->ioc));
 }
-- 
2.26.2

[PULL v2 24/28] block/io: fix bdrv_co_block_status_above

2020-10-22 Thread Stefan Hajnoczi

From: Vladimir Sementsov-Ogievskiy 

bdrv_co_block_status_above has several design problems with handling
short backing files:

1. With want_zeros=true, it may return ret with BDRV_BLOCK_ZERO but
without BDRV_BLOCK_ALLOCATED flag, when actually short backing file
which produces these after-EOF zeros is inside requested backing
sequence.

2. With want_zero=false, it may return pnum=0 prior to actual EOF,
because of EOF of short backing file.

Fix these things, making logic about short backing files clearer.

With fixed bdrv_block_status_above we also have to improve is_zero in
qcow2 code, otherwise iotest 154 will fail, because with this patch we
stop to merge zeros of different types (produced by fully unallocated
in the whole backing chain regions vs produced by short backing files).

Note also, that this patch leaves for another day the general problem
around block-status: misuse of BDRV_BLOCK_ALLOCATED as is-fs-allocated
vs go-to-backing.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Message-id: 20200924194003.22080-2-vsement...@virtuozzo.com
[Fix s/comes/come/ as suggested by Eric Blake
--Stefan]
Signed-off-by: Stefan Hajnoczi 
---
 block/io.c| 68 ---
 block/qcow2.c | 16 ++--
 2 files changed, 68 insertions(+), 16 deletions(-)

diff --git a/block/io.c b/block/io.c
index 54f0968aee..a718d50ca2 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2350,34 +2350,74 @@ bdrv_co_common_block_status_above(BlockDriverState *bs,
   int64_t *map,
   BlockDriverState **file)
 {
+int ret;
 BlockDriverState *p;
-int ret = 0;
-bool first = true;
+int64_t eof = 0;
 
 assert(bs != base);
-for (p = bs; p != base; p = bdrv_filter_or_cow_bs(p)) {
+
+ret = bdrv_co_block_status(bs, want_zero, offset, bytes, pnum, map, file);
+if (ret < 0 || *pnum == 0 || ret & BDRV_BLOCK_ALLOCATED) {
+return ret;
+}
+
+if (ret & BDRV_BLOCK_EOF) {
+eof = offset + *pnum;
+}
+
+assert(*pnum <= bytes);
+bytes = *pnum;
+
+for (p = bdrv_filter_or_cow_bs(bs); p != base;
+ p = bdrv_filter_or_cow_bs(p))
+{
 ret = bdrv_co_block_status(p, want_zero, offset, bytes, pnum, map,
file);
 if (ret < 0) {
-break;
+return ret;
 }
-if (ret & BDRV_BLOCK_ZERO && ret & BDRV_BLOCK_EOF && !first) {
+if (*pnum == 0) {
 /*
- * Reading beyond the end of the file continues to read
- * zeroes, but we can only widen the result to the
- * unallocated length we learned from an earlier
- * iteration.
+ * The top layer deferred to this layer, and because this layer is
+ * short, any zeroes that we synthesize beyond EOF behave as if 
they
+ * were allocated at this layer.
+ *
+ * We don't include BDRV_BLOCK_EOF into ret, as upper layer may be
+ * larger. We'll add BDRV_BLOCK_EOF if needed at function end, see
+ * below.
  */
+assert(ret & BDRV_BLOCK_EOF);
 *pnum = bytes;
+if (file) {
+*file = p;
+}
+ret = BDRV_BLOCK_ZERO | BDRV_BLOCK_ALLOCATED;
+break;
 }
-if (ret & (BDRV_BLOCK_ZERO | BDRV_BLOCK_DATA)) {
+if (ret & BDRV_BLOCK_ALLOCATED) {
+/*
+ * We've found the node and the status, we must break.
+ *
+ * Drop BDRV_BLOCK_EOF, as it's not for upper layer, which may be
+ * larger. We'll add BDRV_BLOCK_EOF if needed at function end, see
+ * below.
+ */
+ret &= ~BDRV_BLOCK_EOF;
 break;
 }
-/* [offset, pnum] unallocated on this layer, which could be only
- * the first part of [offset, bytes].  */
-bytes = MIN(bytes, *pnum);
-first = false;
+
+/*
+ * OK, [offset, offset + *pnum) region is unallocated on this layer,
+ * let's continue the diving.
+ */
+assert(*pnum <= bytes);
+bytes = *pnum;
+}
+
+if (offset + *pnum == eof) {
+ret |= BDRV_BLOCK_EOF;
 }
+
 return ret;
 }
 
diff --git a/block/qcow2.c b/block/qcow2.c
index b05512718c..b6cb4db8bb 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -3860,8 +3860,20 @@ static bool is_zero(BlockDriverState *bs, int64_t 
offset, int64_t bytes)
 if (!bytes) {
 return true;
 }
-res = bdrv_block_status_above(bs, NULL, offset, bytes, , NULL, NULL);
-return res >= 0 && (res & BDRV_BLOCK_ZERO) && nr == bytes;
+
+/*
+ * bdrv_block_status_above doesn't merge different types of zeros, for
+ * example, zeros which come from the region which is unallocated in
+ * the whole backing

[PULL v2 28/28] iotests: add commit top->base cases to 274

2020-10-22 Thread Stefan Hajnoczi

From: Vladimir Sementsov-Ogievskiy 

These cases are fixed by previous patches around block_status and
is_allocated.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Eric Blake 
Reviewed-by: Alberto Garcia 
Message-id: 20200924194003.22080-6-vsement...@virtuozzo.com
Signed-off-by: Stefan Hajnoczi 
---
 tests/qemu-iotests/274 | 20 +++
 tests/qemu-iotests/274.out | 68 ++
 2 files changed, 88 insertions(+)

diff --git a/tests/qemu-iotests/274 b/tests/qemu-iotests/274
index d4571c5465..76b1ba6a52 100755
--- a/tests/qemu-iotests/274
+++ b/tests/qemu-iotests/274
@@ -115,6 +115,26 @@ with iotests.FilePath('base') as base, \
 iotests.qemu_io_log('-c', 'read -P 1 0 %d' % size_short, mid)
 iotests.qemu_io_log('-c', 'read -P 0 %d %d' % (size_short, size_diff), mid)
 
+iotests.log('=== Testing qemu-img commit (top -> base) ===')
+
+create_chain()
+iotests.qemu_img_log('commit', '-b', base, top)
+iotests.img_info_log(base)
+iotests.qemu_io_log('-c', 'read -P 1 0 %d' % size_short, base)
+iotests.qemu_io_log('-c', 'read -P 0 %d %d' % (size_short, size_diff), 
base)
+
+iotests.log('=== Testing QMP active commit (top -> base) ===')
+
+create_chain()
+with create_vm() as vm:
+vm.launch()
+vm.qmp_log('block-commit', device='top', base_node='base',
+   job_id='job0', auto_dismiss=False)
+vm.run_job('job0', wait=5)
+
+iotests.img_info_log(mid)
+iotests.qemu_io_log('-c', 'read -P 1 0 %d' % size_short, base)
+iotests.qemu_io_log('-c', 'read -P 0 %d %d' % (size_short, size_diff), 
base)
 
 iotests.log('== Resize tests ==')
 
diff --git a/tests/qemu-iotests/274.out b/tests/qemu-iotests/274.out
index bf5abd4c10..cfe17a8659 100644
--- a/tests/qemu-iotests/274.out
+++ b/tests/qemu-iotests/274.out
@@ -135,6 +135,74 @@ read 1048576/1048576 bytes at offset 0
 read 1048576/1048576 bytes at offset 1048576
 1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
+=== Testing qemu-img commit (top -> base) ===
+Formatting 'TEST_DIR/PID-base', fmt=qcow2 cluster_size=65536 extended_l2=off 
compression_type=zlib size=2097152 lazy_refcounts=off refcount_bits=16
+
+Formatting 'TEST_DIR/PID-mid', fmt=qcow2 cluster_size=65536 extended_l2=off 
compression_type=zlib size=1048576 backing_file=TEST_DIR/PID-base 
backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
+
+Formatting 'TEST_DIR/PID-top', fmt=qcow2 cluster_size=65536 extended_l2=off 
compression_type=zlib size=2097152 backing_file=TEST_DIR/PID-mid 
backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
+
+wrote 2097152/2097152 bytes at offset 0
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+Image committed.
+
+image: TEST_IMG
+file format: IMGFMT
+virtual size: 2 MiB (2097152 bytes)
+cluster_size: 65536
+Format specific information:
+compat: 1.1
+compression type: zlib
+lazy refcounts: false
+refcount bits: 16
+corrupt: false
+extended l2: false
+
+read 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+read 1048576/1048576 bytes at offset 1048576
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+=== Testing QMP active commit (top -> base) ===
+Formatting 'TEST_DIR/PID-base', fmt=qcow2 cluster_size=65536 extended_l2=off 
compression_type=zlib size=2097152 lazy_refcounts=off refcount_bits=16
+
+Formatting 'TEST_DIR/PID-mid', fmt=qcow2 cluster_size=65536 extended_l2=off 
compression_type=zlib size=1048576 backing_file=TEST_DIR/PID-base 
backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
+
+Formatting 'TEST_DIR/PID-top', fmt=qcow2 cluster_size=65536 extended_l2=off 
compression_type=zlib size=2097152 backing_file=TEST_DIR/PID-mid 
backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
+
+wrote 2097152/2097152 bytes at offset 0
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+{"execute": "block-commit", "arguments": {"auto-dismiss": false, "base-node": 
"base", "device": "top", "job-id": "job0"}}
+{"return": {}}
+{"execute": "job-complete", "arguments": {"id": "job0"}}
+{"return": {}}
+{"data": {"device": "job0", "len": 1048576, "offset": 1048576, "speed": 0, 
"type": "commit"}, "event": "BLOCK_JOB_READY", "timestamp": {"microseconds": 
"USECS", "seconds": "SECS"}}
+{"data": {"device": "job0", "len": 1048576, "offset": 1048576, "speed": 0, 
"type": "commit"}, "event": "BLOCK_JOB_COMPLETED", "timestamp": 
{"microseconds": "USECS", "seconds": "SECS"}}
+{"execute": "job-dismiss", "arguments": {"id": "job0"}}
+{"return": {}}
+image: TEST_IMG
+file format: IMGFMT
+virtual size: 1 MiB (1048576 bytes)
+cluster_size: 65536
+backing file: TEST_DIR/PID-base
+backing file format: IMGFMT
+Format specific information:
+compat: 1.1
+compression type: zlib
+lazy refcounts: false
+refcount bits: 16
+corrupt: false
+extended l2: false
+
+read 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+

1 2 >

1 - 100 of 129 matches

Mail list logo