date:20161013

Re: [Qemu-block] [Qemu-devel] [PATCH v6 00/15] nbd: efficient write zeroes

2016-10-13 Thread no-reply

Hi,

Your series seems to have some coding style problems. See output below for
more information:

Subject: [Qemu-devel] [PATCH v6 00/15] nbd: efficient write zeroes
Type: series
Message-id: 1476392335-9256-1-git-send-email-ebl...@redhat.com

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

# Useful git options
git config --local diff.renamelimit 0
git config --local diff.renames True

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
echo "Checking PATCH $n/$total: $(git show --no-patch --format=%s $c)..."
if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
failed=1
echo
fi
n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
3e3cab6 nbd: Implement NBD_CMD_WRITE_ZEROES on client
6127973 nbd: Implement NBD_CMD_WRITE_ZEROES on server
dbe675e nbd: Improve server handling of shutdown requests
d86be2e nbd: Support shorter handshake
598c0b0 nbd: Less allocation during NBD_OPT_LIST
71d9fb4 nbd: Let client skip portions of server reply
4554cca nbd: Let server know when client gives up negotiation
c33b361 nbd: Share common option-sending code in client
9e216fe nbd: Send message along with server NBD_REP_ERR errors
761e798 nbd: Share common reply-sending code in server
dd5e949 nbd: Rename struct nbd_request and nbd_reply
45a155f nbd: Rename NbdClientSession to NBDClientSession
00f1816 nbd: Rename NBDRequest to NBDRequestData
bcd4778 nbd: Treat flags vs. command type as separate fields
fde7920 nbd: Add qemu-nbd -D for human-readable description

=== OUTPUT BEGIN ===
Checking PATCH 1/15: nbd: Add qemu-nbd -D for human-readable description...
Checking PATCH 2/15: nbd: Treat flags vs. command type as separate fields...
Checking PATCH 3/15: nbd: Rename NBDRequest to NBDRequestData...
Checking PATCH 4/15: nbd: Rename NbdClientSession to NBDClientSession...
Checking PATCH 5/15: nbd: Rename struct nbd_request and nbd_reply...
Checking PATCH 6/15: nbd: Share common reply-sending code in server...
Checking PATCH 7/15: nbd: Send message along with server NBD_REP_ERR errors...
Checking PATCH 8/15: nbd: Share common option-sending code in client...
Checking PATCH 9/15: nbd: Let server know when client gives up negotiation...
Checking PATCH 10/15: nbd: Let client skip portions of server reply...
Checking PATCH 11/15: nbd: Less allocation during NBD_OPT_LIST...
Checking PATCH 12/15: nbd: Support shorter handshake...
Checking PATCH 13/15: nbd: Improve server handling of shutdown requests...
ERROR: return of an errno should typically be -ve (return -ESHUTDOWN)
#63: FILE: nbd/client.c:38:
+return ESHUTDOWN;

total: 1 errors, 0 warnings, 95 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

Checking PATCH 14/15: nbd: Implement NBD_CMD_WRITE_ZEROES on server...
Checking PATCH 15/15: nbd: Implement NBD_CMD_WRITE_ZEROES on client...
=== OUTPUT END ===

Test command exited with code: 1


---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-de...@freelists.org

Re: [Qemu-block] [Qemu-devel] [PATCH v6 00/15] nbd: efficient write zeroes

2016-10-13 Thread Eric Blake

On 10/13/2016 07:00 PM,
no-re...@ec2-52-6-146-230.compute-1.amazonaws.com wrote:
> Hi,
> 
> Your series failed automatic build test. Please find the testing commands and
> their output below. If you have docker installed, you can probably reproduce 
> it
> locally.
> 

> /tmp/qemu-test/src/nbd/client.c: In function 'nbd_errno_to_system_errno':
> /tmp/qemu-test/src/nbd/client.c:38:16: error: 'ESHUTDOWN' undeclared (first 
> use in this function)
>  return ESHUTDOWN;
> ^

Oh fun, I get to work around a missing errno value on mingw.  I'll come
up with something to squash in to 13/15.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [Qemu-devel] [PATCH v6 00/15] nbd: efficient write zeroes

2016-10-13 Thread no-reply

Hi,

Your series failed automatic build test. Please find the testing commands and
their output below. If you have docker installed, you can probably reproduce it
locally.

Message-id: 1476392335-9256-1-git-send-email-ebl...@redhat.com
Type: series
Subject: [Qemu-devel] [PATCH v6 00/15] nbd: efficient write zeroes

=== TEST SCRIPT BEGIN ===
#!/bin/bash
set -e
git submodule update --init dtc
# Let docker tests dump environment info
export SHOW_ENV=1
export J=16
make docker-test-quick@centos6
make docker-test-mingw@fedora
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 * [new tag] patchew/1476399422-8028-1-git-send-email-js...@redhat.com 
-> patchew/1476399422-8028-1-git-send-email-js...@redhat.com
Switched to a new branch 'test'
3e3cab6 nbd: Implement NBD_CMD_WRITE_ZEROES on client
6127973 nbd: Implement NBD_CMD_WRITE_ZEROES on server
dbe675e nbd: Improve server handling of shutdown requests
d86be2e nbd: Support shorter handshake
598c0b0 nbd: Less allocation during NBD_OPT_LIST
71d9fb4 nbd: Let client skip portions of server reply
4554cca nbd: Let server know when client gives up negotiation
c33b361 nbd: Share common option-sending code in client
9e216fe nbd: Send message along with server NBD_REP_ERR errors
761e798 nbd: Share common reply-sending code in server
dd5e949 nbd: Rename struct nbd_request and nbd_reply
45a155f nbd: Rename NbdClientSession to NBDClientSession
00f1816 nbd: Rename NBDRequest to NBDRequestData
bcd4778 nbd: Treat flags vs. command type as separate fields
fde7920 nbd: Add qemu-nbd -D for human-readable description

=== OUTPUT BEGIN ===
Submodule 'dtc' (git://git.qemu-project.org/dtc.git) registered for path 'dtc'
Cloning into 'dtc'...
Submodule path 'dtc': checked out '65cc4d2748a2c2e6f27f1cf39e07a5dbabd80ebf'
  BUILD   centos6
  ARCHIVE qemu.tgz
  ARCHIVE dtc.tgz
  COPYRUNNER
  RUN test-quick in centos6
Packages installed:
SDL-devel-1.2.14-7.el6_7.1.x86_64
ccache-3.1.6-2.el6.x86_64
epel-release-6-8.noarch
gcc-4.4.7-17.el6.x86_64
git-1.7.1-4.el6_7.1.x86_64
glib2-devel-2.28.8-5.el6.x86_64
libfdt-devel-1.4.0-1.el6.x86_64
make-3.81-23.el6.x86_64
package g++ is not installed
pixman-devel-0.32.8-1.el6.x86_64
tar-1.23-15.el6_8.x86_64
zlib-devel-1.2.3-29.el6.x86_64

Environment variables:
PACKAGES=libfdt-devel ccache tar git make gcc g++ zlib-devel 
glib2-devel SDL-devel pixman-devel epel-release
HOSTNAME=29f210180aca
TERM=xterm
MAKEFLAGS= -j16
HISTSIZE=1000
J=16
USER=root
CCACHE_DIR=/var/tmp/ccache
EXTRA_CONFIGURE_OPTS=
V=
SHOW_ENV=1
MAIL=/var/spool/mail/root
PATH=/usr/lib/ccache:/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
PWD=/
LANG=en_US.UTF-8
TARGET_LIST=
HISTCONTROL=ignoredups
SHLVL=1
HOME=/root
TEST_DIR=/tmp/qemu-test
LOGNAME=root
LESSOPEN=||/usr/bin/lesspipe.sh %s
FEATURES= dtc
DEBUG=
G_BROKEN_FILENAMES=1
CCACHE_HASHDIR=
_=/usr/bin/env

Configure options:
--enable-werror --target-list=x86_64-softmmu,aarch64-softmmu 
--prefix=/var/tmp/qemu-build/install
No C++ compiler available; disabling C++ specific optional code
Install prefix/var/tmp/qemu-build/install
BIOS directory/var/tmp/qemu-build/install/share/qemu
binary directory  /var/tmp/qemu-build/install/bin
library directory /var/tmp/qemu-build/install/lib
module directory  /var/tmp/qemu-build/install/lib/qemu
libexec directory /var/tmp/qemu-build/install/libexec
include directory /var/tmp/qemu-build/install/include
config directory  /var/tmp/qemu-build/install/etc
local state directory   /var/tmp/qemu-build/install/var
Manual directory  /var/tmp/qemu-build/install/share/man
ELF interp prefix /usr/gnemul/qemu-%M
Source path   /tmp/qemu-test/src
C compilercc
Host C compiler   cc
C++ compiler  
Objective-C compiler cc
ARFLAGS   rv
CFLAGS-O2 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -g 
QEMU_CFLAGS   -I/usr/include/pixman-1-pthread -I/usr/include/glib-2.0 
-I/usr/lib64/glib-2.0/include   -fPIE -DPIE -m64 -D_GNU_SOURCE 
-D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -Wstrict-prototypes 
-Wredundant-decls -Wall -Wundef -Wwrite-strings -Wmissing-prototypes 
-fno-strict-aliasing -fno-common -fwrapv  -Wendif-labels -Wmissing-include-dirs 
-Wempty-body -Wnested-externs -Wformat-security -Wformat-y2k -Winit-self 
-Wignored-qualifiers -Wold-style-declaration -Wold-style-definition 
-Wtype-limits -fstack-protector-all
LDFLAGS   -Wl,--warn-common -Wl,-z,relro -Wl,-z,now -pie -m64 -g 
make  make
install   install
pythonpython -B
smbd  /usr/sbin/smbd
module supportno
host CPU  x86_64
host big endian   no
target list   x86_64-softmmu aarch64-softmmu
tcg debug enabled no
gprof enabled no
sparse enabledno
strip binariesyes
profiler  no
static build  no
pixmansystem
SDL support   yes (1.2.14)
GTK support   no 
GTK GL supportno
VTE support   no 
TLS priority

[Qemu-block] [PATCH 3/7] Replication/Blockjobs: Create replication jobs as internal

2016-10-13 Thread John Snow

Bubble up the internal interface to commit and backup jobs, then switch
replication tasks over to using this methodology.

Signed-off-by: John Snow 
---
 block/backup.c|  3 ++-
 block/mirror.c| 21 ++---
 block/replication.c   | 14 +++---
 blockdev.c| 11 +++
 include/block/block_int.h |  9 +++--
 qemu-img.c|  5 +++--
 6 files changed, 36 insertions(+), 27 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 5acb5c4..6a60ca8 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -527,6 +527,7 @@ void backup_start(const char *job_id, BlockDriverState *bs,
   bool compress,
   BlockdevOnError on_source_error,
   BlockdevOnError on_target_error,
+  int creation_flags,
   BlockCompletionFunc *cb, void *opaque,
   BlockJobTxn *txn, Error **errp)
 {
@@ -596,7 +597,7 @@ void backup_start(const char *job_id, BlockDriverState *bs,
 }
 
 job = block_job_create(job_id, _job_driver, bs, speed,
-   BLOCK_JOB_DEFAULT, cb, opaque, errp);
+   creation_flags, cb, opaque, errp);
 if (!job) {
 goto error;
 }
diff --git a/block/mirror.c b/block/mirror.c
index 74c03ae..15d2d10 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -906,9 +906,9 @@ static const BlockJobDriver commit_active_job_driver = {
 };
 
 static void mirror_start_job(const char *job_id, BlockDriverState *bs,
- BlockDriverState *target, const char *replaces,
- int64_t speed, uint32_t granularity,
- int64_t buf_size,
+ int creation_flags, BlockDriverState *target,
+ const char *replaces, int64_t speed,
+ uint32_t granularity, int64_t buf_size,
  BlockMirrorBackingMode backing_mode,
  BlockdevOnError on_source_error,
  BlockdevOnError on_target_error,
@@ -936,8 +936,8 @@ static void mirror_start_job(const char *job_id, 
BlockDriverState *bs,
 buf_size = DEFAULT_MIRROR_BUF_SIZE;
 }
 
-s = block_job_create(job_id, driver, bs, speed,
- BLOCK_JOB_DEFAULT, cb, opaque, errp);
+s = block_job_create(job_id, driver, bs, speed, creation_flags,
+ cb, opaque, errp);
 if (!s) {
 return;
 }
@@ -992,17 +992,16 @@ void mirror_start(const char *job_id, BlockDriverState 
*bs,
 }
 is_none_mode = mode == MIRROR_SYNC_MODE_NONE;
 base = mode == MIRROR_SYNC_MODE_TOP ? backing_bs(bs) : NULL;
-mirror_start_job(job_id, bs, target, replaces,
+mirror_start_job(job_id, bs, BLOCK_JOB_DEFAULT, target, replaces,
  speed, granularity, buf_size, backing_mode,
  on_source_error, on_target_error, unmap, cb, opaque, errp,
  _job_driver, is_none_mode, base, false);
 }
 
 void commit_active_start(const char *job_id, BlockDriverState *bs,
- BlockDriverState *base, int64_t speed,
- BlockdevOnError on_error,
- BlockCompletionFunc *cb,
- void *opaque, Error **errp,
+ BlockDriverState *base, int creation_flags,
+ int64_t speed, BlockdevOnError on_error,
+ BlockCompletionFunc *cb, void *opaque, Error **errp,
  bool auto_complete)
 {
 int64_t length, base_length;
@@ -1041,7 +1040,7 @@ void commit_active_start(const char *job_id, 
BlockDriverState *bs,
 }
 }
 
-mirror_start_job(job_id, bs, base, NULL, speed, 0, 0,
+mirror_start_job(job_id, bs, creation_flags, base, NULL, speed, 0, 0,
  MIRROR_LEAVE_BACKING_CHAIN,
  on_error, on_error, false, cb, opaque, _err,
  _active_job_driver, false, base, auto_complete);
diff --git a/block/replication.c b/block/replication.c
index 3bd1cf1..d4f4a7b 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -496,10 +496,11 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
 bdrv_op_block_all(top_bs, s->blocker);
 bdrv_op_unblock(top_bs, BLOCK_OP_TYPE_DATAPLANE, s->blocker);
 
-backup_start("replication-backup", s->secondary_disk->bs,
- s->hidden_disk->bs, 0, MIRROR_SYNC_MODE_NONE, NULL, false,
+backup_start(NULL, s->secondary_disk->bs, s->hidden_disk->bs, 0,
+ MIRROR_SYNC_MODE_NONE, NULL, false,
  BLOCKDEV_ON_ERROR_REPORT, BLOCKDEV_ON_ERROR_REPORT,
- backup_job_completed, s, NULL, _err);
+ BLOCK_JOB_INTERNAL, backup_job_completed, s,
+

[Qemu-block] [PATCH 5/7] Blockjobs: Internalize user_pause logic

2016-10-13 Thread John Snow

BlockJobs will begin hiding their state in preparation for some
refactorings anyway, so let's internalize the user_pause mechanism
instead of leaving it to callers to correctly manage.

Signed-off-by: John Snow 
---
 blockdev.c   | 12 +---
 blockjob.c   | 22 --
 include/block/blockjob.h | 26 ++
 3 files changed, 51 insertions(+), 9 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 22a1280..1661d08 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3579,7 +3579,7 @@ void qmp_block_job_cancel(const char *device,
 force = false;
 }
 
-if (job->user_paused && !force) {
+if (block_job_user_paused(job) && !force) {
 error_setg(errp, "The block job for device '%s' is currently paused",
device);
 goto out;
@@ -3596,13 +3596,12 @@ void qmp_block_job_pause(const char *device, Error 
**errp)
 AioContext *aio_context;
 BlockJob *job = find_block_job(device, _context, errp);
 
-if (!job || job->user_paused) {
+if (!job || block_job_user_paused(job)) {
 return;
 }
 
-job->user_paused = true;
 trace_qmp_block_job_pause(job);
-block_job_pause(job);
+block_job_user_pause(job);
 aio_context_release(aio_context);
 }
 
@@ -3611,14 +3610,13 @@ void qmp_block_job_resume(const char *device, Error 
**errp)
 AioContext *aio_context;
 BlockJob *job = find_block_job(device, _context, errp);
 
-if (!job || !job->user_paused) {
+if (!job || !block_job_user_paused(job)) {
 return;
 }
 
-job->user_paused = false;
 trace_qmp_block_job_resume(job);
 block_job_iostatus_reset(job);
-block_job_resume(job);
+block_job_user_resume(job);
 aio_context_release(aio_context);
 }
 
diff --git a/blockjob.c b/blockjob.c
index e32cb78..d118a1f 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -362,11 +362,22 @@ void block_job_pause(BlockJob *job)
 job->pause_count++;
 }
 
+void block_job_user_pause(BlockJob *job)
+{
+job->user_paused = true;
+block_job_pause(job);
+}
+
 static bool block_job_should_pause(BlockJob *job)
 {
 return job->pause_count > 0;
 }
 
+bool block_job_user_paused(BlockJob *job)
+{
+return job ? job->user_paused : 0;
+}
+
 void coroutine_fn block_job_pause_point(BlockJob *job)
 {
 if (!block_job_should_pause(job)) {
@@ -403,6 +414,14 @@ void block_job_resume(BlockJob *job)
 block_job_enter(job);
 }
 
+void block_job_user_resume(BlockJob *job)
+{
+if (job && job->user_paused && job->pause_count > 0) {
+job->user_paused = false;
+block_job_resume(job);
+}
+}
+
 void block_job_enter(BlockJob *job)
 {
 if (job->co && !job->busy) {
@@ -626,8 +645,7 @@ BlockErrorAction block_job_error_action(BlockJob *job, 
BlockdevOnError on_err,
 }
 if (action == BLOCK_ERROR_ACTION_STOP) {
 /* make the pause user visible, which will be resumed from QMP. */
-job->user_paused = true;
-block_job_pause(job);
+block_job_user_pause(job);
 block_job_iostatus_set_err(job, error);
 }
 return action;
diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index 928f0b8..5b61140 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -358,6 +358,23 @@ void coroutine_fn block_job_pause_point(BlockJob *job);
 void block_job_pause(BlockJob *job);
 
 /**
+ * block_job_user_pause:
+ * @job: The job to be paused.
+ *
+ * Asynchronously pause the specified job.
+ * Do not allow a resume until a matching call to block_job_user_resume.
+ */
+void block_job_user_pause(BlockJob *job);
+
+/**
+ * block_job_paused:
+ * @job: The job to query.
+ *
+ * Returns true if the job is user-paused.
+ */
+bool block_job_user_paused(BlockJob *job);
+
+/**
  * block_job_resume:
  * @job: The job to be resumed.
  *
@@ -366,6 +383,15 @@ void block_job_pause(BlockJob *job);
 void block_job_resume(BlockJob *job);
 
 /**
+ * block_job_user_resume:
+ * @job: The job to be resumed.
+ *
+ * Resume the specified job.
+ * Must be paired with a preceding block_job_user_pause.
+ */
+void block_job_user_resume(BlockJob *job);
+
+/**
  * block_job_enter:
  * @job: The job to enter.
  *
-- 
2.7.4

[Qemu-block] [PATCH 4/7] blockjob: centralize QMP event emissions

2016-10-13 Thread John Snow

There's no reason to leave this to blockdev; we can do it in blockjobs
directly and get rid of an extra callback for most users.

All non-internal events, even those created outside of QMP, will
consistently emit events.

Signed-off-by: John Snow 
---
 block/commit.c|  8 
 block/mirror.c|  6 ++
 block/stream.c|  7 +++
 block/trace-events|  5 ++---
 blockdev.c| 42 --
 blockjob.c| 23 +++
 include/block/block_int.h | 17 -
 include/block/blockjob.h  | 17 -
 8 files changed, 42 insertions(+), 83 deletions(-)

diff --git a/block/commit.c b/block/commit.c
index f29e341..475a375 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -209,8 +209,8 @@ static const BlockJobDriver commit_job_driver = {
 
 void commit_start(const char *job_id, BlockDriverState *bs,
   BlockDriverState *base, BlockDriverState *top, int64_t speed,
-  BlockdevOnError on_error, BlockCompletionFunc *cb,
-  void *opaque, const char *backing_file_str, Error **errp)
+  BlockdevOnError on_error, const char *backing_file_str,
+  Error **errp)
 {
 CommitBlockJob *s;
 BlockReopenQueue *reopen_queue = NULL;
@@ -233,7 +233,7 @@ void commit_start(const char *job_id, BlockDriverState *bs,
 }
 
 s = block_job_create(job_id, _job_driver, bs, speed,
- BLOCK_JOB_DEFAULT, cb, opaque, errp);
+ BLOCK_JOB_DEFAULT, NULL, NULL, errp);
 if (!s) {
 return;
 }
@@ -276,7 +276,7 @@ void commit_start(const char *job_id, BlockDriverState *bs,
 s->on_error = on_error;
 s->common.co = qemu_coroutine_create(commit_run, s);
 
-trace_commit_start(bs, base, top, s, s->common.co, opaque);
+trace_commit_start(bs, base, top, s, s->common.co);
 qemu_coroutine_enter(s->common.co);
 }
 
diff --git a/block/mirror.c b/block/mirror.c
index 15d2d10..4374fb4 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -979,9 +979,7 @@ void mirror_start(const char *job_id, BlockDriverState *bs,
   MirrorSyncMode mode, BlockMirrorBackingMode backing_mode,
   BlockdevOnError on_source_error,
   BlockdevOnError on_target_error,
-  bool unmap,
-  BlockCompletionFunc *cb,
-  void *opaque, Error **errp)
+  bool unmap, Error **errp)
 {
 bool is_none_mode;
 BlockDriverState *base;
@@ -994,7 +992,7 @@ void mirror_start(const char *job_id, BlockDriverState *bs,
 base = mode == MIRROR_SYNC_MODE_TOP ? backing_bs(bs) : NULL;
 mirror_start_job(job_id, bs, BLOCK_JOB_DEFAULT, target, replaces,
  speed, granularity, buf_size, backing_mode,
- on_source_error, on_target_error, unmap, cb, opaque, errp,
+ on_source_error, on_target_error, unmap, NULL, NULL, errp,
  _job_driver, is_none_mode, base, false);
 }
 
diff --git a/block/stream.c b/block/stream.c
index eeb6f52..7d6877d 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -216,13 +216,12 @@ static const BlockJobDriver stream_job_driver = {
 
 void stream_start(const char *job_id, BlockDriverState *bs,
   BlockDriverState *base, const char *backing_file_str,
-  int64_t speed, BlockdevOnError on_error,
-  BlockCompletionFunc *cb, void *opaque, Error **errp)
+  int64_t speed, BlockdevOnError on_error, Error **errp)
 {
 StreamBlockJob *s;
 
 s = block_job_create(job_id, _job_driver, bs, speed,
- BLOCK_JOB_DEFAULT, cb, opaque, errp);
+ BLOCK_JOB_DEFAULT, NULL, NULL, errp);
 if (!s) {
 return;
 }
@@ -232,6 +231,6 @@ void stream_start(const char *job_id, BlockDriverState *bs,
 
 s->on_error = on_error;
 s->common.co = qemu_coroutine_create(stream_run, s);
-trace_stream_start(bs, base, s, s->common.co, opaque);
+trace_stream_start(bs, base, s, s->common.co);
 qemu_coroutine_enter(s->common.co);
 }
diff --git a/block/trace-events b/block/trace-events
index 05fa13c..c12f91b 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -20,11 +20,11 @@ bdrv_co_do_copy_on_readv(void *bs, int64_t offset, unsigned 
int bytes, int64_t c
 
 # block/stream.c
 stream_one_iteration(void *s, int64_t sector_num, int nb_sectors, int 
is_allocated) "s %p sector_num %"PRId64" nb_sectors %d is_allocated %d"
-stream_start(void *bs, void *base, void *s, void *co, void *opaque) "bs %p 
base %p s %p co %p opaque %p"
+stream_start(void *bs, void *base, void *s, void *co) "bs %p base %p s %p co 
%p"
 
 # block/commit.c
 commit_one_iteration(void *s, int64_t sector_num, int nb_sectors, int 
is_allocated) "s %p sector_num %"PRId64" nb_sectors %d is_allocated %d"

[Qemu-block] [PATCH 6/7] blockjobs: split interface into public/private, Part 1

2016-10-13 Thread John Snow

To make it a little more obvious which functions are intended to be
public interface and which are intended to be for use only by jobs
themselves, split the interface into "public" and "private" files.

Convert blockjobs (e.g. block/backup) to using the private interface.
Leave blockdev and others on the public interface.

There are remaining uses of private state by qemu-img, and several
cases in blockdev.c and block/io.c where we grab job->blk for the
purposes of acquiring an AIOContext.

These will be corrected in future patches.

Signed-off-by: John Snow 
---
 block/backup.c   |   2 +-
 block/commit.c   |   2 +-
 block/mirror.c   |   2 +-
 block/stream.c   |   2 +-
 blockjob.c   |   2 +-
 include/block/block.h|   3 +-
 include/block/blockjob.h | 205 +-
 include/block/blockjob_int.h | 232 +++
 tests/test-blockjob-txn.c|   2 +-
 tests/test-blockjob.c|   2 +-
 10 files changed, 244 insertions(+), 210 deletions(-)
 create mode 100644 include/block/blockjob_int.h

diff --git a/block/backup.c b/block/backup.c
index 6a60ca8..6d12100 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -16,7 +16,7 @@
 #include "trace.h"
 #include "block/block.h"
 #include "block/block_int.h"
-#include "block/blockjob.h"
+#include "block/blockjob_int.h"
 #include "block/block_backup.h"
 #include "qapi/error.h"
 #include "qapi/qmp/qerror.h"
diff --git a/block/commit.c b/block/commit.c
index 475a375..d555600 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -15,7 +15,7 @@
 #include "qemu/osdep.h"
 #include "trace.h"
 #include "block/block_int.h"
-#include "block/blockjob.h"
+#include "block/blockjob_int.h"
 #include "qapi/error.h"
 #include "qapi/qmp/qerror.h"
 #include "qemu/ratelimit.h"
diff --git a/block/mirror.c b/block/mirror.c
index 4374fb4..c81b5e0 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -13,7 +13,7 @@
 
 #include "qemu/osdep.h"
 #include "trace.h"
-#include "block/blockjob.h"
+#include "block/blockjob_int.h"
 #include "block/block_int.h"
 #include "sysemu/block-backend.h"
 #include "qapi/error.h"
diff --git a/block/stream.c b/block/stream.c
index 7d6877d..906f7f3 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -14,7 +14,7 @@
 #include "qemu/osdep.h"
 #include "trace.h"
 #include "block/block_int.h"
-#include "block/blockjob.h"
+#include "block/blockjob_int.h"
 #include "qapi/error.h"
 #include "qapi/qmp/qerror.h"
 #include "qemu/ratelimit.h"
diff --git a/blockjob.c b/blockjob.c
index d118a1f..e6f0d97 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -27,7 +27,7 @@
 #include "qemu-common.h"
 #include "trace.h"
 #include "block/block.h"
-#include "block/blockjob.h"
+#include "block/blockjob_int.h"
 #include "block/block_int.h"
 #include "sysemu/block-backend.h"
 #include "qapi/qmp/qerror.h"
diff --git a/include/block/block.h b/include/block/block.h
index 107c603..89b5feb 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -7,16 +7,15 @@
 #include "qemu/coroutine.h"
 #include "block/accounting.h"
 #include "block/dirty-bitmap.h"
+#include "block/blockjob.h"
 #include "qapi/qmp/qobject.h"
 #include "qapi-types.h"
 #include "qemu/hbitmap.h"
 
 /* block.c */
 typedef struct BlockDriver BlockDriver;
-typedef struct BlockJob BlockJob;
 typedef struct BdrvChild BdrvChild;
 typedef struct BdrvChildRole BdrvChildRole;
-typedef struct BlockJobTxn BlockJobTxn;
 
 typedef struct BlockDriverInfo {
 /* in bytes, 0 if irrelevant */
diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index 5b61140..bfc8233 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -28,78 +28,15 @@
 
 #include "block/block.h"
 
-/**
- * BlockJobDriver:
- *
- * A class type for block job driver.
- */
-typedef struct BlockJobDriver {
-/** Derived BlockJob struct size */
-size_t instance_size;
-
-/** String describing the operation, part of query-block-jobs QMP API */
-BlockJobType job_type;
-
-/** Optional callback for job types that support setting a speed limit */
-void (*set_speed)(BlockJob *job, int64_t speed, Error **errp);
-
-/** Optional callback for job types that need to forward I/O status reset 
*/
-void (*iostatus_reset)(BlockJob *job);
-
-/**
- * Optional callback for job types whose completion must be triggered
- * manually.
- */
-void (*complete)(BlockJob *job, Error **errp);
-
-/**
- * If the callback is not NULL, it will be invoked when all the jobs
- * belonging to the same transaction complete; or upon this job's
- * completion if it is not in a transaction. Skipped if NULL.
- *
- * All jobs will complete with a call to either .commit() or .abort() but
- * never both.
- */
-void (*commit)(BlockJob *job);
-
-/**
- * If the callback is not NULL, it will be invoked when any job in the
- * same transaction fails;

[Qemu-block] [PATCH 0/7] blockjobs: preliminary refactoring work, Pt 1

2016-10-13 Thread John Snow

This is a follow-up to patches 1-6 of:
[PATCH v2 00/11] blockjobs: Fix transactional race condition

That series started trying to refactor blockjobs with the goal of
internalizing BlockJob state as a side effect of having gone through
the effort of figuring out which commands were "safe" to call on
a Job that has no coroutine object.

I've split out the less contentious bits so I can move forward with my
original work of focusing on the transactional race condition in a
different series.

Functionally the biggest difference here is the presence of "internal"
block jobs, which do not emit QMP events or show up in block query
requests. This is done for the sake of replication jobs, which should
not be interfering with the public jobs namespace.



For convenience, this branch is available at:
https://github.com/jnsnow/qemu.git branch job-refactor-pt1
https://github.com/jnsnow/qemu/tree/job-refactor-pt1

This version is tagged job-refactor-pt1-v1:
https://github.com/jnsnow/qemu/releases/tag/job-refactor-pt1-v1

John Snow (7):
  blockjobs: hide internal jobs from management API
  blockjobs: Allow creating internal jobs
  Replication/Blockjobs: Create replication jobs as internal
  blockjob: centralize QMP event emissions
  Blockjobs: Internalize user_pause logic
  blockjobs: split interface into public/private, Part 1
  blockjobs: fix documentation

 block/backup.c   |   5 +-
 block/commit.c   |  10 +-
 block/mirror.c   |  28 +++--
 block/replication.c  |  14 +--
 block/stream.c   |   9 +-
 block/trace-events   |   5 +-
 blockdev.c   |  74 +
 blockjob.c   | 109 ++
 include/block/block.h|   3 +-
 include/block/block_int.h|  26 ++---
 include/block/blockjob.h | 257 +++
 include/block/blockjob_int.h | 232 ++
 qemu-img.c   |   5 +-
 tests/test-blockjob-txn.c|   5 +-
 tests/test-blockjob.c|   4 +-
 15 files changed, 443 insertions(+), 343 deletions(-)
 create mode 100644 include/block/blockjob_int.h

-- 
2.7.4

[Qemu-block] [PATCH 2/7] blockjobs: Allow creating internal jobs

2016-10-13 Thread John Snow

Add the ability to create jobs without an ID.

Signed-off-by: John Snow 
---
 block/backup.c|  2 +-
 block/commit.c|  2 +-
 block/mirror.c|  3 ++-
 block/stream.c|  2 +-
 blockjob.c| 25 -
 include/block/blockjob.h  |  7 ++-
 tests/test-blockjob-txn.c |  3 ++-
 tests/test-blockjob.c |  2 +-
 8 files changed, 30 insertions(+), 16 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 582bd0f..5acb5c4 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -596,7 +596,7 @@ void backup_start(const char *job_id, BlockDriverState *bs,
 }
 
 job = block_job_create(job_id, _job_driver, bs, speed,
-   cb, opaque, errp);
+   BLOCK_JOB_DEFAULT, cb, opaque, errp);
 if (!job) {
 goto error;
 }
diff --git a/block/commit.c b/block/commit.c
index 9f67a8b..f29e341 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -233,7 +233,7 @@ void commit_start(const char *job_id, BlockDriverState *bs,
 }
 
 s = block_job_create(job_id, _job_driver, bs, speed,
- cb, opaque, errp);
+ BLOCK_JOB_DEFAULT, cb, opaque, errp);
 if (!s) {
 return;
 }
diff --git a/block/mirror.c b/block/mirror.c
index f9d1fec..74c03ae 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -936,7 +936,8 @@ static void mirror_start_job(const char *job_id, 
BlockDriverState *bs,
 buf_size = DEFAULT_MIRROR_BUF_SIZE;
 }
 
-s = block_job_create(job_id, driver, bs, speed, cb, opaque, errp);
+s = block_job_create(job_id, driver, bs, speed,
+ BLOCK_JOB_DEFAULT, cb, opaque, errp);
 if (!s) {
 return;
 }
diff --git a/block/stream.c b/block/stream.c
index 3187481..eeb6f52 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -222,7 +222,7 @@ void stream_start(const char *job_id, BlockDriverState *bs,
 StreamBlockJob *s;
 
 s = block_job_create(job_id, _job_driver, bs, speed,
- cb, opaque, errp);
+ BLOCK_JOB_DEFAULT, cb, opaque, errp);
 if (!s) {
 return;
 }
diff --git a/blockjob.c b/blockjob.c
index e78ad94..017905a 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -118,7 +118,7 @@ static void block_job_detach_aio_context(void *opaque)
 }
 
 void *block_job_create(const char *job_id, const BlockJobDriver *driver,
-   BlockDriverState *bs, int64_t speed,
+   BlockDriverState *bs, int64_t speed, int flags,
BlockCompletionFunc *cb, void *opaque, Error **errp)
 {
 BlockBackend *blk;
@@ -130,7 +130,7 @@ void *block_job_create(const char *job_id, const 
BlockJobDriver *driver,
 return NULL;
 }
 
-if (job_id == NULL) {
+if (job_id == NULL && !(flags & BLOCK_JOB_INTERNAL)) {
 job_id = bdrv_get_device_name(bs);
 if (!*job_id) {
 error_setg(errp, "An explicit job ID is required for this node");
@@ -138,14 +138,21 @@ void *block_job_create(const char *job_id, const 
BlockJobDriver *driver,
 }
 }
 
-if (!id_wellformed(job_id)) {
-error_setg(errp, "Invalid job ID '%s'", job_id);
-return NULL;
-}
+if (job_id) {
+if (flags & BLOCK_JOB_INTERNAL) {
+error_setg(errp, "Cannot specify job ID for internal block job");
+return NULL;
+}
 
-if (block_job_get(job_id)) {
-error_setg(errp, "Job ID '%s' already in use", job_id);
-return NULL;
+if (!id_wellformed(job_id)) {
+error_setg(errp, "Invalid job ID '%s'", job_id);
+return NULL;
+}
+
+if (block_job_get(job_id)) {
+error_setg(errp, "Job ID '%s' already in use", job_id);
+return NULL;
+}
 }
 
 blk = blk_new();
diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index 6ecfa2e..fdb31e0 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -200,6 +200,11 @@ struct BlockJob {
 QLIST_ENTRY(BlockJob) txn_list;
 };
 
+typedef enum BlockJobCreateFlags {
+BLOCK_JOB_DEFAULT = 0x00,
+BLOCK_JOB_INTERNAL = 0x01,
+} BlockJobCreateFlags;
+
 /**
  * block_job_next:
  * @job: A block job, or %NULL.
@@ -242,7 +247,7 @@ BlockJob *block_job_get(const char *id);
  * called from a wrapper that is specific to the job type.
  */
 void *block_job_create(const char *job_id, const BlockJobDriver *driver,
-   BlockDriverState *bs, int64_t speed,
+   BlockDriverState *bs, int64_t speed, int flags,
BlockCompletionFunc *cb, void *opaque, Error **errp);
 
 /**
diff --git a/tests/test-blockjob-txn.c b/tests/test-blockjob-txn.c
index d049cba..b79e0c6 100644
--- a/tests/test-blockjob-txn.c
+++ b/tests/test-blockjob-txn.c
@@ -98,7 +98,8 @@ static BlockJob *test_block_job_start(unsigned int

[Qemu-block] [PATCH 7/7] blockjobs: fix documentation

2016-10-13 Thread John Snow

(Trivial)

Fix wrong function names in documentation.

Signed-off-by: John Snow 
---
 include/block/blockjob_int.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/block/blockjob_int.h b/include/block/blockjob_int.h
index 8eced19..10ebb38 100644
--- a/include/block/blockjob_int.h
+++ b/include/block/blockjob_int.h
@@ -191,8 +191,8 @@ void coroutine_fn block_job_pause_point(BlockJob *job);
 void block_job_enter(BlockJob *job);
 
 /**
- * block_job_ready:
- * @job: The job which is now ready to complete.
+ * block_job_event_ready:
+ * @job: The job which is now ready to be completed.
  *
  * Send a BLOCK_JOB_READY event for the specified job.
  */
-- 
2.7.4

[Qemu-block] [PATCH 1/7] blockjobs: hide internal jobs from management API

2016-10-13 Thread John Snow

If jobs are not created directly by the user, do not allow them to be
seen by the user/management utility. At the moment, 'internal' jobs are
those that do not have an ID. As of this patch it is impossible to
create such jobs.

Signed-off-by: John Snow 
---
 blockdev.c   | 17 +
 blockjob.c   | 37 +++--
 include/block/blockjob.h | 12 ++--
 3 files changed, 54 insertions(+), 12 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 07ec733..5904edb 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3915,13 +3915,22 @@ BlockJobInfoList *qmp_query_block_jobs(Error **errp)
 BlockJob *job;
 
 for (job = block_job_next(NULL); job; job = block_job_next(job)) {
-BlockJobInfoList *elem = g_new0(BlockJobInfoList, 1);
-AioContext *aio_context = blk_get_aio_context(job->blk);
+BlockJobInfoList *elem;
+AioContext *aio_context;
 
+if (block_job_is_internal(job)) {
+continue;
+}
+elem = g_new0(BlockJobInfoList, 1);
+aio_context = blk_get_aio_context(job->blk);
 aio_context_acquire(aio_context);
-elem->value = block_job_query(job);
+elem->value = block_job_query(job, errp);
 aio_context_release(aio_context);
-
+if (!elem->value) {
+g_free(elem);
+qapi_free_BlockJobInfoList(head);
+return NULL;
+}
 *p_next = elem;
 p_next = >next;
 }
diff --git a/blockjob.c b/blockjob.c
index 43fecbe..e78ad94 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -185,6 +185,11 @@ void *block_job_create(const char *job_id, const 
BlockJobDriver *driver,
 return job;
 }
 
+bool block_job_is_internal(BlockJob *job)
+{
+return (job->id == NULL);
+}
+
 void block_job_ref(BlockJob *job)
 {
 ++job->refcnt;
@@ -494,9 +499,15 @@ void block_job_yield(BlockJob *job)
 block_job_pause_point(job);
 }
 
-BlockJobInfo *block_job_query(BlockJob *job)
+BlockJobInfo *block_job_query(BlockJob *job, Error **errp)
 {
-BlockJobInfo *info = g_new0(BlockJobInfo, 1);
+BlockJobInfo *info;
+
+if (block_job_is_internal(job)) {
+error_setg(errp, "Cannot query QEMU internal Jobs");
+return NULL;
+}
+info = g_new0(BlockJobInfo, 1);
 info->type  = g_strdup(BlockJobType_lookup[job->driver->job_type]);
 info->device= g_strdup(job->id);
 info->len   = job->len;
@@ -519,6 +530,10 @@ static void block_job_iostatus_set_err(BlockJob *job, int 
error)
 
 void block_job_event_cancelled(BlockJob *job)
 {
+if (block_job_is_internal(job)) {
+return;
+}
+
 qapi_event_send_block_job_cancelled(job->driver->job_type,
 job->id,
 job->len,
@@ -529,6 +544,10 @@ void block_job_event_cancelled(BlockJob *job)
 
 void block_job_event_completed(BlockJob *job, const char *msg)
 {
+if (block_job_is_internal(job)) {
+return;
+}
+
 qapi_event_send_block_job_completed(job->driver->job_type,
 job->id,
 job->len,
@@ -543,6 +562,10 @@ void block_job_event_ready(BlockJob *job)
 {
 job->ready = true;
 
+if (block_job_is_internal(job)) {
+return;
+}
+
 qapi_event_send_block_job_ready(job->driver->job_type,
 job->id,
 job->len,
@@ -573,10 +596,12 @@ BlockErrorAction block_job_error_action(BlockJob *job, 
BlockdevOnError on_err,
 default:
 abort();
 }
-qapi_event_send_block_job_error(job->id,
-is_read ? IO_OPERATION_TYPE_READ :
-IO_OPERATION_TYPE_WRITE,
-action, _abort);
+if (!block_job_is_internal(job)) {
+qapi_event_send_block_job_error(job->id,
+is_read ? IO_OPERATION_TYPE_READ :
+IO_OPERATION_TYPE_WRITE,
+action, _abort);
+}
 if (action == BLOCK_ERROR_ACTION_STOP) {
 /* make the pause user visible, which will be resumed from QMP. */
 job->user_paused = true;
diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index 4ddb4ae..6ecfa2e 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -107,7 +107,7 @@ struct BlockJob {
 BlockBackend *blk;
 
 /**
- * The ID of the block job.
+ * The ID of the block job. May be NULL for internal jobs.
  */
 char *id;
 
@@ -333,7 +333,7 @@ bool block_job_is_cancelled(BlockJob *job);
  *
  * Return information about a job.
  */
-BlockJobInfo *block_job_query(BlockJob *job);
+BlockJobInfo *block_job_query(BlockJob *job, Error **errp);
 
 /**
  * block_job_pause_point:
@@ -504,4 +504,12 @@ void

[Qemu-block] [PATCH v10 09/10] tests: Add test code for hbitmap serialization

2016-10-13 Thread John Snow

From: Fam Zheng 

Signed-off-by: Fam Zheng 
[Fixed minor constant issue. --js]
Signed-off-by: John Snow 

Signed-off-by: John Snow 
---
 tests/test-hbitmap.c | 156 +++
 1 file changed, 156 insertions(+)

diff --git a/tests/test-hbitmap.c b/tests/test-hbitmap.c
index e3abde1..9b7495c 100644
--- a/tests/test-hbitmap.c
+++ b/tests/test-hbitmap.c
@@ -11,6 +11,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu/hbitmap.h"
+#include "qemu/bitmap.h"
 #include "block/block.h"
 
 #define LOG_BITS_PER_LONG  (BITS_PER_LONG == 32 ? 5 : 6)
@@ -737,6 +738,16 @@ static void test_hbitmap_meta_one(TestHBitmapData *data, 
const void *unused)
 }
 }
 
+static void test_hbitmap_serialize_granularity(TestHBitmapData *data,
+   const void *unused)
+{
+int r;
+
+hbitmap_test_init(data, L3 * 2, 3);
+r = hbitmap_serialization_granularity(data->hb);
+g_assert_cmpint(r, ==, 64 << 3);
+}
+
 static void test_hbitmap_meta_zero(TestHBitmapData *data, const void *unused)
 {
 hbitmap_test_init_meta(data, 0, 0, 1);
@@ -744,6 +755,142 @@ static void test_hbitmap_meta_zero(TestHBitmapData *data, 
const void *unused)
 hbitmap_check_meta(data, 0, 0);
 }
 
+static void hbitmap_test_serialize_range(TestHBitmapData *data,
+ uint8_t *buf, size_t buf_size,
+ uint64_t pos, uint64_t count)
+{
+size_t i;
+unsigned long *el = (unsigned long *)buf;
+
+assert(hbitmap_granularity(data->hb) == 0);
+hbitmap_reset_all(data->hb);
+memset(buf, 0, buf_size);
+if (count) {
+hbitmap_set(data->hb, pos, count);
+}
+hbitmap_serialize_part(data->hb, buf, 0, data->size);
+
+/* Serialized buffer is inherently LE, convert it back manually to test */
+for (i = 0; i < buf_size / sizeof(unsigned long); i++) {
+el[i] = (BITS_PER_LONG == 32 ? le32_to_cpu(el[i]) : 
le64_to_cpu(el[i]));
+}
+
+for (i = 0; i < data->size; i++) {
+int is_set = test_bit(i, (unsigned long *)buf);
+if (i >= pos && i < pos + count) {
+g_assert(is_set);
+} else {
+g_assert(!is_set);
+}
+}
+
+/* Re-serialize for deserialization testing */
+memset(buf, 0, buf_size);
+hbitmap_serialize_part(data->hb, buf, 0, data->size);
+hbitmap_reset_all(data->hb);
+hbitmap_deserialize_part(data->hb, buf, 0, data->size, true);
+
+for (i = 0; i < data->size; i++) {
+int is_set = hbitmap_get(data->hb, i);
+if (i >= pos && i < pos + count) {
+g_assert(is_set);
+} else {
+g_assert(!is_set);
+}
+}
+}
+
+static void test_hbitmap_serialize_basic(TestHBitmapData *data,
+ const void *unused)
+{
+int i, j;
+size_t buf_size;
+uint8_t *buf;
+uint64_t positions[] = { 0, 1, L1 - 1, L1, L2 - 1, L2, L2 + 1, L3 - 1 };
+int num_positions = sizeof(positions) / sizeof(positions[0]);
+
+hbitmap_test_init(data, L3, 0);
+buf_size = hbitmap_serialization_size(data->hb, 0, data->size);
+buf = g_malloc0(buf_size);
+
+for (i = 0; i < num_positions; i++) {
+for (j = 0; j < num_positions; j++) {
+hbitmap_test_serialize_range(data, buf, buf_size,
+ positions[i],
+ MIN(positions[j], L3 - positions[i]));
+}
+}
+
+g_free(buf);
+}
+
+static void test_hbitmap_serialize_part(TestHBitmapData *data,
+const void *unused)
+{
+int i, j, k;
+size_t buf_size;
+uint8_t *buf;
+uint64_t positions[] = { 0, 1, L1 - 1, L1, L2 - 1, L2, L2 + 1, L3 - 1 };
+int num_positions = sizeof(positions) / sizeof(positions[0]);
+
+hbitmap_test_init(data, L3, 0);
+buf_size = L2;
+buf = g_malloc0(buf_size);
+
+for (i = 0; i < num_positions; i++) {
+hbitmap_set(data->hb, positions[i], 1);
+}
+
+for (i = 0; i < data->size; i += buf_size) {
+unsigned long *el = (unsigned long *)buf;
+hbitmap_serialize_part(data->hb, buf, i, buf_size);
+for (j = 0; j < buf_size / sizeof(unsigned long); j++) {
+el[j] = (BITS_PER_LONG == 32 ? le32_to_cpu(el[j]) : 
le64_to_cpu(el[j]));
+}
+
+for (j = 0; j < buf_size; j++) {
+bool should_set = false;
+for (k = 0; k < num_positions; k++) {
+if (positions[k] == j + i) {
+should_set = true;
+break;
+}
+}
+g_assert_cmpint(should_set, ==, test_bit(j, (unsigned long *)buf));
+}
+}
+
+g_free(buf);
+}
+
+static void test_hbitmap_serialize_zeroes(TestHBitmapData *data,
+  const void *unused)

[Qemu-block] [PATCH v10 10/10] block: More operations for meta dirty bitmap

2016-10-13 Thread John Snow

From: Fam Zheng 

Callers can create an iterator of meta bitmap with
bdrv_dirty_meta_iter_new(), then use the bdrv_dirty_iter_* operations on
it. Meta iterators are also counted by bitmap->active_iterators.

Also add a couple of functions to retrieve granularity and count.

Signed-off-by: Fam Zheng 
Reviewed-by: Max Reitz 
Signed-off-by: John Snow 
---
 block/dirty-bitmap.c | 19 +++
 include/block/dirty-bitmap.h |  3 +++
 2 files changed, 22 insertions(+)

diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 384146b..519737c 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -393,6 +393,11 @@ uint32_t bdrv_dirty_bitmap_granularity(BdrvDirtyBitmap 
*bitmap)
 return BDRV_SECTOR_SIZE << hbitmap_granularity(bitmap->bitmap);
 }
 
+uint32_t bdrv_dirty_bitmap_meta_granularity(BdrvDirtyBitmap *bitmap)
+{
+return BDRV_SECTOR_SIZE << hbitmap_granularity(bitmap->meta);
+}
+
 BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap *bitmap,
  uint64_t first_sector)
 {
@@ -403,6 +408,15 @@ BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap 
*bitmap,
 return iter;
 }
 
+BdrvDirtyBitmapIter *bdrv_dirty_meta_iter_new(BdrvDirtyBitmap *bitmap)
+{
+BdrvDirtyBitmapIter *iter = g_new(BdrvDirtyBitmapIter, 1);
+hbitmap_iter_init(>hbi, bitmap->meta, 0);
+iter->bitmap = bitmap;
+bitmap->active_iterators++;
+return iter;
+}
+
 void bdrv_dirty_iter_free(BdrvDirtyBitmapIter *iter)
 {
 if (!iter) {
@@ -514,3 +528,8 @@ int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap)
 {
 return hbitmap_count(bitmap->bitmap);
 }
+
+int64_t bdrv_get_meta_dirty_count(BdrvDirtyBitmap *bitmap)
+{
+return hbitmap_count(bitmap->meta);
+}
diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
index efc2965..9dea14b 100644
--- a/include/block/dirty-bitmap.h
+++ b/include/block/dirty-bitmap.h
@@ -30,6 +30,7 @@ void bdrv_enable_dirty_bitmap(BdrvDirtyBitmap *bitmap);
 BlockDirtyInfoList *bdrv_query_dirty_bitmaps(BlockDriverState *bs);
 uint32_t bdrv_get_default_bitmap_granularity(BlockDriverState *bs);
 uint32_t bdrv_dirty_bitmap_granularity(BdrvDirtyBitmap *bitmap);
+uint32_t bdrv_dirty_bitmap_meta_granularity(BdrvDirtyBitmap *bitmap);
 bool bdrv_dirty_bitmap_enabled(BdrvDirtyBitmap *bitmap);
 bool bdrv_dirty_bitmap_frozen(BdrvDirtyBitmap *bitmap);
 const char *bdrv_dirty_bitmap_name(const BdrvDirtyBitmap *bitmap);
@@ -47,12 +48,14 @@ int bdrv_dirty_bitmap_get_meta(BlockDriverState *bs,
 void bdrv_dirty_bitmap_reset_meta(BlockDriverState *bs,
   BdrvDirtyBitmap *bitmap, int64_t sector,
   int nb_sectors);
+BdrvDirtyBitmapIter *bdrv_dirty_meta_iter_new(BdrvDirtyBitmap *bitmap);
 BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap *bitmap,
  uint64_t first_sector);
 void bdrv_dirty_iter_free(BdrvDirtyBitmapIter *iter);
 int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter);
 void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *hbi, int64_t sector_num);
 int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap);
+int64_t bdrv_get_meta_dirty_count(BdrvDirtyBitmap *bitmap);
 void bdrv_dirty_bitmap_truncate(BlockDriverState *bs);
 
 uint64_t bdrv_dirty_bitmap_serialization_size(const BdrvDirtyBitmap *bitmap,
-- 
2.7.4

[Qemu-block] [PATCH v10 03/10] tests: Add test code for meta bitmap

2016-10-13 Thread John Snow

From: Fam Zheng 

Signed-off-by: Fam Zheng 
Reviewed-by: John Snow 
Reviewed-by: Max Reitz 
Signed-off-by: John Snow 
---
 tests/test-hbitmap.c | 116 +++
 1 file changed, 116 insertions(+)

diff --git a/tests/test-hbitmap.c b/tests/test-hbitmap.c
index c0e9895..e3abde1 100644
--- a/tests/test-hbitmap.c
+++ b/tests/test-hbitmap.c
@@ -11,6 +11,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu/hbitmap.h"
+#include "block/block.h"
 
 #define LOG_BITS_PER_LONG  (BITS_PER_LONG == 32 ? 5 : 6)
 
@@ -20,6 +21,7 @@
 
 typedef struct TestHBitmapData {
 HBitmap   *hb;
+HBitmap   *meta;
 unsigned long *bits;
 size_t size;
 size_t old_size;
@@ -91,6 +93,14 @@ static void hbitmap_test_init(TestHBitmapData *data,
 }
 }
 
+static void hbitmap_test_init_meta(TestHBitmapData *data,
+   uint64_t size, int granularity,
+   int meta_chunk)
+{
+hbitmap_test_init(data, size, granularity);
+data->meta = hbitmap_create_meta(data->hb, meta_chunk);
+}
+
 static inline size_t hbitmap_test_array_size(size_t bits)
 {
 size_t n = DIV_ROUND_UP(bits, BITS_PER_LONG);
@@ -133,6 +143,9 @@ static void hbitmap_test_teardown(TestHBitmapData *data,
   const void *unused)
 {
 if (data->hb) {
+if (data->meta) {
+hbitmap_free_meta(data->hb);
+}
 hbitmap_free(data->hb);
 data->hb = NULL;
 }
@@ -634,6 +647,103 @@ static void 
test_hbitmap_truncate_shrink_large(TestHBitmapData *data,
 hbitmap_test_truncate(data, size, -diff, 0);
 }
 
+static void hbitmap_check_meta(TestHBitmapData *data,
+   int64_t start, int count)
+{
+int64_t i;
+
+for (i = 0; i < data->size; i++) {
+if (i >= start && i < start + count) {
+g_assert(hbitmap_get(data->meta, i));
+} else {
+g_assert(!hbitmap_get(data->meta, i));
+}
+}
+}
+
+static void hbitmap_test_meta(TestHBitmapData *data,
+  int64_t start, int count,
+  int64_t check_start, int check_count)
+{
+hbitmap_reset_all(data->hb);
+hbitmap_reset_all(data->meta);
+
+/* Test "unset" -> "unset" will not update meta. */
+hbitmap_reset(data->hb, start, count);
+hbitmap_check_meta(data, 0, 0);
+
+/* Test "unset" -> "set" will update meta */
+hbitmap_set(data->hb, start, count);
+hbitmap_check_meta(data, check_start, check_count);
+
+/* Test "set" -> "set" will not update meta */
+hbitmap_reset_all(data->meta);
+hbitmap_set(data->hb, start, count);
+hbitmap_check_meta(data, 0, 0);
+
+/* Test "set" -> "unset" will update meta */
+hbitmap_reset_all(data->meta);
+hbitmap_reset(data->hb, start, count);
+hbitmap_check_meta(data, check_start, check_count);
+}
+
+static void hbitmap_test_meta_do(TestHBitmapData *data, int chunk_size)
+{
+uint64_t size = chunk_size * 100;
+hbitmap_test_init_meta(data, size, 0, chunk_size);
+
+hbitmap_test_meta(data, 0, 1, 0, chunk_size);
+hbitmap_test_meta(data, 0, chunk_size, 0, chunk_size);
+hbitmap_test_meta(data, chunk_size - 1, 1, 0, chunk_size);
+hbitmap_test_meta(data, chunk_size - 1, 2, 0, chunk_size * 2);
+hbitmap_test_meta(data, chunk_size - 1, chunk_size + 1, 0, chunk_size * 2);
+hbitmap_test_meta(data, chunk_size - 1, chunk_size + 2, 0, chunk_size * 3);
+hbitmap_test_meta(data, 7 * chunk_size - 1, chunk_size + 2,
+  6 * chunk_size, chunk_size * 3);
+hbitmap_test_meta(data, size - 1, 1, size - chunk_size, chunk_size);
+hbitmap_test_meta(data, 0, size, 0, size);
+}
+
+static void test_hbitmap_meta_byte(TestHBitmapData *data, const void *unused)
+{
+hbitmap_test_meta_do(data, BITS_PER_BYTE);
+}
+
+static void test_hbitmap_meta_word(TestHBitmapData *data, const void *unused)
+{
+hbitmap_test_meta_do(data, BITS_PER_LONG);
+}
+
+static void test_hbitmap_meta_sector(TestHBitmapData *data, const void *unused)
+{
+hbitmap_test_meta_do(data, BDRV_SECTOR_SIZE * BITS_PER_BYTE);
+}
+
+/**
+ * Create an HBitmap and test set/unset.
+ */
+static void test_hbitmap_meta_one(TestHBitmapData *data, const void *unused)
+{
+int i;
+int64_t offsets[] = {
+0, 1, L1 - 1, L1, L1 + 1, L2 - 1, L2, L2 + 1, L3 - 1, L3, L3 + 1
+};
+
+hbitmap_test_init_meta(data, L3 * 2, 0, 1);
+for (i = 0; i < ARRAY_SIZE(offsets); i++) {
+hbitmap_test_meta(data, offsets[i], 1, offsets[i], 1);
+hbitmap_test_meta(data, offsets[i], L1, offsets[i], L1);
+hbitmap_test_meta(data, offsets[i], L2, offsets[i], L2);
+}
+}
+
+static void test_hbitmap_meta_zero(TestHBitmapData *data, const void *unused)
+{
+hbitmap_test_init_meta(data, 0, 0, 1);
+
+

[Qemu-block] [PATCH v10 08/10] block: BdrvDirtyBitmap serialization interface

2016-10-13 Thread John Snow

From: Vladimir Sementsov-Ogievskiy 

Several functions to provide necessary access to BdrvDirtyBitmap for
block-migration.c

Signed-off-by: Vladimir Sementsov-Ogievskiy 
[Add the "finish" parameters. - Fam]
Signed-off-by: Fam Zheng 
Reviewed-by: John Snow 
Reviewed-by: Max Reitz 

Signed-off-by: John Snow 
---
 block/dirty-bitmap.c | 37 +
 include/block/dirty-bitmap.h | 14 ++
 2 files changed, 51 insertions(+)

diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 31d5296..384146b 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -453,6 +453,43 @@ void bdrv_undo_clear_dirty_bitmap(BdrvDirtyBitmap *bitmap, 
HBitmap *in)
 hbitmap_free(tmp);
 }
 
+uint64_t bdrv_dirty_bitmap_serialization_size(const BdrvDirtyBitmap *bitmap,
+  uint64_t start, uint64_t count)
+{
+return hbitmap_serialization_size(bitmap->bitmap, start, count);
+}
+
+uint64_t bdrv_dirty_bitmap_serialization_align(const BdrvDirtyBitmap *bitmap)
+{
+return hbitmap_serialization_granularity(bitmap->bitmap);
+}
+
+void bdrv_dirty_bitmap_serialize_part(const BdrvDirtyBitmap *bitmap,
+  uint8_t *buf, uint64_t start,
+  uint64_t count)
+{
+hbitmap_serialize_part(bitmap->bitmap, buf, start, count);
+}
+
+void bdrv_dirty_bitmap_deserialize_part(BdrvDirtyBitmap *bitmap,
+uint8_t *buf, uint64_t start,
+uint64_t count, bool finish)
+{
+hbitmap_deserialize_part(bitmap->bitmap, buf, start, count, finish);
+}
+
+void bdrv_dirty_bitmap_deserialize_zeroes(BdrvDirtyBitmap *bitmap,
+  uint64_t start, uint64_t count,
+  bool finish)
+{
+hbitmap_deserialize_zeroes(bitmap->bitmap, start, count, finish);
+}
+
+void bdrv_dirty_bitmap_deserialize_finish(BdrvDirtyBitmap *bitmap)
+{
+hbitmap_deserialize_finish(bitmap->bitmap);
+}
+
 void bdrv_set_dirty(BlockDriverState *bs, int64_t cur_sector,
 int64_t nr_sectors)
 {
diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
index c4e7858..efc2965 100644
--- a/include/block/dirty-bitmap.h
+++ b/include/block/dirty-bitmap.h
@@ -55,4 +55,18 @@ void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *hbi, int64_t 
sector_num);
 int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap);
 void bdrv_dirty_bitmap_truncate(BlockDriverState *bs);
 
+uint64_t bdrv_dirty_bitmap_serialization_size(const BdrvDirtyBitmap *bitmap,
+  uint64_t start, uint64_t count);
+uint64_t bdrv_dirty_bitmap_serialization_align(const BdrvDirtyBitmap *bitmap);
+void bdrv_dirty_bitmap_serialize_part(const BdrvDirtyBitmap *bitmap,
+  uint8_t *buf, uint64_t start,
+  uint64_t count);
+void bdrv_dirty_bitmap_deserialize_part(BdrvDirtyBitmap *bitmap,
+uint8_t *buf, uint64_t start,
+uint64_t count, bool finish);
+void bdrv_dirty_bitmap_deserialize_zeroes(BdrvDirtyBitmap *bitmap,
+  uint64_t start, uint64_t count,
+  bool finish);
+void bdrv_dirty_bitmap_deserialize_finish(BdrvDirtyBitmap *bitmap);
+
 #endif
-- 
2.7.4

[Qemu-block] [PATCH v10 06/10] block: Assert that bdrv_release_dirty_bitmap succeeded

2016-10-13 Thread John Snow

From: Fam Zheng 

We use a loop over bs->dirty_bitmaps to make sure the caller is
only releasing a bitmap owned by bs. Let's also assert that in this case
the caller is releasing a bitmap that does exist.

Signed-off-by: Fam Zheng 
Reviewed-by: Max Reitz 
Signed-off-by: John Snow 
---
 block/dirty-bitmap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 860acc9..31d5296 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -305,6 +305,9 @@ static void 
bdrv_do_release_matching_dirty_bitmap(BlockDriverState *bs,
 }
 }
 }
+if (bitmap) {
+abort();
+}
 }
 
 void bdrv_release_dirty_bitmap(BlockDriverState *bs, BdrvDirtyBitmap *bitmap)
-- 
2.7.4

[Qemu-block] [PATCH v10 00/10] Dirty bitmap changes for migration/persistence work

2016-10-13 Thread John Snow

Key:
[] : patches are identical
[] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/10:[] [--] 'block: Hide HBitmap in block dirty bitmap interface'
002/10:[] [--] 'HBitmap: Introduce "meta" bitmap to track bit changes'
003/10:[] [--] 'tests: Add test code for meta bitmap'
004/10:[] [--] 'block: Support meta dirty bitmap'
005/10:[] [--] 'block: Add two dirty bitmap getters'
006/10:[] [--] 'block: Assert that bdrv_release_dirty_bitmap succeeded'
007/10:[] [--] 'hbitmap: serialization'
008/10:[] [--] 'block: BdrvDirtyBitmap serialization interface'
009/10:[0005] [FC] 'tests: Add test code for hbitmap serialization'
010/10:[] [--] 'block: More operations for meta dirty bitmap'

===
v10: Now with less bits
===

09: Fixed tests to work correctly on 32bit machines, Thanks Max.

===
v9: Wow, it's been a while.
===

07: Replaced size_t by uint64_t (Fixes 32bit build)
09: Fixed serialization tests for BE machines

===
v8: Hello, is it v8 you're looking for?
===

01: Rebase conflict over int64_t header for bdrv_reset_dirty_bitmap.
04: Revised sector math to make literally any sense.
08: Rebase conflict over int64_t header for bdrv_reset_dirty_bitmap.

===
v7:
===

02: Fix rebase mishap.
04: Slight loop adjustment.
09: Fix constant on 32bit machines.

===
v6: Rebase. Stole series from Fam.
===

02: Added documentation changes as suggested by Max.

===
v5: Rebase: first 5 patches from last revision are already merged.
===

Addressed Max's comments:

01: - "block.c" -> "block/dirty-bitmap.c" in commit message.
- "an BdrvDirtyBitmapIter" -> "an BdrvDirtyBitmapIter" in code comment.
- hbitmap_next => next_dirty as variable name.
- bdrv_dirty_iter_free()/bdrv_dirty_iter_new() pairs =>
  bdrv_set_dirty_iter.

02: Move the assert fix into 01.

04: Truncate the meta bitmap (done by hbitmap_truncate).

06: Add Max's r-b.

07: I left the memcpy vs cpu_to_le32/64w as is to pick up Max's r-b. That
could be improved on top if wanted.

10: Add Max's r-b.



For convenience, this branch is available at:
https://github.com/jnsnow/qemu.git branch meta-bitmap
https://github.com/jnsnow/qemu/tree/meta-bitmap

This version is tagged meta-bitmap-v10:
https://github.com/jnsnow/qemu/releases/tag/meta-bitmap-v10

Fam Zheng (8):
  block: Hide HBitmap in block dirty bitmap interface
  HBitmap: Introduce "meta" bitmap to track bit changes
  tests: Add test code for meta bitmap
  block: Support meta dirty bitmap
  block: Add two dirty bitmap getters
  block: Assert that bdrv_release_dirty_bitmap succeeded
  tests: Add test code for hbitmap serialization
  block: More operations for meta dirty bitmap

Vladimir Sementsov-Ogievskiy (2):
  hbitmap: serialization
  block: BdrvDirtyBitmap serialization interface

 block/backup.c   |  14 ++-
 block/dirty-bitmap.c | 160 -
 block/mirror.c   |  24 ++--
 include/block/dirty-bitmap.h |  35 +-
 include/qemu/hbitmap.h   | 100 
 include/qemu/typedefs.h  |   1 +
 tests/test-hbitmap.c | 272 +++
 util/hbitmap.c   | 206 +---
 8 files changed, 772 insertions(+), 40 deletions(-)

-- 
2.7.4

[Qemu-block] [PATCH v10 07/10] hbitmap: serialization

2016-10-13 Thread John Snow

From: Vladimir Sementsov-Ogievskiy 

Functions to serialize / deserialize(restore) HBitmap. HBitmap should be
saved to linear sequence of bits independently of endianness and bitmap
array element (unsigned long) size. Therefore Little Endian is chosen.

These functions are appropriate for dirty bitmap migration, restoring
the bitmap in several steps is available. To save performance, every
step writes only the last level of the bitmap. All other levels are
restored by hbitmap_deserialize_finish() as a last step of restoring.
So, HBitmap is inconsistent while restoring.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
[Fix left shift operand to 1UL; add "finish" parameter. - Fam]
Signed-off-by: Fam Zheng 

Signed-off-by: John Snow 
---
 include/qemu/hbitmap.h |  79 
 util/hbitmap.c | 137 +
 2 files changed, 216 insertions(+)

diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h
index 1725919..eb46475 100644
--- a/include/qemu/hbitmap.h
+++ b/include/qemu/hbitmap.h
@@ -146,6 +146,85 @@ void hbitmap_reset_all(HBitmap *hb);
 bool hbitmap_get(const HBitmap *hb, uint64_t item);
 
 /**
+ * hbitmap_serialization_granularity:
+ * @hb: HBitmap to operate on.
+ *
+ * Granularity of serialization chunks, used by other serialization functions.
+ * For every chunk:
+ * 1. Chunk start should be aligned to this granularity.
+ * 2. Chunk size should be aligned too, except for last chunk (for which
+ *  start + count == hb->size)
+ */
+uint64_t hbitmap_serialization_granularity(const HBitmap *hb);
+
+/**
+ * hbitmap_serialization_size:
+ * @hb: HBitmap to operate on.
+ * @start: Starting bit
+ * @count: Number of bits
+ *
+ * Return number of bytes hbitmap_(de)serialize_part needs
+ */
+uint64_t hbitmap_serialization_size(const HBitmap *hb,
+uint64_t start, uint64_t count);
+
+/**
+ * hbitmap_serialize_part
+ * @hb: HBitmap to operate on.
+ * @buf: Buffer to store serialized bitmap.
+ * @start: First bit to store.
+ * @count: Number of bits to store.
+ *
+ * Stores HBitmap data corresponding to given region. The format of saved data
+ * is linear sequence of bits, so it can be used by hbitmap_deserialize_part
+ * independently of endianness and size of HBitmap level array elements
+ */
+void hbitmap_serialize_part(const HBitmap *hb, uint8_t *buf,
+uint64_t start, uint64_t count);
+
+/**
+ * hbitmap_deserialize_part
+ * @hb: HBitmap to operate on.
+ * @buf: Buffer to restore bitmap data from.
+ * @start: First bit to restore.
+ * @count: Number of bits to restore.
+ * @finish: Whether to call hbitmap_deserialize_finish automatically.
+ *
+ * Restores HBitmap data corresponding to given region. The format is the same
+ * as for hbitmap_serialize_part.
+ *
+ * If @finish is false, caller must call hbitmap_serialize_finish before using
+ * the bitmap.
+ */
+void hbitmap_deserialize_part(HBitmap *hb, uint8_t *buf,
+  uint64_t start, uint64_t count,
+  bool finish);
+
+/**
+ * hbitmap_deserialize_zeroes
+ * @hb: HBitmap to operate on.
+ * @start: First bit to restore.
+ * @count: Number of bits to restore.
+ * @finish: Whether to call hbitmap_deserialize_finish automatically.
+ *
+ * Fills the bitmap with zeroes.
+ *
+ * If @finish is false, caller must call hbitmap_serialize_finish before using
+ * the bitmap.
+ */
+void hbitmap_deserialize_zeroes(HBitmap *hb, uint64_t start, uint64_t count,
+bool finish);
+
+/**
+ * hbitmap_deserialize_finish
+ * @hb: HBitmap to operate on.
+ *
+ * Repair HBitmap after calling hbitmap_deserialize_data. Actually, all HBitmap
+ * layers are restored here.
+ */
+void hbitmap_deserialize_finish(HBitmap *hb);
+
+/**
  * hbitmap_free:
  * @hb: HBitmap to operate on.
  *
diff --git a/util/hbitmap.c b/util/hbitmap.c
index f303975..5d1a21c 100644
--- a/util/hbitmap.c
+++ b/util/hbitmap.c
@@ -397,6 +397,143 @@ bool hbitmap_get(const HBitmap *hb, uint64_t item)
 return (hb->levels[HBITMAP_LEVELS - 1][pos >> BITS_PER_LEVEL] & bit) != 0;
 }
 
+uint64_t hbitmap_serialization_granularity(const HBitmap *hb)
+{
+/* Require at least 64 bit granularity to be safe on both 64 bit and 32 bit
+ * hosts. */
+return 64 << hb->granularity;
+}
+
+/* Start should be aligned to serialization granularity, chunk size should be
+ * aligned to serialization granularity too, except for last chunk.
+ */
+static void serialization_chunk(const HBitmap *hb,
+uint64_t start, uint64_t count,
+unsigned long **first_el, uint64_t *el_count)
+{
+uint64_t last = start + count - 1;
+uint64_t gran = hbitmap_serialization_granularity(hb);
+
+assert((start & (gran - 1)) == 0);
+assert((last >> hb->granularity) <

[Qemu-block] [PATCH v10 05/10] block: Add two dirty bitmap getters

2016-10-13 Thread John Snow

From: Fam Zheng 

For dirty bitmap users to get the size and the name of a
BdrvDirtyBitmap.

Signed-off-by: Fam Zheng 
Reviewed-by: John Snow 
Reviewed-by: Max Reitz 
Signed-off-by: John Snow 
---
 block/dirty-bitmap.c | 10 ++
 include/block/dirty-bitmap.h |  2 ++
 2 files changed, 12 insertions(+)

diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 9c6febb..860acc9 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -154,6 +154,16 @@ void bdrv_dirty_bitmap_reset_meta(BlockDriverState *bs,
 hbitmap_reset(bitmap->meta, sector, nb_sectors);
 }
 
+int64_t bdrv_dirty_bitmap_size(const BdrvDirtyBitmap *bitmap)
+{
+return bitmap->size;
+}
+
+const char *bdrv_dirty_bitmap_name(const BdrvDirtyBitmap *bitmap)
+{
+return bitmap->name;
+}
+
 bool bdrv_dirty_bitmap_frozen(BdrvDirtyBitmap *bitmap)
 {
 return bitmap->successor;
diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
index 69c500b..c4e7858 100644
--- a/include/block/dirty-bitmap.h
+++ b/include/block/dirty-bitmap.h
@@ -32,6 +32,8 @@ uint32_t bdrv_get_default_bitmap_granularity(BlockDriverState 
*bs);
 uint32_t bdrv_dirty_bitmap_granularity(BdrvDirtyBitmap *bitmap);
 bool bdrv_dirty_bitmap_enabled(BdrvDirtyBitmap *bitmap);
 bool bdrv_dirty_bitmap_frozen(BdrvDirtyBitmap *bitmap);
+const char *bdrv_dirty_bitmap_name(const BdrvDirtyBitmap *bitmap);
+int64_t bdrv_dirty_bitmap_size(const BdrvDirtyBitmap *bitmap);
 DirtyBitmapStatus bdrv_dirty_bitmap_status(BdrvDirtyBitmap *bitmap);
 int bdrv_get_dirty(BlockDriverState *bs, BdrvDirtyBitmap *bitmap,
int64_t sector);
-- 
2.7.4

[Qemu-block] [PATCH v10 04/10] block: Support meta dirty bitmap

2016-10-13 Thread John Snow

From: Fam Zheng 

The added group of operations enables tracking of the changed bits in
the dirty bitmap.

Signed-off-by: Fam Zheng 
Reviewed-by: Max Reitz 
Signed-off-by: John Snow 
---
 block/dirty-bitmap.c | 52 
 include/block/dirty-bitmap.h |  9 
 2 files changed, 61 insertions(+)

diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index c572dfa..9c6febb 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -38,6 +38,7 @@
  */
 struct BdrvDirtyBitmap {
 HBitmap *bitmap;/* Dirty sector bitmap implementation */
+HBitmap *meta;  /* Meta dirty bitmap */
 BdrvDirtyBitmap *successor; /* Anonymous child; implies frozen status */
 char *name; /* Optional non-empty unique ID */
 int64_t size;   /* Size of the bitmap (Number of sectors) */
@@ -103,6 +104,56 @@ BdrvDirtyBitmap *bdrv_create_dirty_bitmap(BlockDriverState 
*bs,
 return bitmap;
 }
 
+/* bdrv_create_meta_dirty_bitmap
+ *
+ * Create a meta dirty bitmap that tracks the changes of bits in @bitmap. I.e.
+ * when a dirty status bit in @bitmap is changed (either from reset to set or
+ * the other way around), its respective meta dirty bitmap bit will be marked
+ * dirty as well.
+ *
+ * @bitmap: the block dirty bitmap for which to create a meta dirty bitmap.
+ * @chunk_size: how many bytes of bitmap data does each bit in the meta bitmap
+ * track.
+ */
+void bdrv_create_meta_dirty_bitmap(BdrvDirtyBitmap *bitmap,
+   int chunk_size)
+{
+assert(!bitmap->meta);
+bitmap->meta = hbitmap_create_meta(bitmap->bitmap,
+   chunk_size * BITS_PER_BYTE);
+}
+
+void bdrv_release_meta_dirty_bitmap(BdrvDirtyBitmap *bitmap)
+{
+assert(bitmap->meta);
+hbitmap_free_meta(bitmap->bitmap);
+bitmap->meta = NULL;
+}
+
+int bdrv_dirty_bitmap_get_meta(BlockDriverState *bs,
+   BdrvDirtyBitmap *bitmap, int64_t sector,
+   int nb_sectors)
+{
+uint64_t i;
+int sectors_per_bit = 1 << hbitmap_granularity(bitmap->meta);
+
+/* To optimize: we can make hbitmap to internally check the range in a
+ * coarse level, or at least do it word by word. */
+for (i = sector; i < sector + nb_sectors; i += sectors_per_bit) {
+if (hbitmap_get(bitmap->meta, i)) {
+return true;
+}
+}
+return false;
+}
+
+void bdrv_dirty_bitmap_reset_meta(BlockDriverState *bs,
+  BdrvDirtyBitmap *bitmap, int64_t sector,
+  int nb_sectors)
+{
+hbitmap_reset(bitmap->meta, sector, nb_sectors);
+}
+
 bool bdrv_dirty_bitmap_frozen(BdrvDirtyBitmap *bitmap)
 {
 return bitmap->successor;
@@ -233,6 +284,7 @@ static void 
bdrv_do_release_matching_dirty_bitmap(BlockDriverState *bs,
 if ((!bitmap || bm == bitmap) && (!only_named || bm->name)) {
 assert(!bm->active_iterators);
 assert(!bdrv_dirty_bitmap_frozen(bm));
+assert(!bm->meta);
 QLIST_REMOVE(bm, list);
 hbitmap_free(bm->bitmap);
 g_free(bm->name);
diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
index 0ef927d..69c500b 100644
--- a/include/block/dirty-bitmap.h
+++ b/include/block/dirty-bitmap.h
@@ -8,6 +8,9 @@ BdrvDirtyBitmap *bdrv_create_dirty_bitmap(BlockDriverState *bs,
   uint32_t granularity,
   const char *name,
   Error **errp);
+void bdrv_create_meta_dirty_bitmap(BdrvDirtyBitmap *bitmap,
+   int chunk_size);
+void bdrv_release_meta_dirty_bitmap(BdrvDirtyBitmap *bitmap);
 int bdrv_dirty_bitmap_create_successor(BlockDriverState *bs,
BdrvDirtyBitmap *bitmap,
Error **errp);
@@ -36,6 +39,12 @@ void bdrv_set_dirty_bitmap(BdrvDirtyBitmap *bitmap,
int64_t cur_sector, int64_t nr_sectors);
 void bdrv_reset_dirty_bitmap(BdrvDirtyBitmap *bitmap,
  int64_t cur_sector, int64_t nr_sectors);
+int bdrv_dirty_bitmap_get_meta(BlockDriverState *bs,
+   BdrvDirtyBitmap *bitmap, int64_t sector,
+   int nb_sectors);
+void bdrv_dirty_bitmap_reset_meta(BlockDriverState *bs,
+  BdrvDirtyBitmap *bitmap, int64_t sector,
+  int nb_sectors);
 BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap *bitmap,
  uint64_t first_sector);
 void bdrv_dirty_iter_free(BdrvDirtyBitmapIter *iter);
-- 
2.7.4

[Qemu-block] [PATCH v10 02/10] HBitmap: Introduce "meta" bitmap to track bit changes

2016-10-13 Thread John Snow

From: Fam Zheng 

Upon each bit toggle, the corresponding bit in the meta bitmap will be
set.

Signed-off-by: Fam Zheng 
[Amended text inline. --js]
Reviewed-by: Max Reitz 

Signed-off-by: John Snow 
---
 include/qemu/hbitmap.h | 21 +++
 util/hbitmap.c | 69 +++---
 2 files changed, 75 insertions(+), 15 deletions(-)

diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h
index 8ab721e..1725919 100644
--- a/include/qemu/hbitmap.h
+++ b/include/qemu/hbitmap.h
@@ -178,6 +178,27 @@ void hbitmap_iter_init(HBitmapIter *hbi, const HBitmap 
*hb, uint64_t first);
  */
 unsigned long hbitmap_iter_skip_words(HBitmapIter *hbi);
 
+/* hbitmap_create_meta:
+ * Create a "meta" hbitmap to track dirtiness of the bits in this HBitmap.
+ * The caller owns the created bitmap and must call hbitmap_free_meta(hb) to
+ * free it.
+ *
+ * Currently, we only guarantee that if a bit in the hbitmap is changed it
+ * will be reflected in the meta bitmap, but we do not yet guarantee the
+ * opposite.
+ *
+ * @hb: The HBitmap to operate on.
+ * @chunk_size: How many bits in @hb does one bit in the meta track.
+ */
+HBitmap *hbitmap_create_meta(HBitmap *hb, int chunk_size);
+
+/* hbitmap_free_meta:
+ * Free the meta bitmap of @hb.
+ *
+ * @hb: The HBitmap whose meta bitmap should be freed.
+ */
+void hbitmap_free_meta(HBitmap *hb);
+
 /**
  * hbitmap_iter_next:
  * @hbi: HBitmapIter to operate on.
diff --git a/util/hbitmap.c b/util/hbitmap.c
index 99fd2ba..f303975 100644
--- a/util/hbitmap.c
+++ b/util/hbitmap.c
@@ -78,6 +78,9 @@ struct HBitmap {
  */
 int granularity;
 
+/* A meta dirty bitmap to track the dirtiness of bits in this HBitmap. */
+HBitmap *meta;
+
 /* A number of progressively less coarse bitmaps (i.e. level 0 is the
  * coarsest).  Each bit in level N represents a word in level N+1 that
  * has a set bit, except the last level where each bit represents the
@@ -209,25 +212,27 @@ static uint64_t hb_count_between(HBitmap *hb, uint64_t 
start, uint64_t last)
 }
 
 /* Setting starts at the last layer and propagates up if an element
- * changes from zero to non-zero.
+ * changes.
  */
 static inline bool hb_set_elem(unsigned long *elem, uint64_t start, uint64_t 
last)
 {
 unsigned long mask;
-bool changed;
+unsigned long old;
 
 assert((last >> BITS_PER_LEVEL) == (start >> BITS_PER_LEVEL));
 assert(start <= last);
 
 mask = 2UL << (last & (BITS_PER_LONG - 1));
 mask -= 1UL << (start & (BITS_PER_LONG - 1));
-changed = (*elem == 0);
+old = *elem;
 *elem |= mask;
-return changed;
+return old != *elem;
 }
 
-/* The recursive workhorse (the depth is limited to HBITMAP_LEVELS)... */
-static void hb_set_between(HBitmap *hb, int level, uint64_t start, uint64_t 
last)
+/* The recursive workhorse (the depth is limited to HBITMAP_LEVELS)...
+ * Returns true if at least one bit is changed. */
+static bool hb_set_between(HBitmap *hb, int level, uint64_t start,
+   uint64_t last)
 {
 size_t pos = start >> BITS_PER_LEVEL;
 size_t lastpos = last >> BITS_PER_LEVEL;
@@ -256,23 +261,28 @@ static void hb_set_between(HBitmap *hb, int level, 
uint64_t start, uint64_t last
 if (level > 0 && changed) {
 hb_set_between(hb, level - 1, pos, lastpos);
 }
+return changed;
 }
 
 void hbitmap_set(HBitmap *hb, uint64_t start, uint64_t count)
 {
 /* Compute range in the last layer.  */
+uint64_t first, n;
 uint64_t last = start + count - 1;
 
 trace_hbitmap_set(hb, start, count,
   start >> hb->granularity, last >> hb->granularity);
 
-start >>= hb->granularity;
+first = start >> hb->granularity;
 last >>= hb->granularity;
-count = last - start + 1;
 assert(last < hb->size);
+n = last - first + 1;
 
-hb->count += count - hb_count_between(hb, start, last);
-hb_set_between(hb, HBITMAP_LEVELS - 1, start, last);
+hb->count += n - hb_count_between(hb, first, last);
+if (hb_set_between(hb, HBITMAP_LEVELS - 1, first, last) &&
+hb->meta) {
+hbitmap_set(hb->meta, start, count);
+}
 }
 
 /* Resetting works the other way round: propagate up if the new
@@ -293,8 +303,10 @@ static inline bool hb_reset_elem(unsigned long *elem, 
uint64_t start, uint64_t l
 return blanked;
 }
 
-/* The recursive workhorse (the depth is limited to HBITMAP_LEVELS)... */
-static void hb_reset_between(HBitmap *hb, int level, uint64_t start, uint64_t 
last)
+/* The recursive workhorse (the depth is limited to HBITMAP_LEVELS)...
+ * Returns true if at least one bit is changed. */
+static bool hb_reset_between(HBitmap *hb, int level, uint64_t start,
+ uint64_t last)
 {
 size_t pos = start >> BITS_PER_LEVEL;
 size_t lastpos = last >> BITS_PER_LEVEL;
@@ -337,22 +349,29 @@ static void

[Qemu-block] [PATCH v6 11/15] nbd: Less allocation during NBD_OPT_LIST

2016-10-13 Thread Eric Blake

Since we know that the maximum name we are willing to accept
is small enough to stack-allocate, rework the iteration over
NBD_OPT_LIST responses to reuse a stack buffer rather than
allocating every time.  Furthermore, we don't even have to
allocate if we know the server's length doesn't match what
we are searching for.

Signed-off-by: Eric Blake 

---
v6: rebase
v5: alter signature of nbd_receive_list for simpler logic
v4: rebase
v3: tweak commit message
---
 nbd/client.c | 145 +--
 1 file changed, 70 insertions(+), 75 deletions(-)

diff --git a/nbd/client.c b/nbd/client.c
index df7eb9c..f6468db 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -254,19 +254,28 @@ static int nbd_handle_reply_err(QIOChannel *ioc, 
nbd_opt_reply *reply,
 return result;
 }

-static int nbd_receive_list(QIOChannel *ioc, char **name, Error **errp)
+/* Process another portion of the NBD_OPT_LIST reply.  Set *@match if
+ * the current reply matches @want or if the server does not support
+ * NBD_OPT_LIST, otherwise leave @match alone.  Return 0 if iteration
+ * is complete, positive if more replies are expected, or negative
+ * with @errp set if an unrecoverable error occurred. */
+static int nbd_receive_list(QIOChannel *ioc, const char *want, bool *match,
+Error **errp)
 {
 nbd_opt_reply reply;
 uint32_t len;
 uint32_t namelen;
+char name[NBD_MAX_NAME_SIZE + 1];
 int error;

-*name = NULL;
 if (nbd_receive_option_reply(ioc, NBD_OPT_LIST, , errp) < 0) {
 return -1;
 }
 error = nbd_handle_reply_err(ioc, , errp);
 if (error <= 0) {
+/* The server did not support NBD_OPT_LIST, so set *match on
+ * the assumption that any name will be accepted.  */
+*match = true;
 return error;
 }
 len = reply.length;
@@ -277,105 +286,91 @@ static int nbd_receive_list(QIOChannel *ioc, char 
**name, Error **errp)
 nbd_send_opt_abort(ioc);
 return -1;
 }
-} else if (reply.type == NBD_REP_SERVER) {
-if (len < sizeof(namelen) || len > NBD_MAX_BUFFER_SIZE) {
-error_setg(errp, "incorrect option length %" PRIu32, len);
-nbd_send_opt_abort(ioc);
-return -1;
-}
-if (read_sync(ioc, , sizeof(namelen)) != sizeof(namelen)) {
-error_setg(errp, "failed to read option name length");
-nbd_send_opt_abort(ioc);
-return -1;
-}
-namelen = be32_to_cpu(namelen);
-len -= sizeof(namelen);
-if (len < namelen) {
-error_setg(errp, "incorrect option name length");
-nbd_send_opt_abort(ioc);
-return -1;
-}
-if (namelen > NBD_MAX_NAME_SIZE) {
-error_setg(errp, "export name length too long %" PRIu32, namelen);
-nbd_send_opt_abort(ioc);
-return -1;
-}
-
-*name = g_new0(char, namelen + 1);
-if (read_sync(ioc, *name, namelen) != namelen) {
-error_setg(errp, "failed to read export name");
-g_free(*name);
-*name = NULL;
-nbd_send_opt_abort(ioc);
-return -1;
-}
-(*name)[namelen] = '\0';
-len -= namelen;
-if (drop_sync(ioc, len) != len) {
-error_setg(errp, "failed to read export description");
-g_free(*name);
-*name = NULL;
-nbd_send_opt_abort(ioc);
-return -1;
-}
-} else {
+return 0;
+} else if (reply.type != NBD_REP_SERVER) {
 error_setg(errp, "Unexpected reply type %" PRIx32 " expected %x",
reply.type, NBD_REP_SERVER);
 nbd_send_opt_abort(ioc);
 return -1;
 }
+
+if (len < sizeof(namelen) || len > NBD_MAX_BUFFER_SIZE) {
+error_setg(errp, "incorrect option length %" PRIu32, len);
+nbd_send_opt_abort(ioc);
+return -1;
+}
+if (read_sync(ioc, , sizeof(namelen)) != sizeof(namelen)) {
+error_setg(errp, "failed to read option name length");
+nbd_send_opt_abort(ioc);
+return -1;
+}
+namelen = be32_to_cpu(namelen);
+len -= sizeof(namelen);
+if (len < namelen) {
+error_setg(errp, "incorrect option name length");
+nbd_send_opt_abort(ioc);
+return -1;
+}
+if (namelen != strlen(want)) {
+if (drop_sync(ioc, len) != len) {
+error_setg(errp, "failed to skip export name with wrong length");
+nbd_send_opt_abort(ioc);
+return -1;
+}
+return 1;
+}
+
+assert(namelen < sizeof(name));
+if (read_sync(ioc, name, namelen) != namelen) {
+error_setg(errp, "failed to read export name");
+nbd_send_opt_abort(ioc);
+return -1;
+}
+name[namelen] = '\0';
+len -= namelen;
+if (drop_sync(ioc, len) != len) {
+

[Qemu-block] [PATCH v6 12/15] nbd: Support shorter handshake

2016-10-13 Thread Eric Blake

The NBD Protocol allows the server and client to mutually agree
on a shorter handshake (omit the 124 bytes of reserved 0), via
the server advertising NBD_FLAG_NO_ZEROES and the client
acknowledging with NBD_FLAG_C_NO_ZEROES (only possible in
newstyle, whether or not it is fixed newstyle).  It doesn't
shave much off the wire, but we might as well implement it.

Signed-off-by: Eric Blake 
Reviewed-by: Alex Bligh 

---
v6: rebase
v5: no change
v4: rebase
v3: rebase
---
 include/block/nbd.h |  6 --
 nbd/client.c|  8 +++-
 nbd/server.c| 15 +++
 3 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index b69bf1d..d326308 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -74,11 +74,13 @@ typedef struct NBDReply NBDReply;

 /* New-style handshake (global) flags, sent from server to client, and
control what will happen during handshake phase. */
-#define NBD_FLAG_FIXED_NEWSTYLE (1 << 0)/* Fixed newstyle protocol. */
+#define NBD_FLAG_FIXED_NEWSTYLE   (1 << 0) /* Fixed newstyle protocol. */
+#define NBD_FLAG_NO_ZEROES(1 << 1) /* End handshake without zeroes. */

 /* New-style client flags, sent from client to server to control what happens
during handshake phase. */
-#define NBD_FLAG_C_FIXED_NEWSTYLE   (1 << 0)/* Fixed newstyle protocol. */
+#define NBD_FLAG_C_FIXED_NEWSTYLE (1 << 0) /* Fixed newstyle protocol. */
+#define NBD_FLAG_C_NO_ZEROES  (1 << 1) /* End handshake without zeroes. */

 /* Reply types. */
 #define NBD_REP_ACK (1) /* Data sending finished. */
diff --git a/nbd/client.c b/nbd/client.c
index f6468db..f5e4c74 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -439,6 +439,7 @@ int nbd_receive_negotiate(QIOChannel *ioc, const char 
*name, uint16_t *flags,
 char buf[256];
 uint64_t magic, s;
 int rc;
+bool zeroes = true;

 TRACE("Receiving negotiation tlscreds=%p hostname=%s.",
   tlscreds, hostname ? hostname : "");
@@ -503,6 +504,11 @@ int nbd_receive_negotiate(QIOChannel *ioc, const char 
*name, uint16_t *flags,
 TRACE("Server supports fixed new style");
 clientflags |= NBD_FLAG_C_FIXED_NEWSTYLE;
 }
+if (globalflags & NBD_FLAG_NO_ZEROES) {
+zeroes = false;
+TRACE("Server supports no zeroes");
+clientflags |= NBD_FLAG_C_NO_ZEROES;
+}
 /* client requested flags */
 clientflags = cpu_to_be32(clientflags);
 if (write_sync(ioc, , sizeof(clientflags)) !=
@@ -590,7 +596,7 @@ int nbd_receive_negotiate(QIOChannel *ioc, const char 
*name, uint16_t *flags,
 }

 TRACE("Size is %" PRIu64 ", export flags %" PRIx16, *size, *flags);
-if (drop_sync(ioc, 124) != 124) {
+if (zeroes && drop_sync(ioc, 124) != 124) {
 error_setg(errp, "Failed to read reserved block");
 goto fail;
 }
diff --git a/nbd/server.c b/nbd/server.c
index 3d39292..20f1086 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -81,6 +81,7 @@ struct NBDClient {
 int refcount;
 void (*close)(NBDClient *client);

+bool no_zeroes;
 NBDExport *exp;
 QCryptoTLSCreds *tlscreds;
 char *tlsaclname;
@@ -449,6 +450,11 @@ static int nbd_negotiate_options(NBDClient *client)
 fixedNewstyle = true;
 flags &= ~NBD_FLAG_C_FIXED_NEWSTYLE;
 }
+if (flags & NBD_FLAG_C_NO_ZEROES) {
+TRACE("Client supports no zeroes at handshake end");
+client->no_zeroes = true;
+flags &= ~NBD_FLAG_C_NO_ZEROES;
+}
 if (flags != 0) {
 TRACE("Unknown client flags 0x%" PRIx32 " received", flags);
 return -EIO;
@@ -601,6 +607,7 @@ static coroutine_fn int nbd_negotiate(NBDClientNewData 
*data)
 const uint16_t myflags = (NBD_FLAG_HAS_FLAGS | NBD_FLAG_SEND_TRIM |
   NBD_FLAG_SEND_FLUSH | NBD_FLAG_SEND_FUA);
 bool oldStyle;
+size_t len;

 /* Old style negotiation header without options
 [ 0 ..   7]   passwd   ("NBDMAGIC")
@@ -617,7 +624,7 @@ static coroutine_fn int nbd_negotiate(NBDClientNewData 
*data)
 options sent
 [18 ..  25]   size
 [26 ..  27]   export flags
-[28 .. 151]   reserved (0)
+[28 .. 151]   reserved (0, omit if no_zeroes)
  */

 qio_channel_set_blocking(client->ioc, false, NULL);
@@ -636,7 +643,7 @@ static coroutine_fn int nbd_negotiate(NBDClientNewData 
*data)
 stw_be_p(buf + 26, client->exp->nbdflags | myflags);
 } else {
 stq_be_p(buf + 8, NBD_OPTS_MAGIC);
-stw_be_p(buf + 16, NBD_FLAG_FIXED_NEWSTYLE);
+stw_be_p(buf + 16, NBD_FLAG_FIXED_NEWSTYLE | NBD_FLAG_NO_ZEROES);
 }

 if (oldStyle) {
@@ -663,8 +670,8 @@ static coroutine_fn int nbd_negotiate(NBDClientNewData 
*data)
   client->exp->size, client->exp->nbdflags | myflags);
 stq_be_p(buf + 18,

[Qemu-block] [PATCH v6 13/15] nbd: Improve server handling of shutdown requests

2016-10-13 Thread Eric Blake

NBD commit 6d34500b clarified how clients and servers are supposed
to behave before closing a connection. It added NBD_REP_ERR_SHUTDOWN
(for the server to announce it is about to go away during option
haggling, so the client should quit sending NBD_OPT_* other than
NBD_OPT_ABORT) and ESHUTDOWN (for the server to announce it is about
to go away during transmission, so the client should quit sending
NBD_CMD_* other than NBD_CMD_DISC).  It also clarified that
NBD_OPT_ABORT gets a reply, while NBD_CMD_DISC does not.

This patch merely adds the missing reply to NBD_OPT_ABORT and teaches
the client to recognize server errors.  Actually teaching the server
to send NBD_REP_ERR_SHUTDOWN or ESHUTDOWN would require knowing that
the server has been requested to shut down soon (maybe we could do
that by installing a SIGINT handler in qemu-nbd, which transitions
from RUNNING to a new state that waits for the client to react,
rather than just out-right quitting - but that's a bigger task for
another day).

Signed-off-by: Eric Blake 

---
v6: rebase
v5: no change
v4: new patch
---
 include/block/nbd.h | 13 +
 nbd/nbd-internal.h  |  1 +
 nbd/client.c| 16 
 nbd/server.c| 10 ++
 4 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index d326308..eea7ef0 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -83,12 +83,17 @@ typedef struct NBDReply NBDReply;
 #define NBD_FLAG_C_NO_ZEROES  (1 << 1) /* End handshake without zeroes. */

 /* Reply types. */
+#define NBD_REP_ERR(value) ((UINT32_C(1) << 31) | (value))
+
 #define NBD_REP_ACK (1) /* Data sending finished. */
 #define NBD_REP_SERVER  (2) /* Export description. */
-#define NBD_REP_ERR_UNSUP   ((UINT32_C(1) << 31) | 1) /* Unknown option. */
-#define NBD_REP_ERR_POLICY  ((UINT32_C(1) << 31) | 2) /* Server denied */
-#define NBD_REP_ERR_INVALID ((UINT32_C(1) << 31) | 3) /* Invalid length. */
-#define NBD_REP_ERR_TLS_REQD((UINT32_C(1) << 31) | 5) /* TLS required */
+
+#define NBD_REP_ERR_UNSUP   NBD_REP_ERR(1)  /* Unknown option */
+#define NBD_REP_ERR_POLICY  NBD_REP_ERR(2)  /* Server denied */
+#define NBD_REP_ERR_INVALID NBD_REP_ERR(3)  /* Invalid length */
+#define NBD_REP_ERR_PLATFORMNBD_REP_ERR(4)  /* Not compiled in */
+#define NBD_REP_ERR_TLS_REQDNBD_REP_ERR(5)  /* TLS required */
+#define NBD_REP_ERR_SHUTDOWNNBD_REP_ERR(7)  /* Server shutting down */

 /* Request flags, sent from client to server during transmission phase */
 #define NBD_CMD_FLAG_FUA(1 << 0)
diff --git a/nbd/nbd-internal.h b/nbd/nbd-internal.h
index dd57e18..eee20ab 100644
--- a/nbd/nbd-internal.h
+++ b/nbd/nbd-internal.h
@@ -92,6 +92,7 @@
 #define NBD_ENOMEM 12
 #define NBD_EINVAL 22
 #define NBD_ENOSPC 28
+#define NBD_ESHUTDOWN  108

 static inline ssize_t read_sync(QIOChannel *ioc, void *buffer, size_t size)
 {
diff --git a/nbd/client.c b/nbd/client.c
index f5e4c74..b33ae46 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -34,6 +34,8 @@ static int nbd_errno_to_system_errno(int err)
 return ENOMEM;
 case NBD_ENOSPC:
 return ENOSPC;
+case NBD_ESHUTDOWN:
+return ESHUTDOWN;
 default:
 TRACE("Squashing unexpected error %d to EINVAL", err);
 /* fallthrough */
@@ -231,11 +233,21 @@ static int nbd_handle_reply_err(QIOChannel *ioc, 
nbd_opt_reply *reply,
reply->option);
 break;

+case NBD_REP_ERR_PLATFORM:
+error_setg(errp, "Server lacks support for option %" PRIx32,
+   reply->option);
+break;
+
 case NBD_REP_ERR_TLS_REQD:
 error_setg(errp, "TLS negotiation required before option %" PRIx32,
reply->option);
 break;

+case NBD_REP_ERR_SHUTDOWN:
+error_setg(errp, "Server shutting down before option %" PRIx32,
+   reply->option);
+break;
+
 default:
 error_setg(errp, "Unknown error code when asking for option %" PRIx32,
reply->option);
@@ -784,6 +796,10 @@ ssize_t nbd_receive_reply(QIOChannel *ioc, NBDReply *reply)
 LOG("invalid magic (got 0x%" PRIx32 ")", magic);
 return -EINVAL;
 }
+if (reply->error == ESHUTDOWN) {
+LOG("server shutting down");
+return -EINVAL;
+}
 return 0;
 }

diff --git a/nbd/server.c b/nbd/server.c
index 20f1086..a7aa2ba 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -39,6 +39,8 @@ static int system_errno_to_nbd_errno(int err)
 case EFBIG:
 case ENOSPC:
 return NBD_ENOSPC;
+case ESHUTDOWN:
+return NBD_ESHUTDOWN;
 case EINVAL:
 default:
 return NBD_EINVAL;
@@ -526,6 +528,10 @@ static int nbd_negotiate_options(NBDClient *client)
 if (ret < 0) {
 return ret;
 }
+/* Let the

[Qemu-block] [PATCH v6 15/15] nbd: Implement NBD_CMD_WRITE_ZEROES on client

2016-10-13 Thread Eric Blake

Upstream NBD protocol recently added the ability to efficiently
write zeroes without having to send the zeroes over the wire,
along with a flag to control whether the client wants a hole.

The generic block code takes care of falling back to the obvious
write of lots of zeroes if we return -ENOTSUP because the server
does not have WRITE_ZEROES.

Ideally, since NBD_CMD_WRITE_ZEROES does not involve any data
over the wire, we want to support transactions that are much
larger than the normal 32M limit imposed on NBD_CMD_WRITE.  But
the server may still have a limit smaller than UINT_MAX, so
until experimental NBD protocol additions for advertising various
command sizes is finalized (see [1], [2]), for now we just stick to
the same limits as normal writes.

[1] https://github.com/yoe/nbd/blob/extension-info/doc/proto.md
[2] https://sourceforge.net/p/nbd/mailman/message/35081223/

Signed-off-by: Eric Blake 

---
v6: rebase
v5: enhance commit message
v4: rebase to byte-based limits
v3: rebase, tell block layer about our support
---
 block/nbd-client.h |  2 ++
 block/nbd-client.c | 35 +++
 block/nbd.c|  4 
 3 files changed, 41 insertions(+)

diff --git a/block/nbd-client.h b/block/nbd-client.h
index 78e8e57..e51df22 100644
--- a/block/nbd-client.h
+++ b/block/nbd-client.h
@@ -48,6 +48,8 @@ int nbd_client_co_pdiscard(BlockDriverState *bs, int64_t 
offset, int count);
 int nbd_client_co_flush(BlockDriverState *bs);
 int nbd_client_co_pwritev(BlockDriverState *bs, uint64_t offset,
   uint64_t bytes, QEMUIOVector *qiov, int flags);
+int nbd_client_co_pwrite_zeroes(BlockDriverState *bs, int64_t offset,
+int count, BdrvRequestFlags flags);
 int nbd_client_co_preadv(BlockDriverState *bs, uint64_t offset,
  uint64_t bytes, QEMUIOVector *qiov, int flags);

diff --git a/block/nbd-client.c b/block/nbd-client.c
index 8e89add..31db557 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -275,6 +275,41 @@ int nbd_client_co_pwritev(BlockDriverState *bs, uint64_t 
offset,
 return -reply.error;
 }

+int nbd_client_co_pwrite_zeroes(BlockDriverState *bs, int64_t offset,
+int count, BdrvRequestFlags flags)
+{
+ssize_t ret;
+NBDClientSession *client = nbd_get_client_session(bs);
+NBDRequest request = {
+.type = NBD_CMD_WRITE_ZEROES,
+.from = offset,
+.len = count,
+};
+NBDReply reply;
+
+if (!(client->nbdflags & NBD_FLAG_SEND_WRITE_ZEROES)) {
+return -ENOTSUP;
+}
+
+if (flags & BDRV_REQ_FUA) {
+assert(client->nbdflags & NBD_FLAG_SEND_FUA);
+request.flags |= NBD_CMD_FLAG_FUA;
+}
+if (!(flags & BDRV_REQ_MAY_UNMAP)) {
+request.flags |= NBD_CMD_FLAG_NO_HOLE;
+}
+
+nbd_coroutine_start(client, );
+ret = nbd_co_send_request(bs, , NULL);
+if (ret < 0) {
+reply.error = -ret;
+} else {
+nbd_co_receive_reply(client, , , NULL);
+}
+nbd_coroutine_end(client, );
+return -reply.error;
+}
+
 int nbd_client_co_flush(BlockDriverState *bs)
 {
 NBDClientSession *client = nbd_get_client_session(bs);
diff --git a/block/nbd.c b/block/nbd.c
index e227490..6c7bbc8 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -403,6 +403,7 @@ static int nbd_co_flush(BlockDriverState *bs)
 static void nbd_refresh_limits(BlockDriverState *bs, Error **errp)
 {
 bs->bl.max_pdiscard = NBD_MAX_BUFFER_SIZE;
+bs->bl.max_pwrite_zeroes = NBD_MAX_BUFFER_SIZE;
 bs->bl.max_transfer = NBD_MAX_BUFFER_SIZE;
 }

@@ -491,6 +492,7 @@ static BlockDriver bdrv_nbd = {
 .bdrv_file_open = nbd_open,
 .bdrv_co_preadv = nbd_client_co_preadv,
 .bdrv_co_pwritev= nbd_client_co_pwritev,
+.bdrv_co_pwrite_zeroes  = nbd_client_co_pwrite_zeroes,
 .bdrv_close = nbd_close,
 .bdrv_co_flush_to_os= nbd_co_flush,
 .bdrv_co_pdiscard   = nbd_client_co_pdiscard,
@@ -509,6 +511,7 @@ static BlockDriver bdrv_nbd_tcp = {
 .bdrv_file_open = nbd_open,
 .bdrv_co_preadv = nbd_client_co_preadv,
 .bdrv_co_pwritev= nbd_client_co_pwritev,
+.bdrv_co_pwrite_zeroes  = nbd_client_co_pwrite_zeroes,
 .bdrv_close = nbd_close,
 .bdrv_co_flush_to_os= nbd_co_flush,
 .bdrv_co_pdiscard   = nbd_client_co_pdiscard,
@@ -527,6 +530,7 @@ static BlockDriver bdrv_nbd_unix = {
 .bdrv_file_open = nbd_open,
 .bdrv_co_preadv = nbd_client_co_preadv,
 .bdrv_co_pwritev= nbd_client_co_pwritev,
+.bdrv_co_pwrite_zeroes  = nbd_client_co_pwrite_zeroes,
 .bdrv_close = nbd_close,
 .bdrv_co_flush_to_os= nbd_co_flush,
 .bdrv_co_pdiscard   = nbd_client_co_pdiscard,
-- 
2.7.4

[Qemu-block] [PATCH v6 10/15] nbd: Let client skip portions of server reply

2016-10-13 Thread Eric Blake

The server has a nice helper function nbd_negotiate_drop_sync()
which lets it easily ignore fluff from the client (such as the
payload to an unknown option request).  We can't quite make it
common, since it depends on nbd_negotiate_read() which handles
coroutine magic, but we can copy the idea into the client where
we have places where we want to ignore data (such as the
description tacked on the end of NBD_REP_SERVER).

Signed-off-by: Eric Blake 

---
v6: rebase
v5: no change
v4: rebase
v3: rebase
---
 nbd/client.c | 47 +--
 1 file changed, 33 insertions(+), 14 deletions(-)

diff --git a/nbd/client.c b/nbd/client.c
index a3e1e7a..df7eb9c 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -75,6 +75,32 @@ static QTAILQ_HEAD(, NBDExport) exports = 
QTAILQ_HEAD_INITIALIZER(exports);

 */

+/* Discard length bytes from channel.  Return -errno on failure, or
+ * the amount of bytes consumed. */
+static ssize_t drop_sync(QIOChannel *ioc, size_t size)
+{
+ssize_t ret, dropped = size;
+char small[1024];
+char *buffer;
+
+buffer = sizeof(small) < size ? small : g_malloc(MIN(65536, size));
+while (size > 0) {
+ret = read_sync(ioc, buffer, MIN(65536, size));
+if (ret < 0) {
+goto cleanup;
+}
+assert(ret <= size);
+size -= ret;
+}
+ret = dropped;
+
+ cleanup:
+if (buffer != small) {
+g_free(buffer);
+}
+return ret;
+}
+
 /* Send an option request.
  *
  * The request is for option @opt, with @data containing @len bytes of
@@ -285,19 +311,12 @@ static int nbd_receive_list(QIOChannel *ioc, char **name, 
Error **errp)
 }
 (*name)[namelen] = '\0';
 len -= namelen;
-if (len) {
-char *buf = g_malloc(len + 1);
-if (read_sync(ioc, buf, len) != len) {
-error_setg(errp, "failed to read export description");
-g_free(*name);
-g_free(buf);
-*name = NULL;
-nbd_send_opt_abort(ioc);
-return -1;
-}
-buf[len] = '\0';
-TRACE("Ignoring export description: %s", buf);
-g_free(buf);
+if (drop_sync(ioc, len) != len) {
+error_setg(errp, "failed to read export description");
+g_free(*name);
+*name = NULL;
+nbd_send_opt_abort(ioc);
+return -1;
 }
 } else {
 error_setg(errp, "Unexpected reply type %" PRIx32 " expected %x",
@@ -576,7 +595,7 @@ int nbd_receive_negotiate(QIOChannel *ioc, const char 
*name, uint16_t *flags,
 }

 TRACE("Size is %" PRIu64 ", export flags %" PRIx16, *size, *flags);
-if (read_sync(ioc, , 124) != 124) {
+if (drop_sync(ioc, 124) != 124) {
 error_setg(errp, "Failed to read reserved block");
 goto fail;
 }
-- 
2.7.4

[Qemu-block] [PATCH v6 14/15] nbd: Implement NBD_CMD_WRITE_ZEROES on server

2016-10-13 Thread Eric Blake

Upstream NBD protocol recently added the ability to efficiently
write zeroes without having to send the zeroes over the wire,
along with a flag to control whether the client wants to allow
a hole.

Note that when it comes to requiring full allocation, vs.
permitting optimizations, the NBD spec intentionally picked a
different sense for the flag; the rules in qemu are:
MAY_UNMAP == 0: must write zeroes
MAY_UNMAP == 1: may use holes if reads will see zeroes

while in NBD, the rules are:
FLAG_NO_HOLE == 1: must write zeroes
FLAG_NO_HOLE == 0: may use holes if reads will see zeroes

In all cases, the 'may use holes' scenario is optional (the
server need not use a hole, and must not use a hole if
subsequent reads would not see zeroes).

Signed-off-by: Eric Blake 

---
v6: rebase, improve commit message
v5: no change
v4: rebase, fix value for constant
v3: abandon NBD_CMD_CLOSE extension, rebase to use blk_pwrite_zeroes
---
 include/block/nbd.h |  8 ++--
 nbd/server.c| 42 --
 2 files changed, 46 insertions(+), 4 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index eea7ef0..3e373f0 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -71,6 +71,7 @@ typedef struct NBDReply NBDReply;
 #define NBD_FLAG_SEND_FUA   (1 << 3)/* Send FUA (Force Unit 
Access) */
 #define NBD_FLAG_ROTATIONAL (1 << 4)/* Use elevator algorithm - 
rotational media */
 #define NBD_FLAG_SEND_TRIM  (1 << 5)/* Send TRIM (discard) */
+#define NBD_FLAG_SEND_WRITE_ZEROES (1 << 6) /* Send WRITE_ZEROES */

 /* New-style handshake (global) flags, sent from server to client, and
control what will happen during handshake phase. */
@@ -96,7 +97,8 @@ typedef struct NBDReply NBDReply;
 #define NBD_REP_ERR_SHUTDOWNNBD_REP_ERR(7)  /* Server shutting down */

 /* Request flags, sent from client to server during transmission phase */
-#define NBD_CMD_FLAG_FUA(1 << 0)
+#define NBD_CMD_FLAG_FUA(1 << 0) /* 'force unit access' during write */
+#define NBD_CMD_FLAG_NO_HOLE(1 << 1) /* don't punch hole on zero run */

 /* Supported request types */
 enum {
@@ -104,7 +106,9 @@ enum {
 NBD_CMD_WRITE = 1,
 NBD_CMD_DISC = 2,
 NBD_CMD_FLUSH = 3,
-NBD_CMD_TRIM = 4
+NBD_CMD_TRIM = 4,
+/* 5 reserved for failed experiment NBD_CMD_CACHE */
+NBD_CMD_WRITE_ZEROES = 6,
 };

 #define NBD_DEFAULT_PORT   10809
diff --git a/nbd/server.c b/nbd/server.c
index a7aa2ba..b189619 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -615,7 +615,8 @@ static coroutine_fn int nbd_negotiate(NBDClientNewData 
*data)
 char buf[8 + 8 + 8 + 128];
 int rc;
 const uint16_t myflags = (NBD_FLAG_HAS_FLAGS | NBD_FLAG_SEND_TRIM |
-  NBD_FLAG_SEND_FLUSH | NBD_FLAG_SEND_FUA);
+  NBD_FLAG_SEND_FLUSH | NBD_FLAG_SEND_FUA |
+  NBD_FLAG_SEND_WRITE_ZEROES);
 bool oldStyle;
 size_t len;

@@ -1145,11 +1146,17 @@ static ssize_t nbd_co_receive_request(NBDRequestData 
*req,
 rc = request->type == NBD_CMD_WRITE ? -ENOSPC : -EINVAL;
 goto out;
 }
-if (request->flags & ~NBD_CMD_FLAG_FUA) {
+if (request->flags & ~(NBD_CMD_FLAG_FUA | NBD_CMD_FLAG_NO_HOLE)) {
 LOG("unsupported flags (got 0x%x)", request->flags);
 rc = -EINVAL;
 goto out;
 }
+if (request->type != NBD_CMD_WRITE_ZEROES &&
+(request->flags & NBD_CMD_FLAG_NO_HOLE)) {
+LOG("unexpected flags (got 0x%x)", request->flags);
+rc = -EINVAL;
+goto out;
+}

 rc = 0;

@@ -1254,6 +1261,37 @@ static void nbd_trip(void *opaque)
 }
 break;

+case NBD_CMD_WRITE_ZEROES:
+TRACE("Request type is WRITE_ZEROES");
+
+if (exp->nbdflags & NBD_FLAG_READ_ONLY) {
+TRACE("Server is read-only, return error");
+reply.error = EROFS;
+goto error_reply;
+}
+
+TRACE("Writing to device");
+
+flags = 0;
+if (request.flags & NBD_CMD_FLAG_FUA) {
+flags |= BDRV_REQ_FUA;
+}
+if (!(request.flags & NBD_CMD_FLAG_NO_HOLE)) {
+flags |= BDRV_REQ_MAY_UNMAP;
+}
+ret = blk_pwrite_zeroes(exp->blk, request.from + exp->dev_offset,
+request.len, flags);
+if (ret < 0) {
+LOG("writing to file failed");
+reply.error = -ret;
+goto error_reply;
+}
+
+if (nbd_co_send_reply(req, , 0) < 0) {
+goto out;
+}
+break;
+
 case NBD_CMD_DISC:
 /* unreachable, thanks to special case in nbd_co_receive_request() */
 abort();
-- 
2.7.4

[Qemu-block] [PATCH v6 00/15] nbd: efficient write zeroes

2016-10-13 Thread Eric Blake

v5 was here, but missed 2.7 freeze:
https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04053.html

Since then, I've rebased the series, and the bulk of the changes
were to use consistent NBDFoo CamelCase naming, as well as to
improve the commit messages for questions raised on v5.

Also available as a tag at:
git fetch git://repo.or.cz/qemu/ericb.git nbd-zero-v6

001/15:[] [-C] 'nbd: Add qemu-nbd -D for human-readable description'
002/15:[] [--] 'nbd: Treat flags vs. command type as separate fields'
003/15:[down] 'nbd: Rename NBDRequest to NBDRequestData'
004/15:[down] 'nbd: Rename NbdClientSession to NBDClientSession'
005/15:[down] 'nbd: Rename struct nbd_request and nbd_reply'
006/15:[0012] [FC] 'nbd: Share common reply-sending code in server'
007/15:[0006] [FC] 'nbd: Send message along with server NBD_REP_ERR errors'
008/15:[0015] [FC] 'nbd: Share common option-sending code in client'
009/15:[] [-C] 'nbd: Let server know when client gives up negotiation'
010/15:[] [-C] 'nbd: Let client skip portions of server reply'
011/15:[0004] [FC] 'nbd: Less allocation during NBD_OPT_LIST'
012/15:[] [-C] 'nbd: Support shorter handshake'
013/15:[] [-C] 'nbd: Improve server handling of shutdown requests'
014/15:[] [-C] 'nbd: Implement NBD_CMD_WRITE_ZEROES on server'
015/15:[0006] [FC] 'nbd: Implement NBD_CMD_WRITE_ZEROES on client'

Eric Blake (15):
  nbd: Add qemu-nbd -D for human-readable description
  nbd: Treat flags vs. command type as separate fields
  nbd: Rename NBDRequest to NBDRequestData
  nbd: Rename NbdClientSession to NBDClientSession
  nbd: Rename struct nbd_request and nbd_reply
  nbd: Share common reply-sending code in server
  nbd: Send message along with server NBD_REP_ERR errors
  nbd: Share common option-sending code in client
  nbd: Let server know when client gives up negotiation
  nbd: Let client skip portions of server reply
  nbd: Less allocation during NBD_OPT_LIST
  nbd: Support shorter handshake
  nbd: Improve server handling of shutdown requests
  nbd: Implement NBD_CMD_WRITE_ZEROES on server
  nbd: Implement NBD_CMD_WRITE_ZEROES on client

 block/nbd-client.h  |  10 +-
 include/block/nbd.h |  73 ++--
 nbd/nbd-internal.h  |  12 +-
 block/nbd-client.c  |  96 +++
 block/nbd.c |   8 +-
 nbd/client.c| 482 
 nbd/server.c| 294 ++--
 qemu-nbd.c  |  12 +-
 qemu-nbd.texi   |   5 +-
 9 files changed, 614 insertions(+), 378 deletions(-)

-- 
2.7.4

[Qemu-block] [PATCH v6 08/15] nbd: Share common option-sending code in client

2016-10-13 Thread Eric Blake

Rather than open-coding each option request, it's easier to
have common helper functions do the work.  That in turn requires
having convenient packed types for handling option requests
and replies.

Signed-off-by: Eric Blake 

---
v6: comment and formatting tweaks
v5: no change
v4: rebase
v3: rebase, tweak a debug message
---
 include/block/nbd.h |  25 +-
 nbd/nbd-internal.h  |   2 +-
 nbd/client.c| 255 ++--
 3 files changed, 131 insertions(+), 151 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index a33581b..b69bf1d 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -26,15 +26,34 @@
 #include "io/channel-socket.h"
 #include "crypto/tlscreds.h"

-/* Note: these are _NOT_ the same as the network representation of an NBD
+/* Handshake phase structs - this struct is passed on the wire */
+
+struct nbd_option {
+uint64_t magic; /* NBD_OPTS_MAGIC */
+uint32_t option; /* NBD_OPT_* */
+uint32_t length;
+} QEMU_PACKED;
+typedef struct nbd_option nbd_option;
+
+struct nbd_opt_reply {
+uint64_t magic; /* NBD_REP_MAGIC */
+uint32_t option; /* NBD_OPT_* */
+uint32_t type; /* NBD_REP_* */
+uint32_t length;
+} QEMU_PACKED;
+typedef struct nbd_opt_reply nbd_opt_reply;
+
+/* Transmission phase structs
+ *
+ * Note: these are _NOT_ the same as the network representation of an NBD
  * request and reply!
  */
 struct NBDRequest {
 uint64_t handle;
 uint64_t from;
 uint32_t len;
-uint16_t flags;
-uint16_t type;
+uint16_t flags; /* NBD_CMD_FLAG_* */
+uint16_t type; /* NBD_CMD_* */
 };
 typedef struct NBDRequest NBDRequest;

diff --git a/nbd/nbd-internal.h b/nbd/nbd-internal.h
index 99e5157..dd57e18 100644
--- a/nbd/nbd-internal.h
+++ b/nbd/nbd-internal.h
@@ -62,7 +62,7 @@
 #define NBD_REPLY_MAGIC 0x67446698
 #define NBD_OPTS_MAGIC  0x49484156454F5054LL
 #define NBD_CLIENT_MAGIC0x420281861253LL
-#define NBD_REP_MAGIC   0x3e889045565a9LL
+#define NBD_REP_MAGIC   0x0003e889045565a9LL

 #define NBD_SET_SOCK_IO(0xab, 0)
 #define NBD_SET_BLKSIZE _IO(0xab, 1)
diff --git a/nbd/client.c b/nbd/client.c
index 86e29dc..f7a2d6e 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -75,64 +75,128 @@ static QTAILQ_HEAD(, NBDExport) exports = 
QTAILQ_HEAD_INITIALIZER(exports);

 */

+/* Send an option request.
+ *
+ * The request is for option @opt, with @data containing @len bytes of
+ * additional payload for the request (@len may be -1 to treat @data as
+ * a C string; and @data may be NULL if @len is 0).
+ * Return 0 if successful, -1 with errp set if it is impossible to
+ * continue. */
+static int nbd_send_option_request(QIOChannel *ioc, uint32_t opt,
+   uint32_t len, const char *data,
+   Error **errp)
+{
+nbd_option req;
+QEMU_BUILD_BUG_ON(sizeof(req) != 16);

-/* If type represents success, return 1 without further action.
- * If type represents an error reply, consume the rest of the packet on ioc.
- * Then return 0 for unsupported (so the client can fall back to
- * other approaches), or -1 with errp set for other errors.
+if (len == -1) {
+req.length = len = strlen(data);
+}
+TRACE("Sending option request %" PRIu32", len %" PRIu32, opt, len);
+
+stq_be_p(, NBD_OPTS_MAGIC);
+stl_be_p(, opt);
+stl_be_p(, len);
+
+if (write_sync(ioc, , sizeof(req)) != sizeof(req)) {
+error_setg(errp, "Failed to send option request header");
+return -1;
+}
+
+if (len && write_sync(ioc, (char *) data, len) != len) {
+error_setg(errp, "Failed to send option request data");
+return -1;
+}
+
+return 0;
+}
+
+/* Receive the header of an option reply, which should match the given
+ * opt.  Read through the length field, but NOT the length bytes of
+ * payload. Return 0 if successful, -1 with errp set if it is
+ * impossible to continue. */
+static int nbd_receive_option_reply(QIOChannel *ioc, uint32_t opt,
+nbd_opt_reply *reply, Error **errp)
+{
+QEMU_BUILD_BUG_ON(sizeof(*reply) != 20);
+if (read_sync(ioc, reply, sizeof(*reply)) != sizeof(*reply)) {
+error_setg(errp, "failed to read option reply");
+return -1;
+}
+be64_to_cpus(>magic);
+be32_to_cpus(>option);
+be32_to_cpus(>type);
+be32_to_cpus(>length);
+
+TRACE("Received option reply %" PRIx32", type %" PRIx32", len %" PRIu32,
+  reply->option, reply->type, reply->length);
+
+if (reply->magic != NBD_REP_MAGIC) {
+error_setg(errp, "Unexpected option reply magic");
+return -1;
+}
+if (reply->option != opt) {
+error_setg(errp, "Unexpected option type %x expected %x",
+   reply->option, opt);
+return -1;
+}
+return 0;
+}
+
+/* If reply represents success, return 1 without

[Qemu-block] [PATCH v6 09/15] nbd: Let server know when client gives up negotiation

2016-10-13 Thread Eric Blake

The NBD spec says that a client should send NBD_OPT_ABORT
rather than just dropping the connection, if the client doesn't
like something the server sent during option negotiation.  This
is a best-effort attempt only, and can only be done in places
where we know the server is still in sync with what we've sent,
whether or not we've read everything the server has sent.
Technically, the server then has to reply with NBD_REP_ACK, but
it's not worth complicating the client to wait around for that
reply.

Signed-off-by: Eric Blake 

---
v6: rebase
v5: no change
v4: new patch
---
 nbd/client.c | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/nbd/client.c b/nbd/client.c
index f7a2d6e..a3e1e7a 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -111,6 +111,19 @@ static int nbd_send_option_request(QIOChannel *ioc, 
uint32_t opt,
 return 0;
 }

+/* Send NBD_OPT_ABORT as a courtesy to let the server know that we are
+ * not going to attempt further negotiation. */
+static void nbd_send_opt_abort(QIOChannel *ioc)
+{
+/* Technically, a compliant server is supposed to reply to us; but
+ * older servers disconnected instead. At any rate, we're allowed
+ * to disconnect without waiting for the server reply, so we don't
+ * even care if the request makes it to the server, let alone
+ * waiting around for whether the server replies. */
+nbd_send_option_request(ioc, NBD_OPT_ABORT, 0, NULL, NULL);
+}
+
+
 /* Receive the header of an option reply, which should match the given
  * opt.  Read through the length field, but NOT the length bytes of
  * payload. Return 0 if successful, -1 with errp set if it is
@@ -121,6 +134,7 @@ static int nbd_receive_option_reply(QIOChannel *ioc, 
uint32_t opt,
 QEMU_BUILD_BUG_ON(sizeof(*reply) != 20);
 if (read_sync(ioc, reply, sizeof(*reply)) != sizeof(*reply)) {
 error_setg(errp, "failed to read option reply");
+nbd_send_opt_abort(ioc);
 return -1;
 }
 be64_to_cpus(>magic);
@@ -133,11 +147,13 @@ static int nbd_receive_option_reply(QIOChannel *ioc, 
uint32_t opt,

 if (reply->magic != NBD_REP_MAGIC) {
 error_setg(errp, "Unexpected option reply magic");
+nbd_send_opt_abort(ioc);
 return -1;
 }
 if (reply->option != opt) {
 error_setg(errp, "Unexpected option type %x expected %x",
reply->option, opt);
+nbd_send_opt_abort(ioc);
 return -1;
 }
 return 0;
@@ -206,6 +222,9 @@ static int nbd_handle_reply_err(QIOChannel *ioc, 
nbd_opt_reply *reply,

  cleanup:
 g_free(msg);
+if (result < 0) {
+nbd_send_opt_abort(ioc);
+}
 return result;
 }

@@ -229,25 +248,30 @@ static int nbd_receive_list(QIOChannel *ioc, char **name, 
Error **errp)
 if (reply.type == NBD_REP_ACK) {
 if (len != 0) {
 error_setg(errp, "length too long for option end");
+nbd_send_opt_abort(ioc);
 return -1;
 }
 } else if (reply.type == NBD_REP_SERVER) {
 if (len < sizeof(namelen) || len > NBD_MAX_BUFFER_SIZE) {
 error_setg(errp, "incorrect option length %" PRIu32, len);
+nbd_send_opt_abort(ioc);
 return -1;
 }
 if (read_sync(ioc, , sizeof(namelen)) != sizeof(namelen)) {
 error_setg(errp, "failed to read option name length");
+nbd_send_opt_abort(ioc);
 return -1;
 }
 namelen = be32_to_cpu(namelen);
 len -= sizeof(namelen);
 if (len < namelen) {
 error_setg(errp, "incorrect option name length");
+nbd_send_opt_abort(ioc);
 return -1;
 }
 if (namelen > NBD_MAX_NAME_SIZE) {
 error_setg(errp, "export name length too long %" PRIu32, namelen);
+nbd_send_opt_abort(ioc);
 return -1;
 }

@@ -256,6 +280,7 @@ static int nbd_receive_list(QIOChannel *ioc, char **name, 
Error **errp)
 error_setg(errp, "failed to read export name");
 g_free(*name);
 *name = NULL;
+nbd_send_opt_abort(ioc);
 return -1;
 }
 (*name)[namelen] = '\0';
@@ -267,6 +292,7 @@ static int nbd_receive_list(QIOChannel *ioc, char **name, 
Error **errp)
 g_free(*name);
 g_free(buf);
 *name = NULL;
+nbd_send_opt_abort(ioc);
 return -1;
 }
 buf[len] = '\0';
@@ -276,6 +302,7 @@ static int nbd_receive_list(QIOChannel *ioc, char **name, 
Error **errp)
 } else {
 error_setg(errp, "Unexpected reply type %" PRIx32 " expected %x",
reply.type, NBD_REP_SERVER);
+nbd_send_opt_abort(ioc);
 return -1;
 }
 return 1;
@@ -325,6 +352,7 @@ static int nbd_receive_query_exports(QIOChannel *ioc,

 if (!foundExport) {
 error_setg(errp, "No export with name '%s'

[Qemu-block] [PATCH v6 05/15] nbd: Rename struct nbd_request and nbd_reply

2016-10-13 Thread Eric Blake

Our coding convention prefers CamelCase names, and we already
have other existing structs with NBDFoo naming.  Let's be
consistent, before later patches add even more structs.

Signed-off-by: Eric Blake 
---
v6: new patch
---
 block/nbd-client.h  |  2 +-
 include/block/nbd.h | 10 ++
 block/nbd-client.c  | 28 ++--
 nbd/client.c|  4 ++--
 nbd/server.c| 12 ++--
 5 files changed, 29 insertions(+), 27 deletions(-)

diff --git a/block/nbd-client.h b/block/nbd-client.h
index a84a478..78e8e57 100644
--- a/block/nbd-client.h
+++ b/block/nbd-client.h
@@ -29,7 +29,7 @@ typedef struct NBDClientSession {
 int in_flight;

 Coroutine *recv_coroutine[MAX_NBD_REQUESTS];
-struct nbd_reply reply;
+NBDReply reply;

 bool is_unix;
 } NBDClientSession;
diff --git a/include/block/nbd.h b/include/block/nbd.h
index 5fe2670..a33581b 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -29,18 +29,20 @@
 /* Note: these are _NOT_ the same as the network representation of an NBD
  * request and reply!
  */
-struct nbd_request {
+struct NBDRequest {
 uint64_t handle;
 uint64_t from;
 uint32_t len;
 uint16_t flags;
 uint16_t type;
 };
+typedef struct NBDRequest NBDRequest;

-struct nbd_reply {
+struct NBDReply {
 uint64_t handle;
 uint32_t error;
 };
+typedef struct NBDReply NBDReply;

 /* Transmission (export) flags: sent from server to client during handshake,
but describe what will happen during transmission */
@@ -101,8 +103,8 @@ int nbd_receive_negotiate(QIOChannel *ioc, const char 
*name, uint16_t *flags,
   QIOChannel **outioc,
   off_t *size, Error **errp);
 int nbd_init(int fd, QIOChannelSocket *sioc, uint16_t flags, off_t size);
-ssize_t nbd_send_request(QIOChannel *ioc, struct nbd_request *request);
-ssize_t nbd_receive_reply(QIOChannel *ioc, struct nbd_reply *reply);
+ssize_t nbd_send_request(QIOChannel *ioc, NBDRequest *request);
+ssize_t nbd_receive_reply(QIOChannel *ioc, NBDReply *reply);
 int nbd_client(int fd);
 int nbd_disconnect(int fd);

diff --git a/block/nbd-client.c b/block/nbd-client.c
index c94608a..8e89add 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -116,7 +116,7 @@ static void nbd_restart_write(void *opaque)
 }

 static int nbd_co_send_request(BlockDriverState *bs,
-   struct nbd_request *request,
+   NBDRequest *request,
QEMUIOVector *qiov)
 {
 NBDClientSession *s = nbd_get_client_session(bs);
@@ -168,8 +168,8 @@ static int nbd_co_send_request(BlockDriverState *bs,
 }

 static void nbd_co_receive_reply(NBDClientSession *s,
- struct nbd_request *request,
- struct nbd_reply *reply,
+ NBDRequest *request,
+ NBDReply *reply,
  QEMUIOVector *qiov)
 {
 int ret;
@@ -196,7 +196,7 @@ static void nbd_co_receive_reply(NBDClientSession *s,
 }

 static void nbd_coroutine_start(NBDClientSession *s,
-   struct nbd_request *request)
+NBDRequest *request)
 {
 /* Poor man semaphore.  The free_sema is locked when no other request
  * can be accepted, and unlocked after receiving one reply.  */
@@ -210,7 +210,7 @@ static void nbd_coroutine_start(NBDClientSession *s,
 }

 static void nbd_coroutine_end(NBDClientSession *s,
-struct nbd_request *request)
+  NBDRequest *request)
 {
 int i = HANDLE_TO_INDEX(s, request->handle);
 s->recv_coroutine[i] = NULL;
@@ -223,12 +223,12 @@ int nbd_client_co_preadv(BlockDriverState *bs, uint64_t 
offset,
  uint64_t bytes, QEMUIOVector *qiov, int flags)
 {
 NBDClientSession *client = nbd_get_client_session(bs);
-struct nbd_request request = {
+NBDRequest request = {
 .type = NBD_CMD_READ,
 .from = offset,
 .len = bytes,
 };
-struct nbd_reply reply;
+NBDReply reply;
 ssize_t ret;

 assert(bytes <= NBD_MAX_BUFFER_SIZE);
@@ -249,12 +249,12 @@ int nbd_client_co_pwritev(BlockDriverState *bs, uint64_t 
offset,
   uint64_t bytes, QEMUIOVector *qiov, int flags)
 {
 NBDClientSession *client = nbd_get_client_session(bs);
-struct nbd_request request = {
+NBDRequest request = {
 .type = NBD_CMD_WRITE,
 .from = offset,
 .len = bytes,
 };
-struct nbd_reply reply;
+NBDReply reply;
 ssize_t ret;

 if (flags & BDRV_REQ_FUA) {
@@ -278,8 +278,8 @@ int nbd_client_co_pwritev(BlockDriverState *bs, uint64_t 
offset,
 int nbd_client_co_flush(BlockDriverState *bs)
 {
 NBDClientSession *client = nbd_get_client_session(bs);
-struct nbd_request request = { .type = NBD_CMD_FLUSH };
-struct nbd_reply reply;
+NBDRequest request =

[Qemu-block] [PATCH v6 07/15] nbd: Send message along with server NBD_REP_ERR errors

2016-10-13 Thread Eric Blake

The NBD Protocol allows us to send human-readable messages
along with any NBD_REP_ERR error during option negotiation;
make use of this fact for clients that know what to do with
our message.

Signed-off-by: Eric Blake 

---
v6: tweak comments, fix indentation
v5: don't leak 'msg'
v4: new patch
---
 nbd/server.c | 78 +---
 1 file changed, 59 insertions(+), 19 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 14a5bd5..3d39292 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -236,6 +236,38 @@ static int nbd_negotiate_send_rep(QIOChannel *ioc, 
uint32_t type, uint32_t opt)
 return nbd_negotiate_send_rep_len(ioc, type, opt, 0);
 }

+/* Send an error reply.
+ * Return -errno on error, 0 on success. */
+static int GCC_FMT_ATTR(4, 5)
+nbd_negotiate_send_rep_err(QIOChannel *ioc, uint32_t type,
+   uint32_t opt, const char *fmt, ...)
+{
+va_list va;
+char *msg;
+int ret;
+size_t len;
+
+va_start(va, fmt);
+msg = g_strdup_vprintf(fmt, va);
+va_end(va);
+len = strlen(msg);
+assert(len < 4096);
+TRACE("sending error message \"%s\"", msg);
+ret = nbd_negotiate_send_rep_len(ioc, type, opt, len);
+if (ret < 0) {
+goto out;
+}
+if (nbd_negotiate_write(ioc, msg, len) != len) {
+LOG("write failed (error message)");
+ret = -EIO;
+} else {
+ret = 0;
+}
+out:
+g_free(msg);
+return ret;
+}
+
 /* Send a single NBD_REP_SERVER reply to NBD_OPT_LIST, including payload.
  * Return -errno on error, 0 on success. */
 static int nbd_negotiate_send_rep_list(QIOChannel *ioc, NBDExport *exp)
@@ -281,8 +313,9 @@ static int nbd_negotiate_handle_list(NBDClient *client, 
uint32_t length)
 if (nbd_negotiate_drop_sync(client->ioc, length) != length) {
 return -EIO;
 }
-return nbd_negotiate_send_rep(client->ioc,
-  NBD_REP_ERR_INVALID, NBD_OPT_LIST);
+return nbd_negotiate_send_rep_err(client->ioc,
+  NBD_REP_ERR_INVALID, NBD_OPT_LIST,
+  "OPT_LIST should not have length");
 }

 /* For each export, send a NBD_REP_SERVER reply. */
@@ -329,7 +362,8 @@ fail:
 return rc;
 }

-
+/* Handle NBD_OPT_STARTTLS. Return NULL to drop connection, or else the
+ * new channel for all further (now-encrypted) communication. */
 static QIOChannel *nbd_negotiate_handle_starttls(NBDClient *client,
  uint32_t length)
 {
@@ -343,7 +377,8 @@ static QIOChannel *nbd_negotiate_handle_starttls(NBDClient 
*client,
 if (nbd_negotiate_drop_sync(ioc, length) != length) {
 return NULL;
 }
-nbd_negotiate_send_rep(ioc, NBD_REP_ERR_INVALID, NBD_OPT_STARTTLS);
+nbd_negotiate_send_rep_err(ioc, NBD_REP_ERR_INVALID, NBD_OPT_STARTTLS,
+   "OPT_STARTTLS should not have length");
 return NULL;
 }

@@ -473,13 +508,15 @@ static int nbd_negotiate_options(NBDClient *client)
 return -EINVAL;

 default:
-TRACE("Option 0x%" PRIx32 " not permitted before TLS",
-  clientflags);
 if (nbd_negotiate_drop_sync(client->ioc, length) != length) {
 return -EIO;
 }
-ret = nbd_negotiate_send_rep(client->ioc, NBD_REP_ERR_TLS_REQD,
- clientflags);
+ret = nbd_negotiate_send_rep_err(client->ioc,
+ NBD_REP_ERR_TLS_REQD,
+ clientflags,
+ "Option 0x%" PRIx32
+ "not permitted before TLS",
+ clientflags);
 if (ret < 0) {
 return ret;
 }
@@ -505,27 +542,30 @@ static int nbd_negotiate_options(NBDClient *client)
 return -EIO;
 }
 if (client->tlscreds) {
-TRACE("TLS already enabled");
-ret = nbd_negotiate_send_rep(client->ioc,
- NBD_REP_ERR_INVALID,
- clientflags);
+ret = nbd_negotiate_send_rep_err(client->ioc,
+ NBD_REP_ERR_INVALID,
+ clientflags,
+ "TLS already enabled");
 } else {
-TRACE("TLS not configured");
-ret = nbd_negotiate_send_rep(client->ioc,
-

[Qemu-block] [PATCH v6 02/15] nbd: Treat flags vs. command type as separate fields

2016-10-13 Thread Eric Blake

Current upstream NBD documents that requests have a 16-bit flags,
followed by a 16-bit type integer; although older versions mentioned
only a 32-bit field with masking to find flags.  Since the protocol
is in network order (big-endian over the wire), the ABI is unchanged;
but dealing with the flags as a separate field rather than masking
will make it easier to add support for upcoming NBD extensions that
increase the number of both flags and commands.

Improve some comments in nbd.h based on the current upstream
NBD protocol (https://github.com/yoe/nbd/blob/master/doc/proto.md),
and touch some nearby code to keep checkpatch.pl happy.

Signed-off-by: Eric Blake 

---
v6: no change
v5: no change
v4: rebase to earlier changes
v3: rebase to other changes earlier in series
---
 include/block/nbd.h | 18 --
 nbd/nbd-internal.h  |  4 ++--
 block/nbd-client.c  |  9 +++--
 nbd/client.c|  9 ++---
 nbd/server.c| 39 +++
 5 files changed, 42 insertions(+), 37 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index fd58390..5fe2670 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -1,4 +1,5 @@
 /*
+ *  Copyright (C) 2016 Red Hat, Inc.
  *  Copyright (C) 2005  Anthony Liguori 
  *
  *  Network Block Device
@@ -32,7 +33,8 @@ struct nbd_request {
 uint64_t handle;
 uint64_t from;
 uint32_t len;
-uint32_t type;
+uint16_t flags;
+uint16_t type;
 };

 struct nbd_reply {
@@ -40,6 +42,8 @@ struct nbd_reply {
 uint32_t error;
 };

+/* Transmission (export) flags: sent from server to client during handshake,
+   but describe what will happen during transmission */
 #define NBD_FLAG_HAS_FLAGS  (1 << 0)/* Flags are there */
 #define NBD_FLAG_READ_ONLY  (1 << 1)/* Device is read-only */
 #define NBD_FLAG_SEND_FLUSH (1 << 2)/* Send FLUSH */
@@ -47,10 +51,12 @@ struct nbd_reply {
 #define NBD_FLAG_ROTATIONAL (1 << 4)/* Use elevator algorithm - 
rotational media */
 #define NBD_FLAG_SEND_TRIM  (1 << 5)/* Send TRIM (discard) */

-/* New-style global flags. */
+/* New-style handshake (global) flags, sent from server to client, and
+   control what will happen during handshake phase. */
 #define NBD_FLAG_FIXED_NEWSTYLE (1 << 0)/* Fixed newstyle protocol. */

-/* New-style client flags. */
+/* New-style client flags, sent from client to server to control what happens
+   during handshake phase. */
 #define NBD_FLAG_C_FIXED_NEWSTYLE   (1 << 0)/* Fixed newstyle protocol. */

 /* Reply types. */
@@ -61,10 +67,10 @@ struct nbd_reply {
 #define NBD_REP_ERR_INVALID ((UINT32_C(1) << 31) | 3) /* Invalid length. */
 #define NBD_REP_ERR_TLS_REQD((UINT32_C(1) << 31) | 5) /* TLS required */

+/* Request flags, sent from client to server during transmission phase */
+#define NBD_CMD_FLAG_FUA(1 << 0)

-#define NBD_CMD_MASK_COMMAND   0x
-#define NBD_CMD_FLAG_FUA   (1 << 16)
-
+/* Supported request types */
 enum {
 NBD_CMD_READ = 0,
 NBD_CMD_WRITE = 1,
diff --git a/nbd/nbd-internal.h b/nbd/nbd-internal.h
index 7e78064..99e5157 100644
--- a/nbd/nbd-internal.h
+++ b/nbd/nbd-internal.h
@@ -53,10 +53,10 @@
 /* This is all part of the "official" NBD API.
  *
  * The most up-to-date documentation is available at:
- * https://github.com/yoe/nbd/blob/master/doc/proto.txt
+ * https://github.com/yoe/nbd/blob/master/doc/proto.md
  */

-#define NBD_REQUEST_SIZE(4 + 4 + 8 + 8 + 4)
+#define NBD_REQUEST_SIZE(4 + 2 + 2 + 8 + 8 + 4)
 #define NBD_REPLY_SIZE  (4 + 4 + 8)
 #define NBD_REQUEST_MAGIC   0x25609513
 #define NBD_REPLY_MAGIC 0x67446698
diff --git a/block/nbd-client.c b/block/nbd-client.c
index 2cf3237..7e9c3ec 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -1,6 +1,7 @@
 /*
  * QEMU Block driver for  NBD
  *
+ * Copyright (C) 2016 Red Hat, Inc.
  * Copyright (C) 2008 Bull S.A.S.
  * Author: Laurent Vivier 
  *
@@ -258,7 +259,7 @@ int nbd_client_co_pwritev(BlockDriverState *bs, uint64_t 
offset,

 if (flags & BDRV_REQ_FUA) {
 assert(client->nbdflags & NBD_FLAG_SEND_FUA);
-request.type |= NBD_CMD_FLAG_FUA;
+request.flags |= NBD_CMD_FLAG_FUA;
 }

 assert(bytes <= NBD_MAX_BUFFER_SIZE);
@@ -343,11 +344,7 @@ void nbd_client_attach_aio_context(BlockDriverState *bs,
 void nbd_client_close(BlockDriverState *bs)
 {
 NbdClientSession *client = nbd_get_client_session(bs);
-struct nbd_request request = {
-.type = NBD_CMD_DISC,
-.from = 0,
-.len = 0
-};
+struct nbd_request request = { .type = NBD_CMD_DISC };

 if (client->ioc == NULL) {
 return;
diff --git a/nbd/client.c b/nbd/client.c
index a92f1e2..7c172ed 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -1,4 +1,5 @@
 /*
+ *  Copyright (C) 2016 Red Hat, Inc.
  *  Copyright (C) 2005

[Qemu-block] [PATCH v6 04/15] nbd: Rename NbdClientSession to NBDClientSession

2016-10-13 Thread Eric Blake

It's better to use consistent capitalization of the namespace
used for NBD functions; we have more instances of NBD* than
Nbd*.

Signed-off-by: Eric Blake 

---
v6: new patch
---
 block/nbd-client.h |  6 +++---
 block/nbd-client.c | 26 +-
 block/nbd.c|  4 ++--
 3 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/block/nbd-client.h b/block/nbd-client.h
index 044aca4..a84a478 100644
--- a/block/nbd-client.h
+++ b/block/nbd-client.h
@@ -17,7 +17,7 @@

 #define MAX_NBD_REQUESTS16

-typedef struct NbdClientSession {
+typedef struct NBDClientSession {
 QIOChannelSocket *sioc; /* The master data channel */
 QIOChannel *ioc; /* The current I/O channel which may differ (eg TLS) */
 uint16_t nbdflags;
@@ -32,9 +32,9 @@ typedef struct NbdClientSession {
 struct nbd_reply reply;

 bool is_unix;
-} NbdClientSession;
+} NBDClientSession;

-NbdClientSession *nbd_get_client_session(BlockDriverState *bs);
+NBDClientSession *nbd_get_client_session(BlockDriverState *bs);

 int nbd_client_init(BlockDriverState *bs,
 QIOChannelSocket *sock,
diff --git a/block/nbd-client.c b/block/nbd-client.c
index 7e9c3ec..c94608a 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -33,7 +33,7 @@
 #define HANDLE_TO_INDEX(bs, handle) ((handle) ^ ((uint64_t)(intptr_t)bs))
 #define INDEX_TO_HANDLE(bs, index)  ((index)  ^ ((uint64_t)(intptr_t)bs))

-static void nbd_recv_coroutines_enter_all(NbdClientSession *s)
+static void nbd_recv_coroutines_enter_all(NBDClientSession *s)
 {
 int i;

@@ -46,7 +46,7 @@ static void nbd_recv_coroutines_enter_all(NbdClientSession *s)

 static void nbd_teardown_connection(BlockDriverState *bs)
 {
-NbdClientSession *client = nbd_get_client_session(bs);
+NBDClientSession *client = nbd_get_client_session(bs);

 if (!client->ioc) { /* Already closed */
 return;
@@ -68,7 +68,7 @@ static void nbd_teardown_connection(BlockDriverState *bs)
 static void nbd_reply_ready(void *opaque)
 {
 BlockDriverState *bs = opaque;
-NbdClientSession *s = nbd_get_client_session(bs);
+NBDClientSession *s = nbd_get_client_session(bs);
 uint64_t i;
 int ret;

@@ -119,7 +119,7 @@ static int nbd_co_send_request(BlockDriverState *bs,
struct nbd_request *request,
QEMUIOVector *qiov)
 {
-NbdClientSession *s = nbd_get_client_session(bs);
+NBDClientSession *s = nbd_get_client_session(bs);
 AioContext *aio_context;
 int rc, ret, i;

@@ -167,7 +167,7 @@ static int nbd_co_send_request(BlockDriverState *bs,
 return rc;
 }

-static void nbd_co_receive_reply(NbdClientSession *s,
+static void nbd_co_receive_reply(NBDClientSession *s,
  struct nbd_request *request,
  struct nbd_reply *reply,
  QEMUIOVector *qiov)
@@ -195,7 +195,7 @@ static void nbd_co_receive_reply(NbdClientSession *s,
 }
 }

-static void nbd_coroutine_start(NbdClientSession *s,
+static void nbd_coroutine_start(NBDClientSession *s,
struct nbd_request *request)
 {
 /* Poor man semaphore.  The free_sema is locked when no other request
@@ -209,7 +209,7 @@ static void nbd_coroutine_start(NbdClientSession *s,
 /* s->recv_coroutine[i] is set as soon as we get the send_lock.  */
 }

-static void nbd_coroutine_end(NbdClientSession *s,
+static void nbd_coroutine_end(NBDClientSession *s,
 struct nbd_request *request)
 {
 int i = HANDLE_TO_INDEX(s, request->handle);
@@ -222,7 +222,7 @@ static void nbd_coroutine_end(NbdClientSession *s,
 int nbd_client_co_preadv(BlockDriverState *bs, uint64_t offset,
  uint64_t bytes, QEMUIOVector *qiov, int flags)
 {
-NbdClientSession *client = nbd_get_client_session(bs);
+NBDClientSession *client = nbd_get_client_session(bs);
 struct nbd_request request = {
 .type = NBD_CMD_READ,
 .from = offset,
@@ -248,7 +248,7 @@ int nbd_client_co_preadv(BlockDriverState *bs, uint64_t 
offset,
 int nbd_client_co_pwritev(BlockDriverState *bs, uint64_t offset,
   uint64_t bytes, QEMUIOVector *qiov, int flags)
 {
-NbdClientSession *client = nbd_get_client_session(bs);
+NBDClientSession *client = nbd_get_client_session(bs);
 struct nbd_request request = {
 .type = NBD_CMD_WRITE,
 .from = offset,
@@ -277,7 +277,7 @@ int nbd_client_co_pwritev(BlockDriverState *bs, uint64_t 
offset,

 int nbd_client_co_flush(BlockDriverState *bs)
 {
-NbdClientSession *client = nbd_get_client_session(bs);
+NBDClientSession *client = nbd_get_client_session(bs);
 struct nbd_request request = { .type = NBD_CMD_FLUSH };
 struct nbd_reply reply;
 ssize_t ret;
@@ -302,7 +302,7 @@ int nbd_client_co_flush(BlockDriverState *bs)

 int nbd_client_co_pdiscard(BlockDriverState *bs, int64_t offset, int count)
 {
-NbdClientSession

[Qemu-block] [PATCH v6 03/15] nbd: Rename NBDRequest to NBDRequestData

2016-10-13 Thread Eric Blake

We have both 'struct NBDRequest' and 'struct nbd_request'; making
it confusing to see which does what.  Furthermore, we want to
rename nbd_request to align with our normal CamelCase naming
conventions.  So, rename the struct which is used to associate
the data received during request callbacks, while leaving the
shorter name for the description of the request sent over the
wire in the NBD protocol.

Signed-off-by: Eric Blake 

---
v6: new patch
---
 nbd/server.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 2e84d51..78c0419 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -47,10 +47,10 @@ static int system_errno_to_nbd_errno(int err)

 /* Definitions for opaque data types */

-typedef struct NBDRequest NBDRequest;
+typedef struct NBDRequestData NBDRequestData;

-struct NBDRequest {
-QSIMPLEQ_ENTRY(NBDRequest) entry;
+struct NBDRequestData {
+QSIMPLEQ_ENTRY(NBDRequestData) entry;
 NBDClient *client;
 uint8_t *data;
 bool complete;
@@ -759,21 +759,21 @@ static void client_close(NBDClient *client)
 }
 }

-static NBDRequest *nbd_request_get(NBDClient *client)
+static NBDRequestData *nbd_request_get(NBDClient *client)
 {
-NBDRequest *req;
+NBDRequestData *req;

 assert(client->nb_requests <= MAX_NBD_REQUESTS - 1);
 client->nb_requests++;
 nbd_update_can_read(client);

-req = g_new0(NBDRequest, 1);
+req = g_new0(NBDRequestData, 1);
 nbd_client_get(client);
 req->client = client;
 return req;
 }

-static void nbd_request_put(NBDRequest *req)
+static void nbd_request_put(NBDRequestData *req)
 {
 NBDClient *client = req->client;

@@ -975,7 +975,7 @@ void nbd_export_close_all(void)
 }
 }

-static ssize_t nbd_co_send_reply(NBDRequest *req, struct nbd_reply *reply,
+static ssize_t nbd_co_send_reply(NBDRequestData *req, struct nbd_reply *reply,
  int len)
 {
 NBDClient *client = req->client;
@@ -1011,7 +1011,7 @@ static ssize_t nbd_co_send_reply(NBDRequest *req, struct 
nbd_reply *reply,
  * and any other negative value to report an error to the client
  * (although the caller may still need to disconnect after reporting
  * the error).  */
-static ssize_t nbd_co_receive_request(NBDRequest *req,
+static ssize_t nbd_co_receive_request(NBDRequestData *req,
   struct nbd_request *request)
 {
 NBDClient *client = req->client;
@@ -1105,7 +1105,7 @@ static void nbd_trip(void *opaque)
 {
 NBDClient *client = opaque;
 NBDExport *exp = client->exp;
-NBDRequest *req;
+NBDRequestData *req;
 struct nbd_request request;
 struct nbd_reply reply;
 ssize_t ret;
-- 
2.7.4

[Qemu-block] [PATCH v6 01/15] nbd: Add qemu-nbd -D for human-readable description

2016-10-13 Thread Eric Blake

The NBD protocol allows servers to advertise a human-readable
description alongside an export name during NBD_OPT_LIST.  Add
an option to pass through the user's string to the NBD client.

Doing this also makes it easier to test commit 200650d4, which
is the client counterpart of receiving the description.

Signed-off-by: Eric Blake 

---
v6: rebase to latest
v5: rebase to latest
v4: rebase to latest
---
 include/block/nbd.h |  1 +
 nbd/nbd-internal.h  |  5 +++--
 nbd/server.c| 34 ++
 qemu-nbd.c  | 12 +++-
 qemu-nbd.texi   |  5 -
 5 files changed, 45 insertions(+), 12 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index 80610ff..fd58390 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -115,6 +115,7 @@ BlockBackend *nbd_export_get_blockdev(NBDExport *exp);

 NBDExport *nbd_export_find(const char *name);
 void nbd_export_set_name(NBDExport *exp, const char *name);
+void nbd_export_set_description(NBDExport *exp, const char *description);
 void nbd_export_close_all(void);

 void nbd_client_new(NBDExport *exp,
diff --git a/nbd/nbd-internal.h b/nbd/nbd-internal.h
index 93a6ca8..7e78064 100644
--- a/nbd/nbd-internal.h
+++ b/nbd/nbd-internal.h
@@ -104,9 +104,10 @@ static inline ssize_t read_sync(QIOChannel *ioc, void 
*buffer, size_t size)
 return nbd_wr_syncv(ioc, , 1, size, true);
 }

-static inline ssize_t write_sync(QIOChannel *ioc, void *buffer, size_t size)
+static inline ssize_t write_sync(QIOChannel *ioc, const void *buffer,
+ size_t size)
 {
-struct iovec iov = { .iov_base = buffer, .iov_len = size };
+struct iovec iov = { .iov_base = (void *) buffer, .iov_len = size };

 return nbd_wr_syncv(ioc, , 1, size, false);
 }
diff --git a/nbd/server.c b/nbd/server.c
index 472f584..319827b 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -61,6 +61,7 @@ struct NBDExport {

 BlockBackend *blk;
 char *name;
+char *description;
 off_t dev_offset;
 off_t size;
 uint16_t nbdflags;
@@ -129,7 +130,8 @@ static ssize_t nbd_negotiate_read(QIOChannel *ioc, void 
*buffer, size_t size)

 }

-static ssize_t nbd_negotiate_write(QIOChannel *ioc, void *buffer, size_t size)
+static ssize_t nbd_negotiate_write(QIOChannel *ioc, const void *buffer,
+   size_t size)
 {
 ssize_t ret;
 guint watch;
@@ -225,11 +227,15 @@ static int nbd_negotiate_send_rep(QIOChannel *ioc, 
uint32_t type, uint32_t opt)

 static int nbd_negotiate_send_rep_list(QIOChannel *ioc, NBDExport *exp)
 {
-uint64_t magic, name_len;
+uint64_t magic;
+size_t name_len, desc_len;
 uint32_t opt, type, len;
+const char *name = exp->name ? exp->name : "";
+const char *desc = exp->description ? exp->description : "";

-TRACE("Advertising export name '%s'", exp->name ? exp->name : "");
-name_len = strlen(exp->name);
+TRACE("Advertising export name '%s' description '%s'", name, desc);
+name_len = strlen(name);
+desc_len = strlen(desc);
 magic = cpu_to_be64(NBD_REP_MAGIC);
 if (nbd_negotiate_write(ioc, , sizeof(magic)) != sizeof(magic)) {
 LOG("write failed (magic)");
@@ -245,18 +251,22 @@ static int nbd_negotiate_send_rep_list(QIOChannel *ioc, 
NBDExport *exp)
 LOG("write failed (reply type)");
 return -EINVAL;
 }
-len = cpu_to_be32(name_len + sizeof(len));
+len = cpu_to_be32(name_len + desc_len + sizeof(len));
 if (nbd_negotiate_write(ioc, , sizeof(len)) != sizeof(len)) {
 LOG("write failed (length)");
 return -EINVAL;
 }
 len = cpu_to_be32(name_len);
 if (nbd_negotiate_write(ioc, , sizeof(len)) != sizeof(len)) {
-LOG("write failed (length)");
+LOG("write failed (name length)");
 return -EINVAL;
 }
-if (nbd_negotiate_write(ioc, exp->name, name_len) != name_len) {
-LOG("write failed (buffer)");
+if (nbd_negotiate_write(ioc, name, name_len) != name_len) {
+LOG("write failed (name buffer)");
+return -EINVAL;
+}
+if (nbd_negotiate_write(ioc, desc, desc_len) != desc_len) {
+LOG("write failed (description buffer)");
 return -EINVAL;
 }
 return 0;
@@ -893,6 +903,12 @@ void nbd_export_set_name(NBDExport *exp, const char *name)
 nbd_export_put(exp);
 }

+void nbd_export_set_description(NBDExport *exp, const char *description)
+{
+g_free(exp->description);
+exp->description = g_strdup(description);
+}
+
 void nbd_export_close(NBDExport *exp)
 {
 NBDClient *client, *next;
@@ -902,6 +918,7 @@ void nbd_export_close(NBDExport *exp)
 client_close(client);
 }
 nbd_export_set_name(exp, NULL);
+nbd_export_set_description(exp, NULL);
 nbd_export_put(exp);
 }

@@ -920,6 +937,7 @@ void nbd_export_put(NBDExport *exp)

 if (--exp->refcount == 0) {
 assert(exp->name == NULL);
+assert(exp->description ==

Re: [Qemu-block] [PATCH 2/4] block/ssh: Add InetSocketAddress and accept it

2016-10-13 Thread Ashijeet Acharya

On Tue, Oct 11, 2016 at 1:07 PM, Ashijeet Acharya
 wrote:
> Add InetSocketAddress compatibility to SSH driver.
>
> Add a new option "server" to the SSH block driver which then accepts
> a InetSocketAddress.
>
> "host" and "port" are supported as legacy options and are mapped to
> their InetSocketAddress representation.
>
> Signed-off-by: Ashijeet Acharya 
> ---
>  block/ssh.c | 87 
> ++---
>  1 file changed, 78 insertions(+), 9 deletions(-)
>
> diff --git a/block/ssh.c b/block/ssh.c
> index 75cb7bc..702871a 100644
> --- a/block/ssh.c
> +++ b/block/ssh.c
> @@ -32,8 +32,11 @@
>  #include "qemu/error-report.h"
>  #include "qemu/sockets.h"
>  #include "qemu/uri.h"
> +#include "qapi-visit.h"
>  #include "qapi/qmp/qint.h"
>  #include "qapi/qmp/qstring.h"
> +#include "qapi/qmp-input-visitor.h"
> +#include "qapi/qmp-output-visitor.h"
>
>  /* DEBUG_SSH=1 enables the DPRINTF (debugging printf) statements in
>   * this block driver code.
> @@ -74,6 +77,8 @@ typedef struct BDRVSSHState {
>   */
>  LIBSSH2_SFTP_ATTRIBUTES attrs;
>
> +InetSocketAddress *inet;
> +
>  /* Used to warn if 'flush' is not supported. */
>  char *hostport;
>  bool unsafe_flush_warning;
> @@ -263,7 +268,9 @@ static bool ssh_has_filename_options_conflict(QDict 
> *options, Error **errp)
>  !strcmp(qe->key, "port") ||
>  !strcmp(qe->key, "path") ||
>  !strcmp(qe->key, "user") ||
> -!strcmp(qe->key, "host_key_check"))
> +!strcmp(qe->key, "host_key_check") ||
> +!strcmp(qe->key, "server") ||
> +!strncmp(qe->key, "server.", 7))
>  {
>  error_setg(errp, "Option '%s' cannot be used with a file name",
> qe->key);
> @@ -555,13 +562,71 @@ static QemuOptsList ssh_runtime_opts = {
>  },
>  };
>
> +static bool ssh_process_legacy_socket_options(QDict *output_opts,
> +  QemuOpts *legacy_opts,
> +  Error **errp)
> +{
> +const char *host = qemu_opt_get(legacy_opts, "host");
> +const char *port = qemu_opt_get(legacy_opts, "port");
> +
> +if (!host && port) {
> +error_setg(errp, "port may not be used without host");
> +return false;
> +}
> +
> +if (!host) {
> +error_setg(errp, "No hostname was specified");
> +return false;

I was debugging this part with gdb while making the changes for v2,
and I hit something very strange.
The code always gives the error of "No hostname was specified". To be
more clear, I reverted back to original driver state and the problem
did not seem to appear for the same qemu command line and I can't find
the bug.

Command I used:

$ ./bin/qemu-system-x86_64 -m 512 -drive
file=ssh://ashijeet@127.0.0.1/home/ashijeet/qemu_build/test.qcow2,
if=virtio

Is there something wrong with the command line?

Ashijeet

Re: [Qemu-block] [Qemu-devel] [PATCH v7 00/22] qcow2: persistent dirty bitmaps

2016-10-13 Thread John Snow




On 10/01/2016 09:37 AM, Max Reitz wrote:

On 30.09.2016 12:53, Vladimir Sementsov-Ogievskiy wrote:

v7:
https://src.openvz.org/users/vsementsov/repos/qemu/browse?at=refs%2Ftags%2Fqcow2-bitmap-v7
based on block-next (https://github.com/XanClic/qemu/commits/block-next)


It should be noted that (at least my) block-next is only valid during
freeze, after that it becomes stale. I assume the patches you require
from my block-next branch are the ones from the "Dirty bitmap changes
for migration/persistence work" series, which I had to drop from my pull
requests, however, due to some issues on Big Endian machines.

The only reason they are still in my block-next branch is, as I said,
that that branch is just stale. I won't overwrite it for the time being,
though, so this series can still be applied on top.

Max



Lemme fix up the 32bit problems and resend that out right now.
(At least it's not lock correctness.)

--js

[Qemu-block] [PATCH 12/18] iothread: detach all block devices before stopping them

2016-10-13 Thread Paolo Bonzini

Soon bdrv_drain will not call aio_poll itself on iothreads.  If block
devices are left hanging off the iothread's AioContext, there will be no
one to do I/O for those poor devices.

Signed-off-by: Paolo Bonzini 
---
 iothread.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/iothread.c b/iothread.c
index 62c8796..8153e21 100644
--- a/iothread.c
+++ b/iothread.c
@@ -16,6 +16,7 @@
 #include "qom/object_interfaces.h"
 #include "qemu/module.h"
 #include "block/aio.h"
+#include "block/block.h"
 #include "sysemu/iothread.h"
 #include "qmp-commands.h"
 #include "qemu/error-report.h"
@@ -199,6 +200,15 @@ IOThreadInfoList *qmp_query_iothreads(Error **errp)
 void iothread_stop_all(void)
 {
 Object *container = object_get_objects_root();
+BlockDriverState *bs;
+BdrvNextIterator it;
+
+for (bs = bdrv_first(); bs; bs = bdrv_next()) {
+AioContext *ctx = bdrv_get_aio_context(bs);
+aio_context_acquire(ctx);
+bdrv_set_aio_context(bs, qemu_get_aio_context());
+aio_context_release(ctx);
+}
 
 object_child_foreach(container, iothread_stop, NULL);
 }
-- 
2.7.4

[Qemu-block] [PATCH 13/18] replication: pass BlockDriverState to reopen_backing_file

2016-10-13 Thread Paolo Bonzini

This will be needed in the next patch to retrieve the AioContext.

Signed-off-by: Paolo Bonzini 
---
 block/replication.c | 21 -
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/block/replication.c b/block/replication.c
index 5231a00..af47479 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -317,9 +317,10 @@ static void secondary_do_checkpoint(BDRVReplicationState 
*s, Error **errp)
 }
 }
 
-static void reopen_backing_file(BDRVReplicationState *s, bool writable,
+static void reopen_backing_file(BlockDriverState *bs, bool writable,
 Error **errp)
 {
+BDRVReplicationState *s = bs->opaque;
 BlockReopenQueue *reopen_queue = NULL;
 int orig_hidden_flags, orig_secondary_flags;
 int new_hidden_flags, new_secondary_flags;
@@ -359,8 +360,9 @@ static void reopen_backing_file(BDRVReplicationState *s, 
bool writable,
 }
 }
 
-static void backup_job_cleanup(BDRVReplicationState *s)
+static void backup_job_cleanup(BlockDriverState *bs)
 {
+BDRVReplicationState *s = bs->opaque;
 BlockDriverState *top_bs;
 
 top_bs = bdrv_lookup_bs(s->top_id, s->top_id, NULL);
@@ -369,19 +371,20 @@ static void backup_job_cleanup(BDRVReplicationState *s)
 }
 bdrv_op_unblock_all(top_bs, s->blocker);
 error_free(s->blocker);
-reopen_backing_file(s, false, NULL);
+reopen_backing_file(bs, false, NULL);
 }
 
 static void backup_job_completed(void *opaque, int ret)
 {
-BDRVReplicationState *s = opaque;
+BlockDriverState *bs = opaque;
+BDRVReplicationState *s = bs->opaque;
 
 if (s->replication_state != BLOCK_REPLICATION_FAILOVER) {
 /* The backup job is cancelled unexpectedly */
 s->error = -EIO;
 }
 
-backup_job_cleanup(s);
+backup_job_cleanup(bs);
 }
 
 static bool check_top_bs(BlockDriverState *top_bs, BlockDriverState *bs)
@@ -477,7 +480,7 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
 }
 
 /* reopen the backing file in r/w mode */
-reopen_backing_file(s, true, _err);
+reopen_backing_file(bs, true, _err);
 if (local_err) {
 error_propagate(errp, local_err);
 aio_context_release(aio_context);
@@ -492,7 +495,7 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
 if (!top_bs || !bdrv_is_root_node(top_bs) ||
 !check_top_bs(top_bs, bs)) {
 error_setg(errp, "No top_bs or it is invalid");
-reopen_backing_file(s, false, NULL);
+reopen_backing_file(bs, false, NULL);
 aio_context_release(aio_context);
 return;
 }
@@ -502,10 +505,10 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
 backup_start("replication-backup", s->secondary_disk->bs,
  s->hidden_disk->bs, 0, MIRROR_SYNC_MODE_NONE, NULL, false,
  BLOCKDEV_ON_ERROR_REPORT, BLOCKDEV_ON_ERROR_REPORT,
- backup_job_completed, s, NULL, _err);
+ backup_job_completed, bs, NULL, _err);
 if (local_err) {
 error_propagate(errp, local_err);
-backup_job_cleanup(s);
+backup_job_cleanup(bs);
 aio_context_release(aio_context);
 return;
 }
-- 
2.7.4

[Qemu-block] [PATCH 11/18] aio: introduce qemu_get_current_aio_context

2016-10-13 Thread Paolo Bonzini

This will be used by bdrv_poll_while (and thus by bdrv_drain)
to choose how to wait for I/O completion.

Reviewed-by: Fam Zheng 
Signed-off-by: Paolo Bonzini 
---
 include/block/aio.h | 15 +++
 iothread.c  |  9 +
 stubs/Makefile.objs |  1 +
 stubs/iothread.c|  8 
 4 files changed, 33 insertions(+)
 create mode 100644 stubs/iothread.c

diff --git a/include/block/aio.h b/include/block/aio.h
index b9fe2cb..60a4f21 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -453,6 +453,21 @@ static inline bool aio_node_check(AioContext *ctx, bool 
is_external)
 }
 
 /**
+ * Return the AioContext whose event loop runs in the current I/O thread.
+ */
+AioContext *qemu_get_current_aio_context(void);
+
+/**
+ * @ctx: the aio context
+ *
+ * Return whether we are running in the I/O thread that manages @ctx.
+ */
+static inline bool aio_context_in_iothread(AioContext *ctx)
+{
+return ctx == qemu_get_current_aio_context();
+}
+
+/**
  * aio_context_setup:
  * @ctx: the aio context
  *
diff --git a/iothread.c b/iothread.c
index fbeb8de..62c8796 100644
--- a/iothread.c
+++ b/iothread.c
@@ -20,6 +20,7 @@
 #include "qmp-commands.h"
 #include "qemu/error-report.h"
 #include "qemu/rcu.h"
+#include "qemu/main-loop.h"
 
 typedef ObjectClass IOThreadClass;
 
@@ -28,6 +29,13 @@ typedef ObjectClass IOThreadClass;
 #define IOTHREAD_CLASS(klass) \
OBJECT_CLASS_CHECK(IOThreadClass, klass, TYPE_IOTHREAD)
 
+static __thread IOThread *my_iothread;
+
+AioContext *qemu_get_current_aio_context(void)
+{
+return my_iothread ? my_iothread->ctx : qemu_get_aio_context();
+}
+
 static void *iothread_run(void *opaque)
 {
 IOThread *iothread = opaque;
@@ -35,6 +43,7 @@ static void *iothread_run(void *opaque)
 
 rcu_register_thread();
 
+my_iothread = iothread;
 qemu_mutex_lock(>init_done_lock);
 iothread->thread_id = qemu_get_thread_id();
 qemu_cond_signal(>init_done_cond);
diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs
index c5850e8..84b9d9e 100644
--- a/stubs/Makefile.objs
+++ b/stubs/Makefile.objs
@@ -17,6 +17,7 @@ stub-obj-y += gdbstub.o
 stub-obj-y += get-fd.o
 stub-obj-y += get-next-serial.o
 stub-obj-y += get-vm-name.o
+stub-obj-y += iothread.o
 stub-obj-y += iothread-lock.o
 stub-obj-y += is-daemonized.o
 stub-obj-y += machine-init-done.o
diff --git a/stubs/iothread.c b/stubs/iothread.c
new file mode 100644
index 000..8cc9e28
--- /dev/null
+++ b/stubs/iothread.c
@@ -0,0 +1,8 @@
+#include "qemu/osdep.h"
+#include "block/aio.h"
+#include "qemu/main-loop.h"
+
+AioContext *qemu_get_current_aio_context(void)
+{
+return qemu_get_aio_context();
+}
-- 
2.7.4

[Qemu-block] [PATCH 10/18] sheepdog: use bdrv_poll_while and bdrv_wakeup

2016-10-13 Thread Paolo Bonzini

These ensure that bdrv_poll_while will exit for a BlockDriverState
that is not in the main AioContext.

Signed-off-by: Paolo Bonzini 
---
 block/sheepdog.c | 67 
 1 file changed, 38 insertions(+), 29 deletions(-)

diff --git a/block/sheepdog.c b/block/sheepdog.c
index ccbf7e1..93365d0 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -641,6 +641,7 @@ static void restart_co_req(void *opaque)
 
 typedef struct SheepdogReqCo {
 int sockfd;
+BlockDriverState *bs;
 AioContext *aio_context;
 SheepdogReq *hdr;
 void *data;
@@ -701,6 +702,9 @@ out:
 
 srco->ret = ret;
 srco->finished = true;
+if (srco->bs) {
+bdrv_wakeup(srco->bs);
+}
 }
 
 /*
@@ -708,13 +712,14 @@ out:
  *
  * Return 0 on success, -errno in case of error.
  */
-static int do_req(int sockfd, AioContext *aio_context, SheepdogReq *hdr,
+static int do_req(int sockfd, BlockDriverState *bs, SheepdogReq *hdr,
   void *data, unsigned int *wlen, unsigned int *rlen)
 {
 Coroutine *co;
 SheepdogReqCo srco = {
 .sockfd = sockfd,
-.aio_context = aio_context,
+.aio_context = bs ? bdrv_get_aio_context(bs) : qemu_get_aio_context(),
+.bs = bs,
 .hdr = hdr,
 .data = data,
 .wlen = wlen,
@@ -727,9 +732,14 @@ static int do_req(int sockfd, AioContext *aio_context, 
SheepdogReq *hdr,
 do_co_req();
 } else {
 co = qemu_coroutine_create(do_co_req, );
-qemu_coroutine_enter(co);
-while (!srco.finished) {
-aio_poll(aio_context, true);
+if (bs) {
+qemu_coroutine_enter(co);
+bdrv_poll_while(bs, !srco.finished);
+} else {
+qemu_coroutine_enter(co);
+while (!srco.finished) {
+aio_poll(qemu_get_aio_context(), true);
+}
 }
 }
 
@@ -1125,7 +1135,7 @@ static int find_vdi_name(BDRVSheepdogState *s, const char 
*filename,
 hdr.snapid = snapid;
 hdr.flags = SD_FLAG_CMD_WRITE;
 
-ret = do_req(fd, s->aio_context, (SheepdogReq *), buf, , );
+ret = do_req(fd, s->bs, (SheepdogReq *), buf, , );
 if (ret) {
 error_setg_errno(errp, -ret, "cannot get vdi info");
 goto out;
@@ -1240,7 +1250,7 @@ out:
 qemu_co_mutex_unlock(>lock);
 }
 
-static int read_write_object(int fd, AioContext *aio_context, char *buf,
+static int read_write_object(int fd, BlockDriverState *bs, char *buf,
  uint64_t oid, uint8_t copies,
  unsigned int datalen, uint64_t offset,
  bool write, bool create, uint32_t cache_flags)
@@ -1274,7 +1284,7 @@ static int read_write_object(int fd, AioContext 
*aio_context, char *buf,
 hdr.offset = offset;
 hdr.copies = copies;
 
-ret = do_req(fd, aio_context, (SheepdogReq *), buf, , );
+ret = do_req(fd, bs, (SheepdogReq *), buf, , );
 if (ret) {
 error_report("failed to send a request to the sheep");
 return ret;
@@ -1289,22 +1299,22 @@ static int read_write_object(int fd, AioContext 
*aio_context, char *buf,
 }
 }
 
-static int read_object(int fd, AioContext *aio_context, char *buf,
+static int read_object(int fd, BlockDriverState *bs, char *buf,
uint64_t oid, uint8_t copies,
unsigned int datalen, uint64_t offset,
uint32_t cache_flags)
 {
-return read_write_object(fd, aio_context, buf, oid, copies,
+return read_write_object(fd, bs, buf, oid, copies,
  datalen, offset, false,
  false, cache_flags);
 }
 
-static int write_object(int fd, AioContext *aio_context, char *buf,
+static int write_object(int fd, BlockDriverState *bs, char *buf,
 uint64_t oid, uint8_t copies,
 unsigned int datalen, uint64_t offset, bool create,
 uint32_t cache_flags)
 {
-return read_write_object(fd, aio_context, buf, oid, copies,
+return read_write_object(fd, bs, buf, oid, copies,
  datalen, offset, true,
  create, cache_flags);
 }
@@ -1331,7 +1341,7 @@ static int reload_inode(BDRVSheepdogState *s, uint32_t 
snapid, const char *tag)
 goto out;
 }
 
-ret = read_object(fd, s->aio_context, (char *)inode, vid_to_vdi_oid(vid),
+ret = read_object(fd, s->bs, (char *)inode, vid_to_vdi_oid(vid),
   s->inode.nr_copies, SD_INODE_HEADER_SIZE, 0,
   s->cache_flags);
 if (ret < 0) {
@@ -1489,7 +1499,7 @@ static int sd_open(BlockDriverState *bs, QDict *options, 
int flags,
 }
 
 buf = g_malloc(SD_INODE_SIZE);
-ret = read_object(fd, s->aio_context, buf, vid_to_vdi_oid(vid),
+ret = read_object(fd, s->bs, buf, vid_to_vdi_oid(vid),
   0, SD_INODE_SIZE,

[Qemu-block] [PATCH 17/18] qemu-thread: introduce QemuRecMutex

2016-10-13 Thread Paolo Bonzini

GRecMutex is new in glib 2.32, so we cannot use it.  Introduce
a recursive mutex in qemu-thread instead, which will be used
instead of RFifoLock.

Reviewed-by: Fam Zheng 
Signed-off-by: Paolo Bonzini 
---
 include/qemu/thread-posix.h |  6 ++
 include/qemu/thread-win32.h | 10 ++
 include/qemu/thread.h   |  3 +++
 util/qemu-thread-posix.c| 14 ++
 util/qemu-thread-win32.c| 25 +
 5 files changed, 58 insertions(+)

diff --git a/include/qemu/thread-posix.h b/include/qemu/thread-posix.h
index aa03567..09d1e15 100644
--- a/include/qemu/thread-posix.h
+++ b/include/qemu/thread-posix.h
@@ -4,6 +4,12 @@
 #include 
 #include 
 
+typedef QemuMutex QemuRecMutex;
+#define qemu_rec_mutex_destroy qemu_mutex_destroy
+#define qemu_rec_mutex_lock qemu_mutex_lock
+#define qemu_rec_mutex_try_lock qemu_mutex_try_lock
+#define qemu_rec_mutex_unlock qemu_mutex_unlock
+
 struct QemuMutex {
 pthread_mutex_t lock;
 };
diff --git a/include/qemu/thread-win32.h b/include/qemu/thread-win32.h
index c7ce8dc..5fb6541 100644
--- a/include/qemu/thread-win32.h
+++ b/include/qemu/thread-win32.h
@@ -8,6 +8,16 @@ struct QemuMutex {
 LONG owner;
 };
 
+typedef struct QemuRecMutex QemuRecMutex;
+struct QemuRecMutex {
+CRITICAL_SECTION lock;
+};
+
+void qemu_rec_mutex_destroy(QemuRecMutex *mutex);
+void qemu_rec_mutex_lock(QemuRecMutex *mutex);
+int qemu_rec_mutex_trylock(QemuRecMutex *mutex);
+void qemu_rec_mutex_unlock(QemuRecMutex *mutex);
+
 struct QemuCond {
 LONG waiters, target;
 HANDLE sema;
diff --git a/include/qemu/thread.h b/include/qemu/thread.h
index 31237e9..e8e665f 100644
--- a/include/qemu/thread.h
+++ b/include/qemu/thread.h
@@ -25,6 +25,9 @@ void qemu_mutex_lock(QemuMutex *mutex);
 int qemu_mutex_trylock(QemuMutex *mutex);
 void qemu_mutex_unlock(QemuMutex *mutex);
 
+/* Prototypes for other functions are in thread-posix.h/thread-win32.h.  */
+void qemu_rec_mutex_init(QemuRecMutex *mutex);
+
 void qemu_cond_init(QemuCond *cond);
 void qemu_cond_destroy(QemuCond *cond);
 
diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
index 74a3023..1bb00a8 100644
--- a/util/qemu-thread-posix.c
+++ b/util/qemu-thread-posix.c
@@ -80,6 +80,20 @@ void qemu_mutex_unlock(QemuMutex *mutex)
 error_exit(err, __func__);
 }
 
+void qemu_rec_mutex_init(QemuRecMutex *mutex)
+{
+int err;
+pthread_mutexattr_t attr;
+
+pthread_mutexattr_init();
+pthread_mutexattr_settype(, PTHREAD_MUTEX_RECURSIVE);
+err = pthread_mutex_init(>lock, );
+pthread_mutexattr_destroy();
+if (err) {
+error_exit(err, __func__);
+}
+}
+
 void qemu_cond_init(QemuCond *cond)
 {
 int err;
diff --git a/util/qemu-thread-win32.c b/util/qemu-thread-win32.c
index 98a5ddf..171d83c 100644
--- a/util/qemu-thread-win32.c
+++ b/util/qemu-thread-win32.c
@@ -79,6 +79,31 @@ void qemu_mutex_unlock(QemuMutex *mutex)
 LeaveCriticalSection(>lock);
 }
 
+void qemu_rec_mutex_init(QemuRecMutex *mutex)
+{
+InitializeCriticalSection(>lock);
+}
+
+void qemu_rec_mutex_destroy(QemuRecMutex *mutex)
+{
+DeleteCriticalSection(>lock);
+}
+
+void qemu_rec_mutex_lock(QemuRecMutex *mutex)
+{
+EnterCriticalSection(>lock);
+}
+
+int qemu_rec_mutex_trylock(QemuRecMutex *mutex)
+{
+return !TryEnterCriticalSection(>lock);
+}
+
+void qemu_rec_mutex_unlock(QemuRecMutex *mutex)
+{
+LeaveCriticalSection(>lock);
+}
+
 void qemu_cond_init(QemuCond *cond)
 {
 memset(cond, 0, sizeof(*cond));
-- 
2.7.4

[Qemu-block] [PATCH 16/18] iothread: release AioContext around aio_poll

2016-10-13 Thread Paolo Bonzini

This is the first step towards having fine-grained critical sections in
dataplane threads, which will resolve lock ordering problems between
address_space_* functions (which need the BQL when doing MMIO, even
after we complete RCU-based dispatch) and the AioContext.

Because AioContext does not use contention callbacks anymore, the
unit test has to be changed.

Previously applied as a0710f7995f914e3044e5899bd8ff6c43c62f916 and
then reverted.

Reviewed-by: Fam Zheng 
Signed-off-by: Paolo Bonzini 
---
 async.c | 22 +++---
 docs/multiple-iothreads.txt | 40 +++-
 include/block/aio.h |  3 ---
 iothread.c  | 11 ++-
 tests/test-aio.c| 22 ++
 5 files changed, 42 insertions(+), 56 deletions(-)

diff --git a/async.c b/async.c
index fb37b03..27db772 100644
--- a/async.c
+++ b/async.c
@@ -107,8 +107,8 @@ int aio_bh_poll(AioContext *ctx)
  * aio_notify again if necessary.
  */
 if (atomic_xchg(>scheduled, 0)) {
-/* Idle BHs and the notify BH don't count as progress */
-if (!bh->idle && bh != ctx->notify_dummy_bh) {
+/* Idle BHs don't count as progress */
+if (!bh->idle) {
 ret = 1;
 }
 bh->idle = 0;
@@ -260,7 +260,6 @@ aio_ctx_finalize(GSource *source)
 {
 AioContext *ctx = (AioContext *) source;
 
-qemu_bh_delete(ctx->notify_dummy_bh);
 thread_pool_free(ctx->thread_pool);
 
 #ifdef CONFIG_LINUX_AIO
@@ -346,19 +345,6 @@ static void aio_timerlist_notify(void *opaque)
 aio_notify(opaque);
 }
 
-static void aio_rfifolock_cb(void *opaque)
-{
-AioContext *ctx = opaque;
-
-/* Kick owner thread in case they are blocked in aio_poll() */
-qemu_bh_schedule(ctx->notify_dummy_bh);
-}
-
-static void notify_dummy_bh(void *opaque)
-{
-/* Do nothing, we were invoked just to force the event loop to iterate */
-}
-
 static void event_notifier_dummy_cb(EventNotifier *e)
 {
 }
@@ -386,11 +372,9 @@ AioContext *aio_context_new(Error **errp)
 #endif
 ctx->thread_pool = NULL;
 qemu_mutex_init(>bh_lock);
-rfifolock_init(>lock, aio_rfifolock_cb, ctx);
+rfifolock_init(>lock, NULL, NULL);
 timerlistgroup_init(>tlg, aio_timerlist_notify, ctx);
 
-ctx->notify_dummy_bh = aio_bh_new(ctx, notify_dummy_bh, NULL);
-
 return ctx;
 fail:
 g_source_destroy(>source);
diff --git a/docs/multiple-iothreads.txt b/docs/multiple-iothreads.txt
index 40b8419..0e7cdb2 100644
--- a/docs/multiple-iothreads.txt
+++ b/docs/multiple-iothreads.txt
@@ -105,13 +105,10 @@ a BH in the target AioContext beforehand and then call 
qemu_bh_schedule().  No
 acquire/release or locking is needed for the qemu_bh_schedule() call.  But be
 sure to acquire the AioContext for aio_bh_new() if necessary.
 
-The relationship between AioContext and the block layer

-The AioContext originates from the QEMU block layer because it provides a
-scoped way of running event loop iterations until all work is done.  This
-feature is used to complete all in-flight block I/O requests (see
-bdrv_drain_all()).  Nowadays AioContext is a generic event loop that can be
-used by any QEMU subsystem.
+AioContext and the block layer
+--
+The AioContext originates from the QEMU block layer, even though nowadays
+AioContext is a generic event loop that can be used by any QEMU subsystem.
 
 The block layer has support for AioContext integrated.  Each BlockDriverState
 is associated with an AioContext using bdrv_set_aio_context() and
@@ -122,13 +119,22 @@ Block layer code must therefore expect to run in an 
IOThread and avoid using
 old APIs that implicitly use the main loop.  See the "How to program for
 IOThreads" above for information on how to do that.
 
-If main loop code such as a QMP function wishes to access a BlockDriverState it
-must first call aio_context_acquire(bdrv_get_aio_context(bs)) to ensure the
-IOThread does not run in parallel.
-
-Long-running jobs (usually in the form of coroutines) are best scheduled in the
-BlockDriverState's AioContext to avoid the need to acquire/release around each
-bdrv_*() call.  Be aware that there is currently no mechanism to get notified
-when bdrv_set_aio_context() moves this BlockDriverState to a different
-AioContext (see bdrv_detach_aio_context()/bdrv_attach_aio_context()), so you
-may need to add this if you want to support long-running jobs.
+If main loop code such as a QMP function wishes to access a BlockDriverState
+it must first call aio_context_acquire(bdrv_get_aio_context(bs)) to ensure
+that callbacks in the IOThread do not run in parallel.
+
+Code running in the monitor typically needs to ensure that past
+requests from the guest are completed.  When a block device is running
+in an IOThread, the IOThread can also process

[Qemu-block] [PATCH 18/18] aio: convert from RFifoLock to QemuRecMutex

2016-10-13 Thread Paolo Bonzini

It is simpler and a bit faster, and QEMU does not need the contention
callbacks (and thus the fairness) anymore.

Reviewed-by: Fam Zheng 
Signed-off-by: Paolo Bonzini 
---
 async.c  |  8 ++---
 include/block/aio.h  |  3 +-
 include/qemu/rfifolock.h | 54 
 tests/.gitignore |  1 -
 tests/Makefile.include   |  2 --
 tests/test-rfifolock.c   | 91 
 util/Makefile.objs   |  1 -
 util/rfifolock.c | 78 -
 8 files changed, 5 insertions(+), 233 deletions(-)
 delete mode 100644 include/qemu/rfifolock.h
 delete mode 100644 tests/test-rfifolock.c
 delete mode 100644 util/rfifolock.c

diff --git a/async.c b/async.c
index 27db772..b2de360 100644
--- a/async.c
+++ b/async.c
@@ -284,7 +284,7 @@ aio_ctx_finalize(GSource *source)
 
 aio_set_event_notifier(ctx, >notifier, false, NULL);
 event_notifier_cleanup(>notifier);
-rfifolock_destroy(>lock);
+qemu_rec_mutex_destroy(>lock);
 qemu_mutex_destroy(>bh_lock);
 timerlistgroup_deinit(>tlg);
 }
@@ -372,7 +372,7 @@ AioContext *aio_context_new(Error **errp)
 #endif
 ctx->thread_pool = NULL;
 qemu_mutex_init(>bh_lock);
-rfifolock_init(>lock, NULL, NULL);
+qemu_rec_mutex_init(>lock);
 timerlistgroup_init(>tlg, aio_timerlist_notify, ctx);
 
 return ctx;
@@ -393,10 +393,10 @@ void aio_context_unref(AioContext *ctx)
 
 void aio_context_acquire(AioContext *ctx)
 {
-rfifolock_lock(>lock);
+qemu_rec_mutex_lock(>lock);
 }
 
 void aio_context_release(AioContext *ctx)
 {
-rfifolock_unlock(>lock);
+qemu_rec_mutex_unlock(>lock);
 }
diff --git a/include/block/aio.h b/include/block/aio.h
index 5714aba..758406a 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -18,7 +18,6 @@
 #include "qemu/queue.h"
 #include "qemu/event_notifier.h"
 #include "qemu/thread.h"
-#include "qemu/rfifolock.h"
 #include "qemu/timer.h"
 
 typedef struct BlockAIOCB BlockAIOCB;
@@ -54,7 +53,7 @@ struct AioContext {
 GSource source;
 
 /* Protects all fields from multi-threaded access */
-RFifoLock lock;
+QemuRecMutex lock;
 
 /* The list of registered AIO handlers */
 QLIST_HEAD(, AioHandler) aio_handlers;
diff --git a/include/qemu/rfifolock.h b/include/qemu/rfifolock.h
deleted file mode 100644
index b23ab53..000
--- a/include/qemu/rfifolock.h
+++ /dev/null
@@ -1,54 +0,0 @@
-/*
- * Recursive FIFO lock
- *
- * Copyright Red Hat, Inc. 2013
- *
- * Authors:
- *  Stefan Hajnoczi   
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- *
- */
-
-#ifndef QEMU_RFIFOLOCK_H
-#define QEMU_RFIFOLOCK_H
-
-#include "qemu/thread.h"
-
-/* Recursive FIFO lock
- *
- * This lock provides more features than a plain mutex:
- *
- * 1. Fairness - enforces FIFO order.
- * 2. Nesting - can be taken recursively.
- * 3. Contention callback - optional, called when thread must wait.
- *
- * The recursive FIFO lock is heavyweight so prefer other synchronization
- * primitives if you do not need its features.
- */
-typedef struct {
-QemuMutex lock; /* protects all fields */
-
-/* FIFO order */
-unsigned int head;  /* active ticket number */
-unsigned int tail;  /* waiting ticket number */
-QemuCond cond;  /* used to wait for our ticket number */
-
-/* Nesting */
-QemuThread owner_thread;/* thread that currently has ownership */
-unsigned int nesting;   /* amount of nesting levels */
-
-/* Contention callback */
-void (*cb)(void *); /* called when thread must wait, with ->lock
- * held so it may not recursively lock/unlock
- */
-void *cb_opaque;
-} RFifoLock;
-
-void rfifolock_init(RFifoLock *r, void (*cb)(void *), void *opaque);
-void rfifolock_destroy(RFifoLock *r);
-void rfifolock_lock(RFifoLock *r);
-void rfifolock_unlock(RFifoLock *r);
-
-#endif /* QEMU_RFIFOLOCK_H */
diff --git a/tests/.gitignore b/tests/.gitignore
index 9f3d2ee..2cd897e 100644
--- a/tests/.gitignore
+++ b/tests/.gitignore
@@ -66,7 +66,6 @@ test-qmp-marshal.c
 test-qmp-output-visitor
 test-rcu-list
 test-replication
-test-rfifolock
 test-string-input-visitor
 test-string-output-visitor
 test-thread-pool
diff --git a/tests/Makefile.include b/tests/Makefile.include
index a7c..26e7e90 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -43,7 +43,6 @@ check-unit-y += tests/test-visitor-serialization$(EXESUF)
 check-unit-y += tests/test-iov$(EXESUF)
 gcov-files-test-iov-y = util/iov.c
 check-unit-y += tests/test-aio$(EXESUF)
-check-unit-$(CONFIG_POSIX) += tests/test-rfifolock$(EXESUF)
 check-unit-y += tests/test-throttle$(EXESUF)
 gcov-files-test-aio-$(CONFIG_WIN32) = aio-win32.c

[Qemu-block] [PATCH 00/18] dataplane: remove RFifoLock (including almost all previously sent patches)

2016-10-13 Thread Paolo Bonzini

This patch reorganizes aio_poll callers to establish new rules for
dataplane locking.  The idea is that I/O operations on a dataplane
BDS (i.e. one where the AioContext is not the main one) do not call
aio_poll anymore.  Instead, they wait for the operation to end in the
other I/O thread, at which point the other I/O thread calls bdrv_wakeup
to wake up the main thread.

With this change, only one thread runs aio_poll for an AioContext.
While aio_context_acquire/release is still needed to protect the BDSes,
it need not interrupt the other thread's event loop anymore, and therefore
it does not need contention callbacks anymore.  Thus the patch can remove
RFifoLock.  This fixes possible hangs in bdrv_drain_all, reproducible (for
example) by unplugging a virtio-scsi-dataplane device while there is I/O
going on for a virtio-blk-dataplane on the same I/O thread.

Patch 1 is a bugfix that I already posted.

Patch 2 makes blockjobs independent of aio_poll, the reason for which
should be apparent from the explanation above.

Patch 3 is an independent mirror bugfix, that I wanted to submit separately
but happens to fix a hang in COLO replication.  Like patch 1 I believe
it's pre-existing and merely exposed by these patches.

Patches 4 to 10 introduce the infrastructure to wake up the main thread
while bdrv_drain or other synchronous operations are running.  Patches 11
to 14 do other changes to prepare for this.  Notably bdrv_drain_all
needs to be called without holding any AioContext lock.

Patch 15 then does the big change, after which there are just some
cleanups left to do.

Paolo

Fam Zheng (1):
  qed: Implement .bdrv_drain

Paolo Bonzini (17):
  replication: interrupt failover if the main device is closed
  blockjob: introduce .drain callback for jobs
  mirror: use bdrv_drained_begin/bdrv_drained_end
  block: add BDS field to count in-flight requests
  block: change drain to look only at one child at a time
  block: introduce bdrv_poll_while and bdrv_wakeup
  nfs: move nfs_set_events out of the while loops
  nfs: use bdrv_poll_while and bdrv_wakeup
  sheepdog: use bdrv_poll_while and bdrv_wakeup
  aio: introduce qemu_get_current_aio_context
  iothread: detach all block devices before stopping them
  replication: pass BlockDriverState to reopen_backing_file
  block: prepare bdrv_reopen_multiple to release AioContext
  block: only call aio_poll on the current thread's AioContext
  iothread: release AioContext around aio_poll
  qemu-thread: introduce QemuRecMutex
  aio: convert from RFifoLock to QemuRecMutex

 async.c |  29 +++--
 block.c |   6 +-
 block/backup.c  |  16 +
 block/block-backend.c   |  30 ++---
 block/commit.c  |   2 +-
 block/io.c  | 148 +---
 block/mirror.c  |  69 +++--
 block/nfs.c |  55 +---
 block/qed-table.c   |  16 ++---
 block/qed.c |  20 +-
 block/replication.c |  27 +---
 block/sheepdog.c|  67 +++-
 blockjob.c  |  37 ++-
 docs/multiple-iothreads.txt |  40 +++-
 include/block/aio.h |  21 +--
 include/block/block.h   |  29 -
 include/block/block_int.h   |  12 ++--
 include/block/blockjob.h|   7 +++
 include/qemu/rfifolock.h|  54 
 include/qemu/thread-posix.h |   6 ++
 include/qemu/thread-win32.h |  10 +++
 include/qemu/thread.h   |   3 +
 iothread.c  |  30 ++---
 qemu-io-cmds.c  |   2 +-
 stubs/Makefile.objs |   1 +
 stubs/iothread.c|   8 +++
 tests/.gitignore|   1 -
 tests/Makefile.include  |   2 -
 tests/test-aio.c|  22 ---
 tests/test-rfifolock.c  |  91 ---
 util/Makefile.objs  |   1 -
 util/qemu-thread-posix.c|  14 +
 util/qemu-thread-win32.c|  25 
 util/rfifolock.c|  78 ---
 34 files changed, 495 insertions(+), 484 deletions(-)
 delete mode 100644 include/qemu/rfifolock.h
 create mode 100644 stubs/iothread.c
 delete mode 100644 tests/test-rfifolock.c
 delete mode 100644 util/rfifolock.c

-- 
2.7.4

[Qemu-block] [PATCH 08/18] nfs: move nfs_set_events out of the while loops

2016-10-13 Thread Paolo Bonzini

nfs_set_events only needs to be called once before entering the
while loop; afterwards, nfs_process_read and nfs_process_write
take care of it.

Signed-off-by: Paolo Bonzini 
---
 block/nfs.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/nfs.c b/block/nfs.c
index c3db2ec..c8df8d8 100644
--- a/block/nfs.c
+++ b/block/nfs.c
@@ -149,8 +149,8 @@ static int coroutine_fn nfs_co_readv(BlockDriverState *bs,
 return -ENOMEM;
 }
 
+nfs_set_events(client);
 while (!task.complete) {
-nfs_set_events(client);
 qemu_coroutine_yield();
 }
 
@@ -191,8 +191,8 @@ static int coroutine_fn nfs_co_writev(BlockDriverState *bs,
 return -ENOMEM;
 }
 
+nfs_set_events(client);
 while (!task.complete) {
-nfs_set_events(client);
 qemu_coroutine_yield();
 }
 
@@ -217,8 +217,8 @@ static int coroutine_fn nfs_co_flush(BlockDriverState *bs)
 return -ENOMEM;
 }
 
+nfs_set_events(client);
 while (!task.complete) {
-nfs_set_events(client);
 qemu_coroutine_yield();
 }
 
@@ -513,8 +513,8 @@ static int64_t nfs_get_allocated_file_size(BlockDriverState 
*bs)
 return -ENOMEM;
 }
 
+nfs_set_events(client);
 while (!task.complete) {
-nfs_set_events(client);
 aio_poll(client->aio_context, true);
 }
 
-- 
2.7.4

[Qemu-block] [PATCH 07/18] block: introduce bdrv_poll_while and bdrv_wakeup

2016-10-13 Thread Paolo Bonzini

We want the BDS event loop to run exclusively in the iothread that
owns the BDS's AioContext.  This function and macro provides the
synchronization between the two event loops.

Signed-off-by: Paolo Bonzini 
---
 block/block-backend.c |  7 +--
 block/io.c| 47 +++
 block/qed-table.c | 16 
 block/qed.c   |  4 +++-
 include/block/block.h |  9 +
 include/block/block_int.h |  1 +
 6 files changed, 29 insertions(+), 55 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 234df1e..c5c2597 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -878,7 +878,6 @@ static int blk_prw(BlockBackend *blk, int64_t offset, 
uint8_t *buf,
int64_t bytes, CoroutineEntry co_entry,
BdrvRequestFlags flags)
 {
-AioContext *aio_context;
 QEMUIOVector qiov;
 struct iovec iov;
 Coroutine *co;
@@ -900,11 +899,7 @@ static int blk_prw(BlockBackend *blk, int64_t offset, 
uint8_t *buf,
 
 co = qemu_coroutine_create(co_entry, );
 qemu_coroutine_enter(co);
-
-aio_context = blk_get_aio_context(blk);
-while (rwco.ret == NOT_DONE) {
-aio_poll(aio_context, true);
-}
+bdrv_poll_while(blk_bs(blk), rwco.ret == NOT_DONE);
 
 return rwco.ret;
 }
diff --git a/block/io.c b/block/io.c
index afec968..7d3dcfc 100644
--- a/block/io.c
+++ b/block/io.c
@@ -156,23 +156,12 @@ bool bdrv_requests_pending(BlockDriverState *bs)
 return false;
 }
 
-static bool bdrv_drain_poll(BlockDriverState *bs)
-{
-bool waited = false;
-
-while (atomic_read(>in_flight) > 0) {
-aio_poll(bdrv_get_aio_context(bs), true);
-waited = true;
-}
-return waited;
-}
-
 static bool bdrv_drain_recurse(BlockDriverState *bs)
 {
 BdrvChild *child;
 bool waited;
 
-waited = bdrv_drain_poll(bs);
+waited = bdrv_poll_while(bs, atomic_read(>in_flight) > 0);
 
 if (bs->drv && bs->drv->bdrv_drain) {
 bs->drv->bdrv_drain(bs);
@@ -485,9 +474,14 @@ void bdrv_inc_in_flight(BlockDriverState *bs)
 atomic_inc(>in_flight);
 }
 
+void bdrv_wakeup(BlockDriverState *bs)
+{
+}
+
 void bdrv_dec_in_flight(BlockDriverState *bs)
 {
 atomic_dec(>in_flight);
+bdrv_wakeup(bs);
 }
 
 static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
@@ -597,13 +591,9 @@ static int bdrv_prwv_co(BdrvChild *child, int64_t offset,
 /* Fast-path if already in coroutine context */
 bdrv_rw_co_entry();
 } else {
-AioContext *aio_context = bdrv_get_aio_context(child->bs);
-
 co = qemu_coroutine_create(bdrv_rw_co_entry, );
 qemu_coroutine_enter(co);
-while (rwco.ret == NOT_DONE) {
-aio_poll(aio_context, true);
-}
+bdrv_poll_while(child->bs, rwco.ret == NOT_DONE);
 }
 return rwco.ret;
 }
@@ -1845,14 +1835,10 @@ int64_t bdrv_get_block_status_above(BlockDriverState 
*bs,
 /* Fast-path if already in coroutine context */
 bdrv_get_block_status_above_co_entry();
 } else {
-AioContext *aio_context = bdrv_get_aio_context(bs);
-
 co = qemu_coroutine_create(bdrv_get_block_status_above_co_entry,
);
 qemu_coroutine_enter(co);
-while (!data.done) {
-aio_poll(aio_context, true);
-}
+bdrv_poll_while(bs, !data.done);
 }
 return data.ret;
 }
@@ -2411,13 +2397,9 @@ int bdrv_flush(BlockDriverState *bs)
 /* Fast-path if already in coroutine context */
 bdrv_flush_co_entry(_co);
 } else {
-AioContext *aio_context = bdrv_get_aio_context(bs);
-
 co = qemu_coroutine_create(bdrv_flush_co_entry, _co);
 qemu_coroutine_enter(co);
-while (flush_co.ret == NOT_DONE) {
-aio_poll(aio_context, true);
-}
+bdrv_poll_while(bs, flush_co.ret == NOT_DONE);
 }
 
 return flush_co.ret;
@@ -2543,13 +2525,9 @@ int bdrv_pdiscard(BlockDriverState *bs, int64_t offset, 
int count)
 /* Fast-path if already in coroutine context */
 bdrv_pdiscard_co_entry();
 } else {
-AioContext *aio_context = bdrv_get_aio_context(bs);
-
 co = qemu_coroutine_create(bdrv_pdiscard_co_entry, );
 qemu_coroutine_enter(co);
-while (rwco.ret == NOT_DONE) {
-aio_poll(aio_context, true);
-}
+bdrv_poll_while(bs, rwco.ret == NOT_DONE);
 }
 
 return rwco.ret;
@@ -2608,11 +2586,8 @@ int bdrv_ioctl(BlockDriverState *bs, unsigned long int 
req, void *buf)
 bdrv_co_ioctl_entry();
 } else {
 Coroutine *co = qemu_coroutine_create(bdrv_co_ioctl_entry, );
-
 qemu_coroutine_enter(co);
-while (data.ret == -EINPROGRESS) {
-aio_poll(bdrv_get_aio_context(bs), true);
-}
+bdrv_poll_while(bs, data.ret == -EINPROGRESS);
 }

[Qemu-block] [PATCH 05/18] block: change drain to look only at one child at a time

2016-10-13 Thread Paolo Bonzini

bdrv_requests_pending is checking children to also wait until internal
requests (such as metadata writes) have completed.  However, checking
children is in general overkill.  Children requests can be of two kinds:

- requests caused by an operation on bs, e.g. a bdrv_aio_write to bs
causing a write to bs->file->bs.  In this case, the parent's in_flight
count will always be incremented by at least one for every request in
the child.

- asynchronous metadata writes or flushes.  Such writes can be started
even if bs's in_flight count is zero, but not after the .bdrv_drain
callback has been invoked.

This patch therefore changes bdrv_drain to finish I/O in the parent
(after which the parent's in_flight will be locked to zero), call
bdrv_drain (after which the parent will not generate I/O on the child
anymore), and then wait for internal I/O in the children to complete.

Signed-off-by: Paolo Bonzini 
---
 block/io.c | 47 ++-
 1 file changed, 26 insertions(+), 21 deletions(-)

diff --git a/block/io.c b/block/io.c
index 8d46d8b..afec968 100644
--- a/block/io.c
+++ b/block/io.c
@@ -156,16 +156,33 @@ bool bdrv_requests_pending(BlockDriverState *bs)
 return false;
 }
 
-static void bdrv_drain_recurse(BlockDriverState *bs)
+static bool bdrv_drain_poll(BlockDriverState *bs)
+{
+bool waited = false;
+
+while (atomic_read(>in_flight) > 0) {
+aio_poll(bdrv_get_aio_context(bs), true);
+waited = true;
+}
+return waited;
+}
+
+static bool bdrv_drain_recurse(BlockDriverState *bs)
 {
 BdrvChild *child;
+bool waited;
+
+waited = bdrv_drain_poll(bs);
 
 if (bs->drv && bs->drv->bdrv_drain) {
 bs->drv->bdrv_drain(bs);
 }
+
 QLIST_FOREACH(child, >children, next) {
-bdrv_drain_recurse(child->bs);
+waited |= bdrv_drain_recurse(child->bs);
 }
+
+return waited;
 }
 
 typedef struct {
@@ -174,14 +191,6 @@ typedef struct {
 bool done;
 } BdrvCoDrainData;
 
-static void bdrv_drain_poll(BlockDriverState *bs)
-{
-while (bdrv_requests_pending(bs)) {
-/* Keep iterating */
-aio_poll(bdrv_get_aio_context(bs), true);
-}
-}
-
 static void bdrv_co_drain_bh_cb(void *opaque)
 {
 BdrvCoDrainData *data = opaque;
@@ -189,7 +198,7 @@ static void bdrv_co_drain_bh_cb(void *opaque)
 BlockDriverState *bs = data->bs;
 
 bdrv_dec_in_flight(bs);
-bdrv_drain_poll(bs);
+bdrv_drained_begin(bs);
 data->done = true;
 qemu_coroutine_enter(co);
 }
@@ -220,6 +229,11 @@ static void coroutine_fn 
bdrv_co_yield_to_drain(BlockDriverState *bs)
 
 void bdrv_drained_begin(BlockDriverState *bs)
 {
+if (qemu_in_coroutine()) {
+bdrv_co_yield_to_drain(bs);
+return;
+}
+
 if (!bs->quiesce_counter++) {
 aio_disable_external(bdrv_get_aio_context(bs));
 bdrv_parent_drained_begin(bs);
@@ -227,11 +241,6 @@ void bdrv_drained_begin(BlockDriverState *bs)
 
 bdrv_io_unplugged_begin(bs);
 bdrv_drain_recurse(bs);
-if (qemu_in_coroutine()) {
-bdrv_co_yield_to_drain(bs);
-} else {
-bdrv_drain_poll(bs);
-}
 bdrv_io_unplugged_end(bs);
 }
 
@@ -299,7 +308,6 @@ void bdrv_drain_all(void)
 aio_context_acquire(aio_context);
 bdrv_parent_drained_begin(bs);
 bdrv_io_unplugged_begin(bs);
-bdrv_drain_recurse(bs);
 aio_context_release(aio_context);
 
 if (!g_slist_find(aio_ctxs, aio_context)) {
@@ -322,10 +330,7 @@ void bdrv_drain_all(void)
 aio_context_acquire(aio_context);
 for (bs = bdrv_first(); bs; bs = bdrv_next()) {
 if (aio_context == bdrv_get_aio_context(bs)) {
-if (bdrv_requests_pending(bs)) {
-aio_poll(aio_context, true);
-waited = true;
-}
+waited |= bdrv_drain_recurse(bs);
 }
 }
 aio_context_release(aio_context);
-- 
2.7.4

[Qemu-block] [PATCH 15/18] block: only call aio_poll on the current thread's AioContext

2016-10-13 Thread Paolo Bonzini

aio_poll is not thread safe; for example bdrv_drain can hang if
the last in-flight I/O operation is completed in the I/O thread after
the main thread has checked bs->in_flight.

The bug remains latent as long as all of it is called within
aio_context_acquire/aio_context_release, but this will change soon.

To fix this, if bdrv_drain is called from outside the I/O thread,
signal the main AioContext through a dummy bottom half.  The event
loop then only runs in the I/O thread.

Reviewed-by: Fam Zheng 
Signed-off-by: Paolo Bonzini 
---
 async.c   |  1 +
 block.c   |  2 ++
 block/io.c|  7 +++
 include/block/block.h | 24 +---
 include/block/block_int.h |  1 +
 5 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/async.c b/async.c
index f30d011..fb37b03 100644
--- a/async.c
+++ b/async.c
@@ -61,6 +61,7 @@ void aio_bh_schedule_oneshot(AioContext *ctx, QEMUBHFunc *cb, 
void *opaque)
 smp_wmb();
 ctx->first_bh = bh;
 qemu_mutex_unlock(>bh_lock);
+aio_notify(ctx);
 }
 
 QEMUBH *aio_bh_new(AioContext *ctx, QEMUBHFunc *cb, void *opaque)
diff --git a/block.c b/block.c
index fbe485c..a17baab 100644
--- a/block.c
+++ b/block.c
@@ -2090,7 +2090,9 @@ int bdrv_reopen_multiple(AioContext *ctx, 
BlockReopenQueue *bs_queue, Error **er
 
 assert(bs_queue != NULL);
 
+aio_context_release(ctx);
 bdrv_drain_all();
+aio_context_acquire(ctx);
 
 QSIMPLEQ_FOREACH(bs_entry, bs_queue, entry) {
 if (bdrv_reopen_prepare(_entry->state, bs_queue, _err)) {
diff --git a/block/io.c b/block/io.c
index 7d3dcfc..cd28909 100644
--- a/block/io.c
+++ b/block/io.c
@@ -474,8 +474,15 @@ void bdrv_inc_in_flight(BlockDriverState *bs)
 atomic_inc(>in_flight);
 }
 
+static void dummy_bh_cb(void *opaque)
+{
+}
+
 void bdrv_wakeup(BlockDriverState *bs)
 {
+if (bs->wakeup) {
+aio_bh_schedule_oneshot(qemu_get_aio_context(), dummy_bh_cb, NULL);
+}
 }
 
 void bdrv_dec_in_flight(BlockDriverState *bs)
diff --git a/include/block/block.h b/include/block/block.h
index ba4318b..72d5d8e 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -343,9 +343,27 @@ void bdrv_drain_all(void);
 #define bdrv_poll_while(bs, cond) ({   \
 bool waited_ = false;  \
 BlockDriverState *bs_ = (bs);  \
-while ((cond)) {   \
-aio_poll(bdrv_get_aio_context(bs_), true); \
-waited_ = true;\
+AioContext *ctx_ = bdrv_get_aio_context(bs_);  \
+if (aio_context_in_iothread(ctx_)) {   \
+while ((cond)) {   \
+aio_poll(ctx_, true);  \
+waited_ = true;\
+}  \
+} else {   \
+assert(qemu_get_current_aio_context() ==   \
+   qemu_get_aio_context());\
+/* Ask bdrv_dec_in_flight to wake up the main  \
+ * QEMU AioContext.\
+ */\
+assert(!bs_->wakeup);  \
+bs_->wakeup = true;\
+while ((cond)) {   \
+aio_context_release(ctx_); \
+aio_poll(qemu_get_aio_context(), true);\
+aio_context_acquire(ctx_); \
+waited_ = true;\
+}  \
+bs_->wakeup = false;   \
 }  \
 waited_; })
 
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 11f877b..0516f62 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -470,6 +470,7 @@ struct BlockDriverState {
 NotifierWithReturnList before_write_notifiers;
 
 /* number of in-flight requests; overall and serialising */
+bool wakeup;
 unsigned int in_flight;
 unsigned int serialising_in_flight;
 
-- 
2.7.4

[Qemu-block] [PATCH 04/18] block: add BDS field to count in-flight requests

2016-10-13 Thread Paolo Bonzini

Unlike tracked_requests, this field also counts throttled requests,
and remains non-zero if an AIO operation needs a BH to be "really"
completed.

With this change, it is no longer necessary to have a dummy
BdrvTrackedRequest for requests that are never serialising, and
it is no longer necessary to poll the AioContext once after
bdrv_requests_pending(bs) returns false.

Signed-off-by: Paolo Bonzini 
---
 block/block-backend.c | 23 +++---
 block/io.c| 81 ---
 include/block/block_int.h | 10 +++---
 3 files changed, 81 insertions(+), 33 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 1a724a8..234df1e 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -799,20 +799,25 @@ int coroutine_fn blk_co_preadv(BlockBackend *blk, int64_t 
offset,
BdrvRequestFlags flags)
 {
 int ret;
+BlockDriverState *bs = blk_bs(blk);
 
-trace_blk_co_preadv(blk, blk_bs(blk), offset, bytes, flags);
+trace_blk_co_preadv(blk, bs, offset, bytes, flags);
 
 ret = blk_check_byte_request(blk, offset, bytes);
 if (ret < 0) {
 return ret;
 }
 
+bdrv_inc_in_flight(bs);
+
 /* throttling disk I/O */
 if (blk->public.throttle_state) {
 throttle_group_co_io_limits_intercept(blk, bytes, false);
 }
 
-return bdrv_co_preadv(blk->root, offset, bytes, qiov, flags);
+ret = bdrv_co_preadv(blk->root, offset, bytes, qiov, flags);
+bdrv_dec_in_flight(bs);
+return ret;
 }
 
 int coroutine_fn blk_co_pwritev(BlockBackend *blk, int64_t offset,
@@ -820,14 +825,17 @@ int coroutine_fn blk_co_pwritev(BlockBackend *blk, 
int64_t offset,
 BdrvRequestFlags flags)
 {
 int ret;
+BlockDriverState *bs = blk_bs(blk);
 
-trace_blk_co_pwritev(blk, blk_bs(blk), offset, bytes, flags);
+trace_blk_co_pwritev(blk, bs, offset, bytes, flags);
 
 ret = blk_check_byte_request(blk, offset, bytes);
 if (ret < 0) {
 return ret;
 }
 
+bdrv_inc_in_flight(bs);
+
 /* throttling disk I/O */
 if (blk->public.throttle_state) {
 throttle_group_co_io_limits_intercept(blk, bytes, true);
@@ -837,7 +845,9 @@ int coroutine_fn blk_co_pwritev(BlockBackend *blk, int64_t 
offset,
 flags |= BDRV_REQ_FUA;
 }
 
-return bdrv_co_pwritev(blk->root, offset, bytes, qiov, flags);
+ret = bdrv_co_pwritev(blk->root, offset, bytes, qiov, flags);
+bdrv_dec_in_flight(bs);
+return ret;
 }
 
 typedef struct BlkRwCo {
@@ -930,6 +940,8 @@ int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags)
 static void error_callback_bh(void *opaque)
 {
 struct BlockBackendAIOCB *acb = opaque;
+
+bdrv_dec_in_flight(acb->common.bs);
 acb->common.cb(acb->common.opaque, acb->ret);
 qemu_aio_unref(acb);
 }
@@ -940,6 +952,7 @@ BlockAIOCB *blk_abort_aio_request(BlockBackend *blk,
 {
 struct BlockBackendAIOCB *acb;
 
+bdrv_inc_in_flight(blk_bs(blk));
 acb = blk_aio_get(_backend_aiocb_info, blk, cb, opaque);
 acb->blk = blk;
 acb->ret = ret;
@@ -962,6 +975,7 @@ static const AIOCBInfo blk_aio_em_aiocb_info = {
 static void blk_aio_complete(BlkAioEmAIOCB *acb)
 {
 if (acb->has_returned) {
+bdrv_dec_in_flight(acb->common.bs);
 acb->common.cb(acb->common.opaque, acb->rwco.ret);
 qemu_aio_unref(acb);
 }
@@ -983,6 +997,7 @@ static BlockAIOCB *blk_aio_prwv(BlockBackend *blk, int64_t 
offset, int bytes,
 BlkAioEmAIOCB *acb;
 Coroutine *co;
 
+bdrv_inc_in_flight(blk_bs(blk));
 acb = blk_aio_get(_aio_em_aiocb_info, blk, cb, opaque);
 acb->rwco = (BlkRwCo) {
 .blk= blk,
diff --git a/block/io.c b/block/io.c
index b136c89..8d46d8b 100644
--- a/block/io.c
+++ b/block/io.c
@@ -143,7 +143,7 @@ bool bdrv_requests_pending(BlockDriverState *bs)
 {
 BdrvChild *child;
 
-if (!QLIST_EMPTY(>tracked_requests)) {
+if (atomic_read(>in_flight)) {
 return true;
 }
 
@@ -176,12 +176,9 @@ typedef struct {
 
 static void bdrv_drain_poll(BlockDriverState *bs)
 {
-bool busy = true;
-
-while (busy) {
+while (bdrv_requests_pending(bs)) {
 /* Keep iterating */
-busy = bdrv_requests_pending(bs);
-busy |= aio_poll(bdrv_get_aio_context(bs), busy);
+aio_poll(bdrv_get_aio_context(bs), true);
 }
 }
 
@@ -189,8 +186,10 @@ static void bdrv_co_drain_bh_cb(void *opaque)
 {
 BdrvCoDrainData *data = opaque;
 Coroutine *co = data->co;
+BlockDriverState *bs = data->bs;
 
-bdrv_drain_poll(data->bs);
+bdrv_dec_in_flight(bs);
+bdrv_drain_poll(bs);
 data->done = true;
 qemu_coroutine_enter(co);
 }
@@ -209,6 +208,7 @@ static void coroutine_fn 
bdrv_co_yield_to_drain(BlockDriverState *bs)
 .bs = bs,
 .done = false,
 };
+bdrv_inc_in_flight(bs);
 aio_bh_schedule_oneshot(bdrv_get_aio_context(bs),

[Qemu-block] [PATCH 09/18] nfs: use bdrv_poll_while and bdrv_wakeup

2016-10-13 Thread Paolo Bonzini

These will make it possible to use nfs_get_allocated_file_size on
a file that is not in the main AioContext.

Signed-off-by: Paolo Bonzini 
---
 block/nfs.c | 47 +--
 1 file changed, 29 insertions(+), 18 deletions(-)

diff --git a/block/nfs.c b/block/nfs.c
index c8df8d8..e5c7e1a 100644
--- a/block/nfs.c
+++ b/block/nfs.c
@@ -52,6 +52,7 @@ typedef struct NFSClient {
 } NFSClient;
 
 typedef struct NFSRPC {
+BlockDriverState *bs;
 int ret;
 int complete;
 QEMUIOVector *iov;
@@ -90,11 +91,12 @@ static void nfs_process_write(void *arg)
 nfs_set_events(client);
 }
 
-static void nfs_co_init_task(NFSClient *client, NFSRPC *task)
+static void nfs_co_init_task(BlockDriverState *bs, NFSRPC *task)
 {
 *task = (NFSRPC) {
 .co = qemu_coroutine_self(),
-.client = client,
+.bs = bs,
+.client = bs->opaque,
 };
 }
 
@@ -111,6 +113,7 @@ nfs_co_generic_cb(int ret, struct nfs_context *nfs, void 
*data,
 {
 NFSRPC *task = private_data;
 task->ret = ret;
+assert(!task->st);
 if (task->ret > 0 && task->iov) {
 if (task->ret <= task->iov->size) {
 qemu_iovec_from_buf(task->iov, 0, data, task->ret);
@@ -118,18 +121,11 @@ nfs_co_generic_cb(int ret, struct nfs_context *nfs, void 
*data,
 task->ret = -EIO;
 }
 }
-if (task->ret == 0 && task->st) {
-memcpy(task->st, data, sizeof(struct stat));
-}
 if (task->ret < 0) {
 error_report("NFS Error: %s", nfs_get_error(nfs));
 }
-if (task->co) {
-aio_bh_schedule_oneshot(task->client->aio_context,
-nfs_co_generic_bh_cb, task);
-} else {
-task->complete = 1;
-}
+aio_bh_schedule_oneshot(task->client->aio_context,
+nfs_co_generic_bh_cb, task);
 }
 
 static int coroutine_fn nfs_co_readv(BlockDriverState *bs,
@@ -139,7 +135,7 @@ static int coroutine_fn nfs_co_readv(BlockDriverState *bs,
 NFSClient *client = bs->opaque;
 NFSRPC task;
 
-nfs_co_init_task(client, );
+nfs_co_init_task(bs, );
 task.iov = iov;
 
 if (nfs_pread_async(client->context, client->fh,
@@ -174,7 +170,7 @@ static int coroutine_fn nfs_co_writev(BlockDriverState *bs,
 NFSRPC task;
 char *buf = NULL;
 
-nfs_co_init_task(client, );
+nfs_co_init_task(bs, );
 
 buf = g_try_malloc(nb_sectors * BDRV_SECTOR_SIZE);
 if (nb_sectors && buf == NULL) {
@@ -210,7 +206,7 @@ static int coroutine_fn nfs_co_flush(BlockDriverState *bs)
 NFSClient *client = bs->opaque;
 NFSRPC task;
 
-nfs_co_init_task(client, );
+nfs_co_init_task(bs, );
 
 if (nfs_fsync_async(client->context, client->fh, nfs_co_generic_cb,
 ) != 0) {
@@ -496,6 +492,22 @@ static int nfs_has_zero_init(BlockDriverState *bs)
 return client->has_zero_init;
 }
 
+static void
+nfs_get_allocated_file_size_cb(int ret, struct nfs_context *nfs, void *data,
+   void *private_data)
+{
+NFSRPC *task = private_data;
+task->ret = ret;
+if (task->ret == 0) {
+memcpy(task->st, data, sizeof(struct stat));
+}
+if (task->ret < 0) {
+error_report("NFS Error: %s", nfs_get_error(nfs));
+}
+task->complete = 1;
+bdrv_wakeup(task->bs);
+}
+
 static int64_t nfs_get_allocated_file_size(BlockDriverState *bs)
 {
 NFSClient *client = bs->opaque;
@@ -507,16 +519,15 @@ static int64_t 
nfs_get_allocated_file_size(BlockDriverState *bs)
 return client->st_blocks * 512;
 }
 
+task.bs = bs;
 task.st = 
-if (nfs_fstat_async(client->context, client->fh, nfs_co_generic_cb,
+if (nfs_fstat_async(client->context, client->fh, 
nfs_get_allocated_file_size_cb,
 ) != 0) {
 return -ENOMEM;
 }
 
 nfs_set_events(client);
-while (!task.complete) {
-aio_poll(client->aio_context, true);
-}
+bdrv_poll_while(bs, !task.complete);
 
 return (task.ret < 0 ? task.ret : st.st_blocks * 512);
 }
-- 
2.7.4

[Qemu-block] [PATCH 03/18] mirror: use bdrv_drained_begin/bdrv_drained_end

2016-10-13 Thread Paolo Bonzini

Ensure that there are no changes between the last check to
bdrv_get_dirty_count and the switch to the target.

There is already a bdrv_drained_end call, we only need to ensure
that bdrv_drained_begin is not called twice.

Cc: qemu-sta...@nongnu.org
Signed-off-by: Paolo Bonzini 
---
 block/mirror.c | 35 +++
 1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index bd1963d..8cd69aa 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -622,6 +622,7 @@ static void coroutine_fn mirror_run(void *opaque)
 MirrorExitData *data;
 BlockDriverState *bs = blk_bs(s->common.blk);
 BlockDriverState *target_bs = blk_bs(s->target);
+bool need_drain = true;
 int64_t length;
 BlockDriverInfo bdi;
 char backing_filename[2]; /* we only need 2 characters because we are only
@@ -756,11 +757,26 @@ static void coroutine_fn mirror_run(void *opaque)
  * source has dirty data to copy!
  *
  * Note that I/O can be submitted by the guest while
- * mirror_populate runs.
+ * mirror_populate runs, so pause it now.  Before deciding
+ * whether to switch to target check one last time if I/O has
+ * come in the meanwhile, and if not flush the data to disk.
  */
 trace_mirror_before_drain(s, cnt);
-bdrv_co_drain(bs);
+
+bdrv_drained_begin(bs);
 cnt = bdrv_get_dirty_count(s->dirty_bitmap);
+if (cnt > 0) {
+bdrv_drained_end(bs);
+continue;
+}
+
+/* The two disks are in sync.  Exit and report successful
+ * completion.
+ */
+assert(QLIST_EMPTY(>tracked_requests));
+s->common.cancelled = false;
+need_drain = false;
+break;
 }
 
 ret = 0;
@@ -773,13 +789,6 @@ static void coroutine_fn mirror_run(void *opaque)
 } else if (!should_complete) {
 delay_ns = (s->in_flight == 0 && cnt == 0 ? SLICE_TIME : 0);
 block_job_sleep_ns(>common, QEMU_CLOCK_REALTIME, delay_ns);
-} else if (cnt == 0) {
-/* The two disks are in sync.  Exit and report successful
- * completion.
- */
-assert(QLIST_EMPTY(>tracked_requests));
-s->common.cancelled = false;
-break;
 }
 s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
 }
@@ -791,6 +800,7 @@ immediate_exit:
  * the target is a copy of the source.
  */
 assert(ret < 0 || (!s->synced && block_job_is_cancelled(>common)));
+assert(need_drain);
 mirror_wait_for_all_io(s);
 }
 
@@ -802,9 +812,10 @@ immediate_exit:
 
 data = g_malloc(sizeof(*data));
 data->ret = ret;
-/* Before we switch to target in mirror_exit, make sure data doesn't
- * change. */
-bdrv_drained_begin(bs);
+
+if (need_drain) {
+bdrv_drained_begin(bs);
+}
 block_job_defer_to_main_loop(>common, mirror_exit, data);
 }
 
-- 
2.7.4

[Qemu-block] [PATCH 02/18] blockjob: introduce .drain callback for jobs

2016-10-13 Thread Paolo Bonzini

This is required to decouple block jobs from running in an
AioContext.  With multiqueue block devices, a BlockDriverState
does not really belong to a single AioContext.

The solution is to first wait until all I/O operations are
complete; then loop in the main thread for the block job to
complete entirely.

Signed-off-by: Paolo Bonzini 
---
 block/backup.c   | 16 
 block/mirror.c   | 34 ++
 blockjob.c   | 37 -
 include/block/blockjob.h |  7 +++
 4 files changed, 69 insertions(+), 25 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 582bd0f..0350cfc 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -300,6 +300,20 @@ void backup_cow_request_end(CowRequest *req)
 cow_request_end(req);
 }
 
+static void backup_drain(BlockJob *job)
+{
+BackupBlockJob *s = container_of(job, BackupBlockJob, common);
+
+/* Need to keep a reference in case blk_drain triggers execution
+ * of backup_complete...
+ */
+if (s->target) {
+blk_ref(s->target);
+blk_drain(s->target);
+blk_unref(s->target);
+}
+}
+
 static const BlockJobDriver backup_job_driver = {
 .instance_size  = sizeof(BackupBlockJob),
 .job_type   = BLOCK_JOB_TYPE_BACKUP,
@@ -307,6 +321,7 @@ static const BlockJobDriver backup_job_driver = {
 .commit = backup_commit,
 .abort  = backup_abort,
 .attached_aio_context   = backup_attached_aio_context,
+.drain  = backup_drain,
 };
 
 static BlockErrorAction backup_error_action(BackupBlockJob *job,
@@ -331,6 +346,7 @@ static void backup_complete(BlockJob *job, void *opaque)
 BackupCompleteData *data = opaque;
 
 blk_unref(s->target);
+s->target = NULL;
 
 block_job_completed(job, data->ret);
 g_free(data);
diff --git a/block/mirror.c b/block/mirror.c
index f9d1fec..bd1963d 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -469,7 +469,11 @@ static void mirror_free_init(MirrorBlockJob *s)
 }
 }
 
-static void mirror_drain(MirrorBlockJob *s)
+/* This is also used for the .pause callback. There is no matching
+ * mirror_resume() because mirror_run() will begin iterating again
+ * when the job is resumed.
+ */
+static void mirror_wait_for_all_io(MirrorBlockJob *s)
 {
 while (s->in_flight > 0) {
 mirror_wait_for_io(s);
@@ -528,6 +532,7 @@ static void mirror_exit(BlockJob *job, void *opaque)
 g_free(s->replaces);
 bdrv_op_unblock_all(target_bs, s->common.blocker);
 blk_unref(s->target);
+s->target = NULL;
 block_job_completed(>common, data->ret);
 g_free(data);
 bdrv_drained_end(src);
@@ -582,7 +587,7 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
 sector_num += nb_sectors;
 }
 
-mirror_drain(s);
+mirror_wait_for_all_io(s);
 }
 
 /* First part, loop on the sectors and initialize the dirty bitmap.  */
@@ -786,7 +791,7 @@ immediate_exit:
  * the target is a copy of the source.
  */
 assert(ret < 0 || (!s->synced && block_job_is_cancelled(>common)));
-mirror_drain(s);
+mirror_wait_for_all_io(s);
 }
 
 assert(s->in_flight == 0);
@@ -870,14 +875,11 @@ static void mirror_complete(BlockJob *job, Error **errp)
 block_job_enter(>common);
 }
 
-/* There is no matching mirror_resume() because mirror_run() will begin
- * iterating again when the job is resumed.
- */
-static void coroutine_fn mirror_pause(BlockJob *job)
+static void mirror_pause(BlockJob *job)
 {
 MirrorBlockJob *s = container_of(job, MirrorBlockJob, common);
 
-mirror_drain(s);
+mirror_wait_for_all_io(s);
 }
 
 static void mirror_attached_aio_context(BlockJob *job, AioContext *new_context)
@@ -887,6 +889,20 @@ static void mirror_attached_aio_context(BlockJob *job, 
AioContext *new_context)
 blk_set_aio_context(s->target, new_context);
 }
 
+static void mirror_drain(BlockJob *job)
+{
+MirrorBlockJob *s = container_of(job, MirrorBlockJob, common);
+
+/* Need to keep a reference in case blk_drain triggers execution
+ * of mirror_complete...
+ */
+if (s->target) {
+blk_ref(s->target);
+blk_drain(s->target);
+blk_unref(s->target);
+}
+}
+
 static const BlockJobDriver mirror_job_driver = {
 .instance_size  = sizeof(MirrorBlockJob),
 .job_type   = BLOCK_JOB_TYPE_MIRROR,
@@ -894,6 +910,7 @@ static const BlockJobDriver mirror_job_driver = {
 .complete   = mirror_complete,
 .pause  = mirror_pause,
 .attached_aio_context   = mirror_attached_aio_context,
+.drain  = mirror_drain,
 };
 
 static const BlockJobDriver commit_active_job_driver = {
@@ -903,6 +920,7 @@ static const BlockJobDriver commit_active_job_driver = {
 .complete   = mirror_complete,

[Qemu-block] [PATCH 01/18] replication: interrupt failover if the main device is closed

2016-10-13 Thread Paolo Bonzini

Without this change, there is a race condition in tests/test-replication.
Depending on how fast the failover job (active commit) runs, there is a
chance of two bad things happening:

1) replication_done can be called after the secondary has been closed
and hence when the BDRVReplicationState is not valid anymore.

2) two copies of the active disk are present during the
/replication/secondary/stop test (that test runs immediately after
/replication/secondary/start, which tests failover).  This causes the
corruption detector to fire.

Reviewed-by: Wen Congyang 
Reviewed-by: Changlong Xie 
Signed-off-by: Paolo Bonzini 
---
 block/replication.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/replication.c b/block/replication.c
index 3bd1cf1..5231a00 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -133,6 +133,9 @@ static void replication_close(BlockDriverState *bs)
 if (s->replication_state == BLOCK_REPLICATION_RUNNING) {
 replication_stop(s->rs, false, NULL);
 }
+if (s->replication_state == BLOCK_REPLICATION_FAILOVER) {
+block_job_cancel_sync(s->active_disk->bs->job);
+}
 
 if (s->mode == REPLICATION_MODE_SECONDARY) {
 g_free(s->top_id);
-- 
2.7.4

Re: [Qemu-block] [PATCH 11/22] qcow2-bitmap: add qcow2_store_persistent_bitmaps()

2016-10-13 Thread Vladimir Sementsov-Ogievskiy


On 07.10.2016 22:24, Max Reitz wrote:

On 30.09.2016 12:53, Vladimir Sementsov-Ogievskiy wrote:

Realize block bitmap stroing interface, to allow qcow2 images store


[snip]


+uint64_t end = MIN(bm_size, sector + dsc);
+uint64_t write_size =
+bdrv_dirty_bitmap_serialization_size(bitmap, sector, end - sector);
+
+int64_t off = qcow2_alloc_clusters(bs, cl_size);
+if (off < 0) {
+ret = off;
+goto finish;
+}
+bitmap_table[cluster] = off;
+
+bdrv_dirty_bitmap_serialize_part(bitmap, buf, sector, end);

s/end/end - sector/?


o_0 terrible mistake, thank you.




+if (write_size < cl_size) {
+memset(buf + write_size, 0, cl_size - write_size);
+}
+

I guess there should be a metadata overlap check here.


What is the general rule of checking it? Should I check it before all my 
extension related writes?



--
Best regards,
Vladimir

Re: [Qemu-block] [Qemu-devel] [PATCH v14 13/21] qdict: allow qdict_crumple to accept compound types as values

2016-10-13 Thread Kevin Wolf

Am 13.10.2016 um 14:35 hat Markus Armbruster geschrieben:
> Cc: Kevin for discussion of QemuOpts dotted key convention
> 
> "Daniel P. Berrange"  writes:
> 
> > Currently qdict_crumple requires a totally flat QDict as its
> > input. i.e. all values in the QDict must be scalar types.
> >
> > In order to have backwards compatibility with the OptsVisitor,
> > qemu_opt_to_qdict() has a new mode where it may return a QList
> > for values in the QDict, if there was a repeated key. We thus
> > need to allow compound types to appear as values in the input
> > dict given to qdict_crumple().
> >
> > To avoid confusion, we sanity check that the user has not mixed
> > the old and new syntax at the same time. e.g. these are allowed
> >
> >foo=hello,foo=world,foo=wibble
> >foo.0=hello,foo.1=world,foo.2=wibble
> >
> > but this is forbidden
> >
> >foo=hello,foo=world,foo.2=wibble
> 
> I understand the need for foo.bar=val.  It makes it possible to specify
> nested dictionaries with QemuOpts.
> 
> The case for foo.0=val is less clear.  QemuOpts already supports lists,
> by repeating keys.  Why do we need a second, wordier way to specify
> them?

Probably primarily because someone didn't realise this when introducing
the dotted syntax. Also because flat QDicts can't represent this.

But even if I realised that QemuOpts support this syntax, I think we
would still have to use the dotted syntax because it's explicit about
the index and we need that because the list can contains dicts.

Compare this:

driver=quorum,
child.0.driver=file,child.0.filename=disk1.img,
child.1.driver=host_device,child.1.filename=/dev/sdb,
child.2.driver=nbd,child.2.host=localhost

And this:

driver=quorum,
child.driver=file,child.filename=disk1.img,
child.driver=host_device,child.filename=/dev/sdb,
child.driver=nbd,child.host=localhost

For starters, can we really trust the order in QemuOpts so that the
right driver and filename are associated with each other?

We would also have code to associate the third child.driver with the
first child.host (because file and host_device don't have a host
property). And this isn't even touching optional arguments yet. Would
you really want to implement or review this?

> Note that this second way creates entirely new failure modes and
> restrictions.  Let me show using an example derived from one in
> qdict_crumple()'s contract:
> 
> foo.0.bar=bla,foo.eek.bar=blubb
> 
> Without the dotted key convention, this is perfectly fine: key
> "foo.0.bar" has the single value "bla", and key "foo.eek.bar" has
> the single value "blubb".  Equivalent JSON would be
> 
>   { "foo.0.bar": "bla", "foo.eek.bar": "blubb" }
> 
> With just the struct convention, it's still fine: it obviously means
> the same as JSON
> 
>   { "foo": { "0": { "bar": "bla" }, "eek": { "bar": "blubb" } } }
> 
> Adding the list convention makes it invalid.  It also outlaws a
> bunch of keys that would be just fine in JSON, namely any that get
> recognized as list index.  Raise your hand if you're willing to bet
> real money on your predictions of what will be recognized as list
> index, without looking at the code.  I'm not.

If you're adding dict keys to the schema that could be reasonably parsed
as a list index, I think you're doing something wrong. I would agree
that this is a problem if we were talking about user-supplied values,
but it's just keys that are defined by qemu.

> I'm afraid I have growing doubts regarding the QemuOpts dotted key
> convention in general.
> 
> The convention makes '.' a special character in keys, but only
> sometimes.  If the key gets consumed by something that uses dotted key
> convention, '.' is special, and to get a non-special '.', you need to
> escape it by doubling.  Else, it's not.

Do we really implement escaping by doubling dots? This would be news to
me.

Do we even have a reason to escape dots, i.e. are there any options in
the schema whose keys contain a dot? I think we took care to avoid such
things.

> Since the same key can be used differently by different code, the same
> '.' could in theory be both special and non-special.  In practice, this
> would be madness.
> 
> Adopting the dotted key convention for an existing QemuOpts option, say
> -object [PATCH 15], *breaks* existing command line usage of keys
> containing '.', because you now have to escape the '.'.  Dan, I'm afraid
> you need to show that no such keys exist, or if they exist they don't
> matter.
> 
> I know we have keys containing '.' elsewhere, e.g. device "macio-ide"
> property "ide.0".  Our chronic inability to consistently restrict names
> in ABI to something sane is beyond foolish.

I wanted to have a look at this example, but I can only find the string
"ide.0" used as a bus name in the sources, that is, a value rather than
a key.

Do you have a pointer to the property definition that you mean?

> It's probably too late to

Re: [Qemu-block] [PATCH v4 12/12] iotests: Add test for NBD's blockdev-add interface

2016-10-13 Thread Kevin Wolf

Am 28.09.2016 um 22:56 hat Max Reitz geschrieben:
> Signed-off-by: Max Reitz 
> ---
>  tests/qemu-iotests/147 | 201 
> +
>  tests/qemu-iotests/147.out |   5 ++
>  tests/qemu-iotests/group   |   1 +
>  3 files changed, 207 insertions(+)
>  create mode 100755 tests/qemu-iotests/147
>  create mode 100644 tests/qemu-iotests/147.out
> 
> diff --git a/tests/qemu-iotests/147 b/tests/qemu-iotests/147
> new file mode 100755
> index 000..61e1e6f
> --- /dev/null
> +++ b/tests/qemu-iotests/147
> @@ -0,0 +1,201 @@
> +#!/usr/bin/env python
> +#
> +# Test case for NBD's blockdev-add interface
> +#
> +# Copyright (C) 2016 Red Hat, Inc.
> +#
> +# This program is free software; you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation; either version 2 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program.  If not, see .
> +#
> +
> +import os
> +import socket
> +import stat
> +import time
> +import iotests
> +from iotests import cachemode, imgfmt, qemu_img, qemu_nbd
> +
> +NBD_PORT = 10811
> +
> +test_img = os.path.join(iotests.test_dir, 'test.img')
> +
> +class NBDBlockdevAddBase(iotests.QMPTestCase):
> +def blockdev_add_options(self, address, export=None):
> +options = { 'node-name': 'nbd-blockdev',
> +'driver': 'raw',
> +'file': {
> +'driver': 'nbd',
> +'address': address
> +} }
> +if export is not None:
> +options['file']['export'] = export
> +return options
> +
> +def client_test(self, filename, address, export=None):
> +bao = self.blockdev_add_options(address, export)
> +result = self.vm.qmp('blockdev-add', options=bao)

This needs a rebase (**bao instead of options=bao).

> +self.assert_qmp(result, 'return', {})
> +
> +result = self.vm.qmp('query-named-block-nodes')
> +for node in result['return']:
> +if node['node-name'] == 'nbd-blockdev':
> +self.assert_qmp(node, 'image/filename', filename)
> +break
> +
> +result = self.vm.qmp('x-blockdev-del', node_name='nbd-blockdev')
> +self.assert_qmp(result, 'return', {})
> +
> +
> +class QemuNBD(NBDBlockdevAddBase):
> +def setUp(self):
> +qemu_img('create', '-f', iotests.imgfmt, test_img, '64k')
> +self.vm = iotests.VM()
> +self.vm.launch()
> +
> +def tearDown(self):
> +self.vm.shutdown()
> +os.remove(test_img)
> +
> +def _server_up(self, *args):
> +self.assertEqual(qemu_nbd('-f', imgfmt, test_img, *args), 0)
> +
> +def test_inet(self):
> +self._server_up('-p', str(NBD_PORT))
> +address = { 'type': 'inet',
> +'data': {
> +'host': 'localhost',
> +'port': str(NBD_PORT)
> +} }
> +self.client_test('nbd://localhost:%i' % NBD_PORT, address)
> +
> +def test_unix(self):
> +socket = os.path.join(iotests.test_dir, 'qemu-nbd.socket')
> +self._server_up('-k', socket)
> +address = { 'type': 'unix',
> +'data': { 'path': socket } }
> +self.client_test('nbd+unix://?socket=' + socket, address)
> +try:
> +os.remove(socket)
> +except OSError:
> +pass

If the test case fails, the socket file is leaked. We should probably
either create and remove it in setUp()/tearDown(), or use a try/finally
block around the test_unix code.

Kevin

Re: [Qemu-block] [PATCH v4 11/12] socket_scm_helper: Accept fd directly

2016-10-13 Thread Kevin Wolf

Am 28.09.2016 um 22:56 hat Max Reitz geschrieben:
> This gives us more freedom about the fd that is passed to qemu, allowing
> us to e.g. pass sockets.
> 
> Signed-off-by: Max Reitz 

Reviewed-by: Kevin Wolf

Re: [Qemu-block] [PATCH v4 08/12] qapi: Allow blockdev-add for NBD

2016-10-13 Thread Kevin Wolf

Am 28.09.2016 um 22:55 hat Max Reitz geschrieben:
> Signed-off-by: Max Reitz 

Reviewed-by: Kevin Wolf

Re: [Qemu-block] [PATCH v4 07/12] block/nbd: Use SocketAddress options

2016-10-13 Thread Kevin Wolf

Am 28.09.2016 um 22:55 hat Max Reitz geschrieben:
> Drop the use of legacy options in favor of the SocketAddress
> representation, even for internal use (i.e. for storing the result of
> the filename parsing).
> 
> Signed-off-by: Max Reitz 

Reviewed-by: Kevin Wolf

Re: [Qemu-block] [Qemu-devel] [PATCH v14 13/21] qdict: allow qdict_crumple to accept compound types as values

2016-10-13 Thread Markus Armbruster

Cc: Kevin for discussion of QemuOpts dotted key convention

"Daniel P. Berrange"  writes:

> Currently qdict_crumple requires a totally flat QDict as its
> input. i.e. all values in the QDict must be scalar types.
>
> In order to have backwards compatibility with the OptsVisitor,
> qemu_opt_to_qdict() has a new mode where it may return a QList
> for values in the QDict, if there was a repeated key. We thus
> need to allow compound types to appear as values in the input
> dict given to qdict_crumple().
>
> To avoid confusion, we sanity check that the user has not mixed
> the old and new syntax at the same time. e.g. these are allowed
>
>foo=hello,foo=world,foo=wibble
>foo.0=hello,foo.1=world,foo.2=wibble
>
> but this is forbidden
>
>foo=hello,foo=world,foo.2=wibble

I understand the need for foo.bar=val.  It makes it possible to specify
nested dictionaries with QemuOpts.

The case for foo.0=val is less clear.  QemuOpts already supports lists,
by repeating keys.  Why do we need a second, wordier way to specify
them?

Note that this second way creates entirely new failure modes and
restrictions.  Let me show using an example derived from one in
qdict_crumple()'s contract:

foo.0.bar=bla,foo.eek.bar=blubb

Without the dotted key convention, this is perfectly fine: key
"foo.0.bar" has the single value "bla", and key "foo.eek.bar" has
the single value "blubb".  Equivalent JSON would be

  { "foo.0.bar": "bla", "foo.eek.bar": "blubb" }

With just the struct convention, it's still fine: it obviously means
the same as JSON

  { "foo": { "0": { "bar": "bla" }, "eek": { "bar": "blubb" } } }

Adding the list convention makes it invalid.  It also outlaws a
bunch of keys that would be just fine in JSON, namely any that get
recognized as list index.  Raise your hand if you're willing to bet
real money on your predictions of what will be recognized as list
index, without looking at the code.  I'm not.

I'm afraid I have growing doubts regarding the QemuOpts dotted key
convention in general.

The convention makes '.' a special character in keys, but only
sometimes.  If the key gets consumed by something that uses dotted key
convention, '.' is special, and to get a non-special '.', you need to
escape it by doubling.  Else, it's not.

Since the same key can be used differently by different code, the same
'.' could in theory be both special and non-special.  In practice, this
would be madness.

Adopting the dotted key convention for an existing QemuOpts option, say
-object [PATCH 15], *breaks* existing command line usage of keys
containing '.', because you now have to escape the '.'.  Dan, I'm afraid
you need to show that no such keys exist, or if they exist they don't
matter.

I know we have keys containing '.' elsewhere, e.g. device "macio-ide"
property "ide.0".  Our chronic inability to consistently restrict names
in ABI to something sane is beyond foolish.

It's probably too late to back out the dotted key convention completely.
Kevin?

Can we still back out the list part of the convention, and use repeated
keys instead?

If we're stuck with some form of the dotted key convention, can we at
least make it a more integral part of QemuOpts rather than something
bolted on as an afterthought?  Here's my thinking on how that might be
done:

* Have a QemuOptsList flag @flat.

* If @flat, QemuOpts behaves as it always has: the special characters
  are ',' and '=', and parsing a key=value,... string produces a list
  where each element represents one key=value from the string, in the
  same order.

* If not @flat, '.' becomes an additional special character, and parsing
  a key=value,... string produces a dictionary, similar to the one you
  get now by converting with qemu_opts_to_qdict() and filtering through
  qdict_crumple().

The difference to now is that you either always crumple, or not at all,
and the meaning of '.' is unambiguous.

I wish we had refrained from saddling QemuOpts with even more magic.
Compared to this swamp, use of JSON on the command line looks rather
appealing to me.

Re: [Qemu-block] [PATCH 06/22] qcow2: add dirty bitmaps extension

2016-10-13 Thread Vladimir Sementsov-Ogievskiy


On 12.10.2016 21:21, Max Reitz wrote:

On 11.10.2016 14:09, Vladimir Sementsov-Ogievskiy wrote:

On 01.10.2016 17:46, Max Reitz wrote:

On 30.09.2016 12:53, Vladimir Sementsov-Ogievskiy wrote:

Add dirty bitmap extension as specified in docs/specs/qcow2.txt.
For now, just mirror extension header into Qcow2 state and check
constraints.

For now, disable image resize if it has bitmaps. It will be fixed later.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
   block/qcow2.c | 83
+++
   block/qcow2.h |  4 +++
   2 files changed, 87 insertions(+)

diff --git a/block/qcow2.c b/block/qcow2.c
index c079aa8..08c4ef9 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c

[...]


@@ -162,6 +164,62 @@ static int
qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
   }
   break;
   +case QCOW2_EXT_MAGIC_DIRTY_BITMAPS:
+ret = bdrv_pread(bs->file, offset, _ext, ext.len);

Overflows with ext.len > sizeof(bitmaps_ext).

(ext.len < sizeof(bitmaps_ext) is also wrong, but less dramatically so.)


+if (ret < 0) {
+error_setg_errno(errp, -ret, "ERROR: bitmaps_ext: "
+ "Could not read ext header");
+return ret;
+}
+
+if (bitmaps_ext.reserved32 != 0) {
+error_setg_errno(errp, -ret, "ERROR: bitmaps_ext: "
+ "Reserved field is not zero.");

Please drop the full stop at the end.

what do you mean? goto to fail: here? or not stop at all, just print error?

The "." at the end of the message. :-)

(https://en.wikipedia.org/wiki/Full_stop)

Max


aha, cool. didn't know this. )

--
Best regards,
Vladimir

Re: [Qemu-block] [PATCH v4 05/12] block/nbd: Add nbd_has_filename_options_conflict()

2016-10-13 Thread Kevin Wolf

Am 28.09.2016 um 22:55 hat Max Reitz geschrieben:
> Right now, we have four possible options that conflict with specifying
> an NBD filename, and a future patch will add another one ("address").
> This future option is a nested QDict that is flattened at this point,
> requiring us to test each option whether its key has an "address."
> prefix. Therefore, we will then need to iterate through all options.
> 
> Adding this iteration logic now will simplify adding the new option
> later. A nice side effect is that the user will not receive a long list
> of five options which are not supposed to be specified with a filename,
> but we can actually print the problematic option.
> 
> Signed-off-by: Max Reitz 

Reviewed-by: Kevin Wolf

Re: [Qemu-block] [PATCH v4 06/12] block/nbd: Accept SocketAddress

2016-10-13 Thread Kevin Wolf

Am 28.09.2016 um 22:55 hat Max Reitz geschrieben:
> Add a new option "address" to the NBD block driver which accepts a
> SocketAddress.
> 
> "path", "host" and "port" are still supported as legacy options and are
> mapped to their corresponding SocketAddress representation.
> 
> Signed-off-by: Max Reitz 

Not opposed in principle to your change, but we should try to keep the
naming consistent between NBD and the other block drivers, notably the
SSH work that is currently going on.

This patch uses 'address' for the SockAddress, the proposed SSH patch
uses 'server'. I don't mind too much which one we choose, though I like
'server' a bit better. Anyway, we should choose one and stick to it in
all drivers.

>  block/nbd.c   | 166 
> ++
>  tests/qemu-iotests/051.out|   4 +-
>  tests/qemu-iotests/051.pc.out |   4 +-
>  3 files changed, 106 insertions(+), 68 deletions(-)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index cdab20f..449f94e 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -32,6 +32,9 @@
>  #include "qemu/uri.h"
>  #include "block/block_int.h"
>  #include "qemu/module.h"
> +#include "qapi-visit.h"
> +#include "qapi/qmp-input-visitor.h"
> +#include "qapi/qmp-output-visitor.h"
>  #include "qapi/qmp/qdict.h"
>  #include "qapi/qmp/qjson.h"
>  #include "qapi/qmp/qint.h"
> @@ -44,7 +47,8 @@ typedef struct BDRVNBDState {
>  NbdClientSession client;
>  
>  /* For nbd_refresh_filename() */
> -char *path, *host, *port, *export, *tlscredsid;
> +SocketAddress *saddr;
> +char *export, *tlscredsid;
>  } BDRVNBDState;
>  
>  static int nbd_parse_uri(const char *filename, QDict *options)
> @@ -131,7 +135,9 @@ static bool nbd_has_filename_options_conflict(QDict 
> *options, Error **errp)
>  if (!strcmp(e->key, "host") ||
>  !strcmp(e->key, "port") ||
>  !strcmp(e->key, "path") ||
> -!strcmp(e->key, "export"))
> +!strcmp(e->key, "export") ||
> +!strcmp(e->key, "address") ||
> +!strncmp(e->key, "address.", 8))

strstart()?

>  {
>  error_setg(errp, "Option '%s' cannot be used with a file name",
> e->key);
> @@ -205,50 +211,67 @@ out:
>  g_free(file);
>  }
>  
> -static SocketAddress *nbd_config(BDRVNBDState *s, QemuOpts *opts, Error 
> **errp)
> +static bool nbd_process_legacy_socket_options(QDict *output_options,
> +  QemuOpts *legacy_opts,
> +  Error **errp)
>  {
> -SocketAddress *saddr;
> -
> -s->path = g_strdup(qemu_opt_get(opts, "path"));
> -s->host = g_strdup(qemu_opt_get(opts, "host"));
> -s->port = g_strdup(qemu_opt_get(opts, "port"));
> -
> -if (!s->path == !s->host) {
> -if (s->path) {
> -error_setg(errp, "path and host may not be used at the same 
> time");
> -} else {
> -error_setg(errp, "one of path and host must be specified");
> +const char *path = qemu_opt_get(legacy_opts, "path");
> +const char *host = qemu_opt_get(legacy_opts, "host");
> +const char *port = qemu_opt_get(legacy_opts, "port");
> +
> +if (path && host) {
> +error_setg(errp, "path and host may not be used at the same time");
> +return false;
> +} else if (path) {
> +if (port) {
> +error_setg(errp, "port may not be used without host");
> +return false;
>  }
> -return NULL;
> +
> +qdict_put(output_options, "address.type", qstring_from_str("unix"));
> +qdict_put(output_options, "address.data.path", 
> qstring_from_str(path));
> +} else if (host) {
> +qdict_put(output_options, "address.type", qstring_from_str("inet"));
> +qdict_put(output_options, "address.data.host", 
> qstring_from_str(host));
> +qdict_put(output_options, "address.data.port",
> +  qstring_from_str(port ?: stringify(NBD_DEFAULT_PORT)));
>  }
> -if (s->port && !s->host) {
> -error_setg(errp, "port may not be used without host");
> -return NULL;
> +
> +return true;
> +}

If both the legacy option and the new one are given, the legacy one
takes precedence. Intentional?

Kevin

Re: [Qemu-block] [PATCH v4 03/12] block/nbd: Default port in nbd_refresh_filename()

2016-10-13 Thread Kevin Wolf

Am 28.09.2016 um 22:55 hat Max Reitz geschrieben:
> Instead of not emitting the port in nbd_refresh_filename(), just set it
> to the default if the user did not specify it. This makes the logic a
> bit simpler.
> 
> Signed-off-by: Max Reitz 

Reviewed-by: Kevin Wolf

Re: [Qemu-block] [PATCH v4 04/12] block/nbd: Use qdict_put()

2016-10-13 Thread Kevin Wolf

Am 28.09.2016 um 22:55 hat Max Reitz geschrieben:
> Instead of inlining this nice macro (i.e. resorting to
> qdict_put_obj(..., QOBJECT(...))), use it.
> 
> Signed-off-by: Max Reitz 

Reviewed-by: Kevin Wolf

Re: [Qemu-block] [PATCH v4 02/12] block/nbd: Reject port parameter without host

2016-10-13 Thread Kevin Wolf

Am 28.09.2016 um 22:55 hat Max Reitz geschrieben:
> Currently, a port that is passed along with a UNIX socket path is
> silently ignored. That is not exactly ideal, it should be an error
> instead.
> 
> Signed-off-by: Max Reitz 

Reviewed-by: Kevin Wolf

Re: [Qemu-block] [PATCH v4 01/12] block/nbd: Drop trailing "." in error messages

2016-10-13 Thread Kevin Wolf

Am 28.09.2016 um 22:55 hat Max Reitz geschrieben:
> Signed-off-by: Max Reitz 

Reviewed-by: Kevin Wolf

Re: [Qemu-block] [Qemu-devel] [PATCH v14 12/21] option: allow qemu_opts_to_qdict to merge repeated options

2016-10-13 Thread Markus Armbruster

Markus Armbruster  writes:

> "Daniel P. Berrange"  writes:
>
>> If given an option string such as
>>
>>   size=1024,nodes=10,nodes=4-5,nodes=1-2,policy=bind
>>
>> the qemu_opts_to_qdict() method will currently overwrite
>> the values for repeated option keys, so only the last
>> value is in the returned dict:
>>
>> size=QString("1024")
>> nodes=QString("1-2")
>> policy=QString("bind")
>>
>> With this change the caller can optionally ask for all
>> the repeated values to be stored in a QList. In the
>> above example that would result in 'nodes' being a
>> QList, so the returned dict would contain
>>
>> size=QString("1024")
>> nodes=QList([QString("10"),
>>  QString("4-5"),
>>  QString("1-2")])
>> policy=QString("bind")
>>
>> Note that the conversion has no way of knowing whether
>> any given key is expected to be a list upfront - it can
>> only figure that out when seeing the first duplicated
>> key. Thus the caller has to be prepared to deal with the
>> fact that if a key 'foo' is a list, then the returned
>> qdict may contain either a QString or a QList for the
>> key 'foo'.
>>
>> In a third mode, it is possible to ask for repeated
>> options to be reported as an error, rather than silently
>> dropping all but the last one.
>
> To serve as a replacement for the options visitor, this needs to be able
> to behave exactly the same together with a suitably hacked up QObject
> input visitor.  Before I dive into the actual patch, let me summarize
> QemuOpts and options visitor behavior.
>
> Warning, this is going to get ugly.
>
> QemuOpts faithfully represents a key=value,... string as a list of
> QemuOpt.  Each QemuOpt represents one key=value.  They are in the same
> order.  If key occurs multiple times in the string, it occurs just the
> same in the list.
>
> *Except* key "id" is special: it's stored outside the list, and all but
> the first one are silently ignored.
>
> Most users only ever get the last value of a key.  Any non-last
> key=value are silently ignored.
>
> We actually exploit this behavior to do defaults, by *prepending* them
> to the list.  See the use of qemu_opts_set_defaults() in main().

This prepending of defaults assumes all users ignore values other than
the last.  It breaks if any user gets non-last values.

> A few users get all values of keys (other than key "id"):
>
> * -device, in qdev_device_add() with callback set_property().
>
>   We first get "driver" and "bus" normally (silently ignoring non-last
>   values, as usual).  All other keys are device properties.  To set
>   them, we get all (key, value), ignore keys "driver" and "bus", and set
>   the rest.  If a key occurs multiple times, it gets set multiple times.
>   This effectively ignores all but the last one, silently.
>
> * -semihosting-config, in main() with callback add_semihosting_arg().
>
>   We first get a bunch of keys normally.  Key "arg" is special: it may
>   be repeated to build a list.  To implement that, we get all (key,
>   value), ignore keys other than "arg", and accumulate the values.
>
> * -machine & friends, in main() with callback machine_set_property()
>
>   Similar to -device, only for machines, with "type" instead of "driver"
>   and "bus".
>
> * -spice, in qemu_spice_init() with callback add_channel()
>
>   Keys "tls-channel" and "plaintext-channel" may be used repeated to
>   specify multiple channels.  To implement that, we get all (key,
>   value), ignore keys other than "tls-channel" and "plaintext-channel",
>   and set up a channel for each of the others.
>
> * -writeconfig, in config_write_opts() with callback config_write_opt()
>
>   We write out all keys in order.
>
> * The options visitor, in opts_start_struct()
>
>   We convert the list of (key, value) to a hash table of (key, list of
>   values).  Most of the time, the list of values has exactly one
>   element.
>
>   When the visitor's user asks for a scalar, we return the last element
>   of the list of values, in lookup_scalar().
>
>   When the user asks for list elements, we return the elements of the
>   list of values in order, in opts_next_list(), or if there are none,
>   the empty list in opts_start_list().

Note that the only way to get non-last values is to iterate over all
(key, value).  The combination of "getting a specific key's value gets
the last one" and "iterating over all keys gets all values" is poor
interface design.  The latter feature got pressed into service to do
list-valued keys.  When qemu_opts_set_defaults() got added (commit
4f6dd9a), the bad interaction with the list-valued keys hack wasn't
considered, probably because the whole thing had become too byzantine to
fully understand.

> Unlike the options visitor, this patch (judging from your description)
> makes a list only when keys are repeated.  The QObject visitor will have
> to cope with finding both scalars and lists.  When it finds a scalar,
> but needs a

Re: [Qemu-block] [Qemu-devel] [PATCH v2 04/11] blockjobs: Always use block_job_get_aio_context

2016-10-13 Thread Paolo Bonzini



On 13/10/2016 02:49, John Snow wrote:
> As context to everyone else as to why I'm going down the rabbit hole of
> trying to remove external references to AioContext at all, see
> https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg00795.html
> 
> On 10/07/2016 03:49 AM, Paolo Bonzini wrote:
>>
>>
>> On 06/10/2016 22:22, John Snow wrote:
>>> Calls .complete(), for which the only implementation is
>>> mirror_complete. Uh, this actually seems messy. Looks like there's
>>> nothing to prevent us from calling this after we've already told it to
>>> complete once.
>>
>> Yeah, it should have an
>>
>> if (s->should_complete) {
>> return;
>> }
>>
>> at the beginning.  I have other mirror.c patches in my queue so I can
>> take care of that.
>>
> 
> Or something up the stack at block_job_complete so it's not up to job
> implementations? What if the next implementer "forgets."

Yeah, that would require moving s->should_complete up to BlockJob.

>>> block_job_cancel and block_job_complete are different.
>>>
>>> block_job_cancel is called in many places, but we can just add a similar
>>> block_job_user_cancel if we wanted a version which takes care to acquire
>>> context and one that does not. (Or we could just acquire the context
>>> regardless, but Paolo warned me ominously that recursive locks are EVIL.
>>> He sounded serious.)
>>
>> Not that many places:
>>
>> - block_job_finish_sync calls it, and you can just release/reacquire
>> around the call to "finish(job, _err)".
> 
> This makes me a little nervous because we went through the trouble of
> creating this callback, but we're going to assume we know that it's a
> public interface that will take the lock for itself (or otherwise does
> not require a lock.)
> 
> In practice it works, but it seems needlessly mystifying in terms of
> proving correctness.

It's _much_ easier to assume that all callbacks take the lock
themselves.  It's counterintuitive (just like the idea that recursive
locks are bad :)), but the point is that if you second guess the
callbacks and invoke them you might get the locking order wrong.  This
ends up with ugly AB-BA deadlocks.

>> - replication_stop is not acquiring s->secondary_disk->bs's AioContext.
> 
> Seems like a bug on their part. Would be fixed by having cancel acquire
> context for itself.

Yep.

> Yeah, I should be reffing it anyway.
> 
> The rest of this... What I think you mean is acquiring and releasing the
> context as needed for EACH of prepare, commit, abort, and clean as
> necessary, right?
> 
> And then in this case, it simply wouldn't be necessary for abort, as the
> sync cancel would do it for us.

Right.

> Alright.
> 
> Say I *do* push the acquisitions down into blockjob.c. What benefit does
> that provide? Won't I still need the block_job_get_aio_context()
> function (At least internally) to know which context to acquire? This
> would preclude you from deleting it.
> 
> Plus... we remove some fairly simple locking mechanisms and then inflate
> it tenfold. I'm not convinced this is an improvement.

The improvement is that you can now replace the huge lock with a smaller
one (you don't have to do it now, but you could).  Furthermore, the
small lock is a) non-recursive b) a leaf lock.  This means that you've
removed a lot of opportunities for deadlock, and generally made things
easier to reason about.

Furthermore, with large locks you never know what they actually protect;
cue the confusion between aio_context_acquire/release and
bdrv_drained_begin/end.  And it's easier to forget them if you force the
caller to do it; and we have an example with replication.

Again, it can be counterintuitive that it's better, but it is. :)

> (1) QMP interface for job management
> (2) bdrv_drain_all, in block/io.c
> 
> (1) AFAICT, the QMP interface is concerned with assuring itself it has
> unique access to the BlockJob structure itself, and it doesn't really
> authentically care about the AIOContext itself -- just race-free access
> to the Job.

Yes.  The protection you need here is mostly against concurrent
completion of the job (and concurrent manipulation of fields such as
job->busy, job->paused, etc.

> This is not necessarily buggy today because, even though we grab the
> BlockBackend's context unconditionally, we already know the main/monitor
> thread is not accessing the blockjob. It's still silly, though.
> 
> (2) bdrv_drain_all appears to be worried about the same thing; we just
> need to safely deliver pause/resume messages.
> 
> I'm less sure about where this can run from, and suspect that if the job
> has deferred to main that this could be buggy. If bdrv_drain_all is
> called from context A and the job is running on context M having
> deferred from B, we may lock against context B (from A) and have racey
> access from between A/M. (Maybe?)

bdrv_drain_all is BQL-only (because it acquires more than one AioContext).

> Would you be terribly offended if I left this patch as-is for now and we
> can work

Re: [Qemu-block] [Qemu-devel] [PATCH v14 12/21] option: allow qemu_opts_to_qdict to merge repeated options

2016-10-13 Thread Markus Armbruster

"Daniel P. Berrange"  writes:

> If given an option string such as
>
>   size=1024,nodes=10,nodes=4-5,nodes=1-2,policy=bind
>
> the qemu_opts_to_qdict() method will currently overwrite
> the values for repeated option keys, so only the last
> value is in the returned dict:
>
> size=QString("1024")
> nodes=QString("1-2")
> policy=QString("bind")
>
> With this change the caller can optionally ask for all
> the repeated values to be stored in a QList. In the
> above example that would result in 'nodes' being a
> QList, so the returned dict would contain
>
> size=QString("1024")
> nodes=QList([QString("10"),
>  QString("4-5"),
>  QString("1-2")])
> policy=QString("bind")
>
> Note that the conversion has no way of knowing whether
> any given key is expected to be a list upfront - it can
> only figure that out when seeing the first duplicated
> key. Thus the caller has to be prepared to deal with the
> fact that if a key 'foo' is a list, then the returned
> qdict may contain either a QString or a QList for the
> key 'foo'.

Actually three cases, not two:

0. qdict does not contain the key means empty list.

1. qdict contains the key with a QString value means list of one
element.

2. qdict contains the key with a QList value means list of more than one
element.

I consider this weird.  However, it's usefully weird with at least your
QObject input visitor.

> In a third mode, it is possible to ask for repeated
> options to be reported as an error, rather than silently
> dropping all but the last one.

Got users for this policy in the pipeline?

> All existing callers are all converted to explicitly
> request the historical behaviour of only reporting the
> last key. Later patches will make use of the new modes.
>
> Signed-off-by: Daniel P. Berrange 
> ---
>  blockdev.c   |   7 ++-
>  include/qemu/option.h|  13 -
>  monitor.c|   3 +-
>  qapi/qobject-input-visitor.c |   4 +-
>  qemu-img.c   |   4 +-
>  qemu-io-cmds.c   |   3 +-
>  qemu-io.c|   6 +-
>  qemu-nbd.c   |   3 +-
>  qom/object_interfaces.c  |   3 +-
>  tests/test-qemu-opts.c   | 132 
> +++
>  tests/test-replication.c |   9 ++-
>  util/qemu-option.c   |  64 ++---
>  12 files changed, 229 insertions(+), 22 deletions(-)
>
> diff --git a/blockdev.c b/blockdev.c
> index 814d49f..a999d46 100644
> --- a/blockdev.c
> +++ b/blockdev.c
> @@ -914,7 +914,8 @@ DriveInfo *drive_new(QemuOpts *all_opts, 
> BlockInterfaceType block_default_type)
>  
>  /* Get a QDict for processing the options */
>  bs_opts = qdict_new();
> -qemu_opts_to_qdict(all_opts, bs_opts);
> +qemu_opts_to_qdict(all_opts, bs_opts, QEMU_OPTS_REPEAT_POLICY_LAST,
> +   _abort);
>  
>  legacy_opts = qemu_opts_create(_legacy_drive_opts, NULL, 0,
> _abort);
> @@ -3804,8 +3805,8 @@ void hmp_drive_add_node(Monitor *mon, const char 
> *optstr)
>  return;
>  }
>  
> -qdict = qemu_opts_to_qdict(opts, NULL);
> -
> +qdict = qemu_opts_to_qdict(opts, NULL, QEMU_OPTS_REPEAT_POLICY_LAST,
> +   _abort);
>  if (!qdict_get_try_str(qdict, "node-name")) {
>  QDECREF(qdict);
>  error_report("'node-name' needs to be specified");
> diff --git a/include/qemu/option.h b/include/qemu/option.h
> index 29f3f18..ffd841d 100644
> --- a/include/qemu/option.h
> +++ b/include/qemu/option.h
> @@ -125,7 +125,18 @@ void qemu_opts_set_defaults(QemuOptsList *list, const 
> char *params,
>  int permit_abbrev);
>  QemuOpts *qemu_opts_from_qdict(QemuOptsList *list, const QDict *qdict,
> Error **errp);
> -QDict *qemu_opts_to_qdict(const QemuOpts *opts, QDict *qdict);

Blank line here, please.

> +typedef enum {
> +/* Last occurrence of a duplicate option silently replaces earlier */
> +QEMU_OPTS_REPEAT_POLICY_LAST,
> +/* Each occurrence of a duplicate option converts value to a QList */
> +QEMU_OPTS_REPEAT_POLICY_ALL,
> +/* First occurrence of a duplicate option causes an error */
> +QEMU_OPTS_REPEAT_POLICY_ERROR,
> +} QemuOptsRepeatPolicy;
> +
> +QDict *qemu_opts_to_qdict(const QemuOpts *opts, QDict *qdict,
> +  QemuOptsRepeatPolicy repeatPolicy,
> +  Error **errp);
>  void qemu_opts_absorb_qdict(QemuOpts *opts, QDict *qdict, Error **errp);
>  
>  typedef int (*qemu_opts_loopfunc)(void *opaque, QemuOpts *opts, Error 
> **errp);
> diff --git a/monitor.c b/monitor.c
> index 14089cc..84e79d4 100644
> --- a/monitor.c
> +++ b/monitor.c
> @@ -2642,7 +2642,8 @@ static QDict *monitor_parse_arguments(Monitor *mon,
>  if (!opts) {
>  goto fail;
>

Re: [Qemu-block] [Qemu-devel] [PATCH v14 08/21] qapi: allow QObjectInputVisitor to be created with QemuOpts

2016-10-13 Thread Markus Armbruster

Markus Armbruster  writes:

> "Daniel P. Berrange"  writes:
>
>> Instead of requiring all callers to go through the mutli-step
>
> multi-step
>
>> process of turning QemuOpts into a suitable QObject for visiting,
>> add a new constructor that encapsulates this logic. This will
>> allow QObjectInputVisitor to be a drop-in replacement for the
>> existing OptsVisitor with minimal code changes for callers.
>>
>> NB, at this point it is only supporting opts syntax which
>> explicitly matches the QAPI schema structure, so is not yet
>> a true drop-in replacement for OptsVisitor. The patches that
>> follow will add the special cases requird for full backwards
>> compatibility with OptsVisitor.
>>
>> Signed-off-by: Daniel P. Berrange 
[...]
>> diff --git a/qapi/qobject-input-visitor.c b/qapi/qobject-input-visitor.c
>> index cf41df6..d9269c9 100644
>> --- a/qapi/qobject-input-visitor.c
>> +++ b/qapi/qobject-input-visitor.c
>> @@ -545,3 +545,28 @@ Visitor *qobject_input_visitor_new_autocast(QObject 
>> *obj)
>>  
>>  return >visitor;
>>  }
>> +
>> +
>> +Visitor *qobject_input_visitor_new_opts(const QemuOpts *opts,
>> +Error **errp)
>> +{
>> +QDict *pdict;
>> +QObject *pobj = NULL;
>
> @pdict and @pobj are unusual names.  Let's stick to the more common
> @dict and @obj.
>
>> +Visitor *v = NULL;
>> +
>> +pdict = qemu_opts_to_qdict(opts, NULL);
>> +if (!pdict) {
>> +goto cleanup;

Returns null without setting an error, which is wrong.  Fortunately,
qemu_opts_to_qdict() cannot fail.  Please drop the broken error
handling.

>> +}
>> +
>> +pobj = qdict_crumple(pdict, true, errp);
>> +if (!pobj) {
>> +goto cleanup;
>> +}
>> +
>> +v = qobject_input_visitor_new_autocast(pobj);
>> + cleanup:
>> +qobject_decref(pobj);
>> +QDECREF(pdict);
>> +return v;
>> +}
[...]

73 matches

Mail list logo