Re: [PATCH v3 06/10] virtiofsd: Let lo_inode_open() return a TempFd

2021-08-09 Thread Max Reitz

On 06.08.21 21:55, Vivek Goyal wrote:

On Fri, Jul 30, 2021 at 05:01:30PM +0200, Max Reitz wrote:

Strictly speaking, this is not necessary, because lo_inode_open() will
always return a new FD owned by the caller, so TempFd.owned will always
be true.

However, auto-cleanup is nice, and in some cases this plays nicely with
an lo_inode_fd() call in another conditional branch (see lo_setattr()).

Signed-off-by: Max Reitz 
---
  tools/virtiofsd/passthrough_ll.c | 138 +--
  1 file changed, 59 insertions(+), 79 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 9e1bc37af8..292b7f7e27 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -291,10 +291,8 @@ static void temp_fd_clear(TempFd *temp_fd)
  /**
   * Return an owned fd from *temp_fd that will not be closed when
   * *temp_fd goes out of scope.
- *
- * (TODO: Remove __attribute__ once this is used.)
   */
-static __attribute__((unused)) int temp_fd_steal(TempFd *temp_fd)
+static int temp_fd_steal(TempFd *temp_fd)
  {
  if (temp_fd->owned) {
  temp_fd->owned = false;
@@ -673,9 +671,12 @@ static int lo_fd(fuse_req_t req, fuse_ino_t ino, TempFd 
*tfd)
   * when a malicious client opens special files such as block device nodes.
   * Symlink inodes are also rejected since symlinks must already have been
   * traversed on the client side.
+ *
+ * The fd is returned in tfd->fd.  The return value is 0 on success and -errno
+ * otherwise.
   */
-static int lo_inode_open(struct lo_data *lo, struct lo_inode *inode,
- int open_flags)
+static int lo_inode_open(const struct lo_data *lo, const struct lo_inode 
*inode,
+ int open_flags, TempFd *tfd)
  {
  g_autofree char *fd_str = g_strdup_printf("%d", inode->fd);
  int fd;
@@ -694,7 +695,13 @@ static int lo_inode_open(struct lo_data *lo, struct 
lo_inode *inode,
  if (fd < 0) {
  return -errno;
  }
-return fd;
+
+*tfd = (TempFd) {
+.fd = fd,
+.owned = true,
+};
+
+return 0;
  }
  
  static void lo_init(void *userdata, struct fuse_conn_info *conn)

@@ -852,7 +859,12 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, 
struct stat *attr,
  return;
  }
  
-res = lo_inode_fd(inode, _fd);

+if (!fi && (valid & FUSE_SET_ATTR_SIZE)) {
+/* We need an O_RDWR FD for ftruncate() */
+res = lo_inode_open(lo, inode, O_RDWR, _fd);
+} else {
+res = lo_inode_fd(inode, _fd);
+}

A minor nit.

So inode_fd could hold either an O_PATH fd returned by lo_inode_fd()
or a O_RDWR fd returned by lo_inode_open().

Previous code held these fds in two different variables, inode_fd and
truncfd respectively. I kind of found that easier to read because looking
at variable name, I knew whether I am dealing with O_PATH fd or an
O_RDWR fd I just opened.

So a minor nit. We could continue to have two variables, say
inode_fd and trunc_fd. Just that type of trunc_fd will now be TempFd.

Also I liked previous style easier to read where I always got hold
of O_PATH fd first. And later opened a O_RDWR fd if operation
is FUSE_ATTR_SIZE. So "valid & FUSE_SET_ATTR_SIZE" check was not
at two places.


Oh, yes.  The problem with that approach is that we unconditionally need 
to get an O_PATH fd, which is trivial for when we have one, but with 
file handles this means an open_by_handle_at() operation – and then 
another one to get the O_RDWR fd.  So there’s a superfluous 
open_by_handle_at() operation there.


I understand this makes the code a bit more complicated, but I felt 
there was sufficient reason for it.


That also means that I don’t really want to differentiate the fd into 
two distinct fd variables.  Nothing in this function needs an O_PATH fd, 
it’s just that that’s the easier one to open, so those places can work 
with any fd.


What we could do is have an rw_fd variable and a path_fd variable. The 
former will only be valid if the conditions are right (!fi && (valid & 
FUSE_SET_ATTR_SIZE)), the latter will always be valid and will be the 
same fd as rw_fd if the latter is valid.


However, both need to be TempFds, because both lo_inode_open() and 
lo_inode_fd() return TempFds.  So copying from rw_fd to path_fd would 
require a new function temp_fd_copy() or something, so the code would 
look like:


if (!fi && (valid & FUSE_SET_ATTR_SIZE)) {
    res = lo_inode_open(..., _fd);
    if (res >= 0) {
    temp_fd_copy(_fd, _fd);
    }
} else {
    res = lo_inode_fd(..., _fd);
}

with

void temp_fd_copy(const TempFd *from, const TempFd *to)
{
    *to = {
    .fd = to->fd,
    .owned = false,
    };
}

And then we use path_fd wherever an O_PATH fd would suffice, and rw_fd 
elsewhere (perhaps with a preceding assert(rw_fd.fd >= 0)).  Would that 
be kind of in accordance with what you 

Re: [PATCH v3 02/10] virtiofsd: Add TempFd structure

2021-08-09 Thread Max Reitz

On 06.08.21 16:41, Vivek Goyal wrote:

On Fri, Jul 30, 2021 at 05:01:26PM +0200, Max Reitz wrote:

We are planning to add file handles to lo_inode objects as an
alternative to lo_inode.fd.  That means that everywhere where we
currently reference lo_inode.fd, we will have to open a temporary file
descriptor that needs to be closed after use.

So instead of directly accessing lo_inode.fd, there will be a helper
function (lo_inode_fd()) that either returns lo_inode.fd, or opens a new
file descriptor with open_by_handle_at().  It encapsulates this result
in a TempFd structure to let the caller know whether the FD needs to be
closed after use (opened from the handle) or not (copied from
lo_inode.fd).

I am wondering why this notion of "owned". Why not have this requirement
of always closing "fd". If we copied it from lo_inode.fd, then we will
need to dup() it. Otherwise we opened it from file handle and we will
need to close it anyway.

I guess you are trying to avoid having to call dup() and that's why
this notion of "owned" fd.


Yes, I don’t want to dup() it.  One reason is that I’d rather just not.  
It’s something that we can avoid, and dup-ing every time wouldn’t make 
the code that much simpler (I think, without having tried).


One other is because this affects the current behavior (with O_PATH 
FDs), which I don’t want to alter.


Well, and finally, as a pragmatic reason, virtiofsd-rs uses the same 
structure and I don’t really want C virtiofsd and virtiofsd-rs to differ 
too much.



By using g_auto(TempFd) to store this result, callers will not even have
to care about closing a temporary FD after use.  It will be done
automatically once the object goes out of scope.

Signed-off-by: Max Reitz 
Reviewed-by: Connor Kuehl 
---
  tools/virtiofsd/passthrough_ll.c | 49 
  1 file changed, 49 insertions(+)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 1f27eeabc5..fb5e073e6a 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -178,6 +178,28 @@ struct lo_data {
  int user_posix_acl, posix_acl;
  };
  
+/**

+ * Represents a file descriptor that may either be owned by this
+ * TempFd, or only referenced (i.e. the ownership belongs to some
+ * other object, and the value has just been copied into this TempFd).
+ *
+ * The purpose of this encapsulation is to be used as g_auto(TempFd)
+ * to automatically clean up owned file descriptors when this object
+ * goes out of scope.
+ *
+ * Use temp_fd_steal() to get an owned file descriptor that will not
+ * be closed when the TempFd goes out of scope.
+ */
+typedef struct {
+int fd;
+bool owned; /* fd owned by this object? */
+} TempFd;
+
+#define TEMP_FD_INIT ((TempFd) { .fd = -1, .owned = false })
+
+static void temp_fd_clear(TempFd *temp_fd);
+G_DEFINE_AUTO_CLEANUP_CLEAR_FUNC(TempFd, temp_fd_clear);
+
  static const struct fuse_opt lo_opts[] = {
  { "sandbox=namespace",
offsetof(struct lo_data, sandbox),
@@ -255,6 +277,33 @@ static struct lo_data *lo_data(fuse_req_t req)
  return (struct lo_data *)fuse_req_userdata(req);
  }
  
+/**

+ * Clean-up function for TempFds
+ */
+static void temp_fd_clear(TempFd *temp_fd)
+{
+if (temp_fd->owned) {
+close(temp_fd->fd);
+*temp_fd = TEMP_FD_INIT;
+}
+}
+
+/**
+ * Return an owned fd from *temp_fd that will not be closed when
+ * *temp_fd goes out of scope.
+ *
+ * (TODO: Remove __attribute__ once this is used.)
+ */
+static __attribute__((unused)) int temp_fd_steal(TempFd *temp_fd)
+{
+if (temp_fd->owned) {
+temp_fd->owned = false;
+return temp_fd->fd;
+} else {
+return dup(temp_fd->fd);
+}
+}

This also will be simpler if we always called dup() and every caller
will close() fd.

I think only downside is having to call dup()/close(). Not sure if this
is an expensive operation or not.

Vivek






Re: [PATCH v3 04/10] virtiofsd: Add lo_inode_fd() helper

2021-08-09 Thread Max Reitz

On 06.08.21 20:25, Vivek Goyal wrote:

On Fri, Jul 30, 2021 at 05:01:28PM +0200, Max Reitz wrote:

[..]

@@ -1335,12 +1359,18 @@ static void lo_mknod_symlink(fuse_req_t req, fuse_ino_t 
parent,
  return;
  }
  
+res = lo_inode_fd(dir, _fd);

+if (res < 0) {
+saverr = -res;
+goto out;
+}
+
  saverr = lo_change_cred(req, , lo->change_umask && !S_ISLNK(mode));
  if (saverr) {
  goto out;
  }
  
-res = mknod_wrapper(dir->fd, name, link, mode, rdev);

+res = mknod_wrapper(dir_fd.fd, name, link, mode, rdev);
  
  saverr = errno;
  
@@ -1388,6 +1418,8 @@ static void lo_symlink(fuse_req_t req, const char *link, fuse_ino_t parent,

  static void lo_link(fuse_req_t req, fuse_ino_t ino, fuse_ino_t parent,
  const char *name)
  {
+g_auto(TempFd) inode_fd = TEMP_FD_INIT;
+g_auto(TempFd) parent_fd = TEMP_FD_INIT;
  int res;
  struct lo_data *lo = lo_data(req);
  struct lo_inode *parent_inode;
@@ -1413,18 +1445,31 @@ static void lo_link(fuse_req_t req, fuse_ino_t ino, 
fuse_ino_t parent,
  goto out_err;
  }
  
+res = lo_inode_fd(inode, _fd);

+if (res < 0) {
+errno = -res;

In previous function, we saved error to "saverr" and jumped to "out"
label, instead of overwriting to errno.

I would think that it will be good to use a single pattern. Either
save error in saverr or overwrite errno. I personally prefer saving
error into "saverr".


Absolutely, will do.


+goto out_err;
+}
+
+res = lo_inode_fd(parent_inode, _fd);
+if (res < 0) {
+errno = -res;
+goto out_err;
+}
+
  memset(, 0, sizeof(struct fuse_entry_param));
  e.attr_timeout = lo->timeout;
  e.entry_timeout = lo->timeout;
  
-sprintf(procname, "%i", inode->fd);

-res = linkat(lo->proc_self_fd, procname, parent_inode->fd, name,
+sprintf(procname, "%i", inode_fd.fd);
+res = linkat(lo->proc_self_fd, procname, parent_fd.fd, name,
   AT_SYMLINK_FOLLOW);
  if (res == -1) {
  goto out_err;
  }
  
-res = fstatat(inode->fd, "", , AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW);

+res = fstatat(inode_fd.fd, "", ,
+  AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW);
  if (res == -1) {
  goto out_err;
  }
@@ -1453,23 +1498,33 @@ out_err:
  static struct lo_inode *lookup_name(fuse_req_t req, fuse_ino_t parent,
  const char *name)
  {
+g_auto(TempFd) dir_fd = TEMP_FD_INIT;
  int res;
  uint64_t mnt_id;
  struct stat attr;
  struct lo_data *lo = lo_data(req);
  struct lo_inode *dir = lo_inode(req, parent);
+struct lo_inode *inode = NULL;
  
  if (!dir) {

-return NULL;
+goto out;

Should we continue to just call "return NULL". dir is NULL. That means
lo_inode() failed. That means we never got the reference. So we don't
have to put the reference. If we do "goto out", it will call
lo_inode_put() which is not needed.


Yes, but lo_inode_put() will handle this gracefully, so it isn’t wrong. 
My personal preference is that if there is an clean-up path, it should 
be used everywhere instead of having pure returns at the beginning of a 
function (where not many resources have been initialized yet), so that 
no clean-up will be forgotten.  Like, if we were to add some resource 
acquisition in the declarations above (and clean-up code in the clean-up 
path), we would need to change the return to a goto here.  Or maybe we’d 
forget that, and then we’d leak something.


So I prefer having clean-up sections be generic enough that they can be 
used from anywhere within the function, and then also use it from 
anywhere within the function, even if they end up being no-ops.



  }
  
-res = do_statx(lo, dir->fd, name, , AT_SYMLINK_NOFOLLOW, _id);

-lo_inode_put(lo, );
+res = lo_inode_fd(dir, _fd);
+if (res < 0) {
+goto out;
+}
+
+res = do_statx(lo, dir_fd.fd, name, , AT_SYMLINK_NOFOLLOW, _id);
  if (res == -1) {
-return NULL;
+goto out;
  }
  
-return lo_find(lo, , mnt_id);

+inode = lo_find(lo, , mnt_id);
+
+out:
+lo_inode_put(lo, );
+return inode;
  }


Thanks
Vivek






Re: [PATCH v3 01/10] virtiofsd: Limit setxattr()'s creds-dropped region

2021-08-09 Thread Max Reitz

On 06.08.21 16:16, Vivek Goyal wrote:

On Fri, Jul 30, 2021 at 05:01:25PM +0200, Max Reitz wrote:

We only need to drop/switch our credentials for the (f)setxattr() call
alone, not for the openat() or fchdir() around it.

(Right now, this may not be that big of a problem, but with inodes being
identified by file handles instead of an O_PATH fd, we will need
open_by_handle_at() calls here, which is really fickle when it comes to
credentials being dropped.)

Signed-off-by: Max Reitz 
---
  tools/virtiofsd/passthrough_ll.c | 34 +++-
  1 file changed, 25 insertions(+), 9 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 38b2af8599..1f27eeabc5 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -3121,6 +3121,7 @@ static void lo_setxattr(fuse_req_t req, fuse_ino_t ino, 
const char *in_name,
  bool switched_creds = false;
  bool cap_fsetid_dropped = false;
  struct lo_cred old = {};
+bool open_inode;
  
  if (block_xattr(lo, in_name)) {

  fuse_reply_err(req, EOPNOTSUPP);
@@ -3155,7 +3156,24 @@ static void lo_setxattr(fuse_req_t req, fuse_ino_t ino, 
const char *in_name,
  fuse_log(FUSE_LOG_DEBUG, "lo_setxattr(ino=%" PRIu64
   ", name=%s value=%s size=%zd)\n", ino, name, value, size);
  
+/*

+ * We can only open regular files or directories.  If the inode is
+ * something else, we have to enter /proc/self/fd and use
+ * setxattr() on the link's filename there.
+ */
+open_inode = S_ISREG(inode->filetype) || S_ISDIR(inode->filetype);
  sprintf(procname, "%i", inode->fd);
+if (open_inode) {
+fd = openat(lo->proc_self_fd, procname, O_RDONLY);
+if (fd < 0) {
+saverr = errno;
+goto out;
+}
+} else {
+/* fchdir should not fail here */
+FCHDIR_NOFAIL(lo->proc_self_fd);
+}
+
  /*
   * If we are setting posix access acl and if SGID needs to be
   * cleared, then switch to caller's gid and drop CAP_FSETID
@@ -3176,20 +3194,13 @@ static void lo_setxattr(fuse_req_t req, fuse_ino_t ino, 
const char *in_name,
  }
  switched_creds = true;
  }
-if (S_ISREG(inode->filetype) || S_ISDIR(inode->filetype)) {
-fd = openat(lo->proc_self_fd, procname, O_RDONLY);
-if (fd < 0) {
-saverr = errno;
-goto out;
-}
+if (open_inode) {
+assert(fd >= 0);
  ret = fsetxattr(fd, name, value, size, flags);
  saverr = ret == -1 ? errno : 0;
  } else {
-/* fchdir should not fail here */
-FCHDIR_NOFAIL(lo->proc_self_fd);
  ret = setxattr(procname, name, value, size, flags);
  saverr = ret == -1 ? errno : 0;
-FCHDIR_NOFAIL(lo->root.fd);
  }
  if (switched_creds) {
  if (cap_fsetid_dropped)
@@ -3198,6 +3209,11 @@ static void lo_setxattr(fuse_req_t req, fuse_ino_t ino, 
const char *in_name,
  lo_restore_cred(, false);
  }
  
+if (!open_inode) {

+/* Change CWD back, fchdir should not fail here */
+FCHDIR_NOFAIL(lo->root.fd);
+}
+

This FCHDIR_NOFAIL() will also need to be called if lo_drop_cap_change_cred()
fails.

 ret = lo_drop_cap_change_cred(req, , false, "FSETID",
   _fsetid_dropped);
 if (ret) {
 saverr = ret;
 goto out;
 }


Oh, right, thanks!

Max




Re: [PATCH for-6.2 v3 05/12] job: @force parameter for job_cancel_sync{, _all}()

2021-08-09 Thread Max Reitz

On 06.08.21 21:39, Eric Blake wrote:

On Fri, Aug 06, 2021 at 11:38:52AM +0200, Max Reitz wrote:

Callers should be able to specify whether they want job_cancel_sync() to
force-cancel the job or not.

In fact, almost all invocations do not care about consistency of the
result and just want the job to terminate as soon as possible, so they
should pass force=true.  The replication block driver is the exception.

This changes some iotest outputs, because quitting qemu while a mirror
job is active will now lead to it being cancelled instead of completed,
which is what we want.  (Cancelling a READY mirror job with force=false
may take an indefinite amount of time, which we do not want when
quitting.  If users want consistent results, they must have all jobs be
done before they quit qemu.)

Feels somewhat like a bug fix, but I also understand why you'd prefer
to delay this to 6.2 (it is not a fresh regression, but a longstanding
issue).


It is, hence the “Buglink” tag below.  However, only all of this series 
together really fixes that bug (or at least patches 5+7+9 together), 
just taking one wouldn’t help much.  And together, it’s just too much 
for 6.2 at this point.



Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Max Reitz 
---
+++ b/job.c
@@ -982,12 +982,24 @@ static void job_cancel_err(Job *job, Error **errp)
  job_cancel(job, false);
  }
  
-int job_cancel_sync(Job *job)

+/**
+ * Same as job_cancel_err(), but force-cancel.
+ */
+static void job_force_cancel_err(Job *job, Error **errp)
  {
-return job_finish_sync(job, _cancel_err, NULL);
+job_cancel(job, true);
+}

In isolation, it looks odd that errp is passed but not used.  But
looking further, it's because this is a callback that must have a
given signature, so it's okay.

Reviewed-by: Eric Blake 






Re: [PATCH for-6.2 v3 01/12] job: Context changes in job_completed_txn_abort()

2021-08-09 Thread Max Reitz

On 06.08.21 21:16, Eric Blake wrote:

On Fri, Aug 06, 2021 at 11:38:48AM +0200, Max Reitz wrote:

Finalizing the job may cause its AioContext to change.  This is noted by
job_exit(), which points at job_txn_apply() to take this fact into
account.

However, job_completed() does not necessarily invoke job_txn_apply()
(through job_completed_txn_success()), but potentially also
job_completed_txn_abort().  The latter stores the context in a local
variable, and so always acquires the same context at its end that it has
released in the beginning -- which may be a different context from the
one that job_exit() releases at its end.  If it is different, qemu
aborts ("qemu_mutex_unlock_impl: Operation not permitted").

Is this a bug fix that needs to make it into 6.1?


Well, I only encountered it as part of this series (which I really don’t 
think is 6.2 material at this point), and so I don’t know.


Can’t hurt, I suppose, but if we wanted this to be in 6.1, we’d better 
have a specific test for it, I think.



Drop the local @outer_ctx variable from job_completed_txn_abort(), and
instead re-acquire the actual job's context at the end of the function,
so job_exit() will release the same.

Signed-off-by: Max Reitz 
---
  job.c | 23 ++-
  1 file changed, 18 insertions(+), 5 deletions(-)

The commit message makes sense, and does a good job at explaining the
change.  I'm still a bit fuzzy on how jobs are supposed to play nice
with contexts,


I can relate :)


but since your patch matches the commit message, I'm
happy to give:

Reviewed-by: Eric Blake 


Thanks!




[PATCH for-6.2 v3 12/12] iotests: Add mirror-ready-cancel-error test

2021-08-06 Thread Max Reitz
Test what happens when there is an I/O error after a mirror job in the
READY phase has been cancelled.

Signed-off-by: Max Reitz 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Tested-by: Vladimir Sementsov-Ogievskiy 
---
 .../tests/mirror-ready-cancel-error   | 143 ++
 .../tests/mirror-ready-cancel-error.out   |   5 +
 2 files changed, 148 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/mirror-ready-cancel-error
 create mode 100644 tests/qemu-iotests/tests/mirror-ready-cancel-error.out

diff --git a/tests/qemu-iotests/tests/mirror-ready-cancel-error 
b/tests/qemu-iotests/tests/mirror-ready-cancel-error
new file mode 100755
index 00..f2dc1f
--- /dev/null
+++ b/tests/qemu-iotests/tests/mirror-ready-cancel-error
@@ -0,0 +1,143 @@
+#!/usr/bin/env python3
+# group: rw quick
+#
+# Test what happens when errors occur to a mirror job after it has
+# been cancelled in the READY phase
+#
+# Copyright (C) 2021 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+import os
+import iotests
+
+
+image_size = 1 * 1024 * 1024
+source = os.path.join(iotests.test_dir, 'source.img')
+target = os.path.join(iotests.test_dir, 'target.img')
+
+
+class TestMirrorReadyCancelError(iotests.QMPTestCase):
+def setUp(self) -> None:
+assert iotests.qemu_img_create('-f', iotests.imgfmt, source,
+   str(image_size)) == 0
+assert iotests.qemu_img_create('-f', iotests.imgfmt, target,
+   str(image_size)) == 0
+
+self.vm = iotests.VM()
+self.vm.launch()
+
+def tearDown(self) -> None:
+self.vm.shutdown()
+os.remove(source)
+os.remove(target)
+
+def add_blockdevs(self, once: bool) -> None:
+res = self.vm.qmp('blockdev-add',
+  **{'node-name': 'source',
+ 'driver': iotests.imgfmt,
+ 'file': {
+ 'driver': 'file',
+ 'filename': source
+ }})
+self.assert_qmp(res, 'return', {})
+
+# blkdebug notes:
+# Enter state 2 on the first flush, which happens before the
+# job enters the READY state.  The second flush will happen
+# when the job is about to complete, and we want that one to
+# fail.
+res = self.vm.qmp('blockdev-add',
+  **{'node-name': 'target',
+ 'driver': iotests.imgfmt,
+ 'file': {
+ 'driver': 'blkdebug',
+ 'image': {
+ 'driver': 'file',
+ 'filename': target
+ },
+ 'set-state': [{
+ 'event': 'flush_to_disk',
+ 'state': 1,
+ 'new_state': 2
+ }],
+ 'inject-error': [{
+ 'event': 'flush_to_disk',
+ 'once': once,
+ 'immediately': True,
+ 'state': 2
+ }]}})
+self.assert_qmp(res, 'return', {})
+
+def start_mirror(self) -> None:
+res = self.vm.qmp('blockdev-mirror',
+  job_id='mirror',
+  device='source',
+  target='target',
+  filter_node_name='mirror-top',
+  sync='full',
+  on_target_error='stop')
+self.assert_qmp(res, 'return', {})
+
+def cancel_mirror_with_error(self) -> None:
+self.vm.event_wait('BLOCK_JOB_READY')
+
+# Write something so will not leave the job immediately, but
+# flush first (which will fail, thanks to blkdebug)
+res = self.vm.qmp('human-monitor-command',
+  command_line='qemu-io mirror-top "write 0 64k"')
+self.assert_qmp(res, 'return', '')
+
+# Drain status change events
+while self.vm.event_wait('

[PATCH for-6.2 v3 11/12] mirror: Do not clear .cancelled

2021-08-06 Thread Max Reitz
Clearing .cancelled before leaving the main loop when the job has been
soft-cancelled is no longer necessary since job_is_cancelled() only
returns true for jobs that have been force-cancelled.

Therefore, this only makes a differences in places that call
job_cancel_requested().  In block/mirror.c, this is done only before
.cancelled was cleared.

In job.c, there are two callers:
- job_completed_txn_abort() asserts that .cancelled is true, so keeping
  it true will not affect this place.

- job_complete() refuses to let a job complete that has .cancelled set.
  It is correct to refuse to let the user invoke job-complete on mirror
  jobs that have already been soft-cancelled.

With this change, there are no places that reset .cancelled to false and
so we can be sure that .force_cancel can only be true of .cancelled is
true as well.  Assert this in job_is_cancelled().

Signed-off-by: Max Reitz 
---
 block/mirror.c | 2 --
 job.c  | 4 +++-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index af89c1716a..f94aa52fae 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -939,7 +939,6 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 while (!job_cancel_requested(>common.job) && !s->should_complete) {
 job_yield(>common.job);
 }
-s->common.job.cancelled = false;
 goto immediate_exit;
 }
 
@@ -1078,7 +1077,6 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
  * completion.
  */
 assert(QLIST_EMPTY(>tracked_requests));
-s->common.job.cancelled = false;
 need_drain = false;
 break;
 }
diff --git a/job.c b/job.c
index 2bd3c946a7..2ce6865ab2 100644
--- a/job.c
+++ b/job.c
@@ -217,7 +217,9 @@ const char *job_type_str(const Job *job)
 
 bool job_is_cancelled(Job *job)
 {
-return job->cancelled && job->force_cancel;
+/* force_cancel may be true only if cancelled is true, too */
+assert(job->cancelled || !job->force_cancel);
+return job->force_cancel;
 }
 
 bool job_cancel_requested(Job *job)
-- 
2.31.1




[PATCH for-6.2 v3 10/12] mirror: Stop active mirroring after force-cancel

2021-08-06 Thread Max Reitz
Once the mirror job is force-cancelled (job_is_cancelled() is true), we
should not generate new I/O requests.  This applies to active mirroring,
too, so stop it once the job is cancelled.

(We must still forward all I/O requests to the source, though, of
course, but those are not really I/O requests generated by the job, so
this is fine.)

Signed-off-by: Max Reitz 
---
 block/mirror.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/mirror.c b/block/mirror.c
index bf1d50ff1c..af89c1716a 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1418,6 +1418,7 @@ static int coroutine_fn 
bdrv_mirror_top_do_write(BlockDriverState *bs,
 bool copy_to_target;
 
 copy_to_target = s->job->ret >= 0 &&
+ !job_is_cancelled(>job->common.job) &&
  s->job->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING;
 
 if (copy_to_target) {
@@ -1466,6 +1467,7 @@ static int coroutine_fn 
bdrv_mirror_top_pwritev(BlockDriverState *bs,
 bool copy_to_target;
 
 copy_to_target = s->job->ret >= 0 &&
+ !job_is_cancelled(>job->common.job) &&
  s->job->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING;
 
 if (copy_to_target) {
-- 
2.31.1




[PATCH for-6.2 v3 08/12] mirror: Use job_is_cancelled()

2021-08-06 Thread Max Reitz
mirror_drained_poll() returns true whenever the job is cancelled,
because "we [can] be sure that it won't issue more requests".  However,
this is only true for force-cancelled jobs, so use job_is_cancelled().

Signed-off-by: Max Reitz 
---
 block/mirror.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/mirror.c b/block/mirror.c
index 72e02fa34e..024fa2dcea 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1177,7 +1177,7 @@ static bool mirror_drained_poll(BlockJob *job)
  * from one of our own drain sections, to avoid a deadlock waiting for
  * ourselves.
  */
-if (!s->common.job.paused && !s->common.job.cancelled && !s->in_drain) {
+if (!s->common.job.paused && !job_is_cancelled(>job) && !s->in_drain) 
{
 return true;
 }
 
-- 
2.31.1




[PATCH for-6.2 v3 09/12] mirror: Check job_is_cancelled() earlier

2021-08-06 Thread Max Reitz
We must check whether the job is force-cancelled early in our main loop,
most importantly before any `continue` statement.  For example, we used
to have `continue`s before our current checking location that are
triggered by `mirror_flush()` failing.  So, if `mirror_flush()` kept
failing, force-cancelling the job would not terminate it.

Jobs can be cancelled while they yield, and once they are
(force-cancelled), they should not generate new I/O requests.
Therefore, we should put the check after the last yield before
mirror_iteration() is invoked.

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Max Reitz 
---
 block/mirror.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 024fa2dcea..bf1d50ff1c 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1000,6 +1000,11 @@ static int coroutine_fn mirror_run(Job *job, Error 
**errp)
 
 job_pause_point(>common.job);
 
+if (job_is_cancelled(>common.job)) {
+ret = 0;
+goto immediate_exit;
+}
+
 cnt = bdrv_get_dirty_count(s->dirty_bitmap);
 /* cnt is the number of dirty bytes remaining and s->bytes_in_flight is
  * the number of bytes currently being processed; together those are
@@ -1078,8 +1083,6 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 break;
 }
 
-ret = 0;
-
 if (job_is_ready(>common.job) && !should_complete) {
 delay_ns = (s->in_flight == 0 &&
 cnt == 0 ? BLOCK_JOB_SLICE_TIME : 0);
@@ -1087,9 +1090,6 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 trace_mirror_before_sleep(s, cnt, job_is_ready(>common.job),
   delay_ns);
 job_sleep_ns(>common.job, delay_ns);
-if (job_is_cancelled(>common.job)) {
-break;
-}
 s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
 }
 
-- 
2.31.1




[PATCH for-6.2 v3 04/12] job: Force-cancel jobs in a failed transaction

2021-08-06 Thread Max Reitz
When a transaction is aborted, no result matters, and so all jobs within
should be force-cancelled.

Signed-off-by: Max Reitz 
---
 job.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/job.c b/job.c
index 3fe23bb77e..24e7c4fcb7 100644
--- a/job.c
+++ b/job.c
@@ -766,7 +766,12 @@ static void job_completed_txn_abort(Job *job)
 if (other_job != job) {
 ctx = other_job->aio_context;
 aio_context_acquire(ctx);
-job_cancel_async(other_job, false);
+/*
+ * This is a transaction: If one job failed, no result will matter.
+ * Therefore, pass force=true to terminate all other jobs as 
quickly
+ * as possible.
+ */
+job_cancel_async(other_job, true);
 aio_context_release(ctx);
 }
 }
-- 
2.31.1




[PATCH for-6.2 v3 05/12] job: @force parameter for job_cancel_sync{, _all}()

2021-08-06 Thread Max Reitz
Callers should be able to specify whether they want job_cancel_sync() to
force-cancel the job or not.

In fact, almost all invocations do not care about consistency of the
result and just want the job to terminate as soon as possible, so they
should pass force=true.  The replication block driver is the exception.

This changes some iotest outputs, because quitting qemu while a mirror
job is active will now lead to it being cancelled instead of completed,
which is what we want.  (Cancelling a READY mirror job with force=false
may take an indefinite amount of time, which we do not want when
quitting.  If users want consistent results, they must have all jobs be
done before they quit qemu.)

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Max Reitz 
---
 include/qemu/job.h| 10 ++---
 block/replication.c   |  4 +-
 blockdev.c|  4 +-
 job.c | 20 +++--
 qemu-nbd.c|  2 +-
 softmmu/runstate.c|  2 +-
 storage-daemon/qemu-storage-daemon.c  |  2 +-
 tests/unit/test-block-iothread.c  |  2 +-
 tests/unit/test-blockjob.c|  2 +-
 tests/qemu-iotests/109.out| 60 +++
 tests/qemu-iotests/tests/qsd-jobs.out |  2 +-
 11 files changed, 55 insertions(+), 55 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 41162ed494..5e8edbc2c8 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -506,19 +506,19 @@ void job_user_cancel(Job *job, bool force, Error **errp);
 
 /**
  * Synchronously cancel the @job.  The completion callback is called
- * before the function returns.  The job may actually complete
- * instead of canceling itself; the circumstances under which this
- * happens depend on the kind of job that is active.
+ * before the function returns.  If @force is false, the job may
+ * actually complete instead of canceling itself; the circumstances
+ * under which this happens depend on the kind of job that is active.
  *
  * Returns the return value from the job if the job actually completed
  * during the call, or -ECANCELED if it was canceled.
  *
  * Callers must hold the AioContext lock of job->aio_context.
  */
-int job_cancel_sync(Job *job);
+int job_cancel_sync(Job *job, bool force);
 
 /** Synchronously cancels all jobs using job_cancel_sync(). */
-void job_cancel_sync_all(void);
+void job_cancel_sync_all(bool force);
 
 /**
  * @job: The job to be completed.
diff --git a/block/replication.c b/block/replication.c
index 32444b9a8f..e7a9327b12 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -149,7 +149,7 @@ static void replication_close(BlockDriverState *bs)
 if (s->stage == BLOCK_REPLICATION_FAILOVER) {
 commit_job = >commit_job->job;
 assert(commit_job->aio_context == qemu_get_current_aio_context());
-job_cancel_sync(commit_job);
+job_cancel_sync(commit_job, false);
 }
 
 if (s->mode == REPLICATION_MODE_SECONDARY) {
@@ -726,7 +726,7 @@ static void replication_stop(ReplicationState *rs, bool 
failover, Error **errp)
  * disk, secondary disk in backup_job_completed().
  */
 if (s->backup_job) {
-job_cancel_sync(>backup_job->job);
+job_cancel_sync(>backup_job->job, false);
 }
 
 if (!failover) {
diff --git a/blockdev.c b/blockdev.c
index 3d8ac368a1..aa95918c02 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1848,7 +1848,7 @@ static void drive_backup_abort(BlkActionState *common)
 aio_context = bdrv_get_aio_context(state->bs);
 aio_context_acquire(aio_context);
 
-job_cancel_sync(>job->job);
+job_cancel_sync(>job->job, true);
 
 aio_context_release(aio_context);
 }
@@ -1949,7 +1949,7 @@ static void blockdev_backup_abort(BlkActionState *common)
 aio_context = bdrv_get_aio_context(state->bs);
 aio_context_acquire(aio_context);
 
-job_cancel_sync(>job->job);
+job_cancel_sync(>job->job, true);
 
 aio_context_release(aio_context);
 }
diff --git a/job.c b/job.c
index 24e7c4fcb7..1b68a7a983 100644
--- a/job.c
+++ b/job.c
@@ -982,12 +982,24 @@ static void job_cancel_err(Job *job, Error **errp)
 job_cancel(job, false);
 }
 
-int job_cancel_sync(Job *job)
+/**
+ * Same as job_cancel_err(), but force-cancel.
+ */
+static void job_force_cancel_err(Job *job, Error **errp)
 {
-return job_finish_sync(job, _cancel_err, NULL);
+job_cancel(job, true);
+}
+
+int job_cancel_sync(Job *job, bool force)
+{
+if (force) {
+return job_finish_sync(job, _force_cancel_err, NULL);
+} else {
+return job_finish_sync(job, _cancel_err, NULL);
+}
 }
 
-void job_cancel_sync_all(void)
+void job_cancel_sync_all(bool force)
 {
 Job *job;
 AioContext *aio_context;
@@ -995,7 +1007,7 @@ 

[PATCH for-6.2 v3 07/12] job: Add job_cancel_requested()

2021-08-06 Thread Max Reitz
Most callers of job_is_cancelled() actually want to know whether the job
is on its way to immediate termination.  For example, we refuse to pause
jobs that are cancelled; but this only makes sense for jobs that are
really actually cancelled.

A mirror job that is cancelled during READY with force=false should
absolutely be allowed to pause.  This "cancellation" (which is actually
a kind of completion) may take an indefinite amount of time, and so
should behave like any job during normal operation.  For example, with
on-target-error=stop, the job should stop on write errors.  (In
contrast, force-cancelled jobs should not get write errors, as they
should just terminate and not do further I/O.)

Therefore, redefine job_is_cancelled() to only return true for jobs that
are force-cancelled (which as of HEAD^ means any job that interprets the
cancellation request as a request for immediate termination), and add
job_cancel_requested() as the general variant, which returns true for
any jobs which have been requested to be cancelled, whether it be
immediately or after an arbitrarily long completion phase.

Finally, here is a justification for how different job_is_cancelled()
invocations are treated by this patch:

- block/mirror.c (mirror_run()):
  - The first invocation is a while loop that should loop until the job
has been cancelled or scheduled for completion.  What kind of cancel
does not matter, only the fact that the job is supposed to end.

  - The second invocation wants to know whether the job has been
soft-cancelled.  Calling job_cancel_requested() is a bit too broad,
but if the job were force-cancelled, we should leave the main loop
as soon as possible anyway, so this should not matter here.

  - The last two invocations already check force_cancel, so they should
continue to use job_is_cancelled().

- block/backup.c, block/commit.c, block/stream.c, anything in tests/:
  These jobs know only force-cancel, so there is no difference between
  job_is_cancelled() and job_cancel_requested().  We can continue using
  job_is_cancelled().

- job.c:
  - job_pause_point(), job_yield(), job_sleep_ns(): Only force-cancelled
jobs should be prevented from being paused.  Continue using 
job_is_cancelled().

  - job_update_rc(), job_finalize_single(), job_finish_sync(): These
functions are all called after the job has left its main loop.  The
mirror job (the only job that can be soft-cancelled) will clear
.cancelled before leaving the main loop if it has been
soft-cancelled.  Therefore, these functions will observe .cancelled
to be true only if the job has been force-cancelled.  We can
continue to use job_is_cancelled().
(Furthermore, conceptually, a soft-cancelled mirror job should not
report to have been cancelled.  It should report completion (see
also the block-job-cancel QAPI documentation).  Therefore, it makes
sense for these functions not to distinguish between a
soft-cancelled mirror job and a job that has completed as normal.)

  - job_completed_txn_abort(): All jobs other than @job have been
force-cancelled.  job_is_cancelled() must be true for them.
Regarding @job itself: job_completed_txn_abort() is mostly called
when the job's return value is not 0.  A soft-cancelled mirror has a
return value of 0, and so will not end up here then.
However, job_cancel() invokes job_completed_txn_abort() if the job
has been deferred to the main loop, which is mostly the case for
completed jobs (which skip the assertion), but not for sure.
To be safe, use job_cancel_requested() in this assertion.

  - job_complete(): This is function eventually invoked by the user
(through qmp_block_job_complete() or qmp_job_complete(), or
job_complete_sync(), which comes from qemu-img).  The intention here
is to prevent a user from invoking job-complete after the job has
been cancelled.  This should also apply to soft cancelling: After a
mirror job has been soft-cancelled, the user should not be able to
decide otherwise and have it complete as normal (i.e. pivoting to
the target).

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Max Reitz 
---
 include/qemu/job.h |  8 +++-
 block/mirror.c | 10 --
 job.c  |  9 +++--
 3 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 8aa90f7395..032edf3c5f 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -436,9 +436,15 @@ const char *job_type_str(const Job *job);
 /** Returns true if the job should not be visible to the management layer. */
 bool job_is_internal(Job *job);
 
-/** Returns whether the job is scheduled for cancellation. */
+/** Returns whether the job is being cancelled. */
 bool job_is_cancelled(Job *job);
 
+/**
+ * Returns whether the job is scheduled for cancellation (at an
+ * indefinite point).
+ */
+bool job_cancel_req

[PATCH for-6.2 v3 03/12] mirror: Drop s->synced

2021-08-06 Thread Max Reitz
As of HEAD^, there is no meaning to s->synced other than whether the job
is READY or not.  job_is_ready() gives us that information, too.

Suggested-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Max Reitz 
Reviewed-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Kevin Wolf 
---
 block/mirror.c | 19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index d73b704473..fcb7b65f93 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -56,7 +56,6 @@ typedef struct MirrorBlockJob {
 bool zero_target;
 MirrorCopyMode copy_mode;
 BlockdevOnError on_source_error, on_target_error;
-bool synced;
 /* Set when the target is synced (dirty bitmap is clean, nothing
  * in flight) and the job is running in active mode */
 bool actively_synced;
@@ -936,7 +935,6 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 if (s->bdev_length == 0) {
 /* Transition to the READY state and wait for complete. */
 job_transition_to_ready(>common.job);
-s->synced = true;
 s->actively_synced = true;
 while (!job_is_cancelled(>common.job) && !s->should_complete) {
 job_yield(>common.job);
@@ -1028,7 +1026,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 should_complete = false;
 if (s->in_flight == 0 && cnt == 0) {
 trace_mirror_before_flush(s);
-if (!s->synced) {
+if (!job_is_ready(>common.job)) {
 if (mirror_flush(s) < 0) {
 /* Go check s->ret.  */
 continue;
@@ -1039,7 +1037,6 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
  * the target in a consistent state.
  */
 job_transition_to_ready(>common.job);
-s->synced = true;
 if (s->copy_mode != MIRROR_COPY_MODE_BACKGROUND) {
 s->actively_synced = true;
 }
@@ -1083,14 +1080,15 @@ static int coroutine_fn mirror_run(Job *job, Error 
**errp)
 
 ret = 0;
 
-if (s->synced && !should_complete) {
+if (job_is_ready(>common.job) && !should_complete) {
 delay_ns = (s->in_flight == 0 &&
 cnt == 0 ? BLOCK_JOB_SLICE_TIME : 0);
 }
-trace_mirror_before_sleep(s, cnt, s->synced, delay_ns);
+trace_mirror_before_sleep(s, cnt, job_is_ready(>common.job),
+  delay_ns);
 job_sleep_ns(>common.job, delay_ns);
 if (job_is_cancelled(>common.job) &&
-(!s->synced || s->common.job.force_cancel))
+(!job_is_ready(>common.job) || s->common.job.force_cancel))
 {
 break;
 }
@@ -1103,8 +1101,9 @@ immediate_exit:
  * or it was cancelled prematurely so that we do not guarantee that
  * the target is a copy of the source.
  */
-assert(ret < 0 || ((s->common.job.force_cancel || !s->synced) &&
-   job_is_cancelled(>common.job)));
+assert(ret < 0 ||
+   ((s->common.job.force_cancel || !job_is_ready(>common.job)) 
&&
+job_is_cancelled(>common.job)));
 assert(need_drain);
 mirror_wait_for_all_io(s);
 }
@@ -1127,7 +1126,7 @@ static void mirror_complete(Job *job, Error **errp)
 {
 MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job);
 
-if (!s->synced) {
+if (!job_is_ready(job)) {
 error_setg(errp, "The active block job '%s' cannot be completed",
job->id);
 return;
-- 
2.31.1




[PATCH for-6.2 v3 02/12] mirror: Keep s->synced on error

2021-08-06 Thread Max Reitz
An error does not take us out of the READY phase, which is what
s->synced signifies.  It does of course mean that source and target are
no longer in sync, but that is what s->actively_sync is for -- s->synced
never meant that source and target are in sync, only that they were at
some point (and at that point we transitioned into the READY phase).

The tangible problem is that we transition to READY once we are in sync
and s->synced is false.  By resetting s->synced here, we will transition
from READY to READY once the error is resolved (if the job keeps
running), and that transition is not allowed.

Signed-off-by: Max Reitz 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Kevin Wolf 
---
 block/mirror.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/block/mirror.c b/block/mirror.c
index 98fc66eabf..d73b704473 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -121,7 +121,6 @@ typedef enum MirrorMethod {
 static BlockErrorAction mirror_error_action(MirrorBlockJob *s, bool read,
 int error)
 {
-s->synced = false;
 s->actively_synced = false;
 if (read) {
 return block_job_error_action(>common, s->on_source_error,
-- 
2.31.1




[PATCH for-6.2 v3 06/12] jobs: Give Job.force_cancel more meaning

2021-08-06 Thread Max Reitz
We largely have two cancel modes for jobs:

First, there is actual cancelling.  The job is terminated as soon as
possible, without trying to reach a consistent result.

Second, we have mirror in the READY state.  Technically, the job is not
really cancelled, but it just is a different completion mode.  The job
can still run for an indefinite amount of time while it tries to reach a
consistent result.

We want to be able to clearly distinguish which cancel mode a job is in
(when it has been cancelled).  We can use Job.force_cancel for this, but
right now it only reflects cancel requests from the user with
force=true, but clearly, jobs that do not even distinguish between
force=false and force=true are effectively always force-cancelled.

So this patch has Job.force_cancel signify whether the job will
terminate as soon as possible (force_cancel=true) or whether it will
effectively remain running despite being "cancelled"
(force_cancel=false).

To this end, we let jobs that provide JobDriver.cancel() tell the
generic job code whether they will terminate as soon as possible or not,
and for jobs that do not provide that method we assume they will.

Signed-off-by: Max Reitz 
Reviewed-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Kevin Wolf 
---
 include/qemu/job.h | 11 ++-
 block/backup.c |  3 ++-
 block/mirror.c | 24 ++--
 job.c  |  6 +-
 4 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 5e8edbc2c8..8aa90f7395 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -253,8 +253,17 @@ struct JobDriver {
 
 /**
  * If the callback is not NULL, it will be invoked in job_cancel_async
+ *
+ * This function must return true if the job will be cancelled
+ * immediately without any further I/O (mandatory if @force is
+ * true), and false otherwise.  This lets the generic job layer
+ * know whether a job has been truly (force-)cancelled, or whether
+ * it is just in a special completion mode (like mirror after
+ * READY).
+ * (If the callback is NULL, the job is assumed to terminate
+ * without I/O.)
  */
-void (*cancel)(Job *job, bool force);
+bool (*cancel)(Job *job, bool force);
 
 
 /** Called when the job is freed */
diff --git a/block/backup.c b/block/backup.c
index bd3614ce70..513e1c8a0b 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -331,11 +331,12 @@ static void coroutine_fn backup_set_speed(BlockJob *job, 
int64_t speed)
 }
 }
 
-static void backup_cancel(Job *job, bool force)
+static bool backup_cancel(Job *job, bool force)
 {
 BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
 
 bdrv_cancel_in_flight(s->target_bs);
+return true;
 }
 
 static const BlockJobDriver backup_job_driver = {
diff --git a/block/mirror.c b/block/mirror.c
index fcb7b65f93..e93631a9f6 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1087,9 +1087,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 trace_mirror_before_sleep(s, cnt, job_is_ready(>common.job),
   delay_ns);
 job_sleep_ns(>common.job, delay_ns);
-if (job_is_cancelled(>common.job) &&
-(!job_is_ready(>common.job) || s->common.job.force_cancel))
-{
+if (job_is_cancelled(>common.job) && s->common.job.force_cancel) {
 break;
 }
 s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
@@ -1102,7 +1100,7 @@ immediate_exit:
  * the target is a copy of the source.
  */
 assert(ret < 0 ||
-   ((s->common.job.force_cancel || !job_is_ready(>common.job)) 
&&
+   (s->common.job.force_cancel &&
 job_is_cancelled(>common.job)));
 assert(need_drain);
 mirror_wait_for_all_io(s);
@@ -1188,14 +1186,27 @@ static bool mirror_drained_poll(BlockJob *job)
 return !!s->in_flight;
 }
 
-static void mirror_cancel(Job *job, bool force)
+static bool mirror_cancel(Job *job, bool force)
 {
 MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job);
 BlockDriverState *target = blk_bs(s->target);
 
-if (force || !job_is_ready(job)) {
+/*
+ * Before the job is READY, we treat any cancellation like a
+ * force-cancellation.
+ */
+force = force || !job_is_ready(job);
+
+if (force) {
 bdrv_cancel_in_flight(target);
 }
+return force;
+}
+
+static bool commit_active_cancel(Job *job, bool force)
+{
+/* Same as above in mirror_cancel() */
+return force || !job_is_ready(job);
 }
 
 static const BlockJobDriver mirror_job_driver = {
@@ -1225,6 +1236,7 @@ static const BlockJobDriver commit_active_job_driver = {
 .abort  = mirror_abort,
 .pause

[PATCH for-6.2 v3 00/12] mirror: Handle errors after READY cancel

2021-08-06 Thread Max Reitz
Hi,

v1 cover letter:
https://lists.nongnu.org/archive/html/qemu-block/2021-07/msg00705.html

v2 cover letter:
https://lists.nongnu.org/archive/html/qemu-block/2021-07/msg00747.html

Changes in v3:
- Patch 1: After adding patch 11, I got a failed assertion in
  tests/unit/test-block-iothread (failing qemu_mutex_unlock_impl()).
  That is because before patch 11, for zero-length source devices,
  mirror clears .cancelled unconditionally before exiting.  So even
  force-cancelled jobs are considered to be completed normally, which
  doesn’t seem quite right.
  Anyway, test-block-iothread does some iothread switching, and
  cancelling jobs is not really prepared for that.  This patch fixes
  that (I hope...).

- Patch 4: Split off from patch 5

- Patch 7:
  - Added a long section in the commit message detailing every choice
for every job_is_cancelled() invocation
  - Use job_cancel_requested() in the assertion in
job_completed_txn_abort(), because it is not quite clear whether
soft-cancelled mirror jobs can end up in this path (it seems like a
bug if that happens, but I think that’s something to fix in some
other series)

- Patch 8: Added: This is kind of preparation for patch 9, but also just
  a bug fix in itself, I believe

- Patch 9: Moved the job_is_cancelled() check after the last yield point
  before the mirror_iteration() call

- Patch 10: Added: If force-cancelled jobs should not generate new I/O
  requests at all (except for forwarding something to the source
  device), then we need to stop doing active mirroring once the mirror
  job is force-cancelled

- Patch 11: Added: Clearing .cancelled seemed like a hack, so getting
  rid of it seems like a good thing to do
  (And only with this patch, I can assert that .force_cancel can only be
  true when .cancelled is true also; if we tried it before this patch,
  tests/unit/test-block-iothread would fail.)


The discussion around v2 has shown that there are probably more bugs in
the job code, but I think this series is becoming long enough that we
should tackle those in a different series.


git-backport-diff against v1:

Key:
[] : patches are identical
[] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/12:[down] 'job: Context changes in job_completed_txn_abort()'
002/12:[] [--] 'mirror: Keep s->synced on error'
003/12:[] [--] 'mirror: Drop s->synced'
004/12:[down] 'job: Force-cancel jobs in a failed transaction'
005/12:[0007] [FC] 'job: @force parameter for job_cancel_sync{,_all}()'
006/12:[] [--] 'jobs: Give Job.force_cancel more meaning'
007/12:[0002] [FC] 'job: Add job_cancel_requested()'
008/12:[down] 'mirror: Use job_is_cancelled()'
009/12:[0007] [FC] 'mirror: Check job_is_cancelled() earlier'
010/12:[down] 'mirror: Stop active mirroring after force-cancel'
011/12:[down] 'mirror: Do not clear .cancelled'
012/12:[] [--] 'iotests: Add mirror-ready-cancel-error test'


Max Reitz (12):
  job: Context changes in job_completed_txn_abort()
  mirror: Keep s->synced on error
  mirror: Drop s->synced
  job: Force-cancel jobs in a failed transaction
  job: @force parameter for job_cancel_sync{,_all}()
  jobs: Give Job.force_cancel more meaning
  job: Add job_cancel_requested()
  mirror: Use job_is_cancelled()
  mirror: Check job_is_cancelled() earlier
  mirror: Stop active mirroring after force-cancel
  mirror: Do not clear .cancelled
  iotests: Add mirror-ready-cancel-error test

 include/qemu/job.h|  29 +++-
 block/backup.c|   3 +-
 block/mirror.c|  56 ---
 block/replication.c   |   4 +-
 blockdev.c|   4 +-
 job.c |  67 ++--
 qemu-nbd.c|   2 +-
 softmmu/runstate.c|   2 +-
 storage-daemon/qemu-storage-daemon.c  |   2 +-
 tests/unit/test-block-iothread.c  |   2 +-
 tests/unit/test-blockjob.c|   2 +-
 tests/qemu-iotests/109.out|  60 +++-
 .../tests/mirror-ready-cancel-error   | 143 ++
 .../tests/mirror-ready-cancel-error.out   |   5 +
 tests/qemu-iotests/tests/qsd-jobs.out |   2 +-
 15 files changed, 292 insertions(+), 91 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/mirror-ready-cancel-error
 create mode 100644 tests/qemu-iotests/tests/mirror-ready-cancel-error.out

-- 
2.31.1




[PATCH for-6.2 v3 01/12] job: Context changes in job_completed_txn_abort()

2021-08-06 Thread Max Reitz
Finalizing the job may cause its AioContext to change.  This is noted by
job_exit(), which points at job_txn_apply() to take this fact into
account.

However, job_completed() does not necessarily invoke job_txn_apply()
(through job_completed_txn_success()), but potentially also
job_completed_txn_abort().  The latter stores the context in a local
variable, and so always acquires the same context at its end that it has
released in the beginning -- which may be a different context from the
one that job_exit() releases at its end.  If it is different, qemu
aborts ("qemu_mutex_unlock_impl: Operation not permitted").

Drop the local @outer_ctx variable from job_completed_txn_abort(), and
instead re-acquire the actual job's context at the end of the function,
so job_exit() will release the same.

Signed-off-by: Max Reitz 
---
 job.c | 23 ++-
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/job.c b/job.c
index e7a5d28854..3fe23bb77e 100644
--- a/job.c
+++ b/job.c
@@ -737,7 +737,6 @@ static void job_cancel_async(Job *job, bool force)
 
 static void job_completed_txn_abort(Job *job)
 {
-AioContext *outer_ctx = job->aio_context;
 AioContext *ctx;
 JobTxn *txn = job->txn;
 Job *other_job;
@@ -751,10 +750,14 @@ static void job_completed_txn_abort(Job *job)
 txn->aborting = true;
 job_txn_ref(txn);
 
-/* We can only hold the single job's AioContext lock while calling
+/*
+ * We can only hold the single job's AioContext lock while calling
  * job_finalize_single() because the finalization callbacks can involve
- * calls of AIO_WAIT_WHILE(), which could deadlock otherwise. */
-aio_context_release(outer_ctx);
+ * calls of AIO_WAIT_WHILE(), which could deadlock otherwise.
+ * Note that the job's AioContext may change when it is finalized.
+ */
+job_ref(job);
+aio_context_release(job->aio_context);
 
 /* Other jobs are effectively cancelled by us, set the status for
  * them; this job, however, may or may not be cancelled, depending
@@ -769,6 +772,10 @@ static void job_completed_txn_abort(Job *job)
 }
 while (!QLIST_EMPTY(>jobs)) {
 other_job = QLIST_FIRST(>jobs);
+/*
+ * The job's AioContext may change, so store it in @ctx so we
+ * release the same context that we have acquired before.
+ */
 ctx = other_job->aio_context;
 aio_context_acquire(ctx);
 if (!job_is_completed(other_job)) {
@@ -779,7 +786,13 @@ static void job_completed_txn_abort(Job *job)
 aio_context_release(ctx);
 }
 
-aio_context_acquire(outer_ctx);
+/*
+ * Use job_ref()/job_unref() so we can read the AioContext here
+ * even if the job went away during job_finalize_single().
+ */
+ctx = job->aio_context;
+job_unref(job);
+aio_context_acquire(ctx);
 
 job_txn_unref(txn);
 }
-- 
2.31.1




[PATCH] gluster: Align block-status tail

2021-08-05 Thread Max Reitz
gluster's block-status implementation is basically a copy of that in
block/file-posix.c, there is only one thing missing, and that is
aligning trailing data extents to the request alignment (as added by
commit 9c3db310ff0).

Note that 9c3db310ff0 mentions that "there seems to be no other block
driver that sets request_alignment and [...]", but while block/gluster.c
does indeed not set request_alignment, block/io.c's
bdrv_refresh_limits() will still default to an alignment of 512 because
block/gluster.c does not provide a byte-aligned read function.
Therefore, unaligned tails can conceivably occur, and so we should apply
the change from 9c3db310ff0 to gluster's block-status implementation.

Reported-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Max Reitz 
---
 block/gluster.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/block/gluster.c b/block/gluster.c
index e8ee14c8e9..48a04417cf 100644
--- a/block/gluster.c
+++ b/block/gluster.c
@@ -1477,6 +1477,8 @@ static int coroutine_fn 
qemu_gluster_co_block_status(BlockDriverState *bs,
 off_t data = 0, hole = 0;
 int ret = -EINVAL;
 
+assert(QEMU_IS_ALIGNED(offset | bytes, bs->bl.request_alignment));
+
 if (!s->fd) {
 return ret;
 }
@@ -1501,6 +1503,20 @@ static int coroutine_fn 
qemu_gluster_co_block_status(BlockDriverState *bs,
 /* On a data extent, compute bytes to the end of the extent,
  * possibly including a partial sector at EOF. */
 *pnum = MIN(bytes, hole - offset);
+
+/*
+ * We are not allowed to return partial sectors, though, so
+ * round up if necessary.
+ */
+if (!QEMU_IS_ALIGNED(*pnum, bs->bl.request_alignment)) {
+int64_t file_length = qemu_gluster_getlength(bs);
+if (file_length > 0) {
+/* Ignore errors, this is just a safeguard */
+assert(hole == file_length);
+}
+*pnum = ROUND_UP(*pnum, bs->bl.request_alignment);
+}
+
 ret = BDRV_BLOCK_DATA;
 } else {
 /* On a hole, compute bytes to the beginning of the next extent.  */
-- 
2.31.1




Re: [PATCH for-6.1? v2 5/7] job: Add job_cancel_requested()

2021-08-04 Thread Max Reitz

On 04.08.21 12:34, Kevin Wolf wrote:

[ Peter, the question for you is at the end. ]

Am 04.08.2021 um 10:07 hat Max Reitz geschrieben:

On 03.08.21 16:25, Kevin Wolf wrote:

Am 26.07.2021 um 16:46 hat Max Reitz geschrieben:

Most callers of job_is_cancelled() actually want to know whether the job
is on its way to immediate termination.  For example, we refuse to pause
jobs that are cancelled; but this only makes sense for jobs that are
really actually cancelled.

A mirror job that is cancelled during READY with force=false should
absolutely be allowed to pause.  This "cancellation" (which is actually
a kind of completion) may take an indefinite amount of time, and so
should behave like any job during normal operation.  For example, with
on-target-error=stop, the job should stop on write errors.  (In
contrast, force-cancelled jobs should not get write errors, as they
should just terminate and not do further I/O.)

Therefore, redefine job_is_cancelled() to only return true for jobs that
are force-cancelled (which as of HEAD^ means any job that interprets the
cancellation request as a request for immediate termination), and add
job_cancel_request() as the general variant, which returns true for any
jobs which have been requested to be cancelled, whether it be
immediately or after an arbitrarily long completion phase.

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Max Reitz 
---
   include/qemu/job.h |  8 +++-
   block/mirror.c | 10 --
   job.c  |  7 ++-
   3 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 8aa90f7395..032edf3c5f 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -436,9 +436,15 @@ const char *job_type_str(const Job *job);
   /** Returns true if the job should not be visible to the management layer. */
   bool job_is_internal(Job *job);
-/** Returns whether the job is scheduled for cancellation. */
+/** Returns whether the job is being cancelled. */
   bool job_is_cancelled(Job *job);
+/**
+ * Returns whether the job is scheduled for cancellation (at an
+ * indefinite point).
+ */
+bool job_cancel_requested(Job *job);

I don't think non-force blockdev-cancel for mirror should actually be
considered cancellation, so what is the question that this function
answers?

"Is this a cancelled job, or a mirror block job that is supposed to
complete soon, but only if it doesn't switch over the users to the
target on completion"?

Well, technically yes, but it was more intended as “Has the user ever
invoked (block-)job-cancel on this job?”.

I understand this, but is this much more useful to know than "Has the
user ever called HMP 'change'?", if you know what I mean?


Hm.  Not really.  It’s still a crutch that shouldn’t be there ideally.

But I like this crutch for this series so I can get this batch done, and 
then worry about all the other bugs that keep popping up (and where 
job_cancel_requested() is a nice sign that something’s off).



Is this ever a reasonable question to ask, except maybe inside the
mirror implementation itself?

I asked myself the same for v3, but found two places in job.c where I
would like to keep it:

First, there’s an assertion in job_completed_txn_abort().  All jobs
other than @job have been force-cancelled, and so job_is_cancelled()
would be true for them.  As for @job itself, the function is mostly
called when the job’s return value is not 0, but a soft-cancelled
mirror does have a return value of 0 and so would not end up in that
function.
But job_cancel() invokes job_completed_txn_abort() if the job has been
deferred to the main loop, which mostly correlates with the job having
been completed (in which case the assertion is skipped), but not 100 %
(there’s a small window between setting deferred_to_main_loop and the
job changing to a completed state).
So I’d prefer to keep the assertion as-is functionally, i.e. to only
check job->cancelled.

Well, you don't. It's still job_is_cancelled() after this patch.


No: I didn’t. O:)

For v3, I had absolutely planned to use job_cancel_requested(), and I 
wanted to put the above explanation into the commit message.



So the scenario you're concerned about is a job that has just finished
successfully (job->ret = 0) and then gets a cancel request?


Yes.


With force=false, I'm pretty sure the code is wrong anyway because
calling job_completed_txn_abort() is not the right response.


Absolutely possible, I just didn’t want to deal with this, too… :/


It should
return an error because you're trying to complete twice, possibly with
conflicting completion modes. Second best is just ignoring the cancel
request because we obviously already fulfilled the request of completing
the job (the completion mode might be different, though).

With force=true, arguably still letting the job fail is correct.
However, letting it fail involves more than just letting the tra

Re: [PATCH for-6.1? v2 6/7] mirror: Check job_is_cancelled() earlier

2021-08-04 Thread Max Reitz

On 04.08.21 11:48, Kevin Wolf wrote:

Am 04.08.2021 um 10:25 hat Max Reitz geschrieben:

On 03.08.21 16:34, Kevin Wolf wrote:

Am 26.07.2021 um 16:46 hat Max Reitz geschrieben:

We must check whether the job is force-cancelled early in our main loop,
most importantly before any `continue` statement.  For example, we used
to have `continue`s before our current checking location that are
triggered by `mirror_flush()` failing.  So, if `mirror_flush()` kept
failing, force-cancelling the job would not terminate it.

A job being force-cancelled should be treated the same as the job having
failed, so put the check in the same place where we check `s->ret < 0`.

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Max Reitz 
---
   block/mirror.c | 7 +--
   1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 72e02fa34e..46d1a1e5a2 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -993,7 +993,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
   mirror_wait_for_any_operation(s, true);
   }
-if (s->ret < 0) {
+if (s->ret < 0 || job_is_cancelled(>common.job)) {
   ret = s->ret;
   goto immediate_exit;
   }
@@ -1078,8 +1078,6 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
   break;
   }
-ret = 0;
-
   if (job_is_ready(>common.job) && !should_complete) {
   delay_ns = (s->in_flight == 0 &&
   cnt == 0 ? BLOCK_JOB_SLICE_TIME : 0);
@@ -1087,9 +1085,6 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
   trace_mirror_before_sleep(s, cnt, job_is_ready(>common.job),
 delay_ns);
   job_sleep_ns(>common.job, delay_ns);
-if (job_is_cancelled(>common.job)) {
-break;
-}

I think it was intentional that the check is here because it means
skipping the job_sleep_ns() and instead cancelling immediately, and we
probably still want that. Between your check above and here, the
coroutine can yield, so cancellation could have been newly requested.

I’m afraid I don’t quite understand.

Hm, I don't either. Somehow I thought job_sleep_ns() was after the
check, while quoting the exact hunk that shows that it comes before
it...

I'm still not sure if sleeping before exiting is really useful, but it
seems we never cared about that.


Jobs that are (force-)cancelled cannot yield or sleep anyway 
(job_sleep_ns(), job_yield(), and job_pause_point() will all return 
immediately when called on a cancelled job).


So I thought you meant that a job can only be cancelled while it is 
yielding, so we should prefer to put the is_cancelled check after a 
yield point (like job_pause_point()) than before it.


But I mean, if you’re happy, I’ll be happy, too. :)

Max




Re: [PATCH for-6.1? v2 6/7] mirror: Check job_is_cancelled() earlier

2021-08-04 Thread Max Reitz

On 03.08.21 16:34, Kevin Wolf wrote:

Am 26.07.2021 um 16:46 hat Max Reitz geschrieben:

We must check whether the job is force-cancelled early in our main loop,
most importantly before any `continue` statement.  For example, we used
to have `continue`s before our current checking location that are
triggered by `mirror_flush()` failing.  So, if `mirror_flush()` kept
failing, force-cancelling the job would not terminate it.

A job being force-cancelled should be treated the same as the job having
failed, so put the check in the same place where we check `s->ret < 0`.

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Max Reitz 
---
  block/mirror.c | 7 +--
  1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 72e02fa34e..46d1a1e5a2 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -993,7 +993,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
  mirror_wait_for_any_operation(s, true);
  }
  
-if (s->ret < 0) {

+if (s->ret < 0 || job_is_cancelled(>common.job)) {
  ret = s->ret;
  goto immediate_exit;
  }
@@ -1078,8 +1078,6 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
  break;
  }
  
-ret = 0;

-
  if (job_is_ready(>common.job) && !should_complete) {
  delay_ns = (s->in_flight == 0 &&
  cnt == 0 ? BLOCK_JOB_SLICE_TIME : 0);
@@ -1087,9 +1085,6 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
  trace_mirror_before_sleep(s, cnt, job_is_ready(>common.job),
delay_ns);
  job_sleep_ns(>common.job, delay_ns);
-if (job_is_cancelled(>common.job)) {
-break;
-}

I think it was intentional that the check is here because it means
skipping the job_sleep_ns() and instead cancelling immediately, and we
probably still want that. Between your check above and here, the
coroutine can yield, so cancellation could have been newly requested.


I’m afraid I don’t quite understand.  If cancel is requested in 
job_sleep_ns(), then we will go back to the top of the loop, wait for 
in-flight active requests and then break.  Waiting for the in-flight 
requests seems unnecessary, but does it really make a difference in 
practice?  We don’t start new requests, so it should be legal to wait 
for existing ones to settle, and also I believe someone will have to 
wait for those in-flight requests anyway (when the mirror top node is 
removed).  (The only thing we could do is to cancel the in-flight 
requests, but that is what mirror_cancel() does.)


Looking more at the whole loop, there are a couple of places that can 
yield.  Of course we can check whether the job has been cancelled after 
every single one of them, but that would be a bit strange.  We only 
really need to check before we initiate new requests or want to change 
the state.  I believe the right place to do the check would be after the 
job_pause_point().


And perhaps the active write functions (bdrv_mirror_top_do_write() and 
bdrv_mirror_top_pwritev()) should stop copying to the target if the job 
has been cancelled.


Max


So have the check in both places, I guess? And a comment to explain why
neither is redundant.


  s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
  }

Kevin






Re: [PATCH for-6.1? v2 5/7] job: Add job_cancel_requested()

2021-08-04 Thread Max Reitz

On 03.08.21 16:25, Kevin Wolf wrote:

Am 26.07.2021 um 16:46 hat Max Reitz geschrieben:

Most callers of job_is_cancelled() actually want to know whether the job
is on its way to immediate termination.  For example, we refuse to pause
jobs that are cancelled; but this only makes sense for jobs that are
really actually cancelled.

A mirror job that is cancelled during READY with force=false should
absolutely be allowed to pause.  This "cancellation" (which is actually
a kind of completion) may take an indefinite amount of time, and so
should behave like any job during normal operation.  For example, with
on-target-error=stop, the job should stop on write errors.  (In
contrast, force-cancelled jobs should not get write errors, as they
should just terminate and not do further I/O.)

Therefore, redefine job_is_cancelled() to only return true for jobs that
are force-cancelled (which as of HEAD^ means any job that interprets the
cancellation request as a request for immediate termination), and add
job_cancel_request() as the general variant, which returns true for any
jobs which have been requested to be cancelled, whether it be
immediately or after an arbitrarily long completion phase.

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Max Reitz 
---
  include/qemu/job.h |  8 +++-
  block/mirror.c | 10 --
  job.c  |  7 ++-
  3 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 8aa90f7395..032edf3c5f 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -436,9 +436,15 @@ const char *job_type_str(const Job *job);
  /** Returns true if the job should not be visible to the management layer. */
  bool job_is_internal(Job *job);
  
-/** Returns whether the job is scheduled for cancellation. */

+/** Returns whether the job is being cancelled. */
  bool job_is_cancelled(Job *job);
  
+/**

+ * Returns whether the job is scheduled for cancellation (at an
+ * indefinite point).
+ */
+bool job_cancel_requested(Job *job);

I don't think non-force blockdev-cancel for mirror should actually be
considered cancellation, so what is the question that this function
answers?

"Is this a cancelled job, or a mirror block job that is supposed to
complete soon, but only if it doesn't switch over the users to the
target on completion"?


Well, technically yes, but it was more intended as “Has the user ever 
invoked (block-)job-cancel on this job?”.



Is this ever a reasonable question to ask, except maybe inside the
mirror implementation itself?


I asked myself the same for v3, but found two places in job.c where I 
would like to keep it:


First, there’s an assertion in job_completed_txn_abort().  All jobs 
other than @job have been force-cancelled, and so job_is_cancelled() 
would be true for them.  As for @job itself, the function is mostly 
called when the job’s return value is not 0, but a soft-cancelled mirror 
does have a return value of 0 and so would not end up in that function.
But job_cancel() invokes job_completed_txn_abort() if the job has been 
deferred to the main loop, which mostly correlates with the job having 
been completed (in which case the assertion is skipped), but not 100 % 
(there’s a small window between setting deferred_to_main_loop and the 
job changing to a completed state).
So I’d prefer to keep the assertion as-is functionally, i.e. to only 
check job->cancelled.


Second, job_complete() refuses to let a job complete that has been 
cancelled.  This function is basically only invoked by the user (through 
qmp_block_job_complete()/qmp_job_complete(), or job_complete_sync(), 
which comes from qemu-img), so I believe that it should correspond to 
the external interface we have right now; i.e., if the user has invoked 
(block-)job-cancel at one point, job_complete() should generally return 
an error.



job_complete() is the only function outside of mirror that seems to use
it. But even there, it feels wrong to make a difference. Either we
accept redundant completion requests, or we don't. It doesn't really
matter how the job reconfigures the graph on completion. (Also, I feel
this should really have been part of the state machine, but I'm not sure
if we want to touch it now...)


Well, yes, I don’t think it makes a difference because I don’t think 
anyone will first tell the job via block-job-cancel to complete without 
pivoting, and then change their mind and call block-job-complete after 
all.  (Not least because that’s an error pre-series.)


Also, I’m not even sure whether completing after a soft cancel request 
works.  I don’t think any of our code accounts for such a case, so I’d 
rather avoid allowing it if there’s no need to allow it anyway.


Max




Re: [PATCH for-6.1? v2 5/7] job: Add job_cancel_requested()

2021-08-02 Thread Max Reitz

On 27.07.21 17:47, Vladimir Sementsov-Ogievskiy wrote:

27.07.2021 18:39, Max Reitz wrote:

On 27.07.21 15:04, Vladimir Sementsov-Ogievskiy wrote:

26.07.2021 17:46, Max Reitz wrote:
Most callers of job_is_cancelled() actually want to know whether 
the job
is on its way to immediate termination.  For example, we refuse to 
pause

jobs that are cancelled; but this only makes sense for jobs that are
really actually cancelled.

A mirror job that is cancelled during READY with force=false should
absolutely be allowed to pause.  This "cancellation" (which is 
actually

a kind of completion) may take an indefinite amount of time, and so
should behave like any job during normal operation.  For example, with
on-target-error=stop, the job should stop on write errors. (In
contrast, force-cancelled jobs should not get write errors, as they
should just terminate and not do further I/O.)

Therefore, redefine job_is_cancelled() to only return true for jobs 
that
are force-cancelled (which as of HEAD^ means any job that 
interprets the

cancellation request as a request for immediate termination), and add
job_cancel_request() as the general variant, which returns true for 
any


job_cancel_requested()


jobs which have been requested to be cancelled, whether it be
immediately or after an arbitrarily long completion phase.

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Max Reitz 
---
  include/qemu/job.h |  8 +++-
  block/mirror.c | 10 --
  job.c  |  7 ++-
  3 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 8aa90f7395..032edf3c5f 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -436,9 +436,15 @@ const char *job_type_str(const Job *job);
  /** Returns true if the job should not be visible to the 
management layer. */

  bool job_is_internal(Job *job);
  -/** Returns whether the job is scheduled for cancellation. */
+/** Returns whether the job is being cancelled. */
  bool job_is_cancelled(Job *job);
  +/**
+ * Returns whether the job is scheduled for cancellation (at an
+ * indefinite point).
+ */
+bool job_cancel_requested(Job *job);
+
  /** Returns whether the job is in a completed state. */
  bool job_is_completed(Job *job);
  diff --git a/block/mirror.c b/block/mirror.c
index e93631a9f6..72e02fa34e 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -936,7 +936,7 @@ static int coroutine_fn mirror_run(Job *job, 
Error **errp)

  /* Transition to the READY state and wait for complete. */
  job_transition_to_ready(>common.job);
  s->actively_synced = true;
-    while (!job_is_cancelled(>common.job) && 
!s->should_complete) {
+    while (!job_cancel_requested(>common.job) && 
!s->should_complete) {

  job_yield(>common.job);
  }
  s->common.job.cancelled = false;
@@ -1043,7 +1043,7 @@ static int coroutine_fn mirror_run(Job *job, 
Error **errp)

  }
    should_complete = s->should_complete ||
-    job_is_cancelled(>common.job);
+ job_cancel_requested(>common.job);
  cnt = bdrv_get_dirty_count(s->dirty_bitmap);
  }
  @@ -1087,7 +1087,7 @@ static int coroutine_fn mirror_run(Job 
*job, Error **errp)
  trace_mirror_before_sleep(s, cnt, 
job_is_ready(>common.job),

    delay_ns);
  job_sleep_ns(>common.job, delay_ns);
-    if (job_is_cancelled(>common.job) && 
s->common.job.force_cancel) {

+    if (job_is_cancelled(>common.job)) {
  break;
  }
  s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
@@ -1099,9 +1099,7 @@ immediate_exit:
   * or it was cancelled prematurely so that we do not 
guarantee that

   * the target is a copy of the source.
   */
-    assert(ret < 0 ||
-   (s->common.job.force_cancel &&
-    job_is_cancelled(>common.job)));
+    assert(ret < 0 || job_is_cancelled(>common.job));


(As a note, I hope this does the job regarding your suggestions for 
patch 4. :))



  assert(need_drain);
  mirror_wait_for_all_io(s);
  }
diff --git a/job.c b/job.c
index e78d893a9c..dba17a680f 100644
--- a/job.c
+++ b/job.c
@@ -216,6 +216,11 @@ const char *job_type_str(const Job *job)
  }
    bool job_is_cancelled(Job *job)
+{
+    return job->cancelled && job->force_cancel;


can job->cancelled be false when job->force_cancel is true ? I think 
not and worth an assertion here. Something like


if (job->force_cancel) {
   assert(job->cancelled);
   return true;
}

return false;


Sounds good, why not.




+}
+
+bool job_cancel_requested(Job *job)
  {
  return job->cancelled;
  }
@@ -1015,7 +1020,7 @@ void job_complete(Job *job, Error **errp)
  if (job_apply_verb(job, JOB_VER

Re: [PATCH RFC 0/3] mirror: rework soft-cancelling READY mirror

2021-07-30 Thread Max Reitz

On 29.07.21 18:29, Vladimir Sementsov-Ogievskiy wrote:

29.07.2021 16:47, Max Reitz wrote:

On 29.07.21 13:35, Vladimir Sementsov-Ogievskiy wrote:

29.07.2021 13:38, Max Reitz wrote:

On 29.07.21 12:02, Vladimir Sementsov-Ogievskiy wrote:

28.07.2021 10:00, Max Reitz wrote:

On 27.07.21 18:47, Vladimir Sementsov-Ogievskiy wrote:

Hi all!

That's an alternative to (part of) Max's
"[PATCH for-6.1? v2 0/7] mirror: Handle errors after READY cancel"
and shows' my idea of handling soft-cancelling READY mirror case
directly in qmp_block_job_cancel. And cleanup all other job 
cancelling

functions.

That's untested draft, don't take it to heart :)


Well, I would have preferred it if you’d rebased this on top of 
that series, precisely because it’s an alternative to only part 
of it. And if it’s just an untested draft, that would have been 
even better, because it would’ve given a better idea on what the 
cleanup looks like.


There are also things like this series making cancel internally 
always a force-cancel, where I’m not sure whether we want that in 
the replication driver or not[1].  With my series, we add an 
explicit parameter, so we’re forced to think about it, and then 
in this series on top we can just drop the parameter for all 
force-cancel invocations again, and for all non-force-cancel 
invocations we would have to think a bit more.


I now don't sure that patch 5 of your series is correct (see my 
last answer to it), that's why I decided to not base on it.


Well, we can always take patch 5 from v1.  (Where I changed any 
job_is_cancelled() to job_cancel_requested() when it influenced the 
external interface.)


My series has the benefit of handling soft-mirror-cancel case the 
other way and handles mirror finalization in case of soft-cancel 
properly.




Specifically as for this series, I don’t like job_complete_ex() 
very much, I think the parameter should be part of job_complete() 
itself.


That was my idea. But job_complete is passed as function pointer, 
so changing its prototype would be more work.. But I think it's 
possible.


  If we think that’s too specific of a mirror parameter to 
include in normal job_complete(), well, then there shouldn’t be a 
job_complete_ex() either, and do_graph_change should be a 
property of the mirror job (perhaps as pivot_on_completion) 
that’s cleared by qmp_block_job_cancel() before invoking 
job_complete().


This way users will lose a way to make a decision during job 
running..


On the contrary, it would be a precursor to letting the user change 
this property explicitly with a new QMP command.


But probably they don't need actually. Moving the option to mirror 
job parameter seems a good option to me.




Max

[1] Although looking at it again now, it probably wants 
force-cancel.





What do you think of my idea to keep old bugs as is and just 
deprecate block-job-cancel and add a new interface for 
"no-graph-change mirror" case?


I don’t see a reason for that.  The fix isn’t that complicated.

Also, honestly, I don’t see a good reason for deprecating anything.



Current interface lead to mess in the code, that's bad. Cancellation 
mode that is actually a kind of completion (and having comments in 
many places about that) - that shows for me that interface is not 
good.. It's a question of terminology, what to call "cancel". Also, 
that's not the first time this question arise. Remember my recent 
cancel-in-flight-requests series, when I thought that "cancel is 
cancel" and didn't consider soft-cancel of mirror.. And reviewers 
didn't caught it. I don't think that interface is good, it will 
always confuse new developers and users. But that's just my opinion, 
I don't impose it )


If not deprecate, i.e. if we consider old interface to be good, than 
no reason for this my series and for introducing new interface :)


I’m not against a better interface, I’m against using this current 
bug as an excuse to improve the interface.  We’ve known we want to 
improve the interface for quite a long time now, we don’t need an 
excuse for that.


If we use this bug as an excuse, I’m afraid of becoming hung up on 
interface discussions instead of just getting the bug fixed. And we 
must get the bug fixed, it’s real, it’s kind of bad, and saying “it 
won’t appear with the new interface, let’s not worry about the old 
one” is not something I like.


OTOH, if we use this bug as an excuse, I’m also afraid of trying to 
rush the design instead of actually implementing the interface that 
we’ve always desired, i.e. where the user gets to choose the 
completion mode via yet-to-be-implemented some job property setter 
function.


As a final note (but this is precisely the interface discussion that 
I want to avoid for now), I said I don’t see a good reason for 
deprecating anything, because `job-cancel force=false` can just 
internally do `set-job-property .pivot_on_completion=false; 
job-complete`.  From an implementation 

[PATCH v3 06/10] virtiofsd: Let lo_inode_open() return a TempFd

2021-07-30 Thread Max Reitz
Strictly speaking, this is not necessary, because lo_inode_open() will
always return a new FD owned by the caller, so TempFd.owned will always
be true.

However, auto-cleanup is nice, and in some cases this plays nicely with
an lo_inode_fd() call in another conditional branch (see lo_setattr()).

Signed-off-by: Max Reitz 
---
 tools/virtiofsd/passthrough_ll.c | 138 +--
 1 file changed, 59 insertions(+), 79 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 9e1bc37af8..292b7f7e27 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -291,10 +291,8 @@ static void temp_fd_clear(TempFd *temp_fd)
 /**
  * Return an owned fd from *temp_fd that will not be closed when
  * *temp_fd goes out of scope.
- *
- * (TODO: Remove __attribute__ once this is used.)
  */
-static __attribute__((unused)) int temp_fd_steal(TempFd *temp_fd)
+static int temp_fd_steal(TempFd *temp_fd)
 {
 if (temp_fd->owned) {
 temp_fd->owned = false;
@@ -673,9 +671,12 @@ static int lo_fd(fuse_req_t req, fuse_ino_t ino, TempFd 
*tfd)
  * when a malicious client opens special files such as block device nodes.
  * Symlink inodes are also rejected since symlinks must already have been
  * traversed on the client side.
+ *
+ * The fd is returned in tfd->fd.  The return value is 0 on success and -errno
+ * otherwise.
  */
-static int lo_inode_open(struct lo_data *lo, struct lo_inode *inode,
- int open_flags)
+static int lo_inode_open(const struct lo_data *lo, const struct lo_inode 
*inode,
+ int open_flags, TempFd *tfd)
 {
 g_autofree char *fd_str = g_strdup_printf("%d", inode->fd);
 int fd;
@@ -694,7 +695,13 @@ static int lo_inode_open(struct lo_data *lo, struct 
lo_inode *inode,
 if (fd < 0) {
 return -errno;
 }
-return fd;
+
+*tfd = (TempFd) {
+.fd = fd,
+.owned = true,
+};
+
+return 0;
 }
 
 static void lo_init(void *userdata, struct fuse_conn_info *conn)
@@ -852,7 +859,12 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, 
struct stat *attr,
 return;
 }
 
-res = lo_inode_fd(inode, _fd);
+if (!fi && (valid & FUSE_SET_ATTR_SIZE)) {
+/* We need an O_RDWR FD for ftruncate() */
+res = lo_inode_open(lo, inode, O_RDWR, _fd);
+} else {
+res = lo_inode_fd(inode, _fd);
+}
 if (res < 0) {
 saverr = -res;
 goto out_err;
@@ -900,18 +912,11 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, 
struct stat *attr,
 if (fi) {
 truncfd = fd;
 } else {
-truncfd = lo_inode_open(lo, inode, O_RDWR);
-if (truncfd < 0) {
-saverr = -truncfd;
-goto out_err;
-}
+truncfd = inode_fd.fd;
 }
 
 saverr = drop_security_capability(lo, truncfd);
 if (saverr) {
-if (!fi) {
-close(truncfd);
-}
 goto out_err;
 }
 
@@ -919,9 +924,6 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, 
struct stat *attr,
 res = drop_effective_cap("FSETID", _fsetid_dropped);
 if (res != 0) {
 saverr = res;
-if (!fi) {
-close(truncfd);
-}
 goto out_err;
 }
 }
@@ -934,9 +936,6 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, 
struct stat *attr,
 fuse_log(FUSE_LOG_ERR, "Failed to gain CAP_FSETID\n");
 }
 }
-if (!fi) {
-close(truncfd);
-}
 if (res == -1) {
 goto out_err;
 }
@@ -1822,11 +1821,12 @@ static struct lo_dirp *lo_dirp(fuse_req_t req, struct 
fuse_file_info *fi)
 static void lo_opendir(fuse_req_t req, fuse_ino_t ino,
struct fuse_file_info *fi)
 {
+g_auto(TempFd) inode_fd = TEMP_FD_INIT;
 int error = ENOMEM;
 struct lo_data *lo = lo_data(req);
 struct lo_inode *inode;
 struct lo_dirp *d = NULL;
-int fd;
+int res;
 ssize_t fh;
 
 inode = lo_inode(req, ino);
@@ -1840,13 +1840,13 @@ static void lo_opendir(fuse_req_t req, fuse_ino_t ino,
 goto out_err;
 }
 
-fd = lo_inode_open(lo, inode, O_RDONLY);
-if (fd < 0) {
-error = -fd;
+res = lo_inode_open(lo, inode, O_RDONLY, _fd);
+if (res < 0) {
+error = -res;
 goto out_err;
 }
 
-d->dp = fdopendir(fd);
+d->dp = fdopendir(temp_fd_steal(_fd));
 if (d->dp == NULL) {
 goto out_errno;
 }
@@ -1876,8 +1876,6 @@ out_err:
 if (d) {
 if (d->dp) {
 closedir(d->dp);
-} else if (fd != -1) {
-close(fd);
 }
 free(d);
 }
@@ -2077,6 +2075,7 @@ static void update_open_

[PATCH v3 10/10] virtiofsd: Add lazy lo_do_find()

2021-07-30 Thread Max Reitz
lo_find() right now takes two lookup keys for two maps, namely the file
handle for inodes_by_handle and the statx information for inodes_by_ids.
However, we only need the statx information if looking up the inode by
the file handle failed.

There are two callers of lo_find(): The first one, lo_do_lookup(), has
both keys anyway, so passing them does not incur any additional cost.
The second one, lookup_name(), though, needs to explicitly invoke
name_to_handle_at() (through get_file_handle()) and statx() (through
do_statx()).  We need to try to get a file handle as the primary key, so
we cannot get rid of get_file_handle(), but we only need the statx
information if looking up an inode by handle failed; so we can defer
that until the lookup has indeed failed.

To this end, replace lo_find()'s st/mnt_id parameters by a get_ids()
closure that is invoked to fill the lo_key struct if necessary.

Also, lo_find() is renamed to lo_do_find(), so we can add a new
lo_find() wrapper whose closure just initializes the lo_key from the
st/mnt_id parameters, just like the old lo_find() did.

lookup_name() directly calls lo_do_find() now and passes its own
closure, which performs the do_statx() call.

Signed-off-by: Max Reitz 
---
 tools/virtiofsd/passthrough_ll.c | 93 ++--
 1 file changed, 76 insertions(+), 17 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index ac95961d12..41e9f53878 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -1200,22 +1200,23 @@ out_err:
 fuse_reply_err(req, saverr);
 }
 
-static struct lo_inode *lo_find(struct lo_data *lo,
-const struct lo_fhandle *fhandle,
-struct stat *st, uint64_t mnt_id)
+/*
+ * get_ids() will be called to get the key for lo->inodes_by_ids if
+ * the lookup by file handle has failed.
+ */
+static struct lo_inode *lo_do_find(struct lo_data *lo,
+const struct lo_fhandle *fhandle,
+int (*get_ids)(struct lo_key *, const void *),
+const void *get_ids_opaque)
 {
 struct lo_inode *p = NULL;
-struct lo_key ids_key = {
-.ino = st->st_ino,
-.dev = st->st_dev,
-.mnt_id = mnt_id,
-};
+struct lo_key ids_key;
 
 pthread_mutex_lock(>mutex);
 if (fhandle) {
 p = g_hash_table_lookup(lo->inodes_by_handle, fhandle);
 }
-if (!p) {
+if (!p && get_ids(_key, get_ids_opaque) == 0) {
 p = g_hash_table_lookup(lo->inodes_by_ids, _key);
 /*
  * When we had to fall back to looking up an inode by its
@@ -1244,6 +1245,36 @@ static struct lo_inode *lo_find(struct lo_data *lo,
 return p;
 }
 
+struct lo_find_get_ids_key_opaque {
+const struct stat *st;
+uint64_t mnt_id;
+};
+
+static int lo_find_get_ids_key(struct lo_key *ids_key, const void *opaque)
+{
+const struct lo_find_get_ids_key_opaque *stat_info = opaque;
+
+*ids_key = (struct lo_key){
+.ino = stat_info->st->st_ino,
+.dev = stat_info->st->st_dev,
+.mnt_id = stat_info->mnt_id,
+};
+
+return 0;
+}
+
+static struct lo_inode *lo_find(struct lo_data *lo,
+const struct lo_fhandle *fhandle,
+struct stat *st, uint64_t mnt_id)
+{
+const struct lo_find_get_ids_key_opaque stat_info = {
+.st = st,
+.mnt_id = mnt_id,
+};
+
+return lo_do_find(lo, fhandle, lo_find_get_ids_key, _info);
+}
+
 /* value_destroy_func for posix_locks GHashTable */
 static void posix_locks_value_destroy(gpointer data)
 {
@@ -1769,14 +1800,41 @@ out_err:
 fuse_reply_err(req, saverr);
 }
 
+struct lookup_name_get_ids_key_opaque {
+struct lo_data *lo;
+int parent_fd;
+const char *name;
+};
+
+static int lookup_name_get_ids_key(struct lo_key *ids_key, const void *opaque)
+{
+const struct lookup_name_get_ids_key_opaque *stat_params = opaque;
+uint64_t mnt_id;
+struct stat attr;
+int res;
+
+res = do_statx(stat_params->lo, stat_params->parent_fd, stat_params->name,
+   , AT_SYMLINK_NOFOLLOW, _id);
+if (res < 0) {
+return -errno;
+}
+
+*ids_key = (struct lo_key){
+.ino = attr.st_ino,
+.dev = attr.st_dev,
+.mnt_id = mnt_id,
+};
+
+return 0;
+}
+
 /* Increments nlookup and caller must release refcount using lo_inode_put() */
 static struct lo_inode *lookup_name(fuse_req_t req, fuse_ino_t parent,
 const char *name)
 {
 g_auto(TempFd) dir_fd = TEMP_FD_INIT;
 int res;
-uint64_t mnt_id;
-struct stat attr;
+struct lookup_name_get_ids_key_opaque stat_params;
 struct lo_fhandle *fh;
 struct lo_data *lo = lo_data(req);
 struct lo_inode *dir = lo_inode(req, parent);
@@ -1794,12 +1852,13 @@ static struct lo_inode *lookup_name(fuse_req_t req, 
fuse_ino_t 

[PATCH v3 09/10] virtiofsd: Optionally fill lo_inode.fhandle

2021-07-30 Thread Max Reitz
When the inode_file_handles option is set, try to generate a file handle
for new inodes instead of opening an O_PATH FD.

Being able to open these again will require CAP_DAC_READ_SEARCH, so the
description text tells the user they will also need to specify
-o modcaps=+dac_read_search.

Generating a file handle returns the mount ID it is valid for.  Opening
it will require an FD instead.  We have mount_fds to map an ID to an FD.
get_file_handle() fills the hash map by opening the file we have
generated a handle for.  To verify that the resulting FD indeed
represents the handle's mount ID, we use statx().  Therefore, using file
handles requires statx() support.

Signed-off-by: Max Reitz 
---
 tools/virtiofsd/helper.c  |   3 +
 tools/virtiofsd/passthrough_ll.c  | 194 --
 tools/virtiofsd/passthrough_seccomp.c |   1 +
 3 files changed, 190 insertions(+), 8 deletions(-)

diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
index a8295d975a..aa63a21d43 100644
--- a/tools/virtiofsd/helper.c
+++ b/tools/virtiofsd/helper.c
@@ -187,6 +187,9 @@ void fuse_cmdline_help(void)
"   default: no_allow_direct_io\n"
"-o announce_submounts  Announce sub-mount points to the 
guest\n"
"-o posix_acl/no_posix_acl  Enable/Disable posix_acl. (default: 
disabled)\n"
+   "-o inode_file_handles  Use file handles to reference 
inodes\n"
+   "   instead of O_PATH file 
descriptors\n"
+   "   (requires -o 
modcaps=+dac_read_search)\n"
);
 }
 
diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index f9d8b2f134..ac95961d12 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -194,6 +194,7 @@ struct lo_data {
 /* If set, virtiofsd is responsible for setting umask during creation */
 bool change_umask;
 int user_posix_acl, posix_acl;
+int inode_file_handles;
 };
 
 /**
@@ -250,6 +251,10 @@ static const struct fuse_opt lo_opts[] = {
 { "no_killpriv_v2", offsetof(struct lo_data, user_killpriv_v2), 0 },
 { "posix_acl", offsetof(struct lo_data, user_posix_acl), 1 },
 { "no_posix_acl", offsetof(struct lo_data, user_posix_acl), 0 },
+{ "inode_file_handles", offsetof(struct lo_data, inode_file_handles), 1 },
+{ "no_inode_file_handles",
+  offsetof(struct lo_data, inode_file_handles),
+  0 },
 FUSE_OPT_END
 };
 static bool use_syslog = false;
@@ -321,6 +326,135 @@ static int temp_fd_steal(TempFd *temp_fd)
 }
 }
 
+/**
+ * Generate a file handle for the given dirfd/name combination.
+ *
+ * If mount_fds does not yet contain an entry for the handle's mount
+ * ID, (re)open dirfd/name in O_RDONLY mode and add it to mount_fds
+ * as the FD for that mount ID.  (That is the file that we have
+ * generated a handle for, so it should be representative for the
+ * mount ID.  However, to be sure (and to rule out races), we use
+ * statx() to verify that our assumption is correct.)
+ */
+static struct lo_fhandle *get_file_handle(struct lo_data *lo,
+  int dirfd, const char *name)
+{
+/* We need statx() to verify the mount ID */
+#if defined(CONFIG_STATX) && defined(STATX_MNT_ID)
+struct lo_fhandle *fh;
+int ret;
+
+if (!lo->use_statx || !lo->inode_file_handles) {
+return NULL;
+}
+
+fh = g_new0(struct lo_fhandle, 1);
+
+fh->handle.handle_bytes = sizeof(fh->padding) - sizeof(fh->handle);
+ret = name_to_handle_at(dirfd, name, >handle, >mount_id,
+AT_EMPTY_PATH);
+if (ret < 0) {
+goto fail;
+}
+
+if (pthread_rwlock_rdlock(_fds_lock)) {
+goto fail;
+}
+if (!g_hash_table_contains(mount_fds, GINT_TO_POINTER(fh->mount_id))) {
+g_auto(TempFd) path_fd = TEMP_FD_INIT;
+struct statx stx;
+char procname[64];
+int fd;
+
+pthread_rwlock_unlock(_fds_lock);
+
+/*
+ * Before opening an O_RDONLY fd, check whether dirfd/name is a regular
+ * file or directory, because we must not open anything else with
+ * anything but O_PATH.
+ * (And we use that occasion to verify that the file has the mount ID 
we
+ * need.)
+ */
+if (name[0]) {
+path_fd.fd = openat(dirfd, name, O_PATH);
+if (path_fd.fd < 0) {
+goto fail;
+}
+path_fd.owned = true;
+} else {
+path_fd.fd = dirfd;
+path_fd.owned = false;
+}
+
+ret = statx(path_fd.fd, "", AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW,
+STATX_TYPE | STATX_MNT_ID, );
+if (ret < 0) 

[PATCH v3 02/10] virtiofsd: Add TempFd structure

2021-07-30 Thread Max Reitz
We are planning to add file handles to lo_inode objects as an
alternative to lo_inode.fd.  That means that everywhere where we
currently reference lo_inode.fd, we will have to open a temporary file
descriptor that needs to be closed after use.

So instead of directly accessing lo_inode.fd, there will be a helper
function (lo_inode_fd()) that either returns lo_inode.fd, or opens a new
file descriptor with open_by_handle_at().  It encapsulates this result
in a TempFd structure to let the caller know whether the FD needs to be
closed after use (opened from the handle) or not (copied from
lo_inode.fd).

By using g_auto(TempFd) to store this result, callers will not even have
to care about closing a temporary FD after use.  It will be done
automatically once the object goes out of scope.

Signed-off-by: Max Reitz 
Reviewed-by: Connor Kuehl 
---
 tools/virtiofsd/passthrough_ll.c | 49 
 1 file changed, 49 insertions(+)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 1f27eeabc5..fb5e073e6a 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -178,6 +178,28 @@ struct lo_data {
 int user_posix_acl, posix_acl;
 };
 
+/**
+ * Represents a file descriptor that may either be owned by this
+ * TempFd, or only referenced (i.e. the ownership belongs to some
+ * other object, and the value has just been copied into this TempFd).
+ *
+ * The purpose of this encapsulation is to be used as g_auto(TempFd)
+ * to automatically clean up owned file descriptors when this object
+ * goes out of scope.
+ *
+ * Use temp_fd_steal() to get an owned file descriptor that will not
+ * be closed when the TempFd goes out of scope.
+ */
+typedef struct {
+int fd;
+bool owned; /* fd owned by this object? */
+} TempFd;
+
+#define TEMP_FD_INIT ((TempFd) { .fd = -1, .owned = false })
+
+static void temp_fd_clear(TempFd *temp_fd);
+G_DEFINE_AUTO_CLEANUP_CLEAR_FUNC(TempFd, temp_fd_clear);
+
 static const struct fuse_opt lo_opts[] = {
 { "sandbox=namespace",
   offsetof(struct lo_data, sandbox),
@@ -255,6 +277,33 @@ static struct lo_data *lo_data(fuse_req_t req)
 return (struct lo_data *)fuse_req_userdata(req);
 }
 
+/**
+ * Clean-up function for TempFds
+ */
+static void temp_fd_clear(TempFd *temp_fd)
+{
+if (temp_fd->owned) {
+close(temp_fd->fd);
+*temp_fd = TEMP_FD_INIT;
+}
+}
+
+/**
+ * Return an owned fd from *temp_fd that will not be closed when
+ * *temp_fd goes out of scope.
+ *
+ * (TODO: Remove __attribute__ once this is used.)
+ */
+static __attribute__((unused)) int temp_fd_steal(TempFd *temp_fd)
+{
+if (temp_fd->owned) {
+temp_fd->owned = false;
+return temp_fd->fd;
+} else {
+return dup(temp_fd->fd);
+}
+}
+
 /*
  * Load capng's state from our saved state if the current thread
  * hadn't previously been loaded.
-- 
2.31.1




[PATCH v3 03/10] virtiofsd: Use lo_inode_open() instead of openat()

2021-07-30 Thread Max Reitz
The xattr functions want a non-O_PATH FD, so they reopen the lo_inode.fd
with the flags they need through /proc/self/fd.

Similarly, lo_opendir() needs an O_RDONLY FD.  Instead of the
/proc/self/fd trick, it just uses openat(fd, "."), because the FD is
guaranteed to be a directory, so this works.

All cases have one problem in common, though: In the future, when we may
have a file handle in the lo_inode instead of an FD, querying an
lo_inode FD may incur an open_by_handle_at() call.  It does not make
sense to then reopen that FD with custom flags, those should have been
passed to open_by_handle_at() instead.

Use lo_inode_open() instead of openat().  As part of the file handle
change, lo_inode_open() will be made to invoke openat() only if
lo_inode.fd is valid.  Otherwise, it will invoke open_by_handle_at()
with the right flags from the start.

Consequently, after this patch, lo_inode_open() is the only place to
invoke openat() to reopen an existing FD with different flags.

Signed-off-by: Max Reitz 
---
 tools/virtiofsd/passthrough_ll.c | 43 
 1 file changed, 27 insertions(+), 16 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index fb5e073e6a..a444c3a7e2 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -1729,18 +1729,26 @@ static void lo_opendir(fuse_req_t req, fuse_ino_t ino,
 {
 int error = ENOMEM;
 struct lo_data *lo = lo_data(req);
-struct lo_dirp *d;
+struct lo_inode *inode;
+struct lo_dirp *d = NULL;
 int fd;
 ssize_t fh;
 
+inode = lo_inode(req, ino);
+if (!inode) {
+error = EBADF;
+goto out_err;
+}
+
 d = calloc(1, sizeof(struct lo_dirp));
 if (d == NULL) {
 goto out_err;
 }
 
-fd = openat(lo_fd(req, ino), ".", O_RDONLY);
-if (fd == -1) {
-goto out_errno;
+fd = lo_inode_open(lo, inode, O_RDONLY);
+if (fd < 0) {
+error = -fd;
+goto out_err;
 }
 
 d->dp = fdopendir(fd);
@@ -1769,6 +1777,7 @@ static void lo_opendir(fuse_req_t req, fuse_ino_t ino,
 out_errno:
 error = errno;
 out_err:
+lo_inode_put(lo, );
 if (d) {
 if (d->dp) {
 closedir(d->dp);
@@ -2973,7 +2982,6 @@ static void lo_getxattr(fuse_req_t req, fuse_ino_t ino, 
const char *in_name,
 }
 }
 
-sprintf(procname, "%i", inode->fd);
 /*
  * It is not safe to open() non-regular/non-dir files in file server
  * unless O_PATH is used, so use that method for regular files/dir
@@ -2981,13 +2989,15 @@ static void lo_getxattr(fuse_req_t req, fuse_ino_t ino, 
const char *in_name,
  * Otherwise, call fchdir() to avoid open().
  */
 if (S_ISREG(inode->filetype) || S_ISDIR(inode->filetype)) {
-fd = openat(lo->proc_self_fd, procname, O_RDONLY);
+fd = lo_inode_open(lo, inode, O_RDONLY);
 if (fd < 0) {
-goto out_err;
+saverr = -fd;
+goto out;
 }
 ret = fgetxattr(fd, name, value, size);
 saverr = ret == -1 ? errno : 0;
 } else {
+sprintf(procname, "%i", inode->fd);
 /* fchdir should not fail here */
 FCHDIR_NOFAIL(lo->proc_self_fd);
 ret = getxattr(procname, name, value, size);
@@ -3054,15 +3064,16 @@ static void lo_listxattr(fuse_req_t req, fuse_ino_t 
ino, size_t size)
 }
 }
 
-sprintf(procname, "%i", inode->fd);
 if (S_ISREG(inode->filetype) || S_ISDIR(inode->filetype)) {
-fd = openat(lo->proc_self_fd, procname, O_RDONLY);
+fd = lo_inode_open(lo, inode, O_RDONLY);
 if (fd < 0) {
-goto out_err;
+saverr = -fd;
+goto out;
 }
 ret = flistxattr(fd, value, size);
 saverr = ret == -1 ? errno : 0;
 } else {
+sprintf(procname, "%i", inode->fd);
 /* fchdir should not fail here */
 FCHDIR_NOFAIL(lo->proc_self_fd);
 ret = listxattr(procname, value, size);
@@ -3211,14 +3222,14 @@ static void lo_setxattr(fuse_req_t req, fuse_ino_t ino, 
const char *in_name,
  * setxattr() on the link's filename there.
  */
 open_inode = S_ISREG(inode->filetype) || S_ISDIR(inode->filetype);
-sprintf(procname, "%i", inode->fd);
 if (open_inode) {
-fd = openat(lo->proc_self_fd, procname, O_RDONLY);
+fd = lo_inode_open(lo, inode, O_RDONLY);
 if (fd < 0) {
-saverr = errno;
+saverr = -fd;
 goto out;
 }
 } else {
+sprintf(procname, "%i", inode->fd);
 /* fchdir should not fail here */
 FCHDIR_NOFAIL(lo->proc_self_fd);
 }
@@ -3317,16 +3328,16 @@ static void lo_removexattr(fuse_req_t req, fuse_ino_t 
ino, const char *in_name)
 fuse_log(FUSE_LOG_DEBUG, &quo

[PATCH v3 07/10] virtiofsd: Add lo_inode.fhandle

2021-07-30 Thread Max Reitz
This new field is an alternative to lo_inode.fd: Either of the two must
be set.  In case an O_PATH FD is needed for some lo_inode, it is either
taken from lo_inode.fd, if valid, or a temporary FD is opened with
open_by_handle_at().

Using a file handle instead of an FD has the advantage of keeping the
number of open file descriptors low.

Because open_by_handle_at() requires a mount FD (i.e. a non-O_PATH FD
opened on the filesystem to which the file handle refers), but every
lo_fhandle only has a mount ID (as returned by name_to_handle_at()), we
keep a hash map of such FDs in mount_fds (mapping ID to FD).
get_file_handle(), which is added by a later patch, will ensure that
every mount ID for which we have generated a handle has a corresponding
entry in mount_fds.

Signed-off-by: Max Reitz 
Reviewed-by: Connor Kuehl 
---
 tools/virtiofsd/passthrough_ll.c  | 116 ++
 tools/virtiofsd/passthrough_seccomp.c |   1 +
 2 files changed, 102 insertions(+), 15 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 292b7f7e27..487448d666 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -88,8 +88,25 @@ struct lo_key {
 uint64_t mnt_id;
 };
 
+struct lo_fhandle {
+union {
+struct file_handle handle;
+char padding[sizeof(struct file_handle) + MAX_HANDLE_SZ];
+};
+int mount_id;
+};
+
+/* Maps mount IDs to an FD that we can pass to open_by_handle_at() */
+static GHashTable *mount_fds;
+pthread_rwlock_t mount_fds_lock = PTHREAD_RWLOCK_INITIALIZER;
+
 struct lo_inode {
+/*
+ * Either of fd or fhandle must be set (i.e. >= 0 or non-NULL,
+ * respectively).
+ */
 int fd;
+struct lo_fhandle *fhandle;
 
 /*
  * Atomic reference count for this object.  The nlookup field holds a
@@ -302,6 +319,44 @@ static int temp_fd_steal(TempFd *temp_fd)
 }
 }
 
+/**
+ * Open the given file handle with the given flags.
+ *
+ * The mount FD to pass to open_by_handle_at() is taken from the
+ * mount_fds hash map.
+ *
+ * On error, return -errno.
+ */
+static int open_file_handle(const struct lo_fhandle *fh, int flags)
+{
+gpointer mount_fd_ptr;
+int mount_fd;
+bool found;
+int ret;
+
+ret = pthread_rwlock_rdlock(_fds_lock);
+if (ret) {
+return -ret;
+}
+
+/* mount_fd == 0 is valid, so we need lookup_extended */
+found = g_hash_table_lookup_extended(mount_fds,
+ GINT_TO_POINTER(fh->mount_id),
+ NULL, _fd_ptr);
+pthread_rwlock_unlock(_fds_lock);
+if (!found) {
+return -EINVAL;
+}
+mount_fd = GPOINTER_TO_INT(mount_fd_ptr);
+
+ret = open_by_handle_at(mount_fd, (struct file_handle *)>handle, 
flags);
+if (ret < 0) {
+return -errno;
+}
+
+return ret;
+}
+
 /*
  * Load capng's state from our saved state if the current thread
  * hadn't previously been loaded.
@@ -608,7 +663,11 @@ static void lo_inode_put(struct lo_data *lo, struct 
lo_inode **inodep)
 *inodep = NULL;
 
 if (g_atomic_int_dec_and_test(>refcount)) {
-close(inode->fd);
+if (inode->fd >= 0) {
+close(inode->fd);
+} else {
+g_free(inode->fhandle);
+}
 free(inode);
 }
 }
@@ -635,10 +694,25 @@ static struct lo_inode *lo_inode(fuse_req_t req, 
fuse_ino_t ino)
 
 static int lo_inode_fd(const struct lo_inode *inode, TempFd *tfd)
 {
-*tfd = (TempFd) {
-.fd = inode->fd,
-.owned = false,
-};
+if (inode->fd >= 0) {
+*tfd = (TempFd) {
+.fd = inode->fd,
+.owned = false,
+};
+} else {
+int fd;
+
+assert(inode->fhandle != NULL);
+fd = open_file_handle(inode->fhandle, O_PATH);
+if (fd < 0) {
+return -errno;
+}
+
+*tfd = (TempFd) {
+.fd = fd,
+.owned = true,
+};
+}
 
 return 0;
 }
@@ -678,22 +752,32 @@ static int lo_fd(fuse_req_t req, fuse_ino_t ino, TempFd 
*tfd)
 static int lo_inode_open(const struct lo_data *lo, const struct lo_inode 
*inode,
  int open_flags, TempFd *tfd)
 {
-g_autofree char *fd_str = g_strdup_printf("%d", inode->fd);
+g_autofree char *fd_str = NULL;
 int fd;
 
 if (!S_ISREG(inode->filetype) && !S_ISDIR(inode->filetype)) {
 return -EBADF;
 }
 
-/*
- * The file is a symlink so O_NOFOLLOW must be ignored. We checked earlier
- * that the inode is not a special file but if an external process races
- * with us then symlinks are traversed here. It is not possible to escape
- * the shared directory since it is mounted as "/" though.
- */
-fd = openat(lo->proc_self_fd, fd_str, open_flags & ~O_NOFOLLOW);
-if (fd < 0) {
-

[PATCH v3 00/10] virtiofsd: Allow using file handles instead of O_PATH FDs

2021-07-30 Thread Max Reitz
Hi,

v1 cover letter for an overview:
https://listman.redhat.com/archives/virtio-fs/2021-June/msg00033.html

v2 cover letter:
https://listman.redhat.com/archives/virtio-fs/2021-June/msg00074.html

For v3, at first I attempted to have errors related to file handle
generation (name_to_handle_at()) be returned to the guest unless they
are cases where file name generation is simply not supported, and only
then do a fallback to an O_PATH FD, as Vivek has suggested.

However, I found that to be rather complicated.  (Always falling back is
just simpler.)  Furthermore, because we believe that name_to_handle_at()
can rarely fail except for EOPNOTSUPP, there should be little difference
in practice.

Therefore, in v3, I kept the v2 model of always falling back to an
O_PATH FD when an error occurred during handle generation.

What did change in v3 is the following:
- I added patch 1, because f1aa1774dfb happened in the meantime, and
  this is basically what we did for virtiofsd-rs in the form of
  31e7ac63944 (virtiofsd-rs commit hash)

- Patch 4: In lookup_name(), I noticed that I failed to invoke
  lo_inode_put() to match the lo_inode() from the beginning of the
  function in all error paths.  Fixed by adding a common error path.

- Patch 6: Mostly contextual rebase conflicts (partly because of patch
  1), but also one functional change: I Dropped the `assert(fd >= 0)`
  under `if (open_inode)` in lo_setxattr(), because `fd` is dropped by
  this patch (and `inode_fd` is used regardless of the value of
  `open_inode` we can’t assert anything similar on it).

- Patch 8:
  - Fixed the condition to reject results found by st_ino lookup.
- st_ino on its own is only a valid identifier/key if we have an
  O_PATH fd for its respective lo_inode, because otherwise the inode
  may be unlinked and its st_ino might be reused by some new inode
- It does not matter whether lo_find()’s caller has supplied a file
  handle for a prior lookup by handle or not, so drop that part of
  the condition
- Semantically, it does not matter whether the lo_inode has a file
  handle or not – what matters is whether it has an O_PATH fd or
  not.  (The two are linked by a `handle <=> !fd` condition, so that
  part wasn’t technically wrong, just semantically.)
- In accordance with the last point, I rewrote the comment
  explaining why we have to reject such results.
  - Rebase conflict in lookup_name() because of the fix in patch 4

- Patch 9:
  - Non-functional change in lo_do_lookup() to separate the
get_file_handle()/openat() part from the do_statx() calls (and have
the do_statx() calls be side by side) – as a side effect, this makes
the diff to master slightly smaller.
  - Rebase conflict in lookup_name() because of the fix in patch 4

- Patch 10:
  - Rebase conflict in lookup_name() because of the fix in patch 4


Max Reitz (10):
  virtiofsd: Limit setxattr()'s creds-dropped region
  virtiofsd: Add TempFd structure
  virtiofsd: Use lo_inode_open() instead of openat()
  virtiofsd: Add lo_inode_fd() helper
  virtiofsd: Let lo_fd() return a TempFd
  virtiofsd: Let lo_inode_open() return a TempFd
  virtiofsd: Add lo_inode.fhandle
  virtiofsd: Add inodes_by_handle hash table
  virtiofsd: Optionally fill lo_inode.fhandle
  virtiofsd: Add lazy lo_do_find()

 tools/virtiofsd/helper.c  |   3 +
 tools/virtiofsd/passthrough_ll.c  | 869 +-
 tools/virtiofsd/passthrough_seccomp.c |   2 +
 3 files changed, 720 insertions(+), 154 deletions(-)

-- 
2.31.1




[PATCH v3 08/10] virtiofsd: Add inodes_by_handle hash table

2021-07-30 Thread Max Reitz
Currently, lo_inode.fhandle is always NULL and so always keep an O_PATH
FD in lo_inode.fd.  Therefore, when the respective inode is unlinked,
its inode ID will remain in use until we drop our lo_inode (and
lo_inode_put() thus closes the FD).  Therefore, lo_find() can safely use
the inode ID as an lo_inode key, because any inode with an inode ID we
find in lo_data.inodes (on the same filesystem) must be the exact same
file.

This will change when we start setting lo_inode.fhandle so we do not
have to keep an O_PATH FD open.  Then, unlinking such an inode will
immediately remove it, so its ID can then be reused by newly created
files, even while the lo_inode object is still there[1].

So creating a new file can then reuse the old file's inode ID, and
looking up the new file would lead to us finding the old file's
lo_inode, which is not ideal.

Luckily, just as file handles cause this problem, they also solve it:  A
file handle contains a generation ID, which changes when an inode ID is
reused, so the new file can be distinguished from the old one.  So all
we need to do is to add a second map besides lo_data.inodes that maps
file handles to lo_inodes, namely lo_data.inodes_by_handle.  For
clarity, lo_data.inodes is renamed to lo_data.inodes_by_ids.

Unfortunately, we cannot rely on being able to generate file handles
every time.  Therefore, we still enter every lo_inode object into
inodes_by_ids, but having an entry in inodes_by_handle is optional.  A
potential inodes_by_handle entry then has precedence, the inodes_by_ids
entry is just a fallback.

Note that we do not generate lo_fhandle objects yet, and so we also do
not enter anything into the inodes_by_handle map yet.  Also, all lookups
skip that map.  We might manually create file handles with some code
that is immediately removed by the next patch again, but that would
break the assumption in lo_find() that every lo_inode with a non-NULL
.fhandle must have an entry in inodes_by_handle and vice versa.  So we
leave actually using the inodes_by_handle map for the next patch.

[1] If some application in the guest still has the file open, there is
going to be a corresponding FD mapping in lo_data.fd_map.  In such a
case, the inode will only go away once every application in the guest
has closed it.  The problem described only applies to cases where the
guest does not have the file open, and it is just in the dentry cache,
basically.

Signed-off-by: Max Reitz 
---
 tools/virtiofsd/passthrough_ll.c | 81 +---
 1 file changed, 65 insertions(+), 16 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 487448d666..f9d8b2f134 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -180,7 +180,8 @@ struct lo_data {
 int announce_submounts;
 bool use_statx;
 struct lo_inode root;
-GHashTable *inodes; /* protected by lo->mutex */
+GHashTable *inodes_by_ids; /* protected by lo->mutex */
+GHashTable *inodes_by_handle; /* protected by lo->mutex */
 struct lo_map ino_map; /* protected by lo->mutex */
 struct lo_map dirp_map; /* protected by lo->mutex */
 struct lo_map fd_map; /* protected by lo->mutex */
@@ -263,8 +264,9 @@ static struct {
 /* That we loaded cap-ng in the current thread from the saved */
 static __thread bool cap_loaded = 0;
 
-static struct lo_inode *lo_find(struct lo_data *lo, struct stat *st,
-uint64_t mnt_id);
+static struct lo_inode *lo_find(struct lo_data *lo,
+const struct lo_fhandle *fhandle,
+struct stat *st, uint64_t mnt_id);
 static int xattr_map_client(const struct lo_data *lo, const char *client_name,
 char **out_name);
 
@@ -1064,18 +1066,40 @@ out_err:
 fuse_reply_err(req, saverr);
 }
 
-static struct lo_inode *lo_find(struct lo_data *lo, struct stat *st,
-uint64_t mnt_id)
+static struct lo_inode *lo_find(struct lo_data *lo,
+const struct lo_fhandle *fhandle,
+struct stat *st, uint64_t mnt_id)
 {
-struct lo_inode *p;
-struct lo_key key = {
+struct lo_inode *p = NULL;
+struct lo_key ids_key = {
 .ino = st->st_ino,
 .dev = st->st_dev,
 .mnt_id = mnt_id,
 };
 
 pthread_mutex_lock(>mutex);
-p = g_hash_table_lookup(lo->inodes, );
+if (fhandle) {
+p = g_hash_table_lookup(lo->inodes_by_handle, fhandle);
+}
+if (!p) {
+p = g_hash_table_lookup(lo->inodes_by_ids, _key);
+/*
+ * When we had to fall back to looking up an inode by its
+ * inode ID, ensure that we hit an entry that has a valid file
+ * descriptor.  Having an FD open means that the inode cannot
+ * really be deleted until the FD is closed, so that the inode
+ *

[PATCH v3 05/10] virtiofsd: Let lo_fd() return a TempFd

2021-07-30 Thread Max Reitz
Accessing lo_inode.fd must generally happen through lo_inode_fd(), and
lo_fd() is no exception; and then it must pass on the TempFd it has
received from lo_inode_fd().

(Note that all lo_fd() calls now use proper error handling, where all of
them were in-line before; i.e. they were used in place of the fd
argument of some function call.  This only worked because the only error
that could occur was that lo_inode() failed to find the inode ID: Then
-1 would be passed as the fd, which would result in an EBADF error,
which is precisely what we would want to return to the guest for an
invalid inode ID.
Now, though, lo_inode_fd() might potentially invoke open_by_handle_at(),
which can return many different errors, and they should be properly
handled and returned to the guest.  So we can no longer allow lo_fd() to
be used in-line, and instead need to do proper error handling for it.)

Signed-off-by: Max Reitz 
Reviewed-by: Connor Kuehl 
---
 tools/virtiofsd/passthrough_ll.c | 55 +---
 1 file changed, 44 insertions(+), 11 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 86b901cf19..9e1bc37af8 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -650,18 +650,19 @@ static int lo_inode_fd(const struct lo_inode *inode, 
TempFd *tfd)
  * they are done with the fd.  This will be done in a later patch to make
  * review easier.
  */
-static int lo_fd(fuse_req_t req, fuse_ino_t ino)
+static int lo_fd(fuse_req_t req, fuse_ino_t ino, TempFd *tfd)
 {
 struct lo_inode *inode = lo_inode(req, ino);
-int fd;
+int res;
 
 if (!inode) {
-return -1;
+return -EBADF;
 }
 
-fd = inode->fd;
+res = lo_inode_fd(inode, tfd);
+
 lo_inode_put(lo_data(req), );
-return fd;
+return res;
 }
 
 /*
@@ -798,14 +799,19 @@ static void lo_init(void *userdata, struct fuse_conn_info 
*conn)
 static void lo_getattr(fuse_req_t req, fuse_ino_t ino,
struct fuse_file_info *fi)
 {
+g_auto(TempFd) ino_fd = TEMP_FD_INIT;
 int res;
 struct stat buf;
 struct lo_data *lo = lo_data(req);
 
 (void)fi;
 
-res =
-fstatat(lo_fd(req, ino), "", , AT_EMPTY_PATH | 
AT_SYMLINK_NOFOLLOW);
+res = lo_fd(req, ino, _fd);
+if (res < 0) {
+return (void)fuse_reply_err(req, -res);
+}
+
+res = fstatat(ino_fd.fd, "", , AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW);
 if (res == -1) {
 return (void)fuse_reply_err(req, errno);
 }
@@ -1529,6 +1535,7 @@ out:
 
 static void lo_rmdir(fuse_req_t req, fuse_ino_t parent, const char *name)
 {
+g_auto(TempFd) parent_fd = TEMP_FD_INIT;
 int res;
 struct lo_inode *inode;
 struct lo_data *lo = lo_data(req);
@@ -1543,13 +1550,19 @@ static void lo_rmdir(fuse_req_t req, fuse_ino_t parent, 
const char *name)
 return;
 }
 
+res = lo_fd(req, parent, _fd);
+if (res < 0) {
+fuse_reply_err(req, -res);
+return;
+}
+
 inode = lookup_name(req, parent, name);
 if (!inode) {
 fuse_reply_err(req, EIO);
 return;
 }
 
-res = unlinkat(lo_fd(req, parent), name, AT_REMOVEDIR);
+res = unlinkat(parent_fd.fd, name, AT_REMOVEDIR);
 
 fuse_reply_err(req, res == -1 ? errno : 0);
 unref_inode_lolocked(lo, inode, 1);
@@ -1635,6 +1648,7 @@ out:
 
 static void lo_unlink(fuse_req_t req, fuse_ino_t parent, const char *name)
 {
+g_auto(TempFd) parent_fd = TEMP_FD_INIT;
 int res;
 struct lo_inode *inode;
 struct lo_data *lo = lo_data(req);
@@ -1649,13 +1663,19 @@ static void lo_unlink(fuse_req_t req, fuse_ino_t 
parent, const char *name)
 return;
 }
 
+res = lo_fd(req, parent, _fd);
+if (res < 0) {
+fuse_reply_err(req, -res);
+return;
+}
+
 inode = lookup_name(req, parent, name);
 if (!inode) {
 fuse_reply_err(req, EIO);
 return;
 }
 
-res = unlinkat(lo_fd(req, parent), name, 0);
+res = unlinkat(parent_fd.fd, name, 0);
 
 fuse_reply_err(req, res == -1 ? errno : 0);
 unref_inode_lolocked(lo, inode, 1);
@@ -1735,10 +1755,16 @@ static void lo_forget_multi(fuse_req_t req, size_t 
count,
 
 static void lo_readlink(fuse_req_t req, fuse_ino_t ino)
 {
+g_auto(TempFd) ino_fd = TEMP_FD_INIT;
 char buf[PATH_MAX + 1];
 int res;
 
-res = readlinkat(lo_fd(req, ino), "", buf, sizeof(buf));
+res = lo_fd(req, ino, _fd);
+if (res < 0) {
+return (void)fuse_reply_err(req, -res);
+}
+
+res = readlinkat(ino_fd.fd, "", buf, sizeof(buf));
 if (res == -1) {
 return (void)fuse_reply_err(req, errno);
 }
@@ -2535,10 +2561,17 @@ static void lo_write_buf(fuse_req_t req, fuse_ino_t ino,
 
 static void lo_statfs(fuse_req_t req, fuse_ino_t ino)
 {
+g_auto(TempFd) ino_fd = TEMP_FD_INIT;
 int res;
 struct statvfs stbuf;
 
-res = fsta

[PATCH v3 01/10] virtiofsd: Limit setxattr()'s creds-dropped region

2021-07-30 Thread Max Reitz
We only need to drop/switch our credentials for the (f)setxattr() call
alone, not for the openat() or fchdir() around it.

(Right now, this may not be that big of a problem, but with inodes being
identified by file handles instead of an O_PATH fd, we will need
open_by_handle_at() calls here, which is really fickle when it comes to
credentials being dropped.)

Signed-off-by: Max Reitz 
---
 tools/virtiofsd/passthrough_ll.c | 34 +++-
 1 file changed, 25 insertions(+), 9 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 38b2af8599..1f27eeabc5 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -3121,6 +3121,7 @@ static void lo_setxattr(fuse_req_t req, fuse_ino_t ino, 
const char *in_name,
 bool switched_creds = false;
 bool cap_fsetid_dropped = false;
 struct lo_cred old = {};
+bool open_inode;
 
 if (block_xattr(lo, in_name)) {
 fuse_reply_err(req, EOPNOTSUPP);
@@ -3155,7 +3156,24 @@ static void lo_setxattr(fuse_req_t req, fuse_ino_t ino, 
const char *in_name,
 fuse_log(FUSE_LOG_DEBUG, "lo_setxattr(ino=%" PRIu64
  ", name=%s value=%s size=%zd)\n", ino, name, value, size);
 
+/*
+ * We can only open regular files or directories.  If the inode is
+ * something else, we have to enter /proc/self/fd and use
+ * setxattr() on the link's filename there.
+ */
+open_inode = S_ISREG(inode->filetype) || S_ISDIR(inode->filetype);
 sprintf(procname, "%i", inode->fd);
+if (open_inode) {
+fd = openat(lo->proc_self_fd, procname, O_RDONLY);
+if (fd < 0) {
+saverr = errno;
+goto out;
+}
+} else {
+/* fchdir should not fail here */
+FCHDIR_NOFAIL(lo->proc_self_fd);
+}
+
 /*
  * If we are setting posix access acl and if SGID needs to be
  * cleared, then switch to caller's gid and drop CAP_FSETID
@@ -3176,20 +3194,13 @@ static void lo_setxattr(fuse_req_t req, fuse_ino_t ino, 
const char *in_name,
 }
 switched_creds = true;
 }
-if (S_ISREG(inode->filetype) || S_ISDIR(inode->filetype)) {
-fd = openat(lo->proc_self_fd, procname, O_RDONLY);
-if (fd < 0) {
-saverr = errno;
-goto out;
-}
+if (open_inode) {
+assert(fd >= 0);
 ret = fsetxattr(fd, name, value, size, flags);
 saverr = ret == -1 ? errno : 0;
 } else {
-/* fchdir should not fail here */
-FCHDIR_NOFAIL(lo->proc_self_fd);
 ret = setxattr(procname, name, value, size, flags);
 saverr = ret == -1 ? errno : 0;
-FCHDIR_NOFAIL(lo->root.fd);
 }
 if (switched_creds) {
 if (cap_fsetid_dropped)
@@ -3198,6 +3209,11 @@ static void lo_setxattr(fuse_req_t req, fuse_ino_t ino, 
const char *in_name,
 lo_restore_cred(, false);
 }
 
+if (!open_inode) {
+/* Change CWD back, fchdir should not fail here */
+FCHDIR_NOFAIL(lo->root.fd);
+}
+
 out:
 if (fd >= 0) {
 close(fd);
-- 
2.31.1




[PATCH v3 04/10] virtiofsd: Add lo_inode_fd() helper

2021-07-30 Thread Max Reitz
Once we let lo_inode.fd be optional, we will need its users to open the
file handle stored in lo_inode instead.  This function will do that.

For now, it just returns lo_inode.fd, though.

Signed-off-by: Max Reitz 
---
 tools/virtiofsd/passthrough_ll.c | 150 +--
 1 file changed, 125 insertions(+), 25 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index a444c3a7e2..86b901cf19 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -635,6 +635,16 @@ static struct lo_inode *lo_inode(fuse_req_t req, 
fuse_ino_t ino)
 return elem->inode;
 }
 
+static int lo_inode_fd(const struct lo_inode *inode, TempFd *tfd)
+{
+*tfd = (TempFd) {
+.fd = inode->fd,
+.owned = false,
+};
+
+return 0;
+}
+
 /*
  * TODO Remove this helper and force callers to hold an inode refcount until
  * they are done with the fd.  This will be done in a later patch to make
@@ -822,11 +832,11 @@ static int lo_fi_fd(fuse_req_t req, struct fuse_file_info 
*fi)
 static void lo_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr,
int valid, struct fuse_file_info *fi)
 {
+g_auto(TempFd) inode_fd = TEMP_FD_INIT;
 int saverr;
 char procname[64];
 struct lo_data *lo = lo_data(req);
 struct lo_inode *inode;
-int ifd;
 int res;
 int fd = -1;
 
@@ -836,7 +846,11 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, 
struct stat *attr,
 return;
 }
 
-ifd = inode->fd;
+res = lo_inode_fd(inode, _fd);
+if (res < 0) {
+saverr = -res;
+goto out_err;
+}
 
 /* If fi->fh is invalid we'll report EBADF later */
 if (fi) {
@@ -847,7 +861,7 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, 
struct stat *attr,
 if (fi) {
 res = fchmod(fd, attr->st_mode);
 } else {
-sprintf(procname, "%i", ifd);
+sprintf(procname, "%i", inode_fd.fd);
 res = fchmodat(lo->proc_self_fd, procname, attr->st_mode, 0);
 }
 if (res == -1) {
@@ -859,12 +873,13 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, 
struct stat *attr,
 uid_t uid = (valid & FUSE_SET_ATTR_UID) ? attr->st_uid : (uid_t)-1;
 gid_t gid = (valid & FUSE_SET_ATTR_GID) ? attr->st_gid : (gid_t)-1;
 
-saverr = drop_security_capability(lo, ifd);
+saverr = drop_security_capability(lo, inode_fd.fd);
 if (saverr) {
 goto out_err;
 }
 
-res = fchownat(ifd, "", uid, gid, AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW);
+res = fchownat(inode_fd.fd, "", uid, gid,
+   AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW);
 if (res == -1) {
 saverr = errno;
 goto out_err;
@@ -943,7 +958,7 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, 
struct stat *attr,
 if (fi) {
 res = futimens(fd, tv);
 } else {
-sprintf(procname, "%i", inode->fd);
+sprintf(procname, "%i", inode_fd.fd);
 res = utimensat(lo->proc_self_fd, procname, tv, 0);
 }
 if (res == -1) {
@@ -1058,7 +1073,8 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t 
parent, const char *name,
 struct fuse_entry_param *e,
 struct lo_inode **inodep)
 {
-int newfd;
+g_auto(TempFd) dir_fd = TEMP_FD_INIT;
+int newfd = -1;
 int res;
 int saverr;
 uint64_t mnt_id;
@@ -1088,7 +1104,13 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t 
parent, const char *name,
 name = ".";
 }
 
-newfd = openat(dir->fd, name, O_PATH | O_NOFOLLOW);
+res = lo_inode_fd(dir, _fd);
+if (res < 0) {
+saverr = -res;
+goto out;
+}
+
+newfd = openat(dir_fd.fd, name, O_PATH | O_NOFOLLOW);
 if (newfd == -1) {
 goto out_err;
 }
@@ -1155,6 +1177,7 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t 
parent, const char *name,
 
 out_err:
 saverr = errno;
+out:
 if (newfd != -1) {
 close(newfd);
 }
@@ -1312,6 +1335,7 @@ static void lo_mknod_symlink(fuse_req_t req, fuse_ino_t 
parent,
  const char *name, mode_t mode, dev_t rdev,
  const char *link)
 {
+g_auto(TempFd) dir_fd = TEMP_FD_INIT;
 int res;
 int saverr;
 struct lo_data *lo = lo_data(req);
@@ -1335,12 +1359,18 @@ static void lo_mknod_symlink(fuse_req_t req, fuse_ino_t 
parent,
 return;
 }
 
+res = lo_inode_fd(dir, _fd);
+if (res < 0) {
+saverr = -res;
+goto out;
+}
+
 saverr = lo_change_cred(req, , lo->change_umask && !S_ISLNK(mode));
 if (saverr) {
 goto out;
 }
 
-res = mknod_wrapper(dir->fd, name, l

Re: [PATCH RFC 0/3] mirror: rework soft-cancelling READY mirror

2021-07-29 Thread Max Reitz

On 29.07.21 13:35, Vladimir Sementsov-Ogievskiy wrote:

29.07.2021 13:38, Max Reitz wrote:

On 29.07.21 12:02, Vladimir Sementsov-Ogievskiy wrote:

28.07.2021 10:00, Max Reitz wrote:

On 27.07.21 18:47, Vladimir Sementsov-Ogievskiy wrote:

Hi all!

That's an alternative to (part of) Max's
"[PATCH for-6.1? v2 0/7] mirror: Handle errors after READY cancel"
and shows' my idea of handling soft-cancelling READY mirror case
directly in qmp_block_job_cancel. And cleanup all other job 
cancelling

functions.

That's untested draft, don't take it to heart :)


Well, I would have preferred it if you’d rebased this on top of 
that series, precisely because it’s an alternative to only part of 
it. And if it’s just an untested draft, that would have been even 
better, because it would’ve given a better idea on what the cleanup 
looks like.


There are also things like this series making cancel internally 
always a force-cancel, where I’m not sure whether we want that in 
the replication driver or not[1].  With my series, we add an 
explicit parameter, so we’re forced to think about it, and then in 
this series on top we can just drop the parameter for all 
force-cancel invocations again, and for all non-force-cancel 
invocations we would have to think a bit more.


I now don't sure that patch 5 of your series is correct (see my last 
answer to it), that's why I decided to not base on it.


Well, we can always take patch 5 from v1.  (Where I changed any 
job_is_cancelled() to job_cancel_requested() when it influenced the 
external interface.)


My series has the benefit of handling soft-mirror-cancel case the 
other way and handles mirror finalization in case of soft-cancel 
properly.




Specifically as for this series, I don’t like job_complete_ex() 
very much, I think the parameter should be part of job_complete() 
itself.


That was my idea. But job_complete is passed as function pointer, so 
changing its prototype would be more work.. But I think it's possible.


  If we think that’s too specific of a mirror parameter to include 
in normal job_complete(), well, then there shouldn’t be a 
job_complete_ex() either, and do_graph_change should be a property 
of the mirror job (perhaps as pivot_on_completion) that’s cleared 
by qmp_block_job_cancel() before invoking job_complete().


This way users will lose a way to make a decision during job running..


On the contrary, it would be a precursor to letting the user change 
this property explicitly with a new QMP command.


But probably they don't need actually. Moving the option to mirror 
job parameter seems a good option to me.




Max

[1] Although looking at it again now, it probably wants force-cancel.




What do you think of my idea to keep old bugs as is and just 
deprecate block-job-cancel and add a new interface for 
"no-graph-change mirror" case?


I don’t see a reason for that.  The fix isn’t that complicated.

Also, honestly, I don’t see a good reason for deprecating anything.



Current interface lead to mess in the code, that's bad. Cancellation 
mode that is actually a kind of completion (and having comments in 
many places about that) - that shows for me that interface is not 
good.. It's a question of terminology, what to call "cancel". Also, 
that's not the first time this question arise. Remember my recent 
cancel-in-flight-requests series, when I thought that "cancel is 
cancel" and didn't consider soft-cancel of mirror.. And reviewers 
didn't caught it. I don't think that interface is good, it will always 
confuse new developers and users. But that's just my opinion, I don't 
impose it )


If not deprecate, i.e. if we consider old interface to be good, than 
no reason for this my series and for introducing new interface :)


I’m not against a better interface, I’m against using this current bug 
as an excuse to improve the interface.  We’ve known we want to improve 
the interface for quite a long time now, we don’t need an excuse for that.


If we use this bug as an excuse, I’m afraid of becoming hung up on 
interface discussions instead of just getting the bug fixed.  And we 
must get the bug fixed, it’s real, it’s kind of bad, and saying “it 
won’t appear with the new interface, let’s not worry about the old one” 
is not something I like.


OTOH, if we use this bug as an excuse, I’m also afraid of trying to rush 
the design instead of actually implementing the interface that we’ve 
always desired, i.e. where the user gets to choose the completion mode 
via yet-to-be-implemented some job property setter function.


As a final note (but this is precisely the interface discussion that I 
want to avoid for now), I said I don’t see a good reason for deprecating 
anything, because `job-cancel force=false` can just internally do 
`set-job-property .pivot_on_completion=false; job-complete`.  From an 
implementation perspective, that should be simple.


I understand that for users the existence of the `force` flag

Re: [PATCH RFC 0/3] mirror: rework soft-cancelling READY mirror

2021-07-29 Thread Max Reitz

On 29.07.21 12:02, Vladimir Sementsov-Ogievskiy wrote:

28.07.2021 10:00, Max Reitz wrote:

On 27.07.21 18:47, Vladimir Sementsov-Ogievskiy wrote:

Hi all!

That's an alternative to (part of) Max's
"[PATCH for-6.1? v2 0/7] mirror: Handle errors after READY cancel"
and shows' my idea of handling soft-cancelling READY mirror case
directly in qmp_block_job_cancel. And cleanup all other job cancelling
functions.

That's untested draft, don't take it to heart :)


Well, I would have preferred it if you’d rebased this on top of that 
series, precisely because it’s an alternative to only part of it. And 
if it’s just an untested draft, that would have been even better, 
because it would’ve given a better idea on what the cleanup looks like.


There are also things like this series making cancel internally 
always a force-cancel, where I’m not sure whether we want that in the 
replication driver or not[1].  With my series, we add an explicit 
parameter, so we’re forced to think about it, and then in this series 
on top we can just drop the parameter for all force-cancel 
invocations again, and for all non-force-cancel invocations we would 
have to think a bit more.


I now don't sure that patch 5 of your series is correct (see my last 
answer to it), that's why I decided to not base on it.


Well, we can always take patch 5 from v1.  (Where I changed any 
job_is_cancelled() to job_cancel_requested() when it influenced the 
external interface.)


My series has the benefit of handling soft-mirror-cancel case the 
other way and handles mirror finalization in case of soft-cancel 
properly.




Specifically as for this series, I don’t like job_complete_ex() very 
much, I think the parameter should be part of job_complete() itself.


That was my idea. But job_complete is passed as function pointer, so 
changing its prototype would be more work.. But I think it's possible.


  If we think that’s too specific of a mirror parameter to include in 
normal job_complete(), well, then there shouldn’t be a 
job_complete_ex() either, and do_graph_change should be a property of 
the mirror job (perhaps as pivot_on_completion) that’s cleared by 
qmp_block_job_cancel() before invoking job_complete().


This way users will lose a way to make a decision during job running..


On the contrary, it would be a precursor to letting the user change this 
property explicitly with a new QMP command.


But probably they don't need actually. Moving the option to mirror job 
parameter seems a good option to me.




Max

[1] Although looking at it again now, it probably wants force-cancel.




What do you think of my idea to keep old bugs as is and just deprecate 
block-job-cancel and add a new interface for "no-graph-change mirror" 
case?


I don’t see a reason for that.  The fix isn’t that complicated.

Also, honestly, I don’t see a good reason for deprecating anything.

Max




Re: [PATCH RFC 0/3] mirror: rework soft-cancelling READY mirror

2021-07-28 Thread Max Reitz

On 27.07.21 18:47, Vladimir Sementsov-Ogievskiy wrote:

Hi all!

That's an alternative to (part of) Max's
"[PATCH for-6.1? v2 0/7] mirror: Handle errors after READY cancel"
and shows' my idea of handling soft-cancelling READY mirror case
directly in qmp_block_job_cancel. And cleanup all other job cancelling
functions.

That's untested draft, don't take it to heart :)


Well, I would have preferred it if you’d rebased this on top of that 
series, precisely because it’s an alternative to only part of it. And if 
it’s just an untested draft, that would have been even better, because 
it would’ve given a better idea on what the cleanup looks like.


There are also things like this series making cancel internally always a 
force-cancel, where I’m not sure whether we want that in the replication 
driver or not[1].  With my series, we add an explicit parameter, so 
we’re forced to think about it, and then in this series on top we can 
just drop the parameter for all force-cancel invocations again, and for 
all non-force-cancel invocations we would have to think a bit more.


Specifically as for this series, I don’t like job_complete_ex() very 
much, I think the parameter should be part of job_complete() itself.  If 
we think that’s too specific of a mirror parameter to include in normal 
job_complete(), well, then there shouldn’t be a job_complete_ex() 
either, and do_graph_change should be a property of the mirror job 
(perhaps as pivot_on_completion) that’s cleared by 
qmp_block_job_cancel() before invoking job_complete().


Max

[1] Although looking at it again now, it probably wants force-cancel.




Re: [PATCH for-6.1? v2 6/7] mirror: Check job_is_cancelled() earlier

2021-07-27 Thread Max Reitz

On 27.07.21 15:13, Vladimir Sementsov-Ogievskiy wrote:

26.07.2021 17:46, Max Reitz wrote:

We must check whether the job is force-cancelled early in our main loop,
most importantly before any `continue` statement.  For example, we used
to have `continue`s before our current checking location that are
triggered by `mirror_flush()` failing.  So, if `mirror_flush()` kept
failing, force-cancelling the job would not terminate it.

A job being force-cancelled should be treated the same as the job having
failed, so put the check in the same place where we check `s->ret < 0`.

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Max Reitz 
---
  block/mirror.c | 7 +--
  1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 72e02fa34e..46d1a1e5a2 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -993,7 +993,7 @@ static int coroutine_fn mirror_run(Job *job, 
Error **errp)

  mirror_wait_for_any_operation(s, true);
  }
  -    if (s->ret < 0) {
+    if (s->ret < 0 || job_is_cancelled(>common.job)) {
  ret = s->ret;
  goto immediate_exit;
  }
@@ -1078,8 +1078,6 @@ static int coroutine_fn mirror_run(Job *job, 
Error **errp)

  break;
  }
  -    ret = 0;
-


That's just a cleanup, that statement is useless pre-patch, yes?


I think it was intended for if we left this loop via the 
job_is_cancelled() condition below.  Since it’s removed, this statement 
seems meaningless, so I removed it along with the `break`.


Max




  if (job_is_ready(>common.job) && !should_complete) {
  delay_ns = (s->in_flight == 0 &&
  cnt == 0 ? BLOCK_JOB_SLICE_TIME : 0);
@@ -1087,9 +1085,6 @@ static int coroutine_fn mirror_run(Job *job, 
Error **errp)
  trace_mirror_before_sleep(s, cnt, 
job_is_ready(>common.job),

    delay_ns);
  job_sleep_ns(>common.job, delay_ns);
-    if (job_is_cancelled(>common.job)) {
-    break;
-    }
  s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
  }



Reviewed-by: Vladimir Sementsov-Ogievskiy 






Re: [PATCH for-6.1? v2 5/7] job: Add job_cancel_requested()

2021-07-27 Thread Max Reitz

On 27.07.21 15:04, Vladimir Sementsov-Ogievskiy wrote:

26.07.2021 17:46, Max Reitz wrote:

Most callers of job_is_cancelled() actually want to know whether the job
is on its way to immediate termination.  For example, we refuse to pause
jobs that are cancelled; but this only makes sense for jobs that are
really actually cancelled.

A mirror job that is cancelled during READY with force=false should
absolutely be allowed to pause.  This "cancellation" (which is actually
a kind of completion) may take an indefinite amount of time, and so
should behave like any job during normal operation.  For example, with
on-target-error=stop, the job should stop on write errors.  (In
contrast, force-cancelled jobs should not get write errors, as they
should just terminate and not do further I/O.)

Therefore, redefine job_is_cancelled() to only return true for jobs that
are force-cancelled (which as of HEAD^ means any job that interprets the
cancellation request as a request for immediate termination), and add
job_cancel_request() as the general variant, which returns true for any


job_cancel_requested()


jobs which have been requested to be cancelled, whether it be
immediately or after an arbitrarily long completion phase.

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Max Reitz 
---
  include/qemu/job.h |  8 +++-
  block/mirror.c | 10 --
  job.c  |  7 ++-
  3 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 8aa90f7395..032edf3c5f 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -436,9 +436,15 @@ const char *job_type_str(const Job *job);
  /** Returns true if the job should not be visible to the management 
layer. */

  bool job_is_internal(Job *job);
  -/** Returns whether the job is scheduled for cancellation. */
+/** Returns whether the job is being cancelled. */
  bool job_is_cancelled(Job *job);
  +/**
+ * Returns whether the job is scheduled for cancellation (at an
+ * indefinite point).
+ */
+bool job_cancel_requested(Job *job);
+
  /** Returns whether the job is in a completed state. */
  bool job_is_completed(Job *job);
  diff --git a/block/mirror.c b/block/mirror.c
index e93631a9f6..72e02fa34e 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -936,7 +936,7 @@ static int coroutine_fn mirror_run(Job *job, 
Error **errp)

  /* Transition to the READY state and wait for complete. */
  job_transition_to_ready(>common.job);
  s->actively_synced = true;
-    while (!job_is_cancelled(>common.job) && 
!s->should_complete) {
+    while (!job_cancel_requested(>common.job) && 
!s->should_complete) {

  job_yield(>common.job);
  }
  s->common.job.cancelled = false;
@@ -1043,7 +1043,7 @@ static int coroutine_fn mirror_run(Job *job, 
Error **errp)

  }
    should_complete = s->should_complete ||
-    job_is_cancelled(>common.job);
+    job_cancel_requested(>common.job);
  cnt = bdrv_get_dirty_count(s->dirty_bitmap);
  }
  @@ -1087,7 +1087,7 @@ static int coroutine_fn mirror_run(Job *job, 
Error **errp)
  trace_mirror_before_sleep(s, cnt, 
job_is_ready(>common.job),

    delay_ns);
  job_sleep_ns(>common.job, delay_ns);
-    if (job_is_cancelled(>common.job) && 
s->common.job.force_cancel) {

+    if (job_is_cancelled(>common.job)) {
  break;
  }
  s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
@@ -1099,9 +1099,7 @@ immediate_exit:
   * or it was cancelled prematurely so that we do not 
guarantee that

   * the target is a copy of the source.
   */
-    assert(ret < 0 ||
-   (s->common.job.force_cancel &&
-    job_is_cancelled(>common.job)));
+    assert(ret < 0 || job_is_cancelled(>common.job));


(As a note, I hope this does the job regarding your suggestions for 
patch 4. :))



  assert(need_drain);
  mirror_wait_for_all_io(s);
  }
diff --git a/job.c b/job.c
index e78d893a9c..dba17a680f 100644
--- a/job.c
+++ b/job.c
@@ -216,6 +216,11 @@ const char *job_type_str(const Job *job)
  }
    bool job_is_cancelled(Job *job)
+{
+    return job->cancelled && job->force_cancel;


can job->cancelled be false when job->force_cancel is true ? I think 
not and worth an assertion here. Something like


if (job->force_cancel) {
   assert(job->cancelled);
   return true;
}

return false;


Sounds good, why not.




+}
+
+bool job_cancel_requested(Job *job)
  {
  return job->cancelled;
  }
@@ -1015,7 +1020,7 @@ void job_complete(Job *job, Error **errp)
  if (job_apply_verb(job, JOB_VERB_COMPLETE, errp)) {
  return;
  }
-    if (job_is_cancelled(job) || !j

Re: [PATCH for-6.1? 4/6] job: Add job_cancel_requested()

2021-07-27 Thread Max Reitz

On 27.07.21 16:47, Vladimir Sementsov-Ogievskiy wrote:

26.07.2021 10:09, Max Reitz wrote:



  job->ret = -ECANCELED;
  }
  if (job->ret) {
@@ -704,7 +709,7 @@ static int job_finalize_single(Job *job)
    /* Emit events only if we actually started */
  if (job_started(job)) {
-    if (job_is_cancelled(job)) {
+    if (job_cancel_requested(job)) {
  job_event_cancelled(job);


Same question here.. Shouldn't mirror report COMPLETED event in case 
of not-force cancelled in READY state?


Same here, I thought this is user-visible, nothing internal, so I 
should leave it as-is.


Now I see that cancelling mirror post-READY indeed should result in a 
COMPLETED event.  So I’m actually not exactly sure how mirror does 
that, despite this code here



Hmm. Now looking at mirror code, I see that it does 
"s->common.job.cancelled = false"


*lol*, that’s nice.

So since we’ve missed the rc1 boat now, I think this is 6.2 material.  
I’ll look into whether we can drop that then, that would be nice.


Max




[PATCH for-6.1? v2 5/7] job: Add job_cancel_requested()

2021-07-26 Thread Max Reitz
Most callers of job_is_cancelled() actually want to know whether the job
is on its way to immediate termination.  For example, we refuse to pause
jobs that are cancelled; but this only makes sense for jobs that are
really actually cancelled.

A mirror job that is cancelled during READY with force=false should
absolutely be allowed to pause.  This "cancellation" (which is actually
a kind of completion) may take an indefinite amount of time, and so
should behave like any job during normal operation.  For example, with
on-target-error=stop, the job should stop on write errors.  (In
contrast, force-cancelled jobs should not get write errors, as they
should just terminate and not do further I/O.)

Therefore, redefine job_is_cancelled() to only return true for jobs that
are force-cancelled (which as of HEAD^ means any job that interprets the
cancellation request as a request for immediate termination), and add
job_cancel_request() as the general variant, which returns true for any
jobs which have been requested to be cancelled, whether it be
immediately or after an arbitrarily long completion phase.

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Max Reitz 
---
 include/qemu/job.h |  8 +++-
 block/mirror.c | 10 --
 job.c  |  7 ++-
 3 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 8aa90f7395..032edf3c5f 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -436,9 +436,15 @@ const char *job_type_str(const Job *job);
 /** Returns true if the job should not be visible to the management layer. */
 bool job_is_internal(Job *job);
 
-/** Returns whether the job is scheduled for cancellation. */
+/** Returns whether the job is being cancelled. */
 bool job_is_cancelled(Job *job);
 
+/**
+ * Returns whether the job is scheduled for cancellation (at an
+ * indefinite point).
+ */
+bool job_cancel_requested(Job *job);
+
 /** Returns whether the job is in a completed state. */
 bool job_is_completed(Job *job);
 
diff --git a/block/mirror.c b/block/mirror.c
index e93631a9f6..72e02fa34e 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -936,7 +936,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 /* Transition to the READY state and wait for complete. */
 job_transition_to_ready(>common.job);
 s->actively_synced = true;
-while (!job_is_cancelled(>common.job) && !s->should_complete) {
+while (!job_cancel_requested(>common.job) && !s->should_complete) {
 job_yield(>common.job);
 }
 s->common.job.cancelled = false;
@@ -1043,7 +1043,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 }
 
 should_complete = s->should_complete ||
-job_is_cancelled(>common.job);
+job_cancel_requested(>common.job);
 cnt = bdrv_get_dirty_count(s->dirty_bitmap);
 }
 
@@ -1087,7 +1087,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 trace_mirror_before_sleep(s, cnt, job_is_ready(>common.job),
   delay_ns);
 job_sleep_ns(>common.job, delay_ns);
-if (job_is_cancelled(>common.job) && s->common.job.force_cancel) {
+if (job_is_cancelled(>common.job)) {
 break;
 }
 s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
@@ -1099,9 +1099,7 @@ immediate_exit:
  * or it was cancelled prematurely so that we do not guarantee that
  * the target is a copy of the source.
  */
-assert(ret < 0 ||
-   (s->common.job.force_cancel &&
-job_is_cancelled(>common.job)));
+assert(ret < 0 || job_is_cancelled(>common.job));
 assert(need_drain);
 mirror_wait_for_all_io(s);
 }
diff --git a/job.c b/job.c
index e78d893a9c..dba17a680f 100644
--- a/job.c
+++ b/job.c
@@ -216,6 +216,11 @@ const char *job_type_str(const Job *job)
 }
 
 bool job_is_cancelled(Job *job)
+{
+return job->cancelled && job->force_cancel;
+}
+
+bool job_cancel_requested(Job *job)
 {
 return job->cancelled;
 }
@@ -1015,7 +1020,7 @@ void job_complete(Job *job, Error **errp)
 if (job_apply_verb(job, JOB_VERB_COMPLETE, errp)) {
 return;
 }
-if (job_is_cancelled(job) || !job->driver->complete) {
+if (job_cancel_requested(job) || !job->driver->complete) {
 error_setg(errp, "The active block job '%s' cannot be completed",
job->id);
 return;
-- 
2.31.1




[PATCH for-6.1? v2 6/7] mirror: Check job_is_cancelled() earlier

2021-07-26 Thread Max Reitz
We must check whether the job is force-cancelled early in our main loop,
most importantly before any `continue` statement.  For example, we used
to have `continue`s before our current checking location that are
triggered by `mirror_flush()` failing.  So, if `mirror_flush()` kept
failing, force-cancelling the job would not terminate it.

A job being force-cancelled should be treated the same as the job having
failed, so put the check in the same place where we check `s->ret < 0`.

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Max Reitz 
---
 block/mirror.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 72e02fa34e..46d1a1e5a2 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -993,7 +993,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 mirror_wait_for_any_operation(s, true);
 }
 
-if (s->ret < 0) {
+if (s->ret < 0 || job_is_cancelled(>common.job)) {
 ret = s->ret;
 goto immediate_exit;
 }
@@ -1078,8 +1078,6 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 break;
 }
 
-ret = 0;
-
 if (job_is_ready(>common.job) && !should_complete) {
 delay_ns = (s->in_flight == 0 &&
 cnt == 0 ? BLOCK_JOB_SLICE_TIME : 0);
@@ -1087,9 +1085,6 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 trace_mirror_before_sleep(s, cnt, job_is_ready(>common.job),
   delay_ns);
 job_sleep_ns(>common.job, delay_ns);
-if (job_is_cancelled(>common.job)) {
-break;
-}
 s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
 }
 
-- 
2.31.1




[PATCH for-6.1? v2 4/7] jobs: Give Job.force_cancel more meaning

2021-07-26 Thread Max Reitz
We largely have two cancel modes for jobs:

First, there is actual cancelling.  The job is terminated as soon as
possible, without trying to reach a consistent result.

Second, we have mirror in the READY state.  Technically, the job is not
really cancelled, but it just is a different completion mode.  The job
can still run for an indefinite amount of time while it tries to reach a
consistent result.

We want to be able to clearly distinguish which cancel mode a job is in
(when it has been cancelled).  We can use Job.force_cancel for this, but
right now it only reflects cancel requests from the user with
force=true, but clearly, jobs that do not even distinguish between
force=false and force=true are effectively always force-cancelled.

So this patch has Job.force_cancel signify whether the job will
terminate as soon as possible (force_cancel=true) or whether it will
effectively remain running despite being "cancelled"
(force_cancel=false).

To this end, we let jobs that provide JobDriver.cancel() tell the
generic job code whether they will terminate as soon as possible or not,
and for jobs that do not provide that method we assume they will.

Signed-off-by: Max Reitz 
---
 include/qemu/job.h | 11 ++-
 block/backup.c |  3 ++-
 block/mirror.c | 24 ++--
 job.c  |  6 +-
 4 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 5e8edbc2c8..8aa90f7395 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -253,8 +253,17 @@ struct JobDriver {
 
 /**
  * If the callback is not NULL, it will be invoked in job_cancel_async
+ *
+ * This function must return true if the job will be cancelled
+ * immediately without any further I/O (mandatory if @force is
+ * true), and false otherwise.  This lets the generic job layer
+ * know whether a job has been truly (force-)cancelled, or whether
+ * it is just in a special completion mode (like mirror after
+ * READY).
+ * (If the callback is NULL, the job is assumed to terminate
+ * without I/O.)
  */
-void (*cancel)(Job *job, bool force);
+bool (*cancel)(Job *job, bool force);
 
 
 /** Called when the job is freed */
diff --git a/block/backup.c b/block/backup.c
index bd3614ce70..513e1c8a0b 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -331,11 +331,12 @@ static void coroutine_fn backup_set_speed(BlockJob *job, 
int64_t speed)
 }
 }
 
-static void backup_cancel(Job *job, bool force)
+static bool backup_cancel(Job *job, bool force)
 {
 BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
 
 bdrv_cancel_in_flight(s->target_bs);
+return true;
 }
 
 static const BlockJobDriver backup_job_driver = {
diff --git a/block/mirror.c b/block/mirror.c
index fcb7b65f93..e93631a9f6 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1087,9 +1087,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 trace_mirror_before_sleep(s, cnt, job_is_ready(>common.job),
   delay_ns);
 job_sleep_ns(>common.job, delay_ns);
-if (job_is_cancelled(>common.job) &&
-(!job_is_ready(>common.job) || s->common.job.force_cancel))
-{
+if (job_is_cancelled(>common.job) && s->common.job.force_cancel) {
 break;
 }
 s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
@@ -1102,7 +1100,7 @@ immediate_exit:
  * the target is a copy of the source.
  */
 assert(ret < 0 ||
-   ((s->common.job.force_cancel || !job_is_ready(>common.job)) 
&&
+   (s->common.job.force_cancel &&
 job_is_cancelled(>common.job)));
 assert(need_drain);
 mirror_wait_for_all_io(s);
@@ -1188,14 +1186,27 @@ static bool mirror_drained_poll(BlockJob *job)
 return !!s->in_flight;
 }
 
-static void mirror_cancel(Job *job, bool force)
+static bool mirror_cancel(Job *job, bool force)
 {
 MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job);
 BlockDriverState *target = blk_bs(s->target);
 
-if (force || !job_is_ready(job)) {
+/*
+ * Before the job is READY, we treat any cancellation like a
+ * force-cancellation.
+ */
+force = force || !job_is_ready(job);
+
+if (force) {
 bdrv_cancel_in_flight(target);
 }
+return force;
+}
+
+static bool commit_active_cancel(Job *job, bool force)
+{
+/* Same as above in mirror_cancel() */
+return force || !job_is_ready(job);
 }
 
 static const BlockJobDriver mirror_job_driver = {
@@ -1225,6 +1236,7 @@ static const BlockJobDriver commit_active_job_driver = {
 .abort  = mirror_abort,
 .pause  = mirror_pause,
 .complete   = mirror_complete,
+.cancel = commit_active_ca

[PATCH for-6.1? v2 7/7] iotests: Add mirror-ready-cancel-error test

2021-07-26 Thread Max Reitz
Test what happens when there is an I/O error after a mirror job in the
READY phase has been cancelled.

Signed-off-by: Max Reitz 
---
 .../tests/mirror-ready-cancel-error   | 143 ++
 .../tests/mirror-ready-cancel-error.out   |   5 +
 2 files changed, 148 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/mirror-ready-cancel-error
 create mode 100644 tests/qemu-iotests/tests/mirror-ready-cancel-error.out

diff --git a/tests/qemu-iotests/tests/mirror-ready-cancel-error 
b/tests/qemu-iotests/tests/mirror-ready-cancel-error
new file mode 100755
index 00..f2dc1f
--- /dev/null
+++ b/tests/qemu-iotests/tests/mirror-ready-cancel-error
@@ -0,0 +1,143 @@
+#!/usr/bin/env python3
+# group: rw quick
+#
+# Test what happens when errors occur to a mirror job after it has
+# been cancelled in the READY phase
+#
+# Copyright (C) 2021 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+import os
+import iotests
+
+
+image_size = 1 * 1024 * 1024
+source = os.path.join(iotests.test_dir, 'source.img')
+target = os.path.join(iotests.test_dir, 'target.img')
+
+
+class TestMirrorReadyCancelError(iotests.QMPTestCase):
+def setUp(self) -> None:
+assert iotests.qemu_img_create('-f', iotests.imgfmt, source,
+   str(image_size)) == 0
+assert iotests.qemu_img_create('-f', iotests.imgfmt, target,
+   str(image_size)) == 0
+
+self.vm = iotests.VM()
+self.vm.launch()
+
+def tearDown(self) -> None:
+self.vm.shutdown()
+os.remove(source)
+os.remove(target)
+
+def add_blockdevs(self, once: bool) -> None:
+res = self.vm.qmp('blockdev-add',
+  **{'node-name': 'source',
+ 'driver': iotests.imgfmt,
+ 'file': {
+ 'driver': 'file',
+ 'filename': source
+ }})
+self.assert_qmp(res, 'return', {})
+
+# blkdebug notes:
+# Enter state 2 on the first flush, which happens before the
+# job enters the READY state.  The second flush will happen
+# when the job is about to complete, and we want that one to
+# fail.
+res = self.vm.qmp('blockdev-add',
+  **{'node-name': 'target',
+ 'driver': iotests.imgfmt,
+ 'file': {
+ 'driver': 'blkdebug',
+ 'image': {
+ 'driver': 'file',
+ 'filename': target
+ },
+ 'set-state': [{
+ 'event': 'flush_to_disk',
+ 'state': 1,
+ 'new_state': 2
+ }],
+ 'inject-error': [{
+ 'event': 'flush_to_disk',
+ 'once': once,
+ 'immediately': True,
+ 'state': 2
+ }]}})
+self.assert_qmp(res, 'return', {})
+
+def start_mirror(self) -> None:
+res = self.vm.qmp('blockdev-mirror',
+  job_id='mirror',
+  device='source',
+  target='target',
+  filter_node_name='mirror-top',
+  sync='full',
+  on_target_error='stop')
+self.assert_qmp(res, 'return', {})
+
+def cancel_mirror_with_error(self) -> None:
+self.vm.event_wait('BLOCK_JOB_READY')
+
+# Write something so will not leave the job immediately, but
+# flush first (which will fail, thanks to blkdebug)
+res = self.vm.qmp('human-monitor-command',
+  command_line='qemu-io mirror-top "write 0 64k"')
+self.assert_qmp(res, 'return', '')
+
+# Drain status change events
+while self.vm.event_wait('JOB_STATUS_CHANGE', timeout=0.0) is not None:
+pass
+
+res = self.vm

[PATCH for-6.1? v2 3/7] job: @force parameter for job_cancel_sync{, _all}()

2021-07-26 Thread Max Reitz
Callers should be able to specify whether they want job_cancel_sync() to
force-cancel the job or not.

In fact, almost all invocations do not care about consistency of the
result and just want the job to terminate as soon as possible, so they
should pass force=true.  The replication block driver is the exception.

This changes some iotest outputs, because quitting qemu while a mirror
job is active will now lead to it being cancelled instead of completed,
which is what we want.  (Cancelling a READY mirror job with force=false
may take an indefinite amount of time, which we do not want when
quitting.  If users want consistent results, they must have all jobs be
done before they quit qemu.)

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Max Reitz 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 include/qemu/job.h| 10 ++---
 block/replication.c   |  4 +-
 blockdev.c|  4 +-
 job.c | 27 +---
 qemu-nbd.c|  2 +-
 softmmu/runstate.c|  2 +-
 storage-daemon/qemu-storage-daemon.c  |  2 +-
 tests/unit/test-block-iothread.c  |  2 +-
 tests/unit/test-blockjob.c|  2 +-
 tests/qemu-iotests/109.out| 60 +++
 tests/qemu-iotests/tests/qsd-jobs.out |  2 +-
 11 files changed, 61 insertions(+), 56 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 41162ed494..5e8edbc2c8 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -506,19 +506,19 @@ void job_user_cancel(Job *job, bool force, Error **errp);
 
 /**
  * Synchronously cancel the @job.  The completion callback is called
- * before the function returns.  The job may actually complete
- * instead of canceling itself; the circumstances under which this
- * happens depend on the kind of job that is active.
+ * before the function returns.  If @force is false, the job may
+ * actually complete instead of canceling itself; the circumstances
+ * under which this happens depend on the kind of job that is active.
  *
  * Returns the return value from the job if the job actually completed
  * during the call, or -ECANCELED if it was canceled.
  *
  * Callers must hold the AioContext lock of job->aio_context.
  */
-int job_cancel_sync(Job *job);
+int job_cancel_sync(Job *job, bool force);
 
 /** Synchronously cancels all jobs using job_cancel_sync(). */
-void job_cancel_sync_all(void);
+void job_cancel_sync_all(bool force);
 
 /**
  * @job: The job to be completed.
diff --git a/block/replication.c b/block/replication.c
index 32444b9a8f..e7a9327b12 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -149,7 +149,7 @@ static void replication_close(BlockDriverState *bs)
 if (s->stage == BLOCK_REPLICATION_FAILOVER) {
 commit_job = >commit_job->job;
 assert(commit_job->aio_context == qemu_get_current_aio_context());
-job_cancel_sync(commit_job);
+job_cancel_sync(commit_job, false);
 }
 
 if (s->mode == REPLICATION_MODE_SECONDARY) {
@@ -726,7 +726,7 @@ static void replication_stop(ReplicationState *rs, bool 
failover, Error **errp)
  * disk, secondary disk in backup_job_completed().
  */
 if (s->backup_job) {
-job_cancel_sync(>backup_job->job);
+job_cancel_sync(>backup_job->job, false);
 }
 
 if (!failover) {
diff --git a/blockdev.c b/blockdev.c
index 3d8ac368a1..aa95918c02 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1848,7 +1848,7 @@ static void drive_backup_abort(BlkActionState *common)
 aio_context = bdrv_get_aio_context(state->bs);
 aio_context_acquire(aio_context);
 
-job_cancel_sync(>job->job);
+job_cancel_sync(>job->job, true);
 
 aio_context_release(aio_context);
 }
@@ -1949,7 +1949,7 @@ static void blockdev_backup_abort(BlkActionState *common)
 aio_context = bdrv_get_aio_context(state->bs);
 aio_context_acquire(aio_context);
 
-job_cancel_sync(>job->job);
+job_cancel_sync(>job->job, true);
 
 aio_context_release(aio_context);
 }
diff --git a/job.c b/job.c
index e7a5d28854..9e971d64cf 100644
--- a/job.c
+++ b/job.c
@@ -763,7 +763,12 @@ static void job_completed_txn_abort(Job *job)
 if (other_job != job) {
 ctx = other_job->aio_context;
 aio_context_acquire(ctx);
-job_cancel_async(other_job, false);
+/*
+ * This is a transaction: If one job failed, no result will matter.
+ * Therefore, pass force=true to terminate all other jobs as 
quickly
+ * as possible.
+ */
+job_cancel_async(other_job, true);
 aio_context_release(ctx);
 }
 }
@@ -964,12 +969,24 @@ static void job_cancel_err(Job *job, Error **errp)
 job_can

[PATCH for-6.1? v2 2/7] mirror: Drop s->synced

2021-07-26 Thread Max Reitz
As of HEAD^, there is no meaning to s->synced other than whether the job
is READY or not.  job_is_ready() gives us that information, too.

Suggested-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Max Reitz 
---
 block/mirror.c | 19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index d73b704473..fcb7b65f93 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -56,7 +56,6 @@ typedef struct MirrorBlockJob {
 bool zero_target;
 MirrorCopyMode copy_mode;
 BlockdevOnError on_source_error, on_target_error;
-bool synced;
 /* Set when the target is synced (dirty bitmap is clean, nothing
  * in flight) and the job is running in active mode */
 bool actively_synced;
@@ -936,7 +935,6 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 if (s->bdev_length == 0) {
 /* Transition to the READY state and wait for complete. */
 job_transition_to_ready(>common.job);
-s->synced = true;
 s->actively_synced = true;
 while (!job_is_cancelled(>common.job) && !s->should_complete) {
 job_yield(>common.job);
@@ -1028,7 +1026,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 should_complete = false;
 if (s->in_flight == 0 && cnt == 0) {
 trace_mirror_before_flush(s);
-if (!s->synced) {
+if (!job_is_ready(>common.job)) {
 if (mirror_flush(s) < 0) {
 /* Go check s->ret.  */
 continue;
@@ -1039,7 +1037,6 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
  * the target in a consistent state.
  */
 job_transition_to_ready(>common.job);
-s->synced = true;
 if (s->copy_mode != MIRROR_COPY_MODE_BACKGROUND) {
 s->actively_synced = true;
 }
@@ -1083,14 +1080,15 @@ static int coroutine_fn mirror_run(Job *job, Error 
**errp)
 
 ret = 0;
 
-if (s->synced && !should_complete) {
+if (job_is_ready(>common.job) && !should_complete) {
 delay_ns = (s->in_flight == 0 &&
 cnt == 0 ? BLOCK_JOB_SLICE_TIME : 0);
 }
-trace_mirror_before_sleep(s, cnt, s->synced, delay_ns);
+trace_mirror_before_sleep(s, cnt, job_is_ready(>common.job),
+  delay_ns);
 job_sleep_ns(>common.job, delay_ns);
 if (job_is_cancelled(>common.job) &&
-(!s->synced || s->common.job.force_cancel))
+(!job_is_ready(>common.job) || s->common.job.force_cancel))
 {
 break;
 }
@@ -1103,8 +1101,9 @@ immediate_exit:
  * or it was cancelled prematurely so that we do not guarantee that
  * the target is a copy of the source.
  */
-assert(ret < 0 || ((s->common.job.force_cancel || !s->synced) &&
-   job_is_cancelled(>common.job)));
+assert(ret < 0 ||
+   ((s->common.job.force_cancel || !job_is_ready(>common.job)) 
&&
+job_is_cancelled(>common.job)));
 assert(need_drain);
 mirror_wait_for_all_io(s);
 }
@@ -1127,7 +1126,7 @@ static void mirror_complete(Job *job, Error **errp)
 {
 MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job);
 
-if (!s->synced) {
+if (!job_is_ready(job)) {
 error_setg(errp, "The active block job '%s' cannot be completed",
job->id);
 return;
-- 
2.31.1




[PATCH for-6.1? v2 0/7] mirror: Handle errors after READY cancel

2021-07-26 Thread Max Reitz
Hi,

v1 cover letter:
https://lists.nongnu.org/archive/html/qemu-block/2021-07/msg00705.html

Changes in v2:
- Added patch 2 (as suggested by Vladimir)
- Patch 4 (ex. 3): Rebase conflicts because of patch 2
- Patch 5 (ex. 4):
  - Rebase conflicts because of patch 2
  - Do not use job_cancel_requested() to determine how a soft-cancelled
job should be completed: A soft-cancelled job should end with
COMPLETED, not CANCELLED, and so job_is_cancelled() is the
appropriate condition there.


git-backport-diff against v1:

Key:
[] : patches are identical
[] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/7:[] [--] 'mirror: Keep s->synced on error'
002/7:[down] 'mirror: Drop s->synced'
003/7:[] [--] 'job: @force parameter for job_cancel_sync{,_all}()'
004/7:[0006] [FC] 'jobs: Give Job.force_cancel more meaning'
005/7:[0011] [FC] 'job: Add job_cancel_requested()'
006/7:[] [-C] 'mirror: Check job_is_cancelled() earlier'
007/7:[] [--] 'iotests: Add mirror-ready-cancel-error test'


Max Reitz (7):
  mirror: Keep s->synced on error
  mirror: Drop s->synced
  job: @force parameter for job_cancel_sync{,_all}()
  jobs: Give Job.force_cancel more meaning
  job: Add job_cancel_requested()
  mirror: Check job_is_cancelled() earlier
  iotests: Add mirror-ready-cancel-error test

 include/qemu/job.h|  29 +++-
 block/backup.c|   3 +-
 block/mirror.c|  47 +++---
 block/replication.c   |   4 +-
 blockdev.c|   4 +-
 job.c |  40 -
 qemu-nbd.c|   2 +-
 softmmu/runstate.c|   2 +-
 storage-daemon/qemu-storage-daemon.c  |   2 +-
 tests/unit/test-block-iothread.c  |   2 +-
 tests/unit/test-blockjob.c|   2 +-
 tests/qemu-iotests/109.out|  60 +++-
 .../tests/mirror-ready-cancel-error   | 143 ++
 .../tests/mirror-ready-cancel-error.out   |   5 +
 tests/qemu-iotests/tests/qsd-jobs.out |   2 +-
 15 files changed, 264 insertions(+), 83 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/mirror-ready-cancel-error
 create mode 100644 tests/qemu-iotests/tests/mirror-ready-cancel-error.out

-- 
2.31.1




[PATCH for-6.1? v2 1/7] mirror: Keep s->synced on error

2021-07-26 Thread Max Reitz
An error does not take us out of the READY phase, which is what
s->synced signifies.  It does of course mean that source and target are
no longer in sync, but that is what s->actively_sync is for -- s->synced
never meant that source and target are in sync, only that they were at
some point (and at that point we transitioned into the READY phase).

The tangible problem is that we transition to READY once we are in sync
and s->synced is false.  By resetting s->synced here, we will transition
from READY to READY once the error is resolved (if the job keeps
running), and that transition is not allowed.

Signed-off-by: Max Reitz 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/mirror.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/block/mirror.c b/block/mirror.c
index 98fc66eabf..d73b704473 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -121,7 +121,6 @@ typedef enum MirrorMethod {
 static BlockErrorAction mirror_error_action(MirrorBlockJob *s, bool read,
 int error)
 {
-s->synced = false;
 s->actively_synced = false;
 if (read) {
 return block_job_error_action(>common, s->on_source_error,
-- 
2.31.1




Re: [PATCH for-6.1? 1/6] mirror: Keep s->synced on error

2021-07-26 Thread Max Reitz

On 22.07.21 18:25, Vladimir Sementsov-Ogievskiy wrote:

22.07.2021 15:26, Max Reitz wrote:

An error does not take us out of the READY phase, which is what
s->synced signifies.  It does of course mean that source and target are
no longer in sync, but that is what s->actively_sync is for -- s->synced
never meant that source and target are in sync, only that they were at
some point (and at that point we transitioned into the READY phase).

The tangible problem is that we transition to READY once we are in sync
and s->synced is false.  By resetting s->synced here, we will transition
from READY to READY once the error is resolved (if the job keeps
running), and that transition is not allowed.

Signed-off-by: Max Reitz 
---
  block/mirror.c | 1 -
  1 file changed, 1 deletion(-)

diff --git a/block/mirror.c b/block/mirror.c
index 98fc66eabf..d73b704473 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -121,7 +121,6 @@ typedef enum MirrorMethod {
  static BlockErrorAction mirror_error_action(MirrorBlockJob *s, bool 
read,

  int error)
  {
-    s->synced = false;
  s->actively_synced = false;
  if (read) {
  return block_job_error_action(>common, s->on_source_error,



Looked through.. Yes, seems s->synced used as "is ready". Isn't it 
better to drop s->synced at all and use job_is_read() instead?


Sounds good, though I think for the change to be clear, I’d like to keep 
this patch and then drop s->synced on top.


Max

Hmm, s->actively_synced used only for assertion in 
active_write_settle().. That's not wrong, just interesting.





Re: [PATCH for-6.1? 4/6] job: Add job_cancel_requested()

2021-07-26 Thread Max Reitz

On 22.07.21 19:58, Vladimir Sementsov-Ogievskiy wrote:

22.07.2021 15:26, Max Reitz wrote:

Most callers of job_is_cancelled() actually want to know whether the job
is on its way to immediate termination.  For example, we refuse to pause
jobs that are cancelled; but this only makes sense for jobs that are
really actually cancelled.

A mirror job that is cancelled during READY with force=false should
absolutely be allowed to pause.  This "cancellation" (which is actually
a kind of completion) 


You have to repeat that this "cancel" is not "cancel".

So, the whole problem is that feature of mirror, on cancel in READY 
state do not cancel but do some specific kind of completion.


You try to make this thing correctly handled on generic layer..

Did you consider instead just drop the feature from generic layer? So 
that all *cancel* functions always do force-cancel. Then the internal 
implementation become a lot clearer.


Yes, I considered that, and I’ve decided against it (for now), because 
such a change would obviously be an incompatible change.  It would 
require a deprecation period, and so we would need to fix this bug now 
anyway.


But we have to support the qmp block-job-cancel of READY mirror (and 
commit) with force=false.


We can do it as an exclusion in qmp_block_job_cancel, something like:

if (job is mirror or commit AND it's ready AND force = false)
   mirror_soft_cancel(...);
else
   job_cancel(...);


I didn’t consider such a hack, though.  I don’t like it.  If we think 
that we should change our approach because mirror’s soft cancel is 
actually a completion mode, and the current situation is too confusing, 
such a change should be user-visible, too.  (I think there was this idea 
of having job-specific flags or parameters you could change at runtime, 
and so you’d just change the “pivot” parameter between true or false.)


Also, I don’t know whether this would really make anything “a lot” 
easier.  After this series job_is_cancelled() already tells the true 
story, so all we could really change is to drop force_cancel and unify 
the “s->should_complete || job_cancel_requested()” conditions in 
block/mirror.c into one variable.  So when I considered making cancel 
exclusively force-cancel jobs, I thought it wouldn’t actually be worth 
it in practice.



may take an indefinite amount of time, and so
should behave like any job during normal operation.  For example, with
on-target-error=stop, the job should stop on write errors.  (In
contrast, force-cancelled jobs should not get write errors, as they
should just terminate and not do further I/O.)

Therefore, redefine job_is_cancelled() to only return true for jobs that
are force-cancelled (which as of HEAD^ means any job that interprets the
cancellation request as a request for immediate termination), and add
job_cancel_request() as the general variant, which returns true for any
jobs which have been requested to be cancelled, whether it be
immediately or after an arbitrarily long completion phase.

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Max Reitz 
---


[..]


--- a/job.c
+++ b/job.c
@@ -216,6 +216,11 @@ const char *job_type_str(const Job *job)
  }
    bool job_is_cancelled(Job *job)
+{
+    return job->cancelled && job->force_cancel;
+}
+
+bool job_cancel_requested(Job *job)
  {
  return job->cancelled;
  }
@@ -650,7 +655,7 @@ static void job_conclude(Job *job)
    static void job_update_rc(Job *job)
  {
-    if (!job->ret && job_is_cancelled(job)) {
+    if (!job->ret && job_cancel_requested(job)) {


Why not job_is_cancelled() here?

So in case of mirror other kind of completion we set ret to -ECANCELED?


I thought the return value is a user-visible thing, so I left it as-is.

Seems I was wrong, more below.


  job->ret = -ECANCELED;
  }
  if (job->ret) {
@@ -704,7 +709,7 @@ static int job_finalize_single(Job *job)
    /* Emit events only if we actually started */
  if (job_started(job)) {
-    if (job_is_cancelled(job)) {
+    if (job_cancel_requested(job)) {
  job_event_cancelled(job);


Same question here.. Shouldn't mirror report COMPLETED event in case 
of not-force cancelled in READY state?


Same here, I thought this is user-visible, nothing internal, so I should 
leave it as-is.


Now I see that cancelling mirror post-READY indeed should result in a 
COMPLETED event.  So I’m actually not exactly sure how mirror does that, 
despite this code here (which functionally isn’t changed by this patch), 
but it’s absolutely true that job_is_cancelled() would be more 
appropriate here.


(No iotest failed, so I thought this change was right.  Well.)


  } else {
  job_event_completed(job);
@@ -1015,7 +1020,7 @@ void job_complete(Job *job, Error **errp)
  if (job_apply_verb(job, JOB_VERB_COMPLETE, errp)) {
  return;
  }
-    if (job_is_cancelle

[PATCH for-6.1? 6/6] iotests: Add mirror-ready-cancel-error test

2021-07-22 Thread Max Reitz
Test what happens when there is an I/O error after a mirror job in the
READY phase has been cancelled.

Signed-off-by: Max Reitz 
---
 .../tests/mirror-ready-cancel-error   | 143 ++
 .../tests/mirror-ready-cancel-error.out   |   5 +
 2 files changed, 148 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/mirror-ready-cancel-error
 create mode 100644 tests/qemu-iotests/tests/mirror-ready-cancel-error.out

diff --git a/tests/qemu-iotests/tests/mirror-ready-cancel-error 
b/tests/qemu-iotests/tests/mirror-ready-cancel-error
new file mode 100755
index 00..f2dc1f
--- /dev/null
+++ b/tests/qemu-iotests/tests/mirror-ready-cancel-error
@@ -0,0 +1,143 @@
+#!/usr/bin/env python3
+# group: rw quick
+#
+# Test what happens when errors occur to a mirror job after it has
+# been cancelled in the READY phase
+#
+# Copyright (C) 2021 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+import os
+import iotests
+
+
+image_size = 1 * 1024 * 1024
+source = os.path.join(iotests.test_dir, 'source.img')
+target = os.path.join(iotests.test_dir, 'target.img')
+
+
+class TestMirrorReadyCancelError(iotests.QMPTestCase):
+def setUp(self) -> None:
+assert iotests.qemu_img_create('-f', iotests.imgfmt, source,
+   str(image_size)) == 0
+assert iotests.qemu_img_create('-f', iotests.imgfmt, target,
+   str(image_size)) == 0
+
+self.vm = iotests.VM()
+self.vm.launch()
+
+def tearDown(self) -> None:
+self.vm.shutdown()
+os.remove(source)
+os.remove(target)
+
+def add_blockdevs(self, once: bool) -> None:
+res = self.vm.qmp('blockdev-add',
+  **{'node-name': 'source',
+ 'driver': iotests.imgfmt,
+ 'file': {
+ 'driver': 'file',
+ 'filename': source
+ }})
+self.assert_qmp(res, 'return', {})
+
+# blkdebug notes:
+# Enter state 2 on the first flush, which happens before the
+# job enters the READY state.  The second flush will happen
+# when the job is about to complete, and we want that one to
+# fail.
+res = self.vm.qmp('blockdev-add',
+  **{'node-name': 'target',
+ 'driver': iotests.imgfmt,
+ 'file': {
+ 'driver': 'blkdebug',
+ 'image': {
+ 'driver': 'file',
+ 'filename': target
+ },
+ 'set-state': [{
+ 'event': 'flush_to_disk',
+ 'state': 1,
+ 'new_state': 2
+ }],
+ 'inject-error': [{
+ 'event': 'flush_to_disk',
+ 'once': once,
+ 'immediately': True,
+ 'state': 2
+ }]}})
+self.assert_qmp(res, 'return', {})
+
+def start_mirror(self) -> None:
+res = self.vm.qmp('blockdev-mirror',
+  job_id='mirror',
+  device='source',
+  target='target',
+  filter_node_name='mirror-top',
+  sync='full',
+  on_target_error='stop')
+self.assert_qmp(res, 'return', {})
+
+def cancel_mirror_with_error(self) -> None:
+self.vm.event_wait('BLOCK_JOB_READY')
+
+# Write something so will not leave the job immediately, but
+# flush first (which will fail, thanks to blkdebug)
+res = self.vm.qmp('human-monitor-command',
+  command_line='qemu-io mirror-top "write 0 64k"')
+self.assert_qmp(res, 'return', '')
+
+# Drain status change events
+while self.vm.event_wait('JOB_STATUS_CHANGE', timeout=0.0) is not None:
+pass
+
+res = self.vm

[PATCH for-6.1? 4/6] job: Add job_cancel_requested()

2021-07-22 Thread Max Reitz
Most callers of job_is_cancelled() actually want to know whether the job
is on its way to immediate termination.  For example, we refuse to pause
jobs that are cancelled; but this only makes sense for jobs that are
really actually cancelled.

A mirror job that is cancelled during READY with force=false should
absolutely be allowed to pause.  This "cancellation" (which is actually
a kind of completion) may take an indefinite amount of time, and so
should behave like any job during normal operation.  For example, with
on-target-error=stop, the job should stop on write errors.  (In
contrast, force-cancelled jobs should not get write errors, as they
should just terminate and not do further I/O.)

Therefore, redefine job_is_cancelled() to only return true for jobs that
are force-cancelled (which as of HEAD^ means any job that interprets the
cancellation request as a request for immediate termination), and add
job_cancel_request() as the general variant, which returns true for any
jobs which have been requested to be cancelled, whether it be
immediately or after an arbitrarily long completion phase.

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Max Reitz 
---
 include/qemu/job.h |  8 +++-
 block/mirror.c |  9 -
 job.c  | 13 +
 3 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 8aa90f7395..032edf3c5f 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -436,9 +436,15 @@ const char *job_type_str(const Job *job);
 /** Returns true if the job should not be visible to the management layer. */
 bool job_is_internal(Job *job);
 
-/** Returns whether the job is scheduled for cancellation. */
+/** Returns whether the job is being cancelled. */
 bool job_is_cancelled(Job *job);
 
+/**
+ * Returns whether the job is scheduled for cancellation (at an
+ * indefinite point).
+ */
+bool job_cancel_requested(Job *job);
+
 /** Returns whether the job is in a completed state. */
 bool job_is_completed(Job *job);
 
diff --git a/block/mirror.c b/block/mirror.c
index c3514f4196..291d2ed040 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -938,7 +938,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 job_transition_to_ready(>common.job);
 s->synced = true;
 s->actively_synced = true;
-while (!job_is_cancelled(>common.job) && !s->should_complete) {
+while (!job_cancel_requested(>common.job) && !s->should_complete) {
 job_yield(>common.job);
 }
 s->common.job.cancelled = false;
@@ -1046,7 +1046,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 }
 
 should_complete = s->should_complete ||
-job_is_cancelled(>common.job);
+job_cancel_requested(>common.job);
 cnt = bdrv_get_dirty_count(s->dirty_bitmap);
 }
 
@@ -1089,7 +1089,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 }
 trace_mirror_before_sleep(s, cnt, s->synced, delay_ns);
 job_sleep_ns(>common.job, delay_ns);
-if (job_is_cancelled(>common.job) && s->common.job.force_cancel) {
+if (job_is_cancelled(>common.job)) {
 break;
 }
 s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
@@ -1101,8 +1101,7 @@ immediate_exit:
  * or it was cancelled prematurely so that we do not guarantee that
  * the target is a copy of the source.
  */
-assert(ret < 0 || (s->common.job.force_cancel &&
-   job_is_cancelled(>common.job)));
+assert(ret < 0 || job_is_cancelled(>common.job));
 assert(need_drain);
 mirror_wait_for_all_io(s);
 }
diff --git a/job.c b/job.c
index e78d893a9c..c51c8077cb 100644
--- a/job.c
+++ b/job.c
@@ -216,6 +216,11 @@ const char *job_type_str(const Job *job)
 }
 
 bool job_is_cancelled(Job *job)
+{
+return job->cancelled && job->force_cancel;
+}
+
+bool job_cancel_requested(Job *job)
 {
 return job->cancelled;
 }
@@ -650,7 +655,7 @@ static void job_conclude(Job *job)
 
 static void job_update_rc(Job *job)
 {
-if (!job->ret && job_is_cancelled(job)) {
+if (!job->ret && job_cancel_requested(job)) {
 job->ret = -ECANCELED;
 }
 if (job->ret) {
@@ -704,7 +709,7 @@ static int job_finalize_single(Job *job)
 
 /* Emit events only if we actually started */
 if (job_started(job)) {
-if (job_is_cancelled(job)) {
+if (job_cancel_requested(job)) {
 job_event_cancelled(job);
 } else {
 job_event_completed(job);
@@ -1015,7 +1020,7 @@ void job_complete(Job *job, Error **errp)
 if (job_apply_verb(job, JOB_VERB_COMPLETE, errp)) {
 return;
 }
-if (job_is_cancelle

[PATCH for-6.1? 3/6] jobs: Give Job.force_cancel more meaning

2021-07-22 Thread Max Reitz
We largely have two cancel modes for jobs:

First, there is actual cancelling.  The job is terminated as soon as
possible, without trying to reach a consistent result.

Second, we have mirror in the READY state.  Technically, the job is not
really cancelled, but it just is a different completion mode.  The job
can still run for an indefinite amount of time while it tries to reach a
consistent result.

We want to be able to clearly distinguish which cancel mode a job is in
(when it has been cancelled).  We can use Job.force_cancel for this, but
right now it only reflects cancel requests from the user with
force=true, but clearly, jobs that do not even distinguish between
force=false and force=true are effectively always force-cancelled.

So this patch has Job.force_cancel signify whether the job will
terminate as soon as possible (force_cancel=true) or whether it will
effectively remain running despite being "cancelled"
(force_cancel=false).

To this end, we let jobs that provide JobDriver.cancel() tell the
generic job code whether they will terminate as soon as possible or not,
and for jobs that do not provide that method we assume they will.

Signed-off-by: Max Reitz 
---
 include/qemu/job.h | 11 ++-
 block/backup.c |  3 ++-
 block/mirror.c | 24 ++--
 job.c  |  6 +-
 4 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 5e8edbc2c8..8aa90f7395 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -253,8 +253,17 @@ struct JobDriver {
 
 /**
  * If the callback is not NULL, it will be invoked in job_cancel_async
+ *
+ * This function must return true if the job will be cancelled
+ * immediately without any further I/O (mandatory if @force is
+ * true), and false otherwise.  This lets the generic job layer
+ * know whether a job has been truly (force-)cancelled, or whether
+ * it is just in a special completion mode (like mirror after
+ * READY).
+ * (If the callback is NULL, the job is assumed to terminate
+ * without I/O.)
  */
-void (*cancel)(Job *job, bool force);
+bool (*cancel)(Job *job, bool force);
 
 
 /** Called when the job is freed */
diff --git a/block/backup.c b/block/backup.c
index bd3614ce70..513e1c8a0b 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -331,11 +331,12 @@ static void coroutine_fn backup_set_speed(BlockJob *job, 
int64_t speed)
 }
 }
 
-static void backup_cancel(Job *job, bool force)
+static bool backup_cancel(Job *job, bool force)
 {
 BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
 
 bdrv_cancel_in_flight(s->target_bs);
+return true;
 }
 
 static const BlockJobDriver backup_job_driver = {
diff --git a/block/mirror.c b/block/mirror.c
index d73b704473..c3514f4196 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1089,9 +1089,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 }
 trace_mirror_before_sleep(s, cnt, s->synced, delay_ns);
 job_sleep_ns(>common.job, delay_ns);
-if (job_is_cancelled(>common.job) &&
-(!s->synced || s->common.job.force_cancel))
-{
+if (job_is_cancelled(>common.job) && s->common.job.force_cancel) {
 break;
 }
 s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
@@ -1103,7 +1101,7 @@ immediate_exit:
  * or it was cancelled prematurely so that we do not guarantee that
  * the target is a copy of the source.
  */
-assert(ret < 0 || ((s->common.job.force_cancel || !s->synced) &&
+assert(ret < 0 || (s->common.job.force_cancel &&
job_is_cancelled(>common.job)));
 assert(need_drain);
 mirror_wait_for_all_io(s);
@@ -1189,14 +1187,27 @@ static bool mirror_drained_poll(BlockJob *job)
 return !!s->in_flight;
 }
 
-static void mirror_cancel(Job *job, bool force)
+static bool mirror_cancel(Job *job, bool force)
 {
 MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job);
 BlockDriverState *target = blk_bs(s->target);
 
-if (force || !job_is_ready(job)) {
+/*
+ * Before the job is READY, we treat any cancellation like a
+ * force-cancellation.
+ */
+force = force || !job_is_ready(job);
+
+if (force) {
 bdrv_cancel_in_flight(target);
 }
+return force;
+}
+
+static bool commit_active_cancel(Job *job, bool force)
+{
+/* Same as above in mirror_cancel() */
+return force || !job_is_ready(job);
 }
 
 static const BlockJobDriver mirror_job_driver = {
@@ -1226,6 +1237,7 @@ static const BlockJobDriver commit_active_job_driver = {
 .abort  = mirror_abort,
 .pause  = mirror_pause,
 .complete   = mirror_complete,
+.cancel = commit_active

[PATCH for-6.1? 5/6] mirror: Check job_is_cancelled() earlier

2021-07-22 Thread Max Reitz
We must check whether the job is force-cancelled early in our main loop,
most importantly before any `continue` statement.  For example, we used
to have `continue`s before our current checking location that are
triggered by `mirror_flush()` failing.  So, if `mirror_flush()` kept
failing, force-cancelling the job would not terminate it.

A job being force-cancelled should be treated the same as the job having
failed, so put the check in the same place where we check `s->ret < 0`.

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Max Reitz 
---
 block/mirror.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 291d2ed040..a993ed37d0 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -995,7 +995,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 mirror_wait_for_any_operation(s, true);
 }
 
-if (s->ret < 0) {
+if (s->ret < 0 || job_is_cancelled(>common.job)) {
 ret = s->ret;
 goto immediate_exit;
 }
@@ -1081,17 +1081,12 @@ static int coroutine_fn mirror_run(Job *job, Error 
**errp)
 break;
 }
 
-ret = 0;
-
 if (s->synced && !should_complete) {
 delay_ns = (s->in_flight == 0 &&
 cnt == 0 ? BLOCK_JOB_SLICE_TIME : 0);
 }
 trace_mirror_before_sleep(s, cnt, s->synced, delay_ns);
 job_sleep_ns(>common.job, delay_ns);
-if (job_is_cancelled(>common.job)) {
-break;
-}
 s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
 }
 
-- 
2.31.1




[PATCH for-6.1? 2/6] job: @force parameter for job_cancel_sync{, _all}()

2021-07-22 Thread Max Reitz
Callers should be able to specify whether they want job_cancel_sync() to
force-cancel the job or not.

In fact, almost all invocations do not care about consistency of the
result and just want the job to terminate as soon as possible, so they
should pass force=true.  The replication block driver is the exception.

This changes some iotest outputs, because quitting qemu while a mirror
job is active will now lead to it being cancelled instead of completed,
which is what we want.  (Cancelling a READY mirror job with force=false
may take an indefinite amount of time, which we do not want when
quitting.  If users want consistent results, they must have all jobs be
done before they quit qemu.)

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Max Reitz 
---
 include/qemu/job.h| 10 ++---
 block/replication.c   |  4 +-
 blockdev.c|  4 +-
 job.c | 27 +---
 qemu-nbd.c|  2 +-
 softmmu/runstate.c|  2 +-
 storage-daemon/qemu-storage-daemon.c  |  2 +-
 tests/unit/test-block-iothread.c  |  2 +-
 tests/unit/test-blockjob.c|  2 +-
 tests/qemu-iotests/109.out| 60 +++
 tests/qemu-iotests/tests/qsd-jobs.out |  2 +-
 11 files changed, 61 insertions(+), 56 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 41162ed494..5e8edbc2c8 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -506,19 +506,19 @@ void job_user_cancel(Job *job, bool force, Error **errp);
 
 /**
  * Synchronously cancel the @job.  The completion callback is called
- * before the function returns.  The job may actually complete
- * instead of canceling itself; the circumstances under which this
- * happens depend on the kind of job that is active.
+ * before the function returns.  If @force is false, the job may
+ * actually complete instead of canceling itself; the circumstances
+ * under which this happens depend on the kind of job that is active.
  *
  * Returns the return value from the job if the job actually completed
  * during the call, or -ECANCELED if it was canceled.
  *
  * Callers must hold the AioContext lock of job->aio_context.
  */
-int job_cancel_sync(Job *job);
+int job_cancel_sync(Job *job, bool force);
 
 /** Synchronously cancels all jobs using job_cancel_sync(). */
-void job_cancel_sync_all(void);
+void job_cancel_sync_all(bool force);
 
 /**
  * @job: The job to be completed.
diff --git a/block/replication.c b/block/replication.c
index 32444b9a8f..e7a9327b12 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -149,7 +149,7 @@ static void replication_close(BlockDriverState *bs)
 if (s->stage == BLOCK_REPLICATION_FAILOVER) {
 commit_job = >commit_job->job;
 assert(commit_job->aio_context == qemu_get_current_aio_context());
-job_cancel_sync(commit_job);
+job_cancel_sync(commit_job, false);
 }
 
 if (s->mode == REPLICATION_MODE_SECONDARY) {
@@ -726,7 +726,7 @@ static void replication_stop(ReplicationState *rs, bool 
failover, Error **errp)
  * disk, secondary disk in backup_job_completed().
  */
 if (s->backup_job) {
-job_cancel_sync(>backup_job->job);
+job_cancel_sync(>backup_job->job, false);
 }
 
 if (!failover) {
diff --git a/blockdev.c b/blockdev.c
index 3d8ac368a1..aa95918c02 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1848,7 +1848,7 @@ static void drive_backup_abort(BlkActionState *common)
 aio_context = bdrv_get_aio_context(state->bs);
 aio_context_acquire(aio_context);
 
-job_cancel_sync(>job->job);
+job_cancel_sync(>job->job, true);
 
 aio_context_release(aio_context);
 }
@@ -1949,7 +1949,7 @@ static void blockdev_backup_abort(BlkActionState *common)
 aio_context = bdrv_get_aio_context(state->bs);
 aio_context_acquire(aio_context);
 
-job_cancel_sync(>job->job);
+job_cancel_sync(>job->job, true);
 
 aio_context_release(aio_context);
 }
diff --git a/job.c b/job.c
index e7a5d28854..9e971d64cf 100644
--- a/job.c
+++ b/job.c
@@ -763,7 +763,12 @@ static void job_completed_txn_abort(Job *job)
 if (other_job != job) {
 ctx = other_job->aio_context;
 aio_context_acquire(ctx);
-job_cancel_async(other_job, false);
+/*
+ * This is a transaction: If one job failed, no result will matter.
+ * Therefore, pass force=true to terminate all other jobs as 
quickly
+ * as possible.
+ */
+job_cancel_async(other_job, true);
 aio_context_release(ctx);
 }
 }
@@ -964,12 +969,24 @@ static void job_cancel_err(Job *job, Error **errp)
 job_cancel(job, false);
 }
 
-int job_cancel_sync(Job *job)

[PATCH for-6.1? 1/6] mirror: Keep s->synced on error

2021-07-22 Thread Max Reitz
An error does not take us out of the READY phase, which is what
s->synced signifies.  It does of course mean that source and target are
no longer in sync, but that is what s->actively_sync is for -- s->synced
never meant that source and target are in sync, only that they were at
some point (and at that point we transitioned into the READY phase).

The tangible problem is that we transition to READY once we are in sync
and s->synced is false.  By resetting s->synced here, we will transition
from READY to READY once the error is resolved (if the job keeps
running), and that transition is not allowed.

Signed-off-by: Max Reitz 
---
 block/mirror.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/block/mirror.c b/block/mirror.c
index 98fc66eabf..d73b704473 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -121,7 +121,6 @@ typedef enum MirrorMethod {
 static BlockErrorAction mirror_error_action(MirrorBlockJob *s, bool read,
 int error)
 {
-s->synced = false;
 s->actively_synced = false;
 if (read) {
 return block_job_error_action(>common, s->on_source_error,
-- 
2.31.1




[PATCH for-6.1? 0/6] mirror: Handle errors after READY cancel

2021-07-22 Thread Max Reitz
Hi,

This is a rather complex series with changes that aren’t exactly local
to the mirror job, so maybe it’s too complex for 6.1.

However, it is a bug fix, and not an insignificant one, though probably
not a regression of any kind.

Bug report:
https://gitlab.com/qemu-project/qemu/-/issues/462

(I didn’t put any “Fixes:” or “Resolves:” into the commit messages,
because there is no single patch here that fixes the bug.)

The root of the problem is that if you cancel a mirror job during its
READY phase, any kind of I/O error (with the error action 'stop') is
likely to not be handled gracefully, which means that perhaps the job
will just loop forever without pausing, doing nothing but emitting
errors.  There is no way to stop the job, or cancel it with force=true,
and so you also cannot quit qemu normally, because, well, cancelling the
job doesn’t do anything.  So you have to kill qemu to stop the mess.

If you’re lucky, the error is transient.  Then qemu will just kill
itself with a failed assertion, because it’ll try a READY -> READY
transition, which isn’t allowed.

There are a couple of problems contributing to it all:

(1) The READY -> READY transition comes from the fact that we will enter
the READY state whenever source and target are synced, and whenever
s->synced is false.  I/O errors reset s->synced.  I believe they
shouldn’t.
(Patch 1)

(2) Quitting qemu doesn’t force-cancel jobs.  I don’t understand why.
If for all jobs but mirror we want them to be cancelled and not
properly completed, why do we want mirror to get a consistent
result?  (Which is what cancel with force=false gives you.)
I believe we actually don’t care, and so on many occasions where we
invoke job_cancel_sync() and job_cancel_sync_all(), we want to
force-cancel the job(s) in question.
(Patch 2)

(3) Cancelling mirror post-READY with force=false is actually not really
cancelling the job.  It’s a different completion mode.  The job
should run like any normal job, it shouldn’t be special-cased.
However, we have a couple of places that special-case cancelled job
because they believe that such jobs are on their way to definite
termination.  For example, we don’t allow pausing cancelled jobs.
We definitely do want to allow pausing a mirror post-READY job that
is being non-force-cancelled.  The job may still take an arbitrary
amount of time, so it absolutely should be pausable.
(Patches 3, 4)

(4) Mirror only checks whether it’s been force-cancelled at the bottom
of its main loop, after several `continue`s.  Therefore, if flushing
fails (and it then `continue`s), that check will be skipped.  If
flushing fails continuously, the job cannot be force-cancelled.
(Patch 5)


Max Reitz (6):
  mirror: Keep s->synced on error
  job: @force parameter for job_cancel_sync{,_all}()
  jobs: Give Job.force_cancel more meaning
  job: Add job_cancel_requested()
  mirror: Check job_is_cancelled() earlier
  iotests: Add mirror-ready-cancel-error test

 include/qemu/job.h|  29 +++-
 block/backup.c|   3 +-
 block/mirror.c|  35 +++--
 block/replication.c   |   4 +-
 blockdev.c|   4 +-
 job.c |  46 --
 qemu-nbd.c|   2 +-
 softmmu/runstate.c|   2 +-
 storage-daemon/qemu-storage-daemon.c  |   2 +-
 tests/unit/test-block-iothread.c  |   2 +-
 tests/unit/test-blockjob.c|   2 +-
 tests/qemu-iotests/109.out|  60 +++-
 .../tests/mirror-ready-cancel-error   | 143 ++
 .../tests/mirror-ready-cancel-error.out   |   5 +
 tests/qemu-iotests/tests/qsd-jobs.out |   2 +-
 15 files changed, 262 insertions(+), 79 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/mirror-ready-cancel-error
 create mode 100644 tests/qemu-iotests/tests/mirror-ready-cancel-error.out

-- 
2.31.1




Re: [Virtio-fs] [PATCH v2 7/9] virtiofsd: Add inodes_by_handle hash table

2021-07-21 Thread Max Reitz

On 20.07.21 16:50, Vivek Goyal wrote:

On Tue, Jul 13, 2021 at 05:07:31PM +0200, Max Reitz wrote:

[..]

The next question is, how do we detect temporary failure, because if we
look up some new inode, name_to_handle_at() fails, we ignore it, and
then it starts to work and we fail all further lookups, that’s not
good.  We should have the first lookup fail.  I suppose ENOTSUPP means
“OK to ignore”, and for everything else we should let lookup fail?  (And
that pretty much answers my "what if name_to_handle_at() works the first
time, but then fails" question.  If we let anything but ENOTSUPP let the
lookup fail, then we should do so every time.)

I don’t think this will work as cleanly as I’d hoped.

The problem I’m facing is that get_file_handle() doesn’t only call
name_to_handle_at(), but also contains a lot of code managing mount_fds.
There are a lot of places that can fail, too, and I think we should have
them fall back to using an O_PATH FD:

Say mount_fds doesn’t contain an FD for the new handle’s mount ID yet, so we
want to add one.  However, it turns out that the file is not a regular file
or directory, so we cannot open it as a regular FD and add it to mount_fds;

Hi Max,

So an fd opened using O_PATH can't be used as "mount_fd" in
open_by_handle_at()? (I see that you are already first opening O_PATH
fd and then verifying if this is a regular file/dir or not).


Yep, unfortunately we need a non-O_PATH fd.


or that it is a regular file, but without permission to open it O_RDONLY.
So we cannot return a file handle, because it will not be usable until a
mount FD is added.

I think in such a case we should fall back to an O_PATH FD, because this is
not some unexpected error, but just an unfortunate (but reproducible and
valid) circumstance where using `-o inode_file_handles` fails to do
something that works without it.

Now, however, this means that the next time we try to generate a handle for
this file (to look it up), it will absolutely work if some other FD was
added to mount_fds for this mount ID in the meantime.


We could get around this by not trying to open the file for which we are to
generate a handle to add its FD to mount_fds, but instead doing what the
open_by_handle_at() man page suggests:


The mount_id argument returns an identifier for the filesystem mount
that corresponds to pathname. This corresponds to the first field in one
of the records in /proc/self/mountinfo. Opening the pathname in the
fifth field of that record yields a file descriptor for the mount point;
that file descriptor can be used in a subsequent call to
open_by_handle_at().

However, I’d rather avoid parsing mountinfo.

Hmm.., not sure what's wrong with parsing mountinfo.


Well, it’s just that it’s some additional complexity that I didn’t 
consider necessary.


(Because I was content with falling back in the rare case that the 
looked up file is not a regular file or directory.)



Example code does
not look too bad. Also it mentions that libmount provides helpers
(if we don't want to write our own function to parse mountinfo).

I would think parsing mountinfo is a good idea because it solves
your problem of not wanting to open device nodes for mount_fds. And
in turn not relying on a fallback to O_PATH fds.


Well.  Strictly speaking, it isn’t really my problem, because I didn’t 
really consider a rare fallback to be a problem.


Furthermore, I don’t even know whether it really solves the problem.  
Just as a mount point need not be a directory, it need not even be a 
regular file.  You absolutely can mount a filesystem on a device file, 
and have the root node be a device file, too:


(I did this by modifying the qemu FUSE block export code (to pass “dev” 
as a mount option, to drop the check whether the mount point is a 
regular file, and to report a device file instead of a regular file).  
It doesn’t work perfectly well because FUSE drops the rdev attribute, 
and so you can only create 0:0 device files, but, well...)


$ cat /proc/self/mountinfo
436 40 0:45 / /tmp/blub rw,nosuid,relatime shared:238 - fuse /dev/fuse 
rw,user_id=0,group_id=0,default_permissions,allow_other,max_read=67108864

$ stat /tmp/blub
File: /tmp/blub
Size: 1073741824 Blocks: 2097152 IO Block: 1 character special file
Device: 2dh/45d Inode: 1 Links: 1 Device type: 0,0
[...]

I know this is something that nobody will normally ever do, but I still 
don’t think we can absolutely safely assume a mount point to always be a 
regular file or directory.



Few thoughts overall.

- We are primarily disagreeing on whether we should fallback to O_PATH
   fds or not if something goes wrong w.r.t handle generation.

   My preference is that atleast in the initial patches we should not try
   to fall back. EOPNOTSUPP is the only case we need to take care of.
   Otherwise if there is any temporary error (EMOMEM, running out of
   fds or something else), we return it to the caller. That's what
   rest of the code is doing. If some

[PULL 6/6] blkdebug: protect rules and suspended_reqs with a lock

2021-07-19 Thread Max Reitz
From: Emanuele Giuseppe Esposito 

First, categorize the structure fields to identify what needs
to be protected and what doesn't.

We essentially need to protect only .state, and the 3 lists in
BDRVBlkdebugState.

Then, add the lock and mark the functions accordingly.

Co-developed-by: Paolo Bonzini 
Signed-off-by: Emanuele Giuseppe Esposito 
Message-Id: <20210614082931.24925-7-eespo...@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Max Reitz 
---
 block/blkdebug.c | 49 ++--
 1 file changed, 39 insertions(+), 10 deletions(-)

diff --git a/block/blkdebug.c b/block/blkdebug.c
index b47c3fd97c..8b67554bec 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -38,24 +38,27 @@
 #include "qapi/qobject-input-visitor.h"
 #include "sysemu/qtest.h"
 
+/* All APIs are thread-safe */
+
 typedef struct BDRVBlkdebugState {
-int state;
+/* IN: initialized in blkdebug_open() and never changed */
 uint64_t align;
 uint64_t max_transfer;
 uint64_t opt_write_zero;
 uint64_t max_write_zero;
 uint64_t opt_discard;
 uint64_t max_discard;
-
+char *config_file; /* For blkdebug_refresh_filename() */
+/* initialized in blkdebug_parse_perms() */
 uint64_t take_child_perms;
 uint64_t unshare_child_perms;
 
-/* For blkdebug_refresh_filename() */
-char *config_file;
-
+/* State. Protected by lock */
+int state;
 QLIST_HEAD(, BlkdebugRule) rules[BLKDBG__MAX];
 QSIMPLEQ_HEAD(, BlkdebugRule) active_rules;
 QLIST_HEAD(, BlkdebugSuspendedReq) suspended_reqs;
+QemuMutex lock;
 } BDRVBlkdebugState;
 
 typedef struct BlkdebugAIOCB {
@@ -64,8 +67,11 @@ typedef struct BlkdebugAIOCB {
 } BlkdebugAIOCB;
 
 typedef struct BlkdebugSuspendedReq {
+/* IN: initialized in suspend_request() */
 Coroutine *co;
 char *tag;
+
+/* List entry protected BDRVBlkdebugState's lock */
 QLIST_ENTRY(BlkdebugSuspendedReq) next;
 } BlkdebugSuspendedReq;
 
@@ -77,6 +83,7 @@ enum {
 };
 
 typedef struct BlkdebugRule {
+/* IN: initialized in add_rule() or blkdebug_debug_breakpoint() */
 BlkdebugEvent event;
 int action;
 int state;
@@ -95,6 +102,8 @@ typedef struct BlkdebugRule {
 char *tag;
 } suspend;
 } options;
+
+/* List entries protected BDRVBlkdebugState's lock */
 QLIST_ENTRY(BlkdebugRule) next;
 QSIMPLEQ_ENTRY(BlkdebugRule) active_next;
 } BlkdebugRule;
@@ -244,11 +253,14 @@ static int add_rule(void *opaque, QemuOpts *opts, Error 
**errp)
 };
 
 /* Add the rule */
+qemu_mutex_lock(>lock);
 QLIST_INSERT_HEAD(>rules[event], rule, next);
+qemu_mutex_unlock(>lock);
 
 return 0;
 }
 
+/* Called with lock held or from .bdrv_close */
 static void remove_rule(BlkdebugRule *rule)
 {
 switch (rule->action) {
@@ -467,6 +479,7 @@ static int blkdebug_open(BlockDriverState *bs, QDict 
*options, int flags,
 int ret;
 uint64_t align;
 
+qemu_mutex_init(>lock);
 opts = qemu_opts_create(_opts, NULL, 0, _abort);
 if (!qemu_opts_absorb_qdict(opts, options, errp)) {
 ret = -EINVAL;
@@ -567,6 +580,7 @@ static int blkdebug_open(BlockDriverState *bs, QDict 
*options, int flags,
 ret = 0;
 out:
 if (ret < 0) {
+qemu_mutex_destroy(>lock);
 g_free(s->config_file);
 }
 qemu_opts_del(opts);
@@ -581,6 +595,7 @@ static int rule_check(BlockDriverState *bs, uint64_t 
offset, uint64_t bytes,
 int error;
 bool immediately;
 
+qemu_mutex_lock(>lock);
 QSIMPLEQ_FOREACH(rule, >active_rules, active_next) {
 uint64_t inject_offset = rule->options.inject.offset;
 
@@ -594,6 +609,7 @@ static int rule_check(BlockDriverState *bs, uint64_t 
offset, uint64_t bytes,
 }
 
 if (!rule || !rule->options.inject.error) {
+qemu_mutex_unlock(>lock);
 return 0;
 }
 
@@ -605,6 +621,7 @@ static int rule_check(BlockDriverState *bs, uint64_t 
offset, uint64_t bytes,
 remove_rule(rule);
 }
 
+qemu_mutex_unlock(>lock);
 if (!immediately) {
 aio_co_schedule(qemu_get_current_aio_context(), qemu_coroutine_self());
 qemu_coroutine_yield();
@@ -770,8 +787,10 @@ static void blkdebug_close(BlockDriverState *bs)
 }
 
 g_free(s->config_file);
+qemu_mutex_destroy(>lock);
 }
 
+/* Called with lock held.  */
 static void suspend_request(BlockDriverState *bs, BlkdebugRule *rule)
 {
 BDRVBlkdebugState *s = bs->opaque;
@@ -790,6 +809,7 @@ static void suspend_request(BlockDriverState *bs, 
BlkdebugRule *rule)
 }
 }
 
+/* Called with lock held.  */
 static void process_rule(BlockDriverState *bs, struct BlkdebugRule *rule,
  int *action_count, int *new_state)
 {
@@ -829,11 +849,13 @@ static void blkdebug_debug_event(BlockDriverState *bs, 
BlkdebugEvent event)
 
 assert((int)event >= 0 && event < BLKDBG_

[PULL 4/6] blkdebug: do not suspend in the middle of QLIST_FOREACH_SAFE

2021-07-19 Thread Max Reitz
From: Emanuele Giuseppe Esposito 

That would be unsafe in case a rule other than the current one
is removed while the coroutine has yielded.
Keep FOREACH_SAFE because suspend_request deletes the current rule.

After this patch, *all* matching rules are deleted before suspending
the coroutine, rather than just one.
This doesn't affect the existing testcases.

Use actions_count to see how many yield to issue.

Co-developed-by: Paolo Bonzini 
Signed-off-by: Emanuele Giuseppe Esposito 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Message-Id: <20210614082931.24925-5-eespo...@redhat.com>
Signed-off-by: Max Reitz 
---
 block/blkdebug.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/blkdebug.c b/block/blkdebug.c
index 6bdeb2c7b3..dd82131d1e 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -789,7 +789,6 @@ static void suspend_request(BlockDriverState *bs, 
BlkdebugRule *rule)
 if (!qtest_enabled()) {
 printf("blkdebug: Suspended request '%s'\n", r->tag);
 }
-qemu_coroutine_yield();
 }
 
 static void process_rule(BlockDriverState *bs, struct BlkdebugRule *rule,
@@ -834,6 +833,12 @@ static void blkdebug_debug_event(BlockDriverState *bs, 
BlkdebugEvent event)
 QLIST_FOREACH_SAFE(rule, >rules[event], next, next) {
 process_rule(bs, rule, actions_count);
 }
+
+while (actions_count[ACTION_SUSPEND] > 0) {
+qemu_coroutine_yield();
+actions_count[ACTION_SUSPEND]--;
+}
+
 s->state = s->new_state;
 }
 
-- 
2.31.1




[PULL 2/6] blkdebug: move post-resume handling to resume_req_by_tag

2021-07-19 Thread Max Reitz
From: Emanuele Giuseppe Esposito 

We want to move qemu_coroutine_yield() after the loop on rules,
because QLIST_FOREACH_SAFE is wrong if the rule list is modified
while the coroutine has yielded.  Therefore move the suspended
request to the heap and clean it up from the remove side.
All that is left is for blkdebug_debug_event to handle the
yielding.

Co-developed-by: Paolo Bonzini 
Signed-off-by: Emanuele Giuseppe Esposito 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Message-Id: <20210614082931.24925-3-eespo...@redhat.com>
Signed-off-by: Max Reitz 
---
 block/blkdebug.c | 31 ++-
 1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/block/blkdebug.c b/block/blkdebug.c
index 5ccbfcab42..e8fdf7b056 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -775,25 +775,20 @@ static void blkdebug_close(BlockDriverState *bs)
 static void suspend_request(BlockDriverState *bs, BlkdebugRule *rule)
 {
 BDRVBlkdebugState *s = bs->opaque;
-BlkdebugSuspendedReq r;
+BlkdebugSuspendedReq *r;
 
-r = (BlkdebugSuspendedReq) {
-.co = qemu_coroutine_self(),
-.tag= g_strdup(rule->options.suspend.tag),
-};
+r = g_new(BlkdebugSuspendedReq, 1);
+
+r->co = qemu_coroutine_self();
+r->tag= g_strdup(rule->options.suspend.tag);
 
 remove_rule(rule);
-QLIST_INSERT_HEAD(>suspended_reqs, , next);
+QLIST_INSERT_HEAD(>suspended_reqs, r, next);
 
 if (!qtest_enabled()) {
-printf("blkdebug: Suspended request '%s'\n", r.tag);
+printf("blkdebug: Suspended request '%s'\n", r->tag);
 }
 qemu_coroutine_yield();
-if (!qtest_enabled()) {
-printf("blkdebug: Resuming request '%s'\n", r.tag);
-}
-
-g_free(r.tag);
 }
 
 static bool process_rule(BlockDriverState *bs, struct BlkdebugRule *rule,
@@ -880,8 +875,18 @@ retry:
  */
 QLIST_FOREACH(r, >suspended_reqs, next) {
 if (!strcmp(r->tag, tag)) {
+Coroutine *co = r->co;
+
+if (!qtest_enabled()) {
+printf("blkdebug: Resuming request '%s'\n", r->tag);
+}
+
 QLIST_REMOVE(r, next);
-qemu_coroutine_enter(r->co);
+g_free(r->tag);
+g_free(r);
+
+qemu_coroutine_enter(co);
+
 if (all) {
 goto retry;
 }
-- 
2.31.1




[PULL 3/6] blkdebug: track all actions

2021-07-19 Thread Max Reitz
From: Emanuele Giuseppe Esposito 

Add a counter for each action that a rule can trigger.
This is mainly used to keep track of how many coroutine_yield()
we need to perform after processing all rules in the list.

Co-developed-by: Paolo Bonzini 
Signed-off-by: Emanuele Giuseppe Esposito 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Message-Id: <20210614082931.24925-4-eespo...@redhat.com>
Signed-off-by: Max Reitz 
---
 block/blkdebug.c | 17 -
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/block/blkdebug.c b/block/blkdebug.c
index e8fdf7b056..6bdeb2c7b3 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -74,6 +74,7 @@ enum {
 ACTION_INJECT_ERROR,
 ACTION_SET_STATE,
 ACTION_SUSPEND,
+ACTION__MAX,
 };
 
 typedef struct BlkdebugRule {
@@ -791,22 +792,22 @@ static void suspend_request(BlockDriverState *bs, 
BlkdebugRule *rule)
 qemu_coroutine_yield();
 }
 
-static bool process_rule(BlockDriverState *bs, struct BlkdebugRule *rule,
-bool injected)
+static void process_rule(BlockDriverState *bs, struct BlkdebugRule *rule,
+ int *action_count)
 {
 BDRVBlkdebugState *s = bs->opaque;
 
 /* Only process rules for the current state */
 if (rule->state && rule->state != s->state) {
-return injected;
+return;
 }
 
 /* Take the action */
+action_count[rule->action]++;
 switch (rule->action) {
 case ACTION_INJECT_ERROR:
-if (!injected) {
+if (action_count[ACTION_INJECT_ERROR] == 1) {
 QSIMPLEQ_INIT(>active_rules);
-injected = true;
 }
 QSIMPLEQ_INSERT_HEAD(>active_rules, rule, active_next);
 break;
@@ -819,21 +820,19 @@ static bool process_rule(BlockDriverState *bs, struct 
BlkdebugRule *rule,
 suspend_request(bs, rule);
 break;
 }
-return injected;
 }
 
 static void blkdebug_debug_event(BlockDriverState *bs, BlkdebugEvent event)
 {
 BDRVBlkdebugState *s = bs->opaque;
 struct BlkdebugRule *rule, *next;
-bool injected;
+int actions_count[ACTION__MAX] = { 0 };
 
 assert((int)event >= 0 && event < BLKDBG__MAX);
 
-injected = false;
 s->new_state = s->state;
 QLIST_FOREACH_SAFE(rule, >rules[event], next, next) {
-injected = process_rule(bs, rule, injected);
+process_rule(bs, rule, actions_count);
 }
 s->state = s->new_state;
 }
-- 
2.31.1




[PULL 0/6] Block patches for 6.1-rc0

2021-07-19 Thread Max Reitz
The following changes since commit 7457b407edd6e8555e4b46488aab2f13959fccf8:

  Merge remote-tracking branch 
'remotes/thuth-gitlab/tags/pull-request-2021-07-19' into staging (2021-07-19 
11:34:08 +0100)

are available in the Git repository at:

  https://github.com/XanClic/qemu.git tags/pull-block-2021-07-19

for you to fetch changes up to 36109bff171ba0811fa4c723cecdf6c3561fa318:

  blkdebug: protect rules and suspended_reqs with a lock (2021-07-19 17:38:38 
+0200)


Block patches for 6.1-rc0:
- Make blkdebug's suspend/resume handling robust (and thread-safe)


Emanuele Giuseppe Esposito (6):
  blkdebug: refactor removal of a suspended request
  blkdebug: move post-resume handling to resume_req_by_tag
  blkdebug: track all actions
  blkdebug: do not suspend in the middle of QLIST_FOREACH_SAFE
  block/blkdebug: remove new_state field and instead use a local
variable
  blkdebug: protect rules and suspended_reqs with a lock

 block/blkdebug.c | 136 ---
 1 file changed, 92 insertions(+), 44 deletions(-)

-- 
2.31.1




[PULL 5/6] block/blkdebug: remove new_state field and instead use a local variable

2021-07-19 Thread Max Reitz
From: Emanuele Giuseppe Esposito 

There seems to be no benefit in using a field. Replace it with a local
variable, and move the state update before the yields.

The state update has do be done before the yields because now using
a local variable does not allow the new updated state to be visible
by the other yields.

Signed-off-by: Emanuele Giuseppe Esposito 
Message-Id: <20210614082931.24925-6-eespo...@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Max Reitz 
---
 block/blkdebug.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/block/blkdebug.c b/block/blkdebug.c
index dd82131d1e..b47c3fd97c 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -40,7 +40,6 @@
 
 typedef struct BDRVBlkdebugState {
 int state;
-int new_state;
 uint64_t align;
 uint64_t max_transfer;
 uint64_t opt_write_zero;
@@ -792,7 +791,7 @@ static void suspend_request(BlockDriverState *bs, 
BlkdebugRule *rule)
 }
 
 static void process_rule(BlockDriverState *bs, struct BlkdebugRule *rule,
- int *action_count)
+ int *action_count, int *new_state)
 {
 BDRVBlkdebugState *s = bs->opaque;
 
@@ -812,7 +811,7 @@ static void process_rule(BlockDriverState *bs, struct 
BlkdebugRule *rule,
 break;
 
 case ACTION_SET_STATE:
-s->new_state = rule->options.set_state.new_state;
+*new_state = rule->options.set_state.new_state;
 break;
 
 case ACTION_SUSPEND:
@@ -825,21 +824,21 @@ static void blkdebug_debug_event(BlockDriverState *bs, 
BlkdebugEvent event)
 {
 BDRVBlkdebugState *s = bs->opaque;
 struct BlkdebugRule *rule, *next;
+int new_state;
 int actions_count[ACTION__MAX] = { 0 };
 
 assert((int)event >= 0 && event < BLKDBG__MAX);
 
-s->new_state = s->state;
+new_state = s->state;
 QLIST_FOREACH_SAFE(rule, >rules[event], next, next) {
-process_rule(bs, rule, actions_count);
+process_rule(bs, rule, actions_count, _state);
 }
+s->state = new_state;
 
 while (actions_count[ACTION_SUSPEND] > 0) {
 qemu_coroutine_yield();
 actions_count[ACTION_SUSPEND]--;
 }
-
-s->state = s->new_state;
 }
 
 static int blkdebug_debug_breakpoint(BlockDriverState *bs, const char *event,
-- 
2.31.1




[PULL 1/6] blkdebug: refactor removal of a suspended request

2021-07-19 Thread Max Reitz
From: Emanuele Giuseppe Esposito 

Extract to a separate function.  Do not rely on FOREACH_SAFE, which is
only "safe" if the *current* node is removed---not if another node is
removed.  Instead, just walk the entire list from the beginning when
asked to resume all suspended requests with a given tag.

Co-developed-by: Paolo Bonzini 
Signed-off-by: Emanuele Giuseppe Esposito 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Eric Blake 
Message-Id: <20210614082931.24925-2-eespo...@redhat.com>
Signed-off-by: Max Reitz 
---
 block/blkdebug.c | 33 ++---
 1 file changed, 22 insertions(+), 11 deletions(-)

diff --git a/block/blkdebug.c b/block/blkdebug.c
index 2c0b9b0ee8..5ccbfcab42 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -793,7 +793,6 @@ static void suspend_request(BlockDriverState *bs, 
BlkdebugRule *rule)
 printf("blkdebug: Resuming request '%s'\n", r.tag);
 }
 
-QLIST_REMOVE(, next);
 g_free(r.tag);
 }
 
@@ -869,25 +868,40 @@ static int blkdebug_debug_breakpoint(BlockDriverState 
*bs, const char *event,
 return 0;
 }
 
-static int blkdebug_debug_resume(BlockDriverState *bs, const char *tag)
+static int resume_req_by_tag(BDRVBlkdebugState *s, const char *tag, bool all)
 {
-BDRVBlkdebugState *s = bs->opaque;
-BlkdebugSuspendedReq *r, *next;
+BlkdebugSuspendedReq *r;
 
-QLIST_FOREACH_SAFE(r, >suspended_reqs, next, next) {
+retry:
+/*
+ * No need for _SAFE, since a different coroutine can remove another node
+ * (not the current one) in this list, and when the current one is removed
+ * the iteration starts back from beginning anyways.
+ */
+QLIST_FOREACH(r, >suspended_reqs, next) {
 if (!strcmp(r->tag, tag)) {
+QLIST_REMOVE(r, next);
 qemu_coroutine_enter(r->co);
+if (all) {
+goto retry;
+}
 return 0;
 }
 }
 return -ENOENT;
 }
 
+static int blkdebug_debug_resume(BlockDriverState *bs, const char *tag)
+{
+BDRVBlkdebugState *s = bs->opaque;
+
+return resume_req_by_tag(s, tag, false);
+}
+
 static int blkdebug_debug_remove_breakpoint(BlockDriverState *bs,
 const char *tag)
 {
 BDRVBlkdebugState *s = bs->opaque;
-BlkdebugSuspendedReq *r, *r_next;
 BlkdebugRule *rule, *next;
 int i, ret = -ENOENT;
 
@@ -900,11 +914,8 @@ static int 
blkdebug_debug_remove_breakpoint(BlockDriverState *bs,
 }
 }
 }
-QLIST_FOREACH_SAFE(r, >suspended_reqs, next, r_next) {
-if (!strcmp(r->tag, tag)) {
-qemu_coroutine_enter(r->co);
-ret = 0;
-}
+if (resume_req_by_tag(s, tag, true) == 0) {
+ret = 0;
 }
 return ret;
 }
-- 
2.31.1




Re: [PATCH 08/14] iotests/common.rc: _make_test_img(): smarter compressiont_type handling

2021-07-16 Thread Max Reitz

On 16.07.21 16:24, Vladimir Sementsov-Ogievskiy wrote:

16.07.2021 15:38, Max Reitz wrote:

Subject: s/compressiont_type/compression_type/

On 05.07.21 11:15, Vladimir Sementsov-Ogievskiy wrote:

Like it is done in iotests.py in qemu_img_create_prepare_args(), let's
not follow compression_type=zstd of IMGOPTS if test creates image in
old format.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/common.rc | 8 
  1 file changed, 8 insertions(+)

diff --git a/tests/qemu-iotests/common.rc 
b/tests/qemu-iotests/common.rc

index cbbf6d7c7f..4cae5b2d70 100644
--- a/tests/qemu-iotests/common.rc
+++ b/tests/qemu-iotests/common.rc
@@ -438,6 +438,14 @@ _make_test_img()
  backing_file=$param
  continue
  elif $opts_param; then
+    if [[ "$param" == *"compat=0"* ]]; then


Like in patch 2, probably should be 0.10, and account for “v2”.

+    # If user specified zstd compression type in 
IMGOPTS, this will
+    # just not work. So, let's imply forcing zlib 
compression when

+    # test creates image in old version of the format.
+    # Similarly works qemu_img_create_prepare_args() in 
iotests.py
+    optstr=$(echo "$optstr" | $SED -e 
's/compression_type=\w\+//')


What about the surrounding comma, if compression_type is just one 
option among others?  Do we need something like


$SED -e 's/,compression_type=\w\+//' -e 's/compression_type=\w\+,\?//'

?


Agree



+    optstr=$(_optstr_add "$optstr" 
"compression_type=zlib")


As the comment says, this is for compression_type in $IMGOPTS and 
then compat=0.10 in the parameters.  It won’t work if you have e.g. 
“_make_test_img -o compat=0.10,compression_type=zstd”, because then 
this generates the optstr 
“$IMGOPTS,compression_type=zlib,compat=0.10,compression_type=zstd”. 
Not sure if we want to care about this case, but, well...


Then there’s the case where I have compat=0.10 in $IMGOPTS, and the 
test wants to use compression_type=zstd.  I think it’s correct not to 
replace compression_type=zstd then, because the test should be 
skipped for compat=0.10 in $IMGOPTS.  But that’s not what happens in 
the iotest.py version (qemu_img_create_prepare_args()), so I wonder 
whether the latter should be made to match this behavior here, if in 
any way possible.


Now that I think about it more, I begin to wonder more...

So this code doesn’t explicitly handle compression_type only in 
$IMGOPTS.  If you have


_make_test_img -o compression_type=zstd,compat=0.10

It’ll still keep the compression_type=zstd.  However, for

_make_test_img -o compression_type=zstd -o compat=0.10

It’ll replace it by zlib.

So perhaps we should explicitly scan for compression_type only in 
$IMGOPTS and then drop it from the optstr if compat=0.10 is in the 
_make_test_img's -o options.


But thinking further, this is not how $IMGOPTS work.  So far they 
aren’t advisory, they are mandatory.  If a test cannot work with 
something in $IMGOPTS, it has to be skipped.  Like, when you have 
compat=0.10 in $IMGOPTS, I don’t want to run tests that use v3 
features and have them just create v3 images for those tests.


So my impression is that you’re giving compression_type special 
treatment here, and I don’t know why exactly.  Tests that create v2 
images should just have compression_type be an unsupported_imgopt.


Max



Hmm.. I have better idea: deprecate v2 and drop all iotest support for 
it now :)) What do you think?


I haven’t yet understood the appeal of deprecating v2, because basically 
all code is shared between v2 and v3. So I don’t really see the appeal 
in dropping iotest support for it either.


At least, this doesn’t appear like a better idea than to add 
_unsupported_imgopts where needed (in fact, _unsupported_imgopts should 
already be there for other v3-only options like lazy_refcounts).


If not, than instead of this patch, we just should skip all tests that 
don't support compression_type=zstd due to using old version.. This 
means that we will skip some test-cases that can work with zstd just 
because we can't skip separate test cases in bash tests.


The standard procedure for this is to have a quick look whether we 
actually lose (relevant) coverage for this imgopt if we skip the test 
(usually not), and if so, split that part out into a new file. But 
again, usually nothing of value is lost, so nothing is split off.



(ohh, I'd deprecate bash tests too.. But that's kind of taste)


So far I don’t think there is a pressing reason why bash tests would be 
harder to support than Python tests, and so the effort to port all bash 
tests to Python seems much more difficult to me than having to duplicate 
meta-work like this.


(And in fact, as an example, I found it much easier to have bash tests 
support -o data_file than the Python tests, not least because the bash 
tests at least kind of all w

Re: [PATCH 12/14] iotests 60: more accurate set dirty bit in qcow2 header

2021-07-16 Thread Max Reitz

On 05.07.21 11:15, Vladimir Sementsov-Ogievskiy wrote:

Don't touch other incompatible bits, like compression-type. This makes
the test pass with IMGOPTS='compression_type=zstd'.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/060 | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)


Reviewed-by: Max Reitz 




Re: [PATCH 14/14] iotest 214: explicit compression type

2021-07-16 Thread Max Reitz

On 05.07.21 11:15, Vladimir Sementsov-Ogievskiy wrote:

The test-case "Corrupted size field in compressed cluster descriptor"
heavily depends on zlib compression type. So, make it explicit. This
way test passes with IMGOPTS='compression_type=zstd'.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/214 | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)


Reviewed-by: Max Reitz 




Re: [PATCH 13/14] iotest 39: use _qcow2_dump_header

2021-07-16 Thread Max Reitz

On 05.07.21 11:15, Vladimir Sementsov-Ogievskiy wrote:

_qcow2_dump_header has filter for compression type, so this change
makes test pass with IMGOPTS='compression_type=zstd'.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/039 | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)


Reviewed-by: Max Reitz 

But as I said, I’d prefer this to come right after patch 10.

Max




Re: [PATCH 11/14] iotests: bash tests: filter compression type

2021-07-16 Thread Max Reitz

On 05.07.21 11:15, Vladimir Sementsov-Ogievskiy wrote:

We want iotests pass with both the default zlib compression and with
IMGOPTS='compression_type=zstd'.

Actually the only test that is interested in real compression type in
test output is 287 (test for qcow2 compression type), so implement
specific option for it.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/060.out   |  2 +-
  tests/qemu-iotests/061.out   | 12 ++--
  tests/qemu-iotests/082.out   | 14 +++---
  tests/qemu-iotests/198.out   |  4 ++--
  tests/qemu-iotests/287   |  8 
  tests/qemu-iotests/common.filter |  7 +++
  tests/qemu-iotests/common.rc | 14 +-
  7 files changed, 40 insertions(+), 21 deletions(-)


[...]


diff --git a/tests/qemu-iotests/common.filter b/tests/qemu-iotests/common.filter
index 268b749e2f..78efe3e4dd 100644
--- a/tests/qemu-iotests/common.filter
+++ b/tests/qemu-iotests/common.filter
@@ -247,6 +247,7 @@ _filter_img_info()
  -e "/block_state_zero: \\(on\\|off\\)/d" \
  -e "/log_size: [0-9]\\+/d" \
  -e "s/iters: [0-9]\\+/iters: 1024/" \
+-e 's/\(compression type: \)\(zlib\|zstd\)/\1COMPRESSION_TYPE/' \
  -e "s/uuid: [-a-f0-9]\\+/uuid: ----/" 
| \
  while IFS='' read -r line; do
  if [[ $format_specific == 1 ]]; then
@@ -332,5 +333,11 @@ for fname in fnames:
  sys.stdout.write(result)'
  }
  
+_filter_qcow2_compression_type_bit()

+{
+$SED -e 's/\(incompatible_features\s\+\)\[3\(, \)\?/\1[/' \
+ -e 's/\(incompatible_features.*\), 3\]/\1]/'


What about “incompatble_features   [2, 3, 4]”?

I’d like to propose adding some form of filtering parameter to qcow2.py 
which allows filtering a specific bit from the qcow2_format.Flags64 
representation, but that seems rather difficult, actually...



+}
+
  # make sure this script returns success
  true
diff --git a/tests/qemu-iotests/common.rc b/tests/qemu-iotests/common.rc
index ee4b9d795e..813b51ee03 100644
--- a/tests/qemu-iotests/common.rc
+++ b/tests/qemu-iotests/common.rc
@@ -697,6 +697,7 @@ _img_info()
  -e "s#$TEST_DIR#TEST_DIR#g" \
  -e "s#$SOCK_DIR/fuse-#TEST_DIR/#g" \
  -e "s#$IMGFMT#IMGFMT#g" \
+-e 's/\(compression type: \)\(zlib\|zstd\)/\1COMPRESSION_TYPE/' \
  -e "/^disk size:/ D" \
  -e "/actual-size/ D" | \
  while IFS='' read -r line; do
@@ -996,12 +997,23 @@ _require_one_device_of()
  
  _qcow2_dump_header()

  {
+if [[ "$1" == "--no-filter-compression" ]]; then
+local filter_compression=0
+shift
+else
+local filter_compression=1
+fi
+
  img="$1"
  if [ -z "$img" ]; then
  img="$TEST_IMG"
  fi
  
-$PYTHON qcow2.py "$img" dump-header

+if [[ $filter_compression == 0 ]]; then
+$PYTHON qcow2.py "$img" dump-header
+else
+$PYTHON qcow2.py "$img" dump-header | 
_filter_qcow2_compression_type_bit
+fi
  }
  
  # make sure this script returns success


Could have been done more extensibly for the future (i.e. a loop over 
the parameters, and a variable to invoke all applicable filters), but, 
well.  Not much reason to think about a future that we’re not sure will 
ever happen.


Max




Re: [PATCH 10/14] iotests: massive use _qcow2_dump_header

2021-07-16 Thread Max Reitz

On 05.07.21 11:15, Vladimir Sementsov-Ogievskiy wrote:

We are going to add filtering in _qcow2_dump_header and want all tests
use it.

The patch is generated by commands:
   cd tests/qemu-iotests
   sed -ie 's/$PYTHON qcow2.py "$TEST_IMG" dump-header\($\| 
\)/_qcow2_dump_header\1/' ??? tests/*

(the difficulty is to avoid converting dump-header-exts)

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/031 |  6 +++---
  tests/qemu-iotests/036 |  6 +++---
  tests/qemu-iotests/039 | 20 ++--
  tests/qemu-iotests/060 | 20 ++--
  tests/qemu-iotests/061 | 36 ++--
  tests/qemu-iotests/137 |  2 +-
  tests/qemu-iotests/287 |  8 
  7 files changed, 49 insertions(+), 49 deletions(-)


Reviewed-by: Max Reitz 

I think I’d have merged patch 13 into this one, but if you want to keep 
it separate (so that this remains a purely auto-generated patch), then I 
think it should at least come right after this one.


Max




Re: [PATCH 09/14] iotests/common.rc: introduce _qcow2_dump_header helper

2021-07-16 Thread Max Reitz

On 05.07.21 11:15, Vladimir Sementsov-Ogievskiy wrote:

We'll use it in tests instead of explicit qcow2.py. Then we are going
to add some filtering in _qcow2_dump_header.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/common.rc | 10 ++
  1 file changed, 10 insertions(+)


Reviewed-by: Max Reitz 




Re: [PATCH 08/14] iotests/common.rc: _make_test_img(): smarter compressiont_type handling

2021-07-16 Thread Max Reitz

Subject: s/compressiont_type/compression_type/

On 05.07.21 11:15, Vladimir Sementsov-Ogievskiy wrote:

Like it is done in iotests.py in qemu_img_create_prepare_args(), let's
not follow compression_type=zstd of IMGOPTS if test creates image in
old format.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/common.rc | 8 
  1 file changed, 8 insertions(+)

diff --git a/tests/qemu-iotests/common.rc b/tests/qemu-iotests/common.rc
index cbbf6d7c7f..4cae5b2d70 100644
--- a/tests/qemu-iotests/common.rc
+++ b/tests/qemu-iotests/common.rc
@@ -438,6 +438,14 @@ _make_test_img()
  backing_file=$param
  continue
  elif $opts_param; then
+if [[ "$param" == *"compat=0"* ]]; then


Like in patch 2, probably should be 0.10, and account for “v2”.


+# If user specified zstd compression type in IMGOPTS, this will
+# just not work. So, let's imply forcing zlib compression when
+# test creates image in old version of the format.
+# Similarly works qemu_img_create_prepare_args() in iotests.py
+optstr=$(echo "$optstr" | $SED -e 's/compression_type=\w\+//')


What about the surrounding comma, if compression_type is just one option 
among others?  Do we need something like


$SED -e 's/,compression_type=\w\+//' -e 's/compression_type=\w\+,\?//'

?


+optstr=$(_optstr_add "$optstr" "compression_type=zlib")


As the comment says, this is for compression_type in $IMGOPTS and then 
compat=0.10 in the parameters.  It won’t work if you have e.g. 
“_make_test_img -o compat=0.10,compression_type=zstd”, because then this 
generates the optstr 
“$IMGOPTS,compression_type=zlib,compat=0.10,compression_type=zstd”. Not 
sure if we want to care about this case, but, well...


Then there’s the case where I have compat=0.10 in $IMGOPTS, and the test 
wants to use compression_type=zstd.  I think it’s correct not to replace 
compression_type=zstd then, because the test should be skipped for 
compat=0.10 in $IMGOPTS.  But that’s not what happens in the iotest.py 
version (qemu_img_create_prepare_args()), so I wonder whether the latter 
should be made to match this behavior here, if in any way possible.


Now that I think about it more, I begin to wonder more...

So this code doesn’t explicitly handle compression_type only in 
$IMGOPTS.  If you have


_make_test_img -o compression_type=zstd,compat=0.10

It’ll still keep the compression_type=zstd.  However, for

_make_test_img -o compression_type=zstd -o compat=0.10

It’ll replace it by zlib.

So perhaps we should explicitly scan for compression_type only in 
$IMGOPTS and then drop it from the optstr if compat=0.10 is in the 
_make_test_img's -o options.


But thinking further, this is not how $IMGOPTS work.  So far they aren’t 
advisory, they are mandatory.  If a test cannot work with something in 
$IMGOPTS, it has to be skipped.  Like, when you have compat=0.10 in 
$IMGOPTS, I don’t want to run tests that use v3 features and have them 
just create v3 images for those tests.


So my impression is that you’re giving compression_type special 
treatment here, and I don’t know why exactly.  Tests that create v2 
images should just have compression_type be an unsupported_imgopt.


Max


+fi
  optstr=$(_optstr_add "$optstr" "$param")
  opts_param=false
  continue





Re: [PATCH 07/14] qcow2: simple case support for downgrading of qcow2 images with zstd

2021-07-16 Thread Max Reitz

On 05.07.21 11:15, Vladimir Sementsov-Ogievskiy wrote:

If image doesn't have any compressed cluster we can easily switch to
zlib compression, which may allow to downgrade the image.

That's mostly needed to support IMGOPTS='compression_type=zstd' in some
iotests which do qcow2 downgrade.

While being here also fix checkpatch complain against '#' in printf
formatting.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  block/qcow2.c | 58 +--
  1 file changed, 56 insertions(+), 2 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index ee4530cdbd..bed3354474 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -5221,6 +5221,38 @@ static int qcow2_load_vmstate(BlockDriverState *bs, 
QEMUIOVector *qiov,
  qiov->size, qiov, 0, 0);
  }
  
+static int qcow2_has_compressed_clusters(BlockDriverState *bs)

+{
+int64_t offset = 0;
+int64_t bytes = bdrv_getlength(bs);
+
+if (bytes < 0) {
+return bytes;
+}
+
+while (bytes != 0) {
+int ret;
+QCow2SubclusterType type;
+unsigned int cur_bytes = MIN(INT_MAX, bytes);
+uint64_t host_offset;
+
+ret = qcow2_get_host_offset(bs, offset, _bytes, _offset,
+);
+if (ret < 0) {
+return ret;
+}
+
+if (type == QCOW2_SUBCLUSTER_COMPRESSED) {
+return 1;
+}
+
+offset += cur_bytes;
+bytes -= cur_bytes;
+}
+
+return 0;
+}
+
  /*
   * Downgrades an image's version. To achieve this, any incompatible features
   * have to be removed.
@@ -5278,9 +5310,10 @@ static int qcow2_downgrade(BlockDriverState *bs, int 
target_version,
   * the first place; if that happens nonetheless, returning -ENOTSUP is the
   * best thing to do anyway */
  
-if (s->incompatible_features) {

+if (s->incompatible_features & ~QCOW2_INCOMPAT_COMPRESSION) {
  error_setg(errp, "Cannot downgrade an image with incompatible features 
"
-   "%#" PRIx64 " set", s->incompatible_features);
+   "0x%" PRIx64 " set",
+   s->incompatible_features & ~QCOW2_INCOMPAT_COMPRESSION);
  return -ENOTSUP;
  }
  
@@ -5298,6 +5331,27 @@ static int qcow2_downgrade(BlockDriverState *bs, int target_version,

  return ret;
  }
  
+if (s->incompatible_features & QCOW2_INCOMPAT_COMPRESSION) {

+ret = qcow2_has_compressed_clusters(bs);
+if (ret < 0) {
+error_setg(errp, "Failed to check block status");
+return -EINVAL;
+}
+if (ret) {
+error_setg(errp, "Cannot downgrade an image with zstd compression "


Perhaps s/zstd/non-zlib/?

Like, really “perhaps”.  Right now I think this is the better error 
message, it’s just that “non-zlib” is more technically correct and 
theoretically future-proof.



+   "type and existing compressed clusters");
+return -ENOTSUP;
+}
+/*
+ * No compressed clusters for now, so just chose default zlib
+ * compression.
+ */
+s->incompatible_features = 0;


Not wrong, though I’d prefer

s->incompatible_features &= ~QCOW2_INCOMPAT_COMPRESSION;

Anyway:

Reviewed-by: Max Reitz 


+s->compression_type = QCOW2_COMPRESSION_TYPE_ZLIB;
+}
+
+assert(s->incompatible_features == 0);
+
  s->qcow_version = target_version;
  ret = qcow2_update_header(bs);
  if (ret < 0) {





Re: [PATCH 06/14] iotest 302: use img_info_log() helper

2021-07-16 Thread Max Reitz

On 05.07.21 11:15, Vladimir Sementsov-Ogievskiy wrote:

Instead of qemu_img_log("info", ..) use generic helper img_info_log().

img_info_log() has smarter logic. For example it use filter_img_info()
to filter output, which in turns filter a compression type. So it will
help us in future when we implement a possibility to use zstd
compression by default (with help of some runtime config file or maybe
build option). For now to test you should recompile qemu with a small
patch:

 --- a/block/qcow2.c
 +++ b/block/qcow2.c
 @@ -3540,6 +3540,11 @@ qcow2_co_create(BlockdevCreateOptions 
*create_options, Error **errp)
  }
  }

 +if (!qcow2_opts->has_compression_type && version >= 3) {
 +qcow2_opts->has_compression_type = true;
 +qcow2_opts->compression_type = QCOW2_COMPRESSION_TYPE_ZSTD;
 +}
 +
  if (qcow2_opts->has_compression_type &&
  qcow2_opts->compression_type != QCOW2_COMPRESSION_TYPE_ZLIB) {

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/302 | 3 ++-
  tests/qemu-iotests/302.out | 7 +++
  2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/tests/qemu-iotests/302 b/tests/qemu-iotests/302
index 5695af4914..2180dbc896 100755
--- a/tests/qemu-iotests/302
+++ b/tests/qemu-iotests/302
@@ -34,6 +34,7 @@ from iotests import (
  qemu_img_measure,
  qemu_io,
  qemu_nbd_popen,
+img_info_log,
  )
  
  iotests.script_initialize(supported_fmts=["qcow2"])

@@ -99,7 +100,7 @@ with tarfile.open(tar_file, "w") as tar:
  nbd_uri)
  
  iotests.log("=== Converted image info ===")

-qemu_img_log("info", nbd_uri)
+img_info_log(nbd_uri)


There’s another `qemu_img_log("info", nbd_uri)` call above this place.  
We can’t use `img_info_log()` there, because in that case, the image is 
not in qcow2 format (which is the test’s image format), but 
`img_info_log()` enforces “-f {imgfmt}”.  It would have been nice to 
have a comment on that somewhere, though.


But, well.

Reviewed-by: Max Reitz 

(And speaking in principle, I don’t think I like the broad 
img_info_log() very much anyway, because I feel like tests should rather 
only have the actually relevant bits in their reference outputs…)


  
  iotests.log("=== Converted image check ===")

  qemu_img_log("check", nbd_uri)
diff --git a/tests/qemu-iotests/302.out b/tests/qemu-iotests/302.out
index e2f6077e83..3e7c281b91 100644
--- a/tests/qemu-iotests/302.out
+++ b/tests/qemu-iotests/302.out
@@ -6,14 +6,13 @@ virtual size: 448 KiB (458752 bytes)
  disk size: unavailable
  
  === Converted image info ===

-image: nbd+unix:///exp?socket=SOCK_DIR/PID-nbd-sock
-file format: qcow2
+image: TEST_IMG
+file format: IMGFMT
  virtual size: 1 GiB (1073741824 bytes)
-disk size: unavailable
  cluster_size: 65536
  Format specific information:
  compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
  lazy refcounts: false
  refcount bits: 16
  corrupt: false





Re: [PATCH 05/14] iotests.py: filter compression type out

2021-07-16 Thread Max Reitz

On 05.07.21 11:15, Vladimir Sementsov-Ogievskiy wrote:

We want iotests pass with both the default zlib compression and with
IMGOPTS='compression_type=zstd'.

Actually the only test that is interested in real compression type in
test output is 287 (test for qcow2 compression type) and it's in bash.
So for now we can safely filter out compression type in all qcow2
tests.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/206.out| 10 +++---
  tests/qemu-iotests/242.out| 10 +++---
  tests/qemu-iotests/255.out|  8 ++---
  tests/qemu-iotests/274.out| 68 +--
  tests/qemu-iotests/280.out|  2 +-
  tests/qemu-iotests/iotests.py | 13 ++-
  6 files changed, 61 insertions(+), 50 deletions(-)


Looks OK, though I wonder if it weren’t better to have a filter that 
only prints some options and explicitly filters out everything else.  
(Well, actually, I’d prefer not to have the “Formatting…” line in the 
reference output at all, because I don’t see the point, but I suppose 
that can be considered a different problem.)


[...]


diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 80f0cb4f42..6a8cc1bad7 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -224,9 +224,18 @@ def qemu_img_verbose(*args):
   % (-exitcode, ' '.join(qemu_img_args + list(args
  return exitcode
  
+def filter_img_create(text: str) -> str:

+return re.sub('(compression_type=)(zlib|zstd)', r'\1COMPRESSION_TYPE',
+  text)
+
  def qemu_img_pipe(*args: str) -> str:
  '''Run qemu-img and return its output'''
-return qemu_img_pipe_and_status(*args)[0]
+output =  qemu_img_pipe_and_status(*args)[0]


There’s a superfluous space after '='.


+
+if args[0] == 'create':
+return filter_img_create(output)
+
+return output


Wouldn’t it make more sense to have this filter be in 
qemu_img_pipe_and_status()?


Max


  def qemu_img_log(*args):
  result = qemu_img_pipe(*args)
@@ -479,6 +488,8 @@ def filter_img_info(output, filename):
'uuid: ----',
line)
  line = re.sub('cid: [0-9]+', 'cid: XX', line)
+line = re.sub('(compression type: )(zlib|zstd)', r'\1COMPRESSION_TYPE',
+  line)
  lines.append(line)
  return '\n'.join(lines)
  





Re: [PATCH 04/14] iotest 065: explicit compression type

2021-07-16 Thread Max Reitz

On 05.07.21 11:15, Vladimir Sementsov-Ogievskiy wrote:

The test checks different options. It of course fails if set
IMGOPTS='compression_type=zstd'. So, let's be explicit in what
compression type we want and independent of IMGOPTS. Test both existing
compression types.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/065 | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)


Reviewed-by: Max Reitz 




Re: [PATCH 03/14] iotest 303: explicit compression type

2021-07-16 Thread Max Reitz

On 05.07.21 11:15, Vladimir Sementsov-Ogievskiy wrote:

The test prints qcow2 header fields which depends on chosen compression
type. So, let's be explicit in what compression type we want and
independent of IMGOPTS. Test both existing compression types.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/303 | 25 -
  tests/qemu-iotests/303.out | 30 +-
  2 files changed, 45 insertions(+), 10 deletions(-)


Reviewed-by: Max Reitz 




Re: [PATCH 02/14] iotests.py: qemu_img*("create"): support IMGOPTS='compression_type=zstd'

2021-07-16 Thread Max Reitz

On 05.07.21 11:15, Vladimir Sementsov-Ogievskiy wrote:

Adding support of IMGOPTS (like in bash tests) allows user to pass a
lot of different options. Still, some may require additional logic.

Now we want compression_type option, so add some smart logic around it:
ignore compression_type=zstd in IMGOPTS, if test want qcow2 in
compatibility mode. As well, ignore compression_type for non-qcow2
formats.

Note that we may instead add support only to qemu_img_create(), but
that works bad:

1. We'll have to update a lot of tests to use qemu_img_create instead
of qemu_img('create'). (still, we may want do it anyway, but no
reason to create a dependancy between task of supporting IMGOPTS and
updating a lot of tests)

2. Some tests use qemu_img_pipe('create', ..) - even more work on
updating


I feel compelled to again say that we had a series that did exactly 
that.  But of course, now that so much time has passed, overhauling it 
would require quite a bit of work.



3. Even if we update all tests to go through qemu_img_create, we'll
need a way to avoid creating new tests using qemu_img*('create') -
add assertions.. That doesn't seem good.


That almost sounds like you remember my series, because:

https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg00135.html

;)


So, let's add support of IMGOPTS to most generic
qemu_img_pipe_and_status().

Signed-off-by: Vladimir Sementsov-Ogievskiy
---
  tests/qemu-iotests/iotests.py | 48 ++-
  1 file changed, 47 insertions(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 0d99dd841f..80f0cb4f42 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -16,6 +16,7 @@
  # along with this program.  If not, see.
  #
  
+import argparse

  import atexit
  import bz2
  from collections import OrderedDict
@@ -41,6 +42,19 @@
  from qemu.machine import qtest
  from qemu.qmp import QMPMessage
  
+

+def optstr2dict(opts: str) -> Dict[str, str]:
+if not opts:
+return {}
+
+return {arr[0]: arr[1] for arr in
+(opt.split('=', 1) for opt in opts.strip().split(','))}
+
+
+def dict2optstr(opts: Dict[str, str]) -> str:
+return ','.join(f'{k}={v}' for k, v in opts.items())
+
+
  # Use this logger for logging messages directly from the iotests module
  logger = logging.getLogger('qemu.iotests')
  logger.addHandler(logging.NullHandler())
@@ -57,6 +71,8 @@
  if os.environ.get('QEMU_IMG_OPTIONS'):
  qemu_img_args += os.environ['QEMU_IMG_OPTIONS'].strip().split(' ')
  
+imgopts = optstr2dict(os.environ.get('IMGOPTS', ''))

+
  qemu_io_args = [os.environ.get('QEMU_IO_PROG', 'qemu-io')]
  if os.environ.get('QEMU_IO_OPTIONS'):
  qemu_io_args += os.environ['QEMU_IO_OPTIONS'].strip().split(' ')
@@ -121,11 +137,41 @@ def qemu_tool_pipe_and_status(tool: str, args: 
Sequence[str],
 {-subp.returncode}: {cmd}\n')
  return (output, subp.returncode)
  
+def qemu_img_create_prepare_args(args: List[str]) -> List[str]:

+if not args or args[0] != 'create':
+return list(args)
+args = args[1:]
+
+p = argparse.ArgumentParser(allow_abbrev=False)
+# -o option may be specified several times
+p.add_argument('-o', action='append', default=[])
+p.add_argument('-f')
+parsed, remaining = p.parse_known_args(args)
+
+opts = optstr2dict(','.join(parsed.o))
+
+compat = 'compat' in opts and opts['compat'][0] == '0'


I suppose `opts['compat'][0] == '0'` is supposed to check for compat=0.10?

If so, then why not just check `opts['compat'] == '0.10'` to be 
clearer?  I don’t think we allow any other compat=0* values, and I have 
no reason to believe we ever will.


Also, I think compat=v2 is valid, too.  (And I think calling this 
variable “v2” would also make more sense than “compat”.)



+for k, v in imgopts.items():
+if k in opts:
+continue
+if k == 'compression_type' and (compat or parsed.f != 'qcow2'):
+continue
+opts[k] = v


Could also be done with something like

imgopts = os.environ.get('IMGOPTS')
opts = optstr2dict(','.join(([imgopts] if imgopts else []) + parsed.o))

if parsed.f != 'qcow2' or (opts.get('compat') in ['v2', '0.10']):
    opts.pop('compression_type', None)

(Never tested, of course)

Because optstr2dict() prioritizes later options over earlier ones. 
(Which is good, because that’s also qemu-img’s behavior.)


*shrug*


+
+result = ['create']
+if parsed.f is not None:
+result += ['-f', parsed.f]


Can this even be None?  I hope none of our tests do this.


+if opts:
+result += ['-o', dict2optstr(opts)]
+result += remaining
+
+return result
+
  def qemu_img_pipe_and_status(*args: str) -> Tuple[str, int]:
  """
  Run qemu-img and return both its output and its exit code
  """
-full_args = qemu_img_args + list(args)
+  

Re: [PATCH 01/14] iotests.py: img_info_log(): rename imgopts argument

2021-07-16 Thread Max Reitz

On 05.07.21 11:15, Vladimir Sementsov-Ogievskiy wrote:

We are going to support IMGOPTS environment variable like in bash
tests. Corresponding global variable in iotests.py should be called
imgopts. So to not interfere with function argument, rename it in
advance.

Signed-off-by: Vladimir Sementsov-Ogievskiy
---
  tests/qemu-iotests/210| 8 
  tests/qemu-iotests/iotests.py | 5 +++--
  2 files changed, 7 insertions(+), 6 deletions(-)


Reviewed-by: Max Reitz 

Reminds me how I sent a huge series for having Python tests support 
$IMGOPTS two years ago 
(https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg00071.html). 
I guess the reason it was so big and this series is comparatively small 
is because I mostly concerned myself with `-o data_file` (and so all the 
operations that currently assume that every image is a single file need 
to be amended).


Max




Re: [PATCH v8 00/16] qemu_iotests: improve debugging options

2021-07-15 Thread Max Reitz

On 05.07.21 08:56, Emanuele Giuseppe Esposito wrote:

This series adds the option to attach gdbserver and valgrind
to the QEMU binary running in qemu_iotests.
It also allows to redirect QEMU binaries output of the python tests
to the stdout, instead of a log file.

Patches 1-9 introduce the -gdb option to both python and bash tests,
10-14 extend the already existing -valgrind flag to work also on
python tests, and patch 15-16 introduces -p to enable logging to stdout.

In particular, patches 1,6,8,11 focus on extending the QMP socket timers
when using gdb/valgrind, otherwise the python tests will fail due to
delays in the QMP responses.

Signed-off-by: Emanuele Giuseppe Esposito 
---
v7:
* Adjust documentation and error message when -gdb and -valgrind are set
   at the same time [Eric]
* Add missing Acked-by [John]


All patches I didn’t comment on:

Reviewed-by: Max Reitz 

Which really only leaves the quotes around $GDB_OPTIONS in patch 8.

Max




Re: [PATCH v8 16/16] docs/devel/testing: add -p option to the debug section of QEMU iotests

2021-07-15 Thread Max Reitz

On 05.07.21 08:57, Emanuele Giuseppe Esposito wrote:

Signed-off-by: Emanuele Giuseppe Esposito 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
  docs/devel/testing.rst | 4 
  1 file changed, 4 insertions(+)

diff --git a/docs/devel/testing.rst b/docs/devel/testing.rst
index 719accdb1e..e5311cb167 100644
--- a/docs/devel/testing.rst
+++ b/docs/devel/testing.rst
@@ -249,6 +249,10 @@ a failing test:
  * ``-d`` (debug) just increases the logging verbosity, showing
for example the QMP commands and answers.
  
+* ``-p`` (print) redirect QEMU’s stdout and stderr to the test output,


Sorry, my bad: s/redirect/redirects/

With that fixed:

Reviewed-by: Max Reitz 

Max


+  instead of saving it into a log file in
+  ``$TEST_DIR/qemu-machine-``.
+
  Test case groups
  
  





Re: [PATCH v8 09/16] docs/devel/testing: add -gdb option to the debugging section of QEMU iotests

2021-07-15 Thread Max Reitz

On 05.07.21 08:57, Emanuele Giuseppe Esposito wrote:

Signed-off-by: Emanuele Giuseppe Esposito 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
  docs/devel/testing.rst | 11 +++
  1 file changed, 11 insertions(+)

diff --git a/docs/devel/testing.rst b/docs/devel/testing.rst
index 9d6a8f8636..8b24e6fb47 100644
--- a/docs/devel/testing.rst
+++ b/docs/devel/testing.rst
@@ -229,6 +229,17 @@ Debugging a test case
  The following options to the ``check`` script can be useful when debugging
  a failing test:
  
+* ``-gdb`` wraps every QEMU invocation in a ``gdbserver``, which waits for a

+  connection from a gdb client.  The options given to ``gdbserver`` (e.g. the
+  address on which to listen for connections) are taken from the 
``$GDB_OPTIONS``
+  environment variable.  By default (if ``$GDB_OPTIONS`` is empty), it listens 
on
+  ``localhost:12345``.
+  It is possible to connect to it for example with
+  ``gdb -iex "target remote $addr"``, where ``$addr`` is the address
+  ``gdbserver`` listens on.
+  If the ``-gdb`` option is not used, ``$GDB_OPTIONS`` is ignored,
+  regardless on whether it is set or not.


s/on/of/

With that: Reviewed-by: Max Reitz 




Re: [PATCH v8 08/16] qemu-iotests: add gdbserver option to script tests too

2021-07-15 Thread Max Reitz

On 05.07.21 08:57, Emanuele Giuseppe Esposito wrote:

Remove read timer in test script when GDB_OPTIONS are set,
so that the bash tests won't timeout while running gdb.

The only limitation here is that running a script with gdbserver
will make the test output mismatch with the expected
results, making the test fail.

Signed-off-by: Emanuele Giuseppe Esposito 
---
  tests/qemu-iotests/common.qemu | 7 ++-
  tests/qemu-iotests/common.rc   | 8 +++-
  2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/common.qemu b/tests/qemu-iotests/common.qemu
index 0fc52d20d7..cbca757b49 100644
--- a/tests/qemu-iotests/common.qemu
+++ b/tests/qemu-iotests/common.qemu
@@ -85,7 +85,12 @@ _timed_wait_for()
  timeout=yes
  
  QEMU_STATUS[$h]=0

-while IFS= read -t ${QEMU_COMM_TIMEOUT} resp <&${QEMU_OUT[$h]}
+read_timeout="-t ${QEMU_COMM_TIMEOUT}"
+if [ ! -z ${GDB_OPTIONS} ]; then


Shouldn’t we quote "${GDB_OPTIONS}" so that `test` won’t interpret it as 
its own parameters (if something in there starts with `--`, which I 
don’t think is the intended usage for $GDB_OPTIONS, but, well...)?


(Also, `! -z` is the same as `-n`, but I suppose choosing between the 
two can be a matter of style.)



+read_timeout=
+fi
+
+while IFS= read ${read_timeout} resp <&${QEMU_OUT[$h]}
  do
  if [ -n "$capture_events" ]; then
  capture=0
diff --git a/tests/qemu-iotests/common.rc b/tests/qemu-iotests/common.rc
index cbbf6d7c7f..a1ef2b5c2f 100644
--- a/tests/qemu-iotests/common.rc
+++ b/tests/qemu-iotests/common.rc
@@ -166,8 +166,14 @@ _qemu_wrapper()
  if [ -n "${QEMU_NEED_PID}" ]; then
  echo $BASHPID > "${QEMU_TEST_DIR}/qemu-${_QEMU_HANDLE}.pid"
  fi
+
+GDB=""
+if [ ! -z ${GDB_OPTIONS} ]; then


Here, too.  (Sorry for not noticing in v3 already...)

Max


+GDB="gdbserver ${GDB_OPTIONS}"
+fi
+
  VALGRIND_QEMU="${VALGRIND_QEMU_VM}" _qemu_proc_exec 
"${VALGRIND_LOGFILE}" \
-"$QEMU_PROG" $QEMU_OPTIONS "$@"
+$GDB "$QEMU_PROG" $QEMU_OPTIONS "$@"
  )
  RETVAL=$?
  _qemu_proc_valgrind_log "${VALGRIND_LOGFILE}" $RETVAL





Re: [PATCH v5 0/6] blkdebug: fix racing condition when iterating on

2021-07-15 Thread Max Reitz

On 14.06.21 10:29, Emanuele Giuseppe Esposito wrote:

When qemu_coroutine_enter is executed in a loop
(even QEMU_FOREACH_SAFE), the new routine can modify the list,
for example removing an element, causing problem when control
is given back to the caller that continues iterating on the same list.

Patch 1 solves the issue in blkdebug_debug_resume by restarting
the list walk after every coroutine_enter if list has to be fully iterated.
Patches 2,3,4 aim to fix blkdebug_debug_event by gathering
all actions that the rules make in a counter and invoking
the respective coroutine_yeld only after processing all requests.

Patch 5-6 are somewhat independent of the others, patch 5 removes the need
of new_state field, and patch 6 adds a lock to
protect rules and suspended_reqs; right now everything works because
it's protected by the AioContext lock.
This is a preparation for the current proposal of removing the AioContext
lock and instead using smaller granularity locks to allow multiple
iothread execution in the same block device.

Signed-off-by: Emanuele Giuseppe Esposito 
---
v5:
* Add comment in patch 1 to explain why we don't need _SAFE in for loop
* Move the state update (s->state = new_state) in patch 5, to maintain
   the same existing effect in all patches


I’m not sure whether this actually fixes a user-visible bug…?  The first 
paragraph makes it sound like it, but there is no test, so I’m not sure.


I’m mostly asking because of freeze; but you make it sound like there’s 
a bug, and as this only concerns blkdebug (i.e., a block driver used 
only for testing), I feel like applying this series after soft freeze 
should be fine, so:


Thanks, I’ve applied this series to my block branch:

https://github.com/XanClic/qemu/commits/block

Max




Re: [PATCH v5 3/6] blkdebug: track all actions

2021-07-15 Thread Max Reitz

On 14.06.21 10:29, Emanuele Giuseppe Esposito wrote:

Add a counter for each action that a rule can trigger.
This is mainly used to keep track of how many coroutine_yield()
we need to perform after processing all rules in the list.

Co-developed-by: Paolo Bonzini
Signed-off-by: Emanuele Giuseppe Esposito
Reviewed-by: Vladimir Sementsov-Ogievskiy
---
  block/blkdebug.c | 17 -
  1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/block/blkdebug.c b/block/blkdebug.c
index e8fdf7b056..6bdeb2c7b3 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -74,6 +74,7 @@ enum {
  ACTION_INJECT_ERROR,
  ACTION_SET_STATE,
  ACTION_SUSPEND,
+ACTION__MAX,
  };
  
  typedef struct BlkdebugRule {

@@ -791,22 +792,22 @@ static void suspend_request(BlockDriverState *bs, 
BlkdebugRule *rule)
  qemu_coroutine_yield();
  }
  
-static bool process_rule(BlockDriverState *bs, struct BlkdebugRule *rule,

-bool injected)
+static void process_rule(BlockDriverState *bs, struct BlkdebugRule *rule,
+ int *action_count)


I would have liked a comment above this function explaining that 
`action_count` is not merely an int pointer, but actually an 
int[ACTION__MAX] pointer.


But it’s too late to complain about that now. O:)


  {
  BDRVBlkdebugState *s = bs->opaque;
  
  /* Only process rules for the current state */

  if (rule->state && rule->state != s->state) {
-return injected;
+return;
  }
  
  /* Take the action */

+action_count[rule->action]++;
  switch (rule->action) {
  case ACTION_INJECT_ERROR:
-if (!injected) {
+if (action_count[ACTION_INJECT_ERROR] == 1) {
  QSIMPLEQ_INIT(>active_rules);


(I don’t quite understand this part – why do we clear the list of active 
rules here?  And why only if a new error is injected?  For example, if I 
have an inject-error rule that should only fire on state 1, and then the 
state changes to state 2, it stays active until a new error is injected, 
which doesn’t make sense to me.  But that has nothing to do with this 
series, of course.  I’m just wondering.)


Max


-injected = true;
  }
  QSIMPLEQ_INSERT_HEAD(>active_rules, rule, active_next);
  break;

   






Re: [PATCH v5 2/6] blkdebug: move post-resume handling to resume_req_by_tag

2021-07-15 Thread Max Reitz

On 14.06.21 10:29, Emanuele Giuseppe Esposito wrote:

We want to move qemu_coroutine_yield() after the loop on rules,
because QLIST_FOREACH_SAFE is wrong if the rule list is modified
while the coroutine has yielded.  Therefore move the suspended
request to the heap and clean it up from the remove side.
All that is left is for blkdebug_debug_event to handle the
yielding.

Co-developed-by: Paolo Bonzini
Signed-off-by: Emanuele Giuseppe Esposito
Reviewed-by: Vladimir Sementsov-Ogievskiy
---
  block/blkdebug.c | 31 ++-
  1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/block/blkdebug.c b/block/blkdebug.c
index 5ccbfcab42..e8fdf7b056 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -775,25 +775,20 @@ static void blkdebug_close(BlockDriverState *bs)
  static void suspend_request(BlockDriverState *bs, BlkdebugRule *rule)
  {
  BDRVBlkdebugState *s = bs->opaque;
-BlkdebugSuspendedReq r;
+BlkdebugSuspendedReq *r;
  
-r = (BlkdebugSuspendedReq) {

-.co = qemu_coroutine_self(),
-.tag= g_strdup(rule->options.suspend.tag),
-};
+r = g_new(BlkdebugSuspendedReq, 1);
+
+r->co = qemu_coroutine_self();
+r->tag= g_strdup(rule->options.suspend.tag);


Not wrong, but just as a note: I personally would have done the 
initialization like


*r = (BlkdebugSuspendedReq) {
    .co = ...,
    .tag = ...,
};

The advantage is that this sets all fields that aren’t mentioned to zero 
(kind of important, because you don’t use g_new0(), and so now I have to 
manually verify that there are no other fields that would need to be 
initialized (which there aren’t)), and in this special case the diff 
stat also would have been smaller. (But that’s a rare coincidence.)


There are no other fields besides the list entry object (which is fully 
overwritten by QLIST_INSERT_HEAD()), though, so this patch is correct 
and I’m happy with it as-is.


Max




Re: [Virtio-fs] [PATCH v2 7/9] virtiofsd: Add inodes_by_handle hash table

2021-07-13 Thread Max Reitz
So I’m coming back to this after three weeks (well, PTO), and this again 
turns into a bit of a pain, actually.


I don’t think it’s anything serious, but I had thought we had found 
something that would make us both happy because it wouldn’t be too ugly, 
and now it’s turning ugly again...  So I’m sending this mail as a heads 
up before I send v3 in the next days, to explain my thought process.


On 21.06.21 11:02, Max Reitz wrote:

On 18.06.21 20:29, Vivek Goyal wrote:



[...]


I am still reading your code and trying to understand it. But one
question came to mind. What happens if we can generate file handle
during lookup. But can't generate when same file is looked up again.

- A file foo.txt is looked. We can create file handle and we add it
   to lo->inodes_by_handle as well as lo->inodes_by_ds.

- Say somebody deleted file and created again and inode number got
   reused.

- Now during ->revalidation path, lookup happens again. This time say
   we can't generate file handle. If am reading lo_do_find() code
   correctly, it will find the old inode using ids and return same
   inode as result of lookup. And we did not recognize that inode
   number has been reused.


Oh, that’s a good point.  If an lo_inode has no O_PATH fd but is only 
addressed by handle, we must always look it up by handle.


Also, just wanted to throw in this remark:

Now that I read the code again, lo_do_find() already has a condition to 
prevent this.  It’s this:


if (p && fhandle != NULL && p->fhandle != NULL) {
    p = NULL;
}

There’s just one thing wrong with it, and that is the `fhandle != NULL` 
part.  It has no place there.  But this piece of code does exactly what 
we’d need it do if it were just:


if (p && p->fhandle != NULL) {
    p = NULL;
}

[...]

However, you made a good point in that we must require 
name_to_handle_at() to work if it worked before for some inode, not 
because it would be simpler, but because it would be wrong otherwise.


As for the other way around...  Well, now I don’t have a strong 
opinion on it.  Handling temporary name_to_handle_at() failure after 
it worked the first time should not add extra complexity, but it 
wouldn’t be symmetric.  Like, allowing temporary failure sometimes but 
not at other times.


(I think I mistyped here, it should be “Handling name_to_handle_at() 
randomly working after it failed the first time”.)


The next question is, how do we detect temporary failure, because if 
we look up some new inode, name_to_handle_at() fails, we ignore it, 
and then it starts to work and we fail all further lookups, that’s not 
good.  We should have the first lookup fail.  I suppose ENOTSUPP means 
“OK to ignore”, and for everything else we should let lookup fail?  
(And that pretty much answers my "what if name_to_handle_at() works 
the first time, but then fails" question.  If we let anything but 
ENOTSUPP let the lookup fail, then we should do so every time.)


I don’t think this will work as cleanly as I’d hoped.

The problem I’m facing is that get_file_handle() doesn’t only call 
name_to_handle_at(), but also contains a lot of code managing 
mount_fds.  There are a lot of places that can fail, too, and I think we 
should have them fall back to using an O_PATH FD:


Say mount_fds doesn’t contain an FD for the new handle’s mount ID yet, 
so we want to add one.  However, it turns out that the file is not a 
regular file or directory, so we cannot open it as a regular FD and add 
it to mount_fds; or that it is a regular file, but without permission to 
open it O_RDONLY.  So we cannot return a file handle, because it will 
not be usable until a mount FD is added.


I think in such a case we should fall back to an O_PATH FD, because this 
is not some unexpected error, but just an unfortunate (but reproducible 
and valid) circumstance where using `-o inode_file_handles` fails to do 
something that works without it.


Now, however, this means that the next time we try to generate a handle 
for this file (to look it up), it will absolutely work if some other FD 
was added to mount_fds for this mount ID in the meantime.



We could get around this by not trying to open the file for which we are 
to generate a handle to add its FD to mount_fds, but instead doing what 
the open_by_handle_at() man page suggests:


The mount_id argument returns an identifier for the filesystem mount 
that corresponds to pathname. This corresponds to the first field in 
one of the records in /proc/self/mountinfo. Opening the pathname in 
the fifth field of that record yields a file descriptor for the mount 
point; that file descriptor can be used in a subsequent call to 
open_by_handle_at().


However, I’d rather avoid parsing mountinfo.  And as far as I 
understand, the only problem here is that we’ll have to cope with the 
fact that sometimes on lookups, we can generate a file handle, but the 
lo_inode we want to find has no file handle attached to it 

Re: [PATCH v2 3/6] block: Clarify that @bytes is no limit on *pnum

2021-07-12 Thread Max Reitz

On 28.06.21 21:10, Eric Blake wrote:

+++ b/include/block/block_int.h
@@ -347,6 +347,11 @@ struct BlockDriver {
    * clamped to bdrv_getlength() and aligned to request_alignment,
    * as well as non-NULL pnum, map, and file; in turn, the driver
    * must return an error or set pnum to an aligned non-zero value.
+ *
+ * Note that @bytes is just a hint on how big of a region the
+ * caller wants to inspect.  It is not a limit on *pnum.
+ * Implementations are free to return larger values of *pnum if
+ * doing so does not incur a performance penalty.

Worth mention that the cache will benefit of it?

Oh, right, absolutely.  Like so:

"block/io.c's bdrv_co_block_status() will clamp *pnum before returning it to
its caller, but it itself can still make use of the unclamped *pnum value.
Specifically, the block-status cache for protocol nodes will benefit from
storing as large a region as possible."

How about this tweak to the wording to make it flow a little better:

block/io.c's bdrv_co_block_status() will utilize an unclamped *pnum
value for the block-status cache on protocol nodes, prior to clamping
*pnum for return to its caller.


Sure, thanks!

Max




Re: [PATCH v2 2/6] block: block-status cache for data regions

2021-07-12 Thread Max Reitz

On 06.07.21 19:04, Kevin Wolf wrote:

Am 23.06.2021 um 17:01 hat Max Reitz geschrieben:

As we have attempted before
(https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
"file-posix: Cache lseek result for data regions";
https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
"file-posix: Cache next hole"), this patch seeks to reduce the number of
SEEK_DATA/HOLE operations the file-posix driver has to perform.  The
main difference is that this time it is implemented as part of the
general block layer code.

The problem we face is that on some filesystems or in some
circumstances, SEEK_DATA/HOLE is unreasonably slow.  Given the
implementation is outside of qemu, there is little we can do about its
performance.

We have already introduced the want_zero parameter to
bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
unless we really want zero information; but sometimes we do want that
information, because for files that consist largely of zero areas,
special-casing those areas can give large performance boosts.  So the
real problem is with files that consist largely of data, so that
inquiring the block status does not gain us much performance, but where
such an inquiry itself takes a lot of time.

To address this, we want to cache data regions.  Most of the time, when
bad performance is reported, it is in places where the image is iterated
over from start to end (qemu-img convert or the mirror job), so a simple
yet effective solution is to cache only the current data region.

(Note that only caching data regions but not zero regions means that
returning false information from the cache is not catastrophic: Treating
zeroes as data is fine.  While we try to invalidate the cache on zero
writes and discards, such incongruences may still occur when there are
other processes writing to the image.)

We only use the cache for nodes without children (i.e. protocol nodes),
because that is where the problem is: Drivers that rely on block-status
implementations outside of qemu (e.g. SEEK_DATA/HOLE).

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
Signed-off-by: Max Reitz 

Since you indicated that you'll respin the patch, I'll add my minor
comments:


@@ -2442,9 +2445,58 @@ static int coroutine_fn 
bdrv_co_block_status(BlockDriverState *bs,
  aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset;
  
  if (bs->drv->bdrv_co_block_status) {

-ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
-aligned_bytes, pnum, _map,
-_file);
+bool from_cache = false;
+
+/*
+ * Use the block-status cache only for protocol nodes: Format
+ * drivers are generally quick to inquire the status, but protocol
+ * drivers often need to get information from outside of qemu, so
+ * we do not have control over the actual implementation.  There
+ * have been cases where inquiring the status took an unreasonably
+ * long time, and we can do nothing in qemu to fix it.
+ * This is especially problematic for images with large data areas,
+ * because finding the few holes in them and giving them special
+ * treatment does not gain much performance.  Therefore, we try to
+ * cache the last-identified data region.
+ *
+ * Second, limiting ourselves to protocol nodes allows us to assume
+ * the block status for data regions to be DATA | OFFSET_VALID, and
+ * that the host offset is the same as the guest offset.
+ *
+ * Note that it is possible that external writers zero parts of
+ * the cached regions without the cache being invalidated, and so
+ * we may report zeroes as data.  This is not catastrophic,
+ * however, because reporting zeroes as data is fine.
+ */
+if (QLIST_EMPTY(>children)) {
+if (bdrv_bsc_is_data(bs, aligned_offset, pnum)) {
+ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
+local_file = bs;
+local_map = aligned_offset;
+
+from_cache = true;
+}
+}
+
+if (!from_cache) {

Is having a separate variable from_cache really useful? This looks like
it could just be:

 if (QLIST_EMPTY() && bdrv_bsc_is_data()) {
 // The code above
 } else {
 // The code below
 }


Oh, yes.

(I guess this was mainly an artifact from v1 where there was a mutex 
around the bdrv_bsc_is_data() block.  Now it’s better to just roll both 
conditions into one, yes.)



+ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
+aligned_bytes, pnum, 
_map,
+_file);
+
+/*
+ * Note that 

[PATCH v2 5/6] iotests/308: Test +w on read-only FUSE exports

2021-06-25 Thread Max Reitz
Test that +w on read-only FUSE exports returns an EROFS error.  u+x on
the other hand should work.  (There is no special reason to choose u+x
here, it simply is like +w another flag that is not set by default.)

Signed-off-by: Max Reitz 
---
 tests/qemu-iotests/308 | 11 +++
 tests/qemu-iotests/308.out |  4 
 2 files changed, 15 insertions(+)

diff --git a/tests/qemu-iotests/308 b/tests/qemu-iotests/308
index d13a9a969c..6b386bd523 100755
--- a/tests/qemu-iotests/308
+++ b/tests/qemu-iotests/308
@@ -170,6 +170,17 @@ fuse_export_add 'export-mp' "'mountpoint': '$EXT_MP'"
 # Check that the export presents the same data as the original image
 $QEMU_IMG compare -f raw -F $IMGFMT -U "$EXT_MP" "$TEST_IMG"
 
+# Some quick chmod tests
+stat -c 'Permissions pre-chmod: %a' "$EXT_MP"
+
+# Verify that we cannot set +w
+chmod u+w "$EXT_MP" 2>&1 | _filter_testdir | _filter_imgfmt
+stat -c 'Permissions post-+w: %a' "$EXT_MP"
+
+# But that we can set, say, +x (if we are so inclined)
+chmod u+x "$EXT_MP" 2>&1 | _filter_testdir | _filter_imgfmt
+stat -c 'Permissions post-+x: %a' "$EXT_MP"
+
 echo
 echo '=== Mount over existing file ==='
 
diff --git a/tests/qemu-iotests/308.out b/tests/qemu-iotests/308.out
index 0e9420645f..fc47bb11a2 100644
--- a/tests/qemu-iotests/308.out
+++ b/tests/qemu-iotests/308.out
@@ -50,6 +50,10 @@ wrote 67108864/67108864 bytes at offset 0
   } }
 {"return": {}}
 Images are identical.
+Permissions pre-chmod: 400
+chmod: changing permissions of 'TEST_DIR/t.IMGFMT.fuse': Read-only file system
+Permissions post-+w: 400
+Permissions post-+x: 500
 
 === Mount over existing file ===
 {'execute': 'block-export-add',
-- 
2.31.1




[PATCH v2 4/6] export/fuse: Let permissions be adjustable

2021-06-25 Thread Max Reitz
Allow changing the file mode, UID, and GID through SETATTR.

Without allow_other, UID and GID are not allowed to be changed, because
it would not make sense.  Also, changing group or others' permissions
is not allowed either.

For read-only exports, +w cannot be set.

Signed-off-by: Max Reitz 
---
 block/export/fuse.c | 73 ++---
 1 file changed, 62 insertions(+), 11 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 26ad644cd7..ada9e263eb 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -48,6 +48,10 @@ typedef struct FuseExport {
 bool growable;
 /* Whether allow_other was used as a mount option or not */
 bool allow_other;
+
+mode_t st_mode;
+uid_t st_uid;
+gid_t st_gid;
 } FuseExport;
 
 static GHashTable *exports;
@@ -125,6 +129,13 @@ static int fuse_export_create(BlockExport *blk_exp,
 args->allow_other = FUSE_EXPORT_ALLOW_OTHER_AUTO;
 }
 
+exp->st_mode = S_IFREG | S_IRUSR;
+if (exp->writable) {
+exp->st_mode |= S_IWUSR;
+}
+exp->st_uid = getuid();
+exp->st_gid = getgid();
+
 if (args->allow_other == FUSE_EXPORT_ALLOW_OTHER_AUTO) {
 /* Ignore errors on our first attempt */
 ret = setup_fuse_export(exp, args->mountpoint, true, NULL);
@@ -338,7 +349,6 @@ static void fuse_getattr(fuse_req_t req, fuse_ino_t inode,
 int64_t length, allocated_blocks;
 time_t now = time(NULL);
 FuseExport *exp = fuse_req_userdata(req);
-mode_t mode;
 
 length = blk_getlength(exp->common.blk);
 if (length < 0) {
@@ -353,17 +363,12 @@ static void fuse_getattr(fuse_req_t req, fuse_ino_t inode,
 allocated_blocks = DIV_ROUND_UP(allocated_blocks, 512);
 }
 
-mode = S_IFREG | S_IRUSR;
-if (exp->writable) {
-mode |= S_IWUSR;
-}
-
 statbuf = (struct stat) {
 .st_ino = inode,
-.st_mode= mode,
+.st_mode= exp->st_mode,
 .st_nlink   = 1,
-.st_uid = getuid(),
-.st_gid = getgid(),
+.st_uid = exp->st_uid,
+.st_gid = exp->st_gid,
 .st_size= length,
 .st_blksize = blk_bs(exp->common.blk)->bl.request_alignment,
 .st_blocks  = allocated_blocks,
@@ -409,19 +414,52 @@ static int fuse_do_truncate(const FuseExport *exp, 
int64_t size,
 }
 
 /**
- * Let clients set file attributes.  Only resizing is supported.
+ * Let clients set file attributes.  Only resizing and changing
+ * permissions (st_mode, st_uid, st_gid) is allowed.
+ * Changing permissions is only allowed as far as it will actually
+ * permit access: Read-only exports cannot be given +w, and exports
+ * without allow_other cannot be given a different UID or GID, and
+ * they cannot be given non-owner access.
  */
 static void fuse_setattr(fuse_req_t req, fuse_ino_t inode, struct stat 
*statbuf,
  int to_set, struct fuse_file_info *fi)
 {
 FuseExport *exp = fuse_req_userdata(req);
+int supported_attrs;
 int ret;
 
-if (to_set & ~FUSE_SET_ATTR_SIZE) {
+supported_attrs = FUSE_SET_ATTR_SIZE | FUSE_SET_ATTR_MODE;
+if (exp->allow_other) {
+supported_attrs |= FUSE_SET_ATTR_UID | FUSE_SET_ATTR_GID;
+}
+
+if (to_set & ~supported_attrs) {
 fuse_reply_err(req, ENOTSUP);
 return;
 }
 
+/* Do some argument checks first before committing to anything */
+if (to_set & FUSE_SET_ATTR_MODE) {
+/*
+ * Without allow_other, non-owners can never access the export, so do
+ * not allow setting permissions for them
+ */
+if (!exp->allow_other &&
+(statbuf->st_mode & (S_IRWXG | S_IRWXO)) != 0)
+{
+fuse_reply_err(req, EPERM);
+return;
+}
+
+/* +w for read-only exports makes no sense, disallow it */
+if (!exp->writable &&
+(statbuf->st_mode & (S_IWUSR | S_IWGRP | S_IWOTH)) != 0)
+{
+fuse_reply_err(req, EROFS);
+return;
+}
+}
+
 if (to_set & FUSE_SET_ATTR_SIZE) {
 if (!exp->writable) {
 fuse_reply_err(req, EACCES);
@@ -435,6 +473,19 @@ static void fuse_setattr(fuse_req_t req, fuse_ino_t inode, 
struct stat *statbuf,
 }
 }
 
+if (to_set & FUSE_SET_ATTR_MODE) {
+/* Ignore FUSE-supplied file type, only change the mode */
+exp->st_mode = (statbuf->st_mode & 0) | S_IFREG;
+}
+
+if (to_set & FUSE_SET_ATTR_UID) {
+exp->st_uid = statbuf->st_uid;
+}
+
+if (to_set & FUSE_SET_ATTR_GID) {
+exp->st_gid = statbuf->st_gid;
+}
+
 fuse_getattr(req, inode, fi);
 }
 
-- 
2.31.1




[PATCH v2 6/6] iotests/fuse-allow-other: Test allow-other

2021-06-25 Thread Max Reitz
Signed-off-by: Max Reitz 
---
 tests/qemu-iotests/tests/fuse-allow-other | 175 ++
 tests/qemu-iotests/tests/fuse-allow-other.out |  88 +
 2 files changed, 263 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/fuse-allow-other
 create mode 100644 tests/qemu-iotests/tests/fuse-allow-other.out

diff --git a/tests/qemu-iotests/tests/fuse-allow-other 
b/tests/qemu-iotests/tests/fuse-allow-other
new file mode 100755
index 00..a513dbce66
--- /dev/null
+++ b/tests/qemu-iotests/tests/fuse-allow-other
@@ -0,0 +1,175 @@
+#!/usr/bin/env bash
+# group: rw
+#
+# Test FUSE exports' allow-other option
+#
+# Copyright (C) 2021 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+seq=$(basename "$0")
+echo "QA output created by $seq"
+
+status=1   # failure is the default!
+
+_cleanup()
+{
+_cleanup_qemu
+_cleanup_test_img
+rm -f "$EXT_MP"
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ../common.rc
+. ../common.filter
+. ../common.qemu
+
+_supported_fmt generic
+
+_supported_proto file # We create the FUSE export manually
+
+sudo -n -u nobody true || \
+_notrun 'Password-less sudo as nobody required to test allow_other'
+
+# $1: Export ID
+# $2: Options (beyond the node-name and ID)
+# $3: Expected return value (defaults to 'return')
+# $4: Node to export (defaults to 'node-format')
+fuse_export_add()
+{
+allow_other_not_supported='option allow_other only allowed if'
+
+output=$(
+success_or_failure=yes _send_qemu_cmd $QEMU_HANDLE \
+"{'execute': 'block-export-add',
+  'arguments': {
+  'type': 'fuse',
+  'id': '$1',
+  'node-name': '${4:-node-format}',
+  $2
+  } }" \
+"${3:-return}" \
+"$allow_other_not_supported" \
+| _filter_imgfmt
+)
+
+if echo "$output" | grep -q "$allow_other_not_supported"; then
+# Shut down qemu gracefully so it can unmount the export
+_send_qemu_cmd $QEMU_HANDLE \
+"{'execute': 'quit'}" \
+'return'
+
+wait=yes _cleanup_qemu
+
+_notrun "allow_other not supported"
+fi
+
+echo "$output"
+}
+
+EXT_MP="$TEST_DIR/fuse-export"
+
+_make_test_img 64k
+touch "$EXT_MP"
+
+echo
+echo '=== Test permissions ==='
+
+# Test that you can only change permissions on the export with 
allow-other=true.
+# We cannot really test the primary reason behind allow-other (i.e. to allow
+# users other than the current one access to the export), because for that we
+# would need sudo, which realistically nobody will allow this test to use.
+# What we can do is test that allow-other=true also enables 
default_permissions,
+# i.e. whether we can still read from the file if we remove the read 
permission.
+
+# $1: allow-other value ('true' or 'false')
+run_permission_test()
+{
+_launch_qemu \
+-blockdev \
+
"$IMGFMT,node-name=node-format,file.driver=file,file.filename=$TEST_IMG"
+
+_send_qemu_cmd $QEMU_HANDLE \
+"{'execute': 'qmp_capabilities'}" \
+'return'
+
+fuse_export_add 'export' \
+"'mountpoint': '$EXT_MP',
+ 'allow-other': '$1'"
+
+# Should always work
+echo '(Removing all permissions)'
+chmod 000 "$EXT_MP" 2>&1 | _filter_testdir | _filter_imgfmt
+stat -c 'Permissions post-chmod: %a' "$EXT_MP"
+
+# Should always work
+echo '(Granting u+r)'
+chmod u+r "$EXT_MP" 2>&1 | _filter_testdir | _filter_imgfmt
+stat -c 'Permissions post-chmod: %a' "$EXT_MP"
+
+# Should only work with allow-other: Otherwise, no permissions can be
+# granted to the group or others
+echo '(Granting read permissions for everyone)'
+chmod 444 "$EXT_MP" 2>&1 | _filter_testdir | _filter_imgfmt
+stat -c 'Permissions post-chmod: %a' "$EXT_MP"
+
+echo 'Doing operations as nobody:'
+# Change to TEST_DIR, so nobody will not have to attempt a lookup
+pushd "$TEST_DIR" >/dev/null
+
+# This is already prevented by the permissions (with

[PATCH v2 3/6] export/fuse: Give SET_ATTR_SIZE its own branch

2021-06-25 Thread Max Reitz
In order to support changing other attributes than the file size in
fuse_setattr(), we have to give each its own independent branch.  This
also applies to the only attribute we do support right now.

Signed-off-by: Max Reitz 
Reviewed-by: Kevin Wolf 
---
 block/export/fuse.c | 20 +++-
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 4068250241..26ad644cd7 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -417,20 +417,22 @@ static void fuse_setattr(fuse_req_t req, fuse_ino_t 
inode, struct stat *statbuf,
 FuseExport *exp = fuse_req_userdata(req);
 int ret;
 
-if (!exp->writable) {
-fuse_reply_err(req, EACCES);
-return;
-}
-
 if (to_set & ~FUSE_SET_ATTR_SIZE) {
 fuse_reply_err(req, ENOTSUP);
 return;
 }
 
-ret = fuse_do_truncate(exp, statbuf->st_size, true, PREALLOC_MODE_OFF);
-if (ret < 0) {
-fuse_reply_err(req, -ret);
-return;
+if (to_set & FUSE_SET_ATTR_SIZE) {
+if (!exp->writable) {
+fuse_reply_err(req, EACCES);
+return;
+}
+
+ret = fuse_do_truncate(exp, statbuf->st_size, true, PREALLOC_MODE_OFF);
+if (ret < 0) {
+fuse_reply_err(req, -ret);
+return;
+}
 }
 
 fuse_getattr(req, inode, fi);
-- 
2.31.1




[PATCH v2 0/6] export/fuse: Allow other users access to the export

2021-06-25 Thread Max Reitz
Hi,

The v1 cover letter is here:
https://lists.nongnu.org/archive/html/qemu-block/2021-06/msg00730.html

In v2, I changed the following:
- default_permissions is now passed always.  This is the right thing to
  do regardless of whether allow_other is active or not.

- allow_other is no longer a bool, but an off/on/auto enum.  `auto` is
  the default, in which case we will try to mount the export with
  allow_other first, and then fall back to mounting it without.

- Changing the file mode is now possible even without allow_other
  (because default_permissions is always active now), but only for the
  user/owner.  Giving the group or others any permissions only makes
  sense with allow_other, the same applies to changing the UID or GID.
  Giving a read-only export +w makes no sense and hence yields an EROFS
  error now.

- I decided just testing some default_permission quirks is boring.  So
  the new fuse-allow-other iotest does rely on `sudo -n -u nobody`
  working now, and actually tests what allow_other is supposed to do.
  (Also, it is skipped if allow_other does not work.)


git-backport-diff against v1:

Key:
[] : patches are identical
[] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/6:[down] 'export/fuse: Pass default_permissions for mount'
002/6:[0089] [FC] 'export/fuse: Add allow-other option'
003/6:[] [--] 'export/fuse: Give SET_ATTR_SIZE its own branch'
004/6:[0039] [FC] 'export/fuse: Let permissions be adjustable'
005/6:[down] 'iotests/308: Test +w on read-only FUSE exports'
006/6:[down] 'iotests/fuse-allow-other: Test allow-other'


Max Reitz (6):
  export/fuse: Pass default_permissions for mount
  export/fuse: Add allow-other option
  export/fuse: Give SET_ATTR_SIZE its own branch
  export/fuse: Let permissions be adjustable
  iotests/308: Test +w on read-only FUSE exports
  iotests/fuse-allow-other: Test allow-other

 qapi/block-export.json|  33 +++-
 block/export/fuse.c   | 121 +---
 tests/qemu-iotests/308|  20 +-
 tests/qemu-iotests/308.out|   6 +-
 tests/qemu-iotests/common.rc  |   6 +-
 tests/qemu-iotests/tests/fuse-allow-other | 175 ++
 tests/qemu-iotests/tests/fuse-allow-other.out |  88 +
 7 files changed, 421 insertions(+), 28 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/fuse-allow-other
 create mode 100644 tests/qemu-iotests/tests/fuse-allow-other.out

-- 
2.31.1




[PATCH v2 2/6] export/fuse: Add allow-other option

2021-06-25 Thread Max Reitz
Without the allow_other mount option, no user (not even root) but the
one who started qemu/the storage daemon can access the export.  Allow
users to configure the export such that such accesses are possible.

While allow_other is probably what users want, we cannot make it an
unconditional default, because passing it is only possible (for non-root
users) if the global fuse.conf configuration file allows it.  Thus, the
default is an 'auto' mode, in which we first try with allow_other, and
then fall back to without.

FuseExport.allow_other reports whether allow_other was actually used as
a mount option or not.  Currently, this information is not used, but a
future patch will let this field decide whether e.g. an export's UID and
GID can be changed through chmod.

One notable thing about 'auto' mode is that libfuse may print error
messages directly to stderr, and so may fusermount (which it executes).
Our export code cannot really filter or hide them.  Therefore, if 'auto'
fails its first attempt and has to fall back, fusermount will print an
error message that mounting with allow_other failed.

This behavior necessitates a change to iotest 308, namely we need to
filter out this error message (because if the first attempt at mounting
with allow_other succeeds, there will be no such message).

Furthermore, common.rc's _make_test_img should use allow-other=off for
FUSE exports, because iotests generally do not need to access images
from other users, so allow-other=on or allow-other=auto have no
advantage.  OTOH, allow-other=on will not work on systems where
user_allow_other is disabled, and with allow-other=auto, we get said
error message that we would need to filter out again.  Just disabling
allow-other is simplest.

Signed-off-by: Max Reitz 
---
 qapi/block-export.json   | 33 -
 block/export/fuse.c  | 28 +++-
 tests/qemu-iotests/308   |  6 +-
 tests/qemu-iotests/common.rc |  6 +-
 4 files changed, 65 insertions(+), 8 deletions(-)

diff --git a/qapi/block-export.json b/qapi/block-export.json
index e819e70cac..0ed63442a8 100644
--- a/qapi/block-export.json
+++ b/qapi/block-export.json
@@ -120,6 +120,23 @@
'*logical-block-size': 'size',
 '*num-queues': 'uint16'} }
 
+##
+# @FuseExportAllowOther:
+#
+# Possible allow_other modes for FUSE exports.
+#
+# @off: Do not pass allow_other as a mount option.
+#
+# @on: Pass allow_other as a mount option.
+#
+# @auto: Try mounting with allow_other first, and if that fails, retry
+#without allow_other.
+#
+# Since: 6.1
+##
+{ 'enum': 'FuseExportAllowOther',
+  'data': ['off', 'on', 'auto'] }
+
 ##
 # @BlockExportOptionsFuse:
 #
@@ -132,11 +149,25 @@
 # @growable: Whether writes beyond the EOF should grow the block node
 #accordingly. (default: false)
 #
+# @allow-other: If this is off, only qemu's user is allowed access to
+#   this export.  That cannot be changed even with chmod or
+#   chown.
+#   Enabling this option will allow other users access to
+#   the export with the FUSE mount option "allow_other".
+#   Note that using allow_other as a non-root user requires
+#   user_allow_other to be enabled in the global fuse.conf
+#   configuration file.
+#   In auto mode (the default), the FUSE export driver will
+#   first attempt to mount the export with allow_other, and
+#   if that fails, try again without.
+#   (since 6.1; default: auto)
+#
 # Since: 6.0
 ##
 { 'struct': 'BlockExportOptionsFuse',
   'data': { 'mountpoint': 'str',
-'*growable': 'bool' },
+'*growable': 'bool',
+'*allow-other': 'FuseExportAllowOther' },
   'if': 'defined(CONFIG_FUSE)' }
 
 ##
diff --git a/block/export/fuse.c b/block/export/fuse.c
index d0b88e8f80..4068250241 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -46,6 +46,8 @@ typedef struct FuseExport {
 char *mountpoint;
 bool writable;
 bool growable;
+/* Whether allow_other was used as a mount option or not */
+bool allow_other;
 } FuseExport;
 
 static GHashTable *exports;
@@ -57,7 +59,7 @@ static void fuse_export_delete(BlockExport *exp);
 static void init_exports_table(void);
 
 static int setup_fuse_export(FuseExport *exp, const char *mountpoint,
- Error **errp);
+ bool allow_other, Error **errp);
 static void read_from_fuse_export(void *opaque);
 
 static bool is_regular_file(const char *path, Error **errp);
@@ -118,7 +120,22 @@ static int fuse_export_create(BlockExport *blk_exp,
 exp->writable = blk_exp_args->writable;
 exp->growable = args->growable;
 
-ret = setup_fuse_export(exp, args->mountpoint, errp);
+/* set default */
+if (!args->has_allow_other) {
+args->allow_other = FUSE_EXPORT_ALLOW_OTH

[PATCH v2 1/6] export/fuse: Pass default_permissions for mount

2021-06-25 Thread Max Reitz
We do not do any permission checks in fuse_open(), so let the kernel do
them.  We already let fuse_getattr() report the proper UNIX permissions,
so this should work the way we want.

This causes a change in 308's reference output, because now opening a
non-writable export with O_RDWR fails already, instead of only actually
attempting to write to it.  (That is an improvement.)

Signed-off-by: Max Reitz 
---
 block/export/fuse.c| 8 ++--
 tests/qemu-iotests/308 | 3 ++-
 tests/qemu-iotests/308.out | 2 +-
 3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 38f74c94da..d0b88e8f80 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -153,8 +153,12 @@ static int setup_fuse_export(FuseExport *exp, const char 
*mountpoint,
 struct fuse_args fuse_args;
 int ret;
 
-/* Needs to match what fuse_init() sets.  Only max_read must be supplied. 
*/
-mount_opts = g_strdup_printf("max_read=%zu", FUSE_MAX_BOUNCE_BYTES);
+/*
+ * max_read needs to match what fuse_init() sets.
+ * max_write need not be supplied.
+ */
+mount_opts = g_strdup_printf("max_read=%zu,default_permissions",
+ FUSE_MAX_BOUNCE_BYTES);
 
 fuse_argv[0] = ""; /* Dummy program name */
 fuse_argv[1] = "-o";
diff --git a/tests/qemu-iotests/308 b/tests/qemu-iotests/308
index f122065d0f..11c28a75f2 100755
--- a/tests/qemu-iotests/308
+++ b/tests/qemu-iotests/308
@@ -215,7 +215,8 @@ echo '=== Writable export ==='
 fuse_export_add 'export-mp' "'mountpoint': '$EXT_MP', 'writable': true"
 
 # Check that writing to the read-only export fails
-$QEMU_IO -f raw -c 'write -P 42 1M 64k' "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -f raw -c 'write -P 42 1M 64k' "$TEST_IMG" 2>&1 \
+| _filter_qemu_io | _filter_testdir | _filter_imgfmt
 
 # But here it should work
 $QEMU_IO -f raw -c 'write -P 42 1M 64k' "$EXT_MP" | _filter_qemu_io
diff --git a/tests/qemu-iotests/308.out b/tests/qemu-iotests/308.out
index 466e7e0267..0e9420645f 100644
--- a/tests/qemu-iotests/308.out
+++ b/tests/qemu-iotests/308.out
@@ -91,7 +91,7 @@ virtual size: 0 B (0 bytes)
   'mountpoint': 'TEST_DIR/t.IMGFMT.fuse', 'writable': true
   } }
 {"return": {}}
-write failed: Permission denied
+qemu-io: can't open device TEST_DIR/t.IMGFMT: Could not open 
'TEST_DIR/t.IMGFMT': Permission denied
 wrote 65536/65536 bytes at offset 1048576
 64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 wrote 65536/65536 bytes at offset 1048576
-- 
2.31.1




  1   2   3   4   5   6   7   8   9   10   >