Re: [PULL v2 00/13] Block layer patches

2021-05-19 Thread Peter Maydell
On Tue, 18 May 2021 at 12:32, Kevin Wolf  wrote:
>
> The following changes since commit 367196caa07ac31443bc360145cc10fbef4fdf92:
>
>   Merge remote-tracking branch 
> 'remotes/vivier2/tags/trivial-branch-for-6.1-pull-request' into staging 
> (2021-05-17 16:44:47 +0100)
>
> are available in the Git repository at:
>
>   git://repo.or.cz/qemu/kevin.git tags/for-upstream
>
> for you to fetch changes up to c90bd505a3e8210c23d69fecab9ee6f56ec4a161:
>
>   vhost-user-blk: Check that num-queues is supported by backend (2021-05-18 
> 12:57:39 +0200)
>
> 
> Block layer patches
>
> - vhost-user-blk: Fix error handling during initialisation
> - Add test cases for the vhost-user-blk export
> - Fix leaked Transaction objects
> - qcow2: Expose dirty bit in 'qemu-img info'
>


Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/6.1
for any user-visible changes.

-- PMM



Re: [PATCH] block/ssh: Bump minimum libssh version to 0.8.7

2021-05-19 Thread Philippe Mathieu-Daudé
On 5/19/21 5:58 PM, Thomas Huth wrote:
> It has been over two years since RHEL-8 was released, and thus per the
> platform build policy, we no longer need to support RHEL-7 as a build
> target. So from the RHEL-7 perspective, we do not have to support
> libssh v0.7 anymore now.
> 
> Let's look at the versions from other distributions and operating
> systems - according to repology.org, current shipping versions are:
> 
>  RHEL-8: 0.9.4
>   Debian Buster: 0.8.7
>  openSUSE Leap 15.2: 0.8.7
>Ubuntu LTS 18.04: 0.8.0 *
>Ubuntu LTS 20.04: 0.9.3
> FreeBSD: 0.9.5
>   Fedora 33: 0.9.5
>   Fedora 34: 0.9.5
> OpenBSD: 0.9.5
>  macOS HomeBrew: 0.9.5
>  HaikuPorts: 0.9.5
> 
> * The version of libssh in Ubuntu 18.04 claims to be 0.8.0 from the
> name of the package, but in reality it is a 0.7 patched up as a
> Frankenstein monster with patches from the 0.8 development branch.
> This gave us some headaches in the past already and so it never worked
> with QEMU. All attempts to get it supported have failed in the past,
> patches for QEMU have never been merged and a request to Ubuntu to
> fix it in their 18.04 distro has been ignored:
> 
>  https://bugs.launchpad.net/ubuntu/+source/libssh/+bug/1847514
> 
> Thus we really should ignore the libssh in Ubuntu 18.04 in QEMU, too.
> 
> Fix it by bumping the minimum libssh version to something that is
> greater than 0.8.0 now. Debian Buster and openSUSE Leap have the
> oldest version and so 0.8.7 is the new minimum.
> 
> Signed-off-by: Thomas Huth 
> ---
>  block/ssh.c | 59 -
>  configure   | 19 +
>  2 files changed, 1 insertion(+), 77 deletions(-)
> 
> diff --git a/block/ssh.c b/block/ssh.c
> index ebe3d8b631..b51a031620 100644
> --- a/block/ssh.c
> +++ b/block/ssh.c
> @@ -277,7 +277,6 @@ static void ssh_parse_filename(const char *filename, 
> QDict *options,
>  static int check_host_key_knownhosts(BDRVSSHState *s, Error **errp)
>  {
>  int ret;
> -#ifdef HAVE_LIBSSH_0_8
>  enum ssh_known_hosts_e state;
>  int r;
>  ssh_key pubkey;
> @@ -343,46 +342,6 @@ static int check_host_key_knownhosts(BDRVSSHState *s, 
> Error **errp)
>  error_setg(errp, "error while checking for known server (%d)", 
> state);
>  goto out;
>  }
> -#else /* !HAVE_LIBSSH_0_8 */
> -int state;
> -
> -state = ssh_is_server_known(s->session);
> -trace_ssh_server_status(state);
> -
> -switch (state) {
> -case SSH_SERVER_KNOWN_OK:
> -/* OK */
> -trace_ssh_check_host_key_knownhosts();
> -break;
> -case SSH_SERVER_KNOWN_CHANGED:
> -ret = -EINVAL;
> -error_setg(errp,
> -   "host key does not match the one in known_hosts; this "
> -   "may be a possible attack");
> -goto out;
> -case SSH_SERVER_FOUND_OTHER:
> -ret = -EINVAL;
> -error_setg(errp,
> -   "host key for this server not found, another type 
> exists");
> -goto out;
> -case SSH_SERVER_FILE_NOT_FOUND:
> -ret = -ENOENT;
> -error_setg(errp, "known_hosts file not found");
> -goto out;
> -case SSH_SERVER_NOT_KNOWN:
> -ret = -EINVAL;
> -error_setg(errp, "no host key was found in known_hosts");
> -goto out;
> -case SSH_SERVER_ERROR:
> -ret = -EINVAL;
> -error_setg(errp, "server error");
> -goto out;
> -default:
> -ret = -EINVAL;
> -error_setg(errp, "error while checking for known server (%d)", 
> state);
> -goto out;
> -}
> -#endif /* !HAVE_LIBSSH_0_8 */
>  
>  /* known_hosts checking successful. */
>  ret = 0;
> @@ -438,11 +397,7 @@ check_host_key_hash(BDRVSSHState *s, const char *hash,
>  unsigned char *server_hash;
>  size_t server_hash_len;
>  
> -#ifdef HAVE_LIBSSH_0_8
>  r = ssh_get_server_publickey(s->session, );
> -#else
> -r = ssh_get_publickey(s->session, );
> -#endif
>  if (r != SSH_OK) {
>  session_error_setg(errp, s, "failed to read remote host key");
>  return -EINVAL;
> @@ -1233,8 +1188,6 @@ static void unsafe_flush_warning(BDRVSSHState *s, const 
> char *what)
>  }
>  }
>  
> -#ifdef HAVE_LIBSSH_0_8
> -
>  static coroutine_fn int ssh_flush(BDRVSSHState *s, BlockDriverState *bs)
>  {
>  int r;
> @@ -1271,18 +1224,6 @@ static coroutine_fn int ssh_co_flush(BlockDriverState 
> *bs)
>  return ret;
>  }
>  
> -#else /* !HAVE_LIBSSH_0_8 */
> -
> -static coroutine_fn int ssh_co_flush(BlockDriverState *bs)
> -{
> -BDRVSSHState *s = bs->opaque;
> -
> -unsafe_flush_warning(s, "libssh >= 0.8.0");
> -return 0;
> -}
> -
> -#endif /* !HAVE_LIBSSH_0_8 */
> -
>  static int64_t ssh_getlength(BlockDriverState *bs)
>  {
>  BDRVSSHState *s = bs->opaque;
> diff --git a/configure b/configure
> index 879a8e8f17..bf1c740494 100755
> --- a/configure
> +++ b/configure

Re: [RFC PATCH 2/2] qemu-img convert: Fix sparseness detection

2021-05-19 Thread Kevin Wolf
Am 19.05.2021 um 15:24 hat Peter Lieven geschrieben:
> Am 20.04.21 um 18:52 schrieb Vladimir Sementsov-Ogievskiy:
> > 20.04.2021 18:04, Kevin Wolf wrote:
> >> Am 20.04.2021 um 16:31 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >>> 15.04.2021 18:22, Kevin Wolf wrote:
>  In order to avoid RMW cycles, is_allocated_sectors() treats zeroed areas
>  like non-zero data if the end of the checked area isn't aligned. This
>  can improve the efficiency of the conversion and was introduced in
>  commit 8dcd3c9b91a.
> 
>  However, it comes with a correctness problem: qemu-img convert is
>  supposed to sparsify areas that contain only zeros, which it doesn't do
>  any more. It turns out that this even happens when not only the
>  unaligned area is zeroed, but also the blocks before and after it. In
>  the bug report, conversion of a fragmented 10G image containing only
>  zeros resulted in an image consuming 2.82 GiB even though the expected
>  size is only 4 KiB.
> 
>  As a tradeoff between both, let's ignore zeroed sectors only after
>  non-zero data to fix the alignment, but if we're only looking at zeros,
>  keep them as such, even if it may mean additional RMW cycles.
> 
> >>>
> >>> Hmm.. If I understand correctly, we are going to do unaligned
> >>> write-zero. And that helps.
> >>
> >> This can happen (mostly raw images on block devices, I think?), but
> >> usually it just means skipping the write because we know that the target
> >> image is already zeroed.
> >>
> >> What it does mean is that if the next part is data, we'll have an
> >> unaligned data write.
> >>
> >>> Doesn't that mean that alignment is wrongly detected?
> >>
> >> The problem is that you can have bdrv_block_status_above() return the
> >> same allocation status multiple times in a row, but *pnum can be
> >> unaligned for the conversion.
> >>
> >> We only look at a single range returned by it when detecting the
> >> alignment, so it could be that we have zero buffers for both 0-11 and
> >> 12-16 and detect two misaligned ranges, when both together are a
> >> perfectly aligned zeroed range.
> >>
> >> In theory we could try to do some lookahead and merge ranges where
> >> possible, which should give us the perfect result, but it would make the
> >> code considerably more complicated. (Whether we want to merge them
> >> doesn't only depend on the block status, but possibly also on the
> >> content of a DATA range.)
> >>
> >> Kevin
> >>
> >
> > Oh, I understand now the problem, thanks for explanation.
> >
> > Hmm, yes that means, that if the whole buf is zero, is_allocated_sectors 
> > must not align it down, to be possibly "merged" with next chunk if it is 
> > zero too.
> >
> > But it's still good to align zeroes down, if data starts somewhere inside 
> > the buf, isn't it?
> >
> > what about something like this:
> >
> > diff --git a/qemu-img.c b/qemu-img.c
> > index babb5573ab..d1704584a0 100644
> > --- a/qemu-img.c
> > +++ b/qemu-img.c
> > @@ -1167,19 +1167,39 @@ static int is_allocated_sectors(const uint8_t *buf, 
> > int n, int *pnum,
> >  }
> >  }
> >  
> > +    if (i == n) {
> > +    /*
> > + * The whole buf is the same.
> > + *
> > + * if it's data, just return it. It's the old behavior.
> > + *
> > + * if it's zero, just return too. It will work good if target is 
> > alredy
> > + * zeroed. And if next chunk is zero too we'll have no RMW and no 
> > reason
> > + * to write data.
> > + */
> > +    *pnum = i;
> > +    return !is_zero;
> > +    }
> > +
> >  tail = (sector_num + i) & (alignment - 1);
> >  if (tail) {
> >  if (is_zero && i <= tail) {
> > -    /* treat unallocated areas which only consist
> > - * of a small tail as allocated. */
> > +    /*
> > + * For sure next sector after i is data, and it will rewrite 
> > this
> > + * tail anyway due to RMW. So, let's just write data now.
> > + */
> >  is_zero = false;
> >  }
> >  if (!is_zero) {
> > -    /* align up end offset of allocated areas. */
> > +    /* If possible, align up end offset of allocated areas. */
> >  i += alignment - tail;
> >  i = MIN(i, n);
> >  } else {
> > -    /* align down end offset of zero areas. */
> > +    /*
> > + * For sure next sector after i is data, and it will rewrite 
> > this
> > + * tail anyway due to RMW. Better is avoid RMW and write 
> > zeroes up
> > + * to aligned bound.
> > + */
> >  i -= tail;
> >  }
> >  }
> 
> I think we forgot to follow up on this. Has anyone tested this
> suggestion?
> 
> Otherwise, I would try to rerun the tests I did with the my old and
> Kevins suggestion.

I noticed earlier this week that these patches are still in my

Re: [PATCH] block/ssh: Bump minimum libssh version to 0.8.7

2021-05-19 Thread Daniel P . Berrangé
On Wed, May 19, 2021 at 05:58:59PM +0200, Thomas Huth wrote:
> It has been over two years since RHEL-8 was released, and thus per the
> platform build policy, we no longer need to support RHEL-7 as a build
> target. So from the RHEL-7 perspective, we do not have to support
> libssh v0.7 anymore now.
> 
> Let's look at the versions from other distributions and operating
> systems - according to repology.org, current shipping versions are:
> 
>  RHEL-8: 0.9.4
>   Debian Buster: 0.8.7
>  openSUSE Leap 15.2: 0.8.7
>Ubuntu LTS 18.04: 0.8.0 *
>Ubuntu LTS 20.04: 0.9.3
> FreeBSD: 0.9.5
>   Fedora 33: 0.9.5
>   Fedora 34: 0.9.5
> OpenBSD: 0.9.5
>  macOS HomeBrew: 0.9.5
>  HaikuPorts: 0.9.5
> 
> * The version of libssh in Ubuntu 18.04 claims to be 0.8.0 from the
> name of the package, but in reality it is a 0.7 patched up as a
> Frankenstein monster with patches from the 0.8 development branch.
> This gave us some headaches in the past already and so it never worked
> with QEMU. All attempts to get it supported have failed in the past,
> patches for QEMU have never been merged and a request to Ubuntu to
> fix it in their 18.04 distro has been ignored:
> 
>  https://bugs.launchpad.net/ubuntu/+source/libssh/+bug/1847514
> 
> Thus we really should ignore the libssh in Ubuntu 18.04 in QEMU, too.

Agreed, if they're going to ship such a monster, at the very least
they should be responsive to fixing the fallout it creates. Given
the lack of action I agree with ditching libssh support on Ubuntu
18.04, despite it otherwise being  supported distro target.

> Fix it by bumping the minimum libssh version to something that is
> greater than 0.8.0 now. Debian Buster and openSUSE Leap have the
> oldest version and so 0.8.7 is the new minimum.
> 
> Signed-off-by: Thomas Huth 
> ---
>  block/ssh.c | 59 -
>  configure   | 19 +
>  2 files changed, 1 insertion(+), 77 deletions(-)

Reviewed-by: Daniel P. Berrangé 


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH] block/ssh: Bump minimum libssh version to 0.8.7

2021-05-19 Thread Richard W.M. Jones
On Wed, May 19, 2021 at 05:58:59PM +0200, Thomas Huth wrote:
> It has been over two years since RHEL-8 was released, and thus per the
> platform build policy, we no longer need to support RHEL-7 as a build
> target. So from the RHEL-7 perspective, we do not have to support
> libssh v0.7 anymore now.

Not an objection, just an FYI: RHEL 7 has libssh-0.7.1-7.el7.x86_64

nbdkit-ssh-plugin settled on only supporting libssh >= 0.8.0, mainly
because we require knownhosts support which seems a fairly fundamental
requirement for security.

> Let's look at the versions from other distributions and operating
> systems - according to repology.org, current shipping versions are:
> 
>  RHEL-8: 0.9.4
>   Debian Buster: 0.8.7
>  openSUSE Leap 15.2: 0.8.7
>Ubuntu LTS 18.04: 0.8.0 *
>Ubuntu LTS 20.04: 0.9.3
> FreeBSD: 0.9.5
>   Fedora 33: 0.9.5
>   Fedora 34: 0.9.5
> OpenBSD: 0.9.5
>  macOS HomeBrew: 0.9.5
>  HaikuPorts: 0.9.5
> 
> * The version of libssh in Ubuntu 18.04 claims to be 0.8.0 from the
> name of the package, but in reality it is a 0.7 patched up as a
> Frankenstein monster with patches from the 0.8 development branch.
> This gave us some headaches in the past already and so it never worked
> with QEMU. All attempts to get it supported have failed in the past,
> patches for QEMU have never been merged and a request to Ubuntu to
> fix it in their 18.04 distro has been ignored:
> 
>  https://bugs.launchpad.net/ubuntu/+source/libssh/+bug/1847514
> 
> Thus we really should ignore the libssh in Ubuntu 18.04 in QEMU, too.
> 
> Fix it by bumping the minimum libssh version to something that is
> greater than 0.8.0 now. Debian Buster and openSUSE Leap have the
> oldest version and so 0.8.7 is the new minimum.
> 
> Signed-off-by: Thomas Huth 
> ---
>  block/ssh.c | 59 -
>  configure   | 19 +
>  2 files changed, 1 insertion(+), 77 deletions(-)
> 
> diff --git a/block/ssh.c b/block/ssh.c
> index ebe3d8b631..b51a031620 100644
> --- a/block/ssh.c
> +++ b/block/ssh.c
> @@ -277,7 +277,6 @@ static void ssh_parse_filename(const char *filename, 
> QDict *options,
>  static int check_host_key_knownhosts(BDRVSSHState *s, Error **errp)
>  {
>  int ret;
> -#ifdef HAVE_LIBSSH_0_8
>  enum ssh_known_hosts_e state;
>  int r;
>  ssh_key pubkey;
> @@ -343,46 +342,6 @@ static int check_host_key_knownhosts(BDRVSSHState *s, 
> Error **errp)
>  error_setg(errp, "error while checking for known server (%d)", 
> state);
>  goto out;
>  }
> -#else /* !HAVE_LIBSSH_0_8 */
> -int state;
> -
> -state = ssh_is_server_known(s->session);
> -trace_ssh_server_status(state);
> -
> -switch (state) {
> -case SSH_SERVER_KNOWN_OK:
> -/* OK */
> -trace_ssh_check_host_key_knownhosts();
> -break;
> -case SSH_SERVER_KNOWN_CHANGED:
> -ret = -EINVAL;
> -error_setg(errp,
> -   "host key does not match the one in known_hosts; this "
> -   "may be a possible attack");
> -goto out;
> -case SSH_SERVER_FOUND_OTHER:
> -ret = -EINVAL;
> -error_setg(errp,
> -   "host key for this server not found, another type 
> exists");
> -goto out;
> -case SSH_SERVER_FILE_NOT_FOUND:
> -ret = -ENOENT;
> -error_setg(errp, "known_hosts file not found");
> -goto out;
> -case SSH_SERVER_NOT_KNOWN:
> -ret = -EINVAL;
> -error_setg(errp, "no host key was found in known_hosts");
> -goto out;
> -case SSH_SERVER_ERROR:
> -ret = -EINVAL;
> -error_setg(errp, "server error");
> -goto out;
> -default:
> -ret = -EINVAL;
> -error_setg(errp, "error while checking for known server (%d)", 
> state);
> -goto out;
> -}
> -#endif /* !HAVE_LIBSSH_0_8 */
>  
>  /* known_hosts checking successful. */
>  ret = 0;
> @@ -438,11 +397,7 @@ check_host_key_hash(BDRVSSHState *s, const char *hash,
>  unsigned char *server_hash;
>  size_t server_hash_len;
>  
> -#ifdef HAVE_LIBSSH_0_8
>  r = ssh_get_server_publickey(s->session, );
> -#else
> -r = ssh_get_publickey(s->session, );
> -#endif
>  if (r != SSH_OK) {
>  session_error_setg(errp, s, "failed to read remote host key");
>  return -EINVAL;
> @@ -1233,8 +1188,6 @@ static void unsafe_flush_warning(BDRVSSHState *s, const 
> char *what)
>  }
>  }
>  
> -#ifdef HAVE_LIBSSH_0_8
> -
>  static coroutine_fn int ssh_flush(BDRVSSHState *s, BlockDriverState *bs)
>  {
>  int r;
> @@ -1271,18 +1224,6 @@ static coroutine_fn int ssh_co_flush(BlockDriverState 
> *bs)
>  return ret;
>  }
>  
> -#else /* !HAVE_LIBSSH_0_8 */
> -
> -static coroutine_fn int ssh_co_flush(BlockDriverState *bs)
> -{
> -BDRVSSHState *s = bs->opaque;
> -
> -unsafe_flush_warning(s, "libssh >= 0.8.0");
> -return 

[PATCH] block/ssh: Bump minimum libssh version to 0.8.7

2021-05-19 Thread Thomas Huth
It has been over two years since RHEL-8 was released, and thus per the
platform build policy, we no longer need to support RHEL-7 as a build
target. So from the RHEL-7 perspective, we do not have to support
libssh v0.7 anymore now.

Let's look at the versions from other distributions and operating
systems - according to repology.org, current shipping versions are:

 RHEL-8: 0.9.4
  Debian Buster: 0.8.7
 openSUSE Leap 15.2: 0.8.7
   Ubuntu LTS 18.04: 0.8.0 *
   Ubuntu LTS 20.04: 0.9.3
FreeBSD: 0.9.5
  Fedora 33: 0.9.5
  Fedora 34: 0.9.5
OpenBSD: 0.9.5
 macOS HomeBrew: 0.9.5
 HaikuPorts: 0.9.5

* The version of libssh in Ubuntu 18.04 claims to be 0.8.0 from the
name of the package, but in reality it is a 0.7 patched up as a
Frankenstein monster with patches from the 0.8 development branch.
This gave us some headaches in the past already and so it never worked
with QEMU. All attempts to get it supported have failed in the past,
patches for QEMU have never been merged and a request to Ubuntu to
fix it in their 18.04 distro has been ignored:

 https://bugs.launchpad.net/ubuntu/+source/libssh/+bug/1847514

Thus we really should ignore the libssh in Ubuntu 18.04 in QEMU, too.

Fix it by bumping the minimum libssh version to something that is
greater than 0.8.0 now. Debian Buster and openSUSE Leap have the
oldest version and so 0.8.7 is the new minimum.

Signed-off-by: Thomas Huth 
---
 block/ssh.c | 59 -
 configure   | 19 +
 2 files changed, 1 insertion(+), 77 deletions(-)

diff --git a/block/ssh.c b/block/ssh.c
index ebe3d8b631..b51a031620 100644
--- a/block/ssh.c
+++ b/block/ssh.c
@@ -277,7 +277,6 @@ static void ssh_parse_filename(const char *filename, QDict 
*options,
 static int check_host_key_knownhosts(BDRVSSHState *s, Error **errp)
 {
 int ret;
-#ifdef HAVE_LIBSSH_0_8
 enum ssh_known_hosts_e state;
 int r;
 ssh_key pubkey;
@@ -343,46 +342,6 @@ static int check_host_key_knownhosts(BDRVSSHState *s, 
Error **errp)
 error_setg(errp, "error while checking for known server (%d)", state);
 goto out;
 }
-#else /* !HAVE_LIBSSH_0_8 */
-int state;
-
-state = ssh_is_server_known(s->session);
-trace_ssh_server_status(state);
-
-switch (state) {
-case SSH_SERVER_KNOWN_OK:
-/* OK */
-trace_ssh_check_host_key_knownhosts();
-break;
-case SSH_SERVER_KNOWN_CHANGED:
-ret = -EINVAL;
-error_setg(errp,
-   "host key does not match the one in known_hosts; this "
-   "may be a possible attack");
-goto out;
-case SSH_SERVER_FOUND_OTHER:
-ret = -EINVAL;
-error_setg(errp,
-   "host key for this server not found, another type exists");
-goto out;
-case SSH_SERVER_FILE_NOT_FOUND:
-ret = -ENOENT;
-error_setg(errp, "known_hosts file not found");
-goto out;
-case SSH_SERVER_NOT_KNOWN:
-ret = -EINVAL;
-error_setg(errp, "no host key was found in known_hosts");
-goto out;
-case SSH_SERVER_ERROR:
-ret = -EINVAL;
-error_setg(errp, "server error");
-goto out;
-default:
-ret = -EINVAL;
-error_setg(errp, "error while checking for known server (%d)", state);
-goto out;
-}
-#endif /* !HAVE_LIBSSH_0_8 */
 
 /* known_hosts checking successful. */
 ret = 0;
@@ -438,11 +397,7 @@ check_host_key_hash(BDRVSSHState *s, const char *hash,
 unsigned char *server_hash;
 size_t server_hash_len;
 
-#ifdef HAVE_LIBSSH_0_8
 r = ssh_get_server_publickey(s->session, );
-#else
-r = ssh_get_publickey(s->session, );
-#endif
 if (r != SSH_OK) {
 session_error_setg(errp, s, "failed to read remote host key");
 return -EINVAL;
@@ -1233,8 +1188,6 @@ static void unsafe_flush_warning(BDRVSSHState *s, const 
char *what)
 }
 }
 
-#ifdef HAVE_LIBSSH_0_8
-
 static coroutine_fn int ssh_flush(BDRVSSHState *s, BlockDriverState *bs)
 {
 int r;
@@ -1271,18 +1224,6 @@ static coroutine_fn int ssh_co_flush(BlockDriverState 
*bs)
 return ret;
 }
 
-#else /* !HAVE_LIBSSH_0_8 */
-
-static coroutine_fn int ssh_co_flush(BlockDriverState *bs)
-{
-BDRVSSHState *s = bs->opaque;
-
-unsafe_flush_warning(s, "libssh >= 0.8.0");
-return 0;
-}
-
-#endif /* !HAVE_LIBSSH_0_8 */
-
 static int64_t ssh_getlength(BlockDriverState *bs)
 {
 BDRVSSHState *s = bs->opaque;
diff --git a/configure b/configure
index 879a8e8f17..bf1c740494 100755
--- a/configure
+++ b/configure
@@ -3512,7 +3512,7 @@ fi
 ##
 # libssh probe
 if test "$libssh" != "no" ; then
-  if $pkg_config --exists libssh; then
+  if $pkg_config --exists "libssh >= 0.8.7"; then
 libssh_cflags=$($pkg_config libssh --cflags)
 libssh_libs=$($pkg_config libssh --libs)
 libssh=yes
@@ -3524,23 +3524,6 @@ 

[PATCH V3 2/6] block/rbd: store object_size in BDRVRBDState

2021-05-19 Thread Peter Lieven
Signed-off-by: Peter Lieven 
---
 block/rbd.c | 18 +++---
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 6b1cbe1d75..b4caea4f1b 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -90,6 +90,7 @@ typedef struct BDRVRBDState {
 char *snap;
 char *namespace;
 uint64_t image_size;
+uint64_t object_size;
 } BDRVRBDState;
 
 static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t *io_ctx,
@@ -675,6 +676,7 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
*options, int flags,
 const QDictEntry *e;
 Error *local_err = NULL;
 char *keypairs, *secretid;
+rbd_image_info_t info;
 int r;
 
 keypairs = g_strdup(qdict_get_try_str(options, "=keyvalue-pairs"));
@@ -739,13 +741,15 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
*options, int flags,
 goto failed_open;
 }
 
-r = rbd_get_size(s->image, >image_size);
+r = rbd_stat(s->image, , sizeof(info));
 if (r < 0) {
-error_setg_errno(errp, -r, "error getting image size from %s",
+error_setg_errno(errp, -r, "error getting image info from %s",
  s->image_name);
 rbd_close(s->image);
 goto failed_open;
 }
+s->image_size = info.size;
+s->object_size = info.obj_size;
 
 /* If we are using an rbd snapshot, we must be r/o, otherwise
  * leave as-is */
@@ -957,15 +961,7 @@ static BlockAIOCB *qemu_rbd_aio_flush(BlockDriverState *bs,
 static int qemu_rbd_getinfo(BlockDriverState *bs, BlockDriverInfo *bdi)
 {
 BDRVRBDState *s = bs->opaque;
-rbd_image_info_t info;
-int r;
-
-r = rbd_stat(s->image, , sizeof(info));
-if (r < 0) {
-return r;
-}
-
-bdi->cluster_size = info.obj_size;
+bdi->cluster_size = s->object_size;
 return 0;
 }
 
-- 
2.17.1





[PATCH V3 5/6] block/rbd: add write zeroes support

2021-05-19 Thread Peter Lieven
Signed-off-by: Peter Lieven 
---
 block/rbd.c | 37 -
 1 file changed, 36 insertions(+), 1 deletion(-)

diff --git a/block/rbd.c b/block/rbd.c
index 0d8612a988..ee13f08a74 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -63,7 +63,8 @@ typedef enum {
 RBD_AIO_READ,
 RBD_AIO_WRITE,
 RBD_AIO_DISCARD,
-RBD_AIO_FLUSH
+RBD_AIO_FLUSH,
+RBD_AIO_WRITE_ZEROES
 } RBDAIOCmd;
 
 typedef struct BDRVRBDState {
@@ -705,6 +706,10 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
*options, int flags,
 }
 }
 
+#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
+bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
+#endif
+
 /* When extending regular files, we get zeros from the OS */
 bs->supported_truncate_flags = BDRV_REQ_ZERO_WRITE;
 
@@ -818,6 +823,18 @@ static int coroutine_fn qemu_rbd_start_co(BlockDriverState 
*bs,
 case RBD_AIO_FLUSH:
 r = rbd_aio_flush(s->image, c);
 break;
+#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
+case RBD_AIO_WRITE_ZEROES: {
+int zero_flags = 0;
+#ifdef RBD_WRITE_ZEROES_FLAG_THICK_PROVISION
+if (!(flags & BDRV_REQ_MAY_UNMAP)) {
+zero_flags = RBD_WRITE_ZEROES_FLAG_THICK_PROVISION;
+}
+#endif
+r = rbd_aio_write_zeroes(s->image, offset, bytes, c, zero_flags, 0);
+break;
+}
+#endif
 default:
 r = -EINVAL;
 }
@@ -888,6 +905,21 @@ static int coroutine_fn 
qemu_rbd_co_pdiscard(BlockDriverState *bs,
 return qemu_rbd_start_co(bs, offset, count, NULL, 0, RBD_AIO_DISCARD);
 }
 
+#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
+static int
+coroutine_fn qemu_rbd_co_pwrite_zeroes(BlockDriverState *bs, int64_t offset,
+  int count, BdrvRequestFlags flags)
+{
+#ifndef RBD_WRITE_ZEROES_FLAG_THICK_PROVISION
+if (!(flags & BDRV_REQ_MAY_UNMAP)) {
+return -ENOTSUP;
+}
+#endif
+return qemu_rbd_start_co(bs, offset, count, NULL, flags,
+ RBD_AIO_WRITE_ZEROES);
+}
+#endif
+
 static int qemu_rbd_getinfo(BlockDriverState *bs, BlockDriverInfo *bdi)
 {
 BDRVRBDState *s = bs->opaque;
@@ -1113,6 +1145,9 @@ static BlockDriver bdrv_rbd = {
 .bdrv_co_pwritev= qemu_rbd_co_pwritev,
 .bdrv_co_flush_to_disk  = qemu_rbd_co_flush,
 .bdrv_co_pdiscard   = qemu_rbd_co_pdiscard,
+#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
+.bdrv_co_pwrite_zeroes  = qemu_rbd_co_pwrite_zeroes,
+#endif
 
 .bdrv_snapshot_create   = qemu_rbd_snap_create,
 .bdrv_snapshot_delete   = qemu_rbd_snap_remove,
-- 
2.17.1





[PATCH V3 4/6] block/rbd: migrate from aio to coroutines

2021-05-19 Thread Peter Lieven
Signed-off-by: Peter Lieven 
---
 block/rbd.c | 255 ++--
 1 file changed, 87 insertions(+), 168 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 97a2ae4c84..0d8612a988 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -66,22 +66,6 @@ typedef enum {
 RBD_AIO_FLUSH
 } RBDAIOCmd;
 
-typedef struct RBDAIOCB {
-BlockAIOCB common;
-int64_t ret;
-QEMUIOVector *qiov;
-RBDAIOCmd cmd;
-int error;
-struct BDRVRBDState *s;
-} RBDAIOCB;
-
-typedef struct RADOSCB {
-RBDAIOCB *acb;
-struct BDRVRBDState *s;
-int64_t size;
-int64_t ret;
-} RADOSCB;
-
 typedef struct BDRVRBDState {
 rados_t cluster;
 rados_ioctx_t io_ctx;
@@ -93,6 +77,13 @@ typedef struct BDRVRBDState {
 uint64_t object_size;
 } BDRVRBDState;
 
+typedef struct RBDTask {
+BlockDriverState *bs;
+Coroutine *co;
+bool complete;
+int64_t ret;
+} RBDTask;
+
 static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t *io_ctx,
 BlockdevOptionsRbd *opts, bool cache,
 const char *keypairs, const char *secretid,
@@ -325,13 +316,6 @@ static int qemu_rbd_set_keypairs(rados_t cluster, const 
char *keypairs_json,
 return ret;
 }
 
-static void qemu_rbd_memset(RADOSCB *rcb, int64_t offs)
-{
-RBDAIOCB *acb = rcb->acb;
-iov_memset(acb->qiov->iov, acb->qiov->niov, offs, 0,
-   acb->qiov->size - offs);
-}
-
 /* FIXME Deprecate and remove keypairs or make it available in QMP. */
 static int qemu_rbd_do_create(BlockdevCreateOptions *options,
   const char *keypairs, const char 
*password_secret,
@@ -450,46 +434,6 @@ exit:
 return ret;
 }
 
-/*
- * This aio completion is being called from rbd_finish_bh() and runs in qemu
- * BH context.
- */
-static void qemu_rbd_complete_aio(RADOSCB *rcb)
-{
-RBDAIOCB *acb = rcb->acb;
-int64_t r;
-
-r = rcb->ret;
-
-if (acb->cmd != RBD_AIO_READ) {
-if (r < 0) {
-acb->ret = r;
-acb->error = 1;
-} else if (!acb->error) {
-acb->ret = rcb->size;
-}
-} else {
-if (r < 0) {
-qemu_rbd_memset(rcb, 0);
-acb->ret = r;
-acb->error = 1;
-} else if (r < rcb->size) {
-qemu_rbd_memset(rcb, r);
-if (!acb->error) {
-acb->ret = rcb->size;
-}
-} else if (!acb->error) {
-acb->ret = r;
-}
-}
-
-g_free(rcb);
-
-acb->common.cb(acb->common.opaque, (acb->ret > 0 ? 0 : acb->ret));
-
-qemu_aio_unref(acb);
-}
-
 static char *qemu_rbd_mon_host(BlockdevOptionsRbd *opts, Error **errp)
 {
 const char **vals;
@@ -826,89 +770,50 @@ static int qemu_rbd_resize(BlockDriverState *bs, uint64_t 
size)
 return 0;
 }
 
-static const AIOCBInfo rbd_aiocb_info = {
-.aiocb_size = sizeof(RBDAIOCB),
-};
-
-static void rbd_finish_bh(void *opaque)
+static void qemu_rbd_finish_bh(void *opaque)
 {
-RADOSCB *rcb = opaque;
-qemu_rbd_complete_aio(rcb);
+RBDTask *task = opaque;
+task->complete = 1;
+aio_co_wake(task->co);
 }
 
-/*
- * This is the callback function for rbd_aio_read and _write
- *
- * Note: this function is being called from a non qemu thread so
- * we need to be careful about what we do here. Generally we only
- * schedule a BH, and do the rest of the io completion handling
- * from rbd_finish_bh() which runs in a qemu context.
- */
-static void rbd_finish_aiocb(rbd_completion_t c, RADOSCB *rcb)
+static void qemu_rbd_completion_cb(rbd_completion_t c, RBDTask *task)
 {
-RBDAIOCB *acb = rcb->acb;
-
-rcb->ret = rbd_aio_get_return_value(c);
+task->ret = rbd_aio_get_return_value(c);
 rbd_aio_release(c);
-
-replay_bh_schedule_oneshot_event(bdrv_get_aio_context(acb->common.bs),
- rbd_finish_bh, rcb);
+aio_bh_schedule_oneshot(bdrv_get_aio_context(task->bs),
+qemu_rbd_finish_bh, task);
 }
 
-static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
- int64_t off,
- QEMUIOVector *qiov,
- int64_t size,
- BlockCompletionFunc *cb,
- void *opaque,
- RBDAIOCmd cmd)
+static int coroutine_fn qemu_rbd_start_co(BlockDriverState *bs,
+  uint64_t offset,
+  uint64_t bytes,
+  QEMUIOVector *qiov,
+  int flags,
+  RBDAIOCmd cmd)
 {
-RBDAIOCB *acb;
-RADOSCB *rcb = NULL;
+BDRVRBDState *s = bs->opaque;
+RBDTask task = { .bs = bs, .co = qemu_coroutine_self() };
 rbd_completion_t c;
 int r;
 
-BDRVRBDState *s = bs->opaque;
+   

[PATCH V3 1/6] block/rbd: bump librbd requirement to luminous release

2021-05-19 Thread Peter Lieven
even luminous (version 12.2) is unmaintained for over 3 years now.
Bump the requirement to get rid of the ifdef'ry in the code.
Qemu 6.1 dropped the support for RHEL-7 which was the last supported
OS that required an older librbd.

Signed-off-by: Peter Lieven 
---
 block/rbd.c | 120 
 meson.build |   7 ++-
 2 files changed, 13 insertions(+), 114 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 26f64cce7c..6b1cbe1d75 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -55,24 +55,10 @@
  * leading "\".
  */
 
-/* rbd_aio_discard added in 0.1.2 */
-#if LIBRBD_VERSION_CODE >= LIBRBD_VERSION(0, 1, 2)
-#define LIBRBD_SUPPORTS_DISCARD
-#else
-#undef LIBRBD_SUPPORTS_DISCARD
-#endif
-
 #define OBJ_MAX_SIZE (1UL << OBJ_DEFAULT_OBJ_ORDER)
 
 #define RBD_MAX_SNAPS 100
 
-/* The LIBRBD_SUPPORTS_IOVEC is defined in librbd.h */
-#ifdef LIBRBD_SUPPORTS_IOVEC
-#define LIBRBD_USE_IOVEC 1
-#else
-#define LIBRBD_USE_IOVEC 0
-#endif
-
 typedef enum {
 RBD_AIO_READ,
 RBD_AIO_WRITE,
@@ -84,7 +70,6 @@ typedef struct RBDAIOCB {
 BlockAIOCB common;
 int64_t ret;
 QEMUIOVector *qiov;
-char *bounce;
 RBDAIOCmd cmd;
 int error;
 struct BDRVRBDState *s;
@@ -94,7 +79,6 @@ typedef struct RADOSCB {
 RBDAIOCB *acb;
 struct BDRVRBDState *s;
 int64_t size;
-char *buf;
 int64_t ret;
 } RADOSCB;
 
@@ -342,13 +326,9 @@ static int qemu_rbd_set_keypairs(rados_t cluster, const 
char *keypairs_json,
 
 static void qemu_rbd_memset(RADOSCB *rcb, int64_t offs)
 {
-if (LIBRBD_USE_IOVEC) {
-RBDAIOCB *acb = rcb->acb;
-iov_memset(acb->qiov->iov, acb->qiov->niov, offs, 0,
-   acb->qiov->size - offs);
-} else {
-memset(rcb->buf + offs, 0, rcb->size - offs);
-}
+RBDAIOCB *acb = rcb->acb;
+iov_memset(acb->qiov->iov, acb->qiov->niov, offs, 0,
+   acb->qiov->size - offs);
 }
 
 /* FIXME Deprecate and remove keypairs or make it available in QMP. */
@@ -504,13 +484,6 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb)
 
 g_free(rcb);
 
-if (!LIBRBD_USE_IOVEC) {
-if (acb->cmd == RBD_AIO_READ) {
-qemu_iovec_from_buf(acb->qiov, 0, acb->bounce, acb->qiov->size);
-}
-qemu_vfree(acb->bounce);
-}
-
 acb->common.cb(acb->common.opaque, (acb->ret > 0 ? 0 : acb->ret));
 
 qemu_aio_unref(acb);
@@ -878,28 +851,6 @@ static void rbd_finish_aiocb(rbd_completion_t c, RADOSCB 
*rcb)
  rbd_finish_bh, rcb);
 }
 
-static int rbd_aio_discard_wrapper(rbd_image_t image,
-   uint64_t off,
-   uint64_t len,
-   rbd_completion_t comp)
-{
-#ifdef LIBRBD_SUPPORTS_DISCARD
-return rbd_aio_discard(image, off, len, comp);
-#else
-return -ENOTSUP;
-#endif
-}
-
-static int rbd_aio_flush_wrapper(rbd_image_t image,
- rbd_completion_t comp)
-{
-#ifdef LIBRBD_SUPPORTS_AIO_FLUSH
-return rbd_aio_flush(image, comp);
-#else
-return -ENOTSUP;
-#endif
-}
-
 static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
  int64_t off,
  QEMUIOVector *qiov,
@@ -922,21 +873,6 @@ static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
 
 rcb = g_new(RADOSCB, 1);
 
-if (!LIBRBD_USE_IOVEC) {
-if (cmd == RBD_AIO_DISCARD || cmd == RBD_AIO_FLUSH) {
-acb->bounce = NULL;
-} else {
-acb->bounce = qemu_try_blockalign(bs, qiov->size);
-if (acb->bounce == NULL) {
-goto failed;
-}
-}
-if (cmd == RBD_AIO_WRITE) {
-qemu_iovec_to_buf(acb->qiov, 0, acb->bounce, qiov->size);
-}
-rcb->buf = acb->bounce;
-}
-
 acb->ret = 0;
 acb->error = 0;
 acb->s = s;
@@ -950,7 +886,7 @@ static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
 }
 
 switch (cmd) {
-case RBD_AIO_WRITE: {
+case RBD_AIO_WRITE:
 /*
  * RBD APIs don't allow us to write more than actual size, so in order
  * to support growing images, we resize the image before write
@@ -962,25 +898,16 @@ static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
 goto failed_completion;
 }
 }
-#ifdef LIBRBD_SUPPORTS_IOVEC
-r = rbd_aio_writev(s->image, qiov->iov, qiov->niov, off, c);
-#else
-r = rbd_aio_write(s->image, off, size, rcb->buf, c);
-#endif
+r = rbd_aio_writev(s->image, qiov->iov, qiov->niov, off, c);
 break;
-}
 case RBD_AIO_READ:
-#ifdef LIBRBD_SUPPORTS_IOVEC
-r = rbd_aio_readv(s->image, qiov->iov, qiov->niov, off, c);
-#else
-r = rbd_aio_read(s->image, off, size, rcb->buf, c);
-#endif
+r = rbd_aio_readv(s->image, qiov->iov, qiov->niov, off, c);
 break;
 case RBD_AIO_DISCARD:
-r = 

[PATCH V3 6/6] block/rbd: drop qemu_rbd_refresh_limits

2021-05-19 Thread Peter Lieven
librbd supports 1 byte alignment for all aio operations.

Currently, there is no API call to query limits from the ceph backend.
So drop the bdrv_refresh_limits completely until there is such an API call.

Signed-off-by: Peter Lieven 
---
 block/rbd.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index ee13f08a74..368a674aa0 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -228,14 +228,6 @@ done:
 return;
 }
 
-
-static void qemu_rbd_refresh_limits(BlockDriverState *bs, Error **errp)
-{
-/* XXX Does RBD support AIO on less than 512-byte alignment? */
-bs->bl.request_alignment = 512;
-}
-
-
 static int qemu_rbd_set_auth(rados_t cluster, BlockdevOptionsRbd *opts,
  Error **errp)
 {
@@ -1128,7 +1120,6 @@ static BlockDriver bdrv_rbd = {
 .format_name= "rbd",
 .instance_size  = sizeof(BDRVRBDState),
 .bdrv_parse_filename= qemu_rbd_parse_filename,
-.bdrv_refresh_limits= qemu_rbd_refresh_limits,
 .bdrv_file_open = qemu_rbd_open,
 .bdrv_close = qemu_rbd_close,
 .bdrv_reopen_prepare= qemu_rbd_reopen_prepare,
-- 
2.17.1





[PATCH V3 0/6] block/rbd: migrate to coroutines and add write zeroes support

2021-05-19 Thread Peter Lieven
this series migrates the qemu rbd driver from the old aio emulation
to native coroutines and adds write zeroes support which is important
for block operations.

To achive this we first bump the librbd requirement to the already
outdated luminous release of ceph to get rid of some wrappers and
ifdef'ry in the code.

V2->V3:
 - this patch is now rebased on top of current master
 - Patch 1: only use cc.links and not cc.run to not break
   cross-compiling. [Kevin]
   Since Qemu 6.1 its okay to rely on librbd >= 12.x since RHEL-7
   support was dropped [Daniel]
 - Patch 4: dropped
 - Patch 5: store BDS in RBDTask and use bdrv_get_aio_context() [Kevin]

V1->V2:
 - this patch is now rebased on top of current master with Paolos
   upcoming fixes for the meson.build script included:
- meson: accept either shared or static libraries if --disable-static
- meson: honor --enable-rbd if cc.links test fails
 - Patch 1: adjusted to meson.build script
 - Patch 2: unchanged
 - Patch 3: new patch
 - Patch 4: do not implement empty detach_aio_context callback [Jason]
 - Patch 5: - fix aio completion cleanup in error case [Jason]
- return error codes from librbd
 - Patch 6: - add support for thick provisioning [Jason]
- do not set write zeroes alignment
 - Patch 7: new patch

Peter Lieven (6):
  block/rbd: bump librbd requirement to luminous release
  block/rbd: store object_size in BDRVRBDState
  block/rbd: update s->image_size in qemu_rbd_getlength
  block/rbd: migrate from aio to coroutines
  block/rbd: add write zeroes support
  block/rbd: drop qemu_rbd_refresh_limits

 block/rbd.c | 408 
 meson.build |   7 +-
 2 files changed, 128 insertions(+), 287 deletions(-)

-- 
2.17.1





[PATCH V3 3/6] block/rbd: update s->image_size in qemu_rbd_getlength

2021-05-19 Thread Peter Lieven
in case the image size changed we should adjust our internally stored size as 
well.

Signed-off-by: Peter Lieven 
---
 block/rbd.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/rbd.c b/block/rbd.c
index b4caea4f1b..97a2ae4c84 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -976,6 +976,7 @@ static int64_t qemu_rbd_getlength(BlockDriverState *bs)
 return r;
 }
 
+s->image_size = info.size;
 return info.size;
 }
 
-- 
2.17.1





Re: Qemu block filter insertion/removal API

2021-05-19 Thread Vladimir Sementsov-Ogievskiy

19.05.2021 16:02, Kevin Wolf wrote:

Am 19.05.2021 um 14:19 hat Vladimir Sementsov-Ogievskiy geschrieben:

19.05.2021 14:44, Kevin Wolf wrote:

Am 17.05.2021 um 14:44 hat Vladimir Sementsov-Ogievskiy geschrieben:

Hi all!

I'd like to be sure that we know where we are going to.

In blockdev-era where qemu user is aware about block nodes, all nodes have good 
names and controlled by user we can efficiently use block filters.

We already have some useful filters: copy-on-read, throttling, compress. In my 
parallel series I make backup-top filter public and useful without backup block 
jobs. But now filters could be inserted only together with opening their child. 
We can specify filters in qemu cmdline, or filter can take place in the block 
node chain created by blockdev-add.

Still, it would be good to insert/remove filters on demand.

Currently we are going to use x-blockdev-reopen for this. Still it can't be used to 
insert a filter above root node (as x-blockdev-reopen can change only block node options 
and their children). In my series "[PATCH 00/21] block: publish backup-top 
filter" I propose (as Kevin suggested) to modify qom-set, so that it can set drive 
option of running device. That's not difficult, but it means that we have different 
scenario of inserting/removing filters:

1. filter above root node X:

inserting:

- do blockdev-add to add a filter (and specify X as its child)
- do qom-set to set new filter as a rood node instead of X

removing

- do qom-set to make X a root node again
- do blockdev-del to drop a filter

2. filter between two block nodes P and X. (For example, X is a backing child 
of P)

inserting

- do blockdev-add to add a filter (and specify X as its child)
- do blockdev-reopen to set P.backing = filter

remvoing

- do blockdev-reopen to set P.backing = X
- do blockdev-del to drop a filter


And, probably we'll want transaction support for all these things.


Is it OK? Or do we need some kind of additional blockdev-replace command, that 
can replace one node by another, so in both cases we will do

inserting:
- blockdev-add filter
- blockdev-replace (make all parents of X to point to the new filter 
instead (except for the filter itself of course)

removing
- blockdev-replace (make all parante of filter to be parents of X instead)
- blockdev-del filter

It's simple to implement, and it seems for me that it is simpler to use. Any 
thoughts?


One reason I remember why we didn't decide to go this way in the many
"dynamic graph reconfiguration" discussions we had, is that it's not
generic enough to cover all cases. But I'm not sure if we ever
considered root nodes as a separate case. I acknowledge that having two
different interfaces is inconvenient, and integrating qom-set in a
transaction is rather unlikely to happen.

The reason why it's not generic is that it restricts you to doing the
same thing for all parents. Imagine this:

  +- virtio-blk
  |
  file <- qcow2 <-+
  |
  +- NBD export

Now you want to throttle the NBD export so that it doesn't interfere
with your VM too much. With your simple blockdev-replace this isn't
possible. You would have to add the filter to both users or to none.

In theory, blockdev-replace could take a list of the edges that should
be changed to the new node. The problem is that edges don't have names,
and even the parents don't necessarily have one (and if they do, they
are in separate namespaces, so a BlockBackend, a job and an export could
all have the same name), so finding a good way to refer to them in QMP
doesn't sound trivial.



Hm. I like the idea. And it seems feasible to me:

Both export and block jobs works through BlockBackend.

So, for block-jobs, we can add optional parameters like
source-blk-name, and target-blk-name. If parameters specified, blk's
will be named, and user will be able to do blockdev-replace.


I'm not sure if giving them a name is a good idea. Wouldn't it make the
BlockBackend accessible for the user who could then make a device use
it?


For export it's a bit trickier: it would be strange to add separate
argument for export blk, as export already has id. So, I'd do the
following:

1. make blk named (with same name as the export itself) iff name does
not conflict with other blks
2. deprecate duplicating existing blk names by export name.


Yes, if we decide that giving them a name is a good idea, it's possible,
but still a change that requires deprecation, as you say.

The third one is devices (which is what I actually meant when I said
BlockBackend), which also have anonymous BlockBackends in the -blockdev
world. The same approach could work, but it would essentially mean
unifying the QOM and the block namespace, which sounds more likely to
produce conflicts than exports.


Then, blockdev-replace take a parents list, where parent is either
node-name or blk name.


Note 

Re: [RFC PATCH 2/2] qemu-img convert: Fix sparseness detection

2021-05-19 Thread Peter Lieven
Am 20.04.21 um 18:52 schrieb Vladimir Sementsov-Ogievskiy:
> 20.04.2021 18:04, Kevin Wolf wrote:
>> Am 20.04.2021 um 16:31 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>> 15.04.2021 18:22, Kevin Wolf wrote:
 In order to avoid RMW cycles, is_allocated_sectors() treats zeroed areas
 like non-zero data if the end of the checked area isn't aligned. This
 can improve the efficiency of the conversion and was introduced in
 commit 8dcd3c9b91a.

 However, it comes with a correctness problem: qemu-img convert is
 supposed to sparsify areas that contain only zeros, which it doesn't do
 any more. It turns out that this even happens when not only the
 unaligned area is zeroed, but also the blocks before and after it. In
 the bug report, conversion of a fragmented 10G image containing only
 zeros resulted in an image consuming 2.82 GiB even though the expected
 size is only 4 KiB.

 As a tradeoff between both, let's ignore zeroed sectors only after
 non-zero data to fix the alignment, but if we're only looking at zeros,
 keep them as such, even if it may mean additional RMW cycles.

>>>
>>> Hmm.. If I understand correctly, we are going to do unaligned
>>> write-zero. And that helps.
>>
>> This can happen (mostly raw images on block devices, I think?), but
>> usually it just means skipping the write because we know that the target
>> image is already zeroed.
>>
>> What it does mean is that if the next part is data, we'll have an
>> unaligned data write.
>>
>>> Doesn't that mean that alignment is wrongly detected?
>>
>> The problem is that you can have bdrv_block_status_above() return the
>> same allocation status multiple times in a row, but *pnum can be
>> unaligned for the conversion.
>>
>> We only look at a single range returned by it when detecting the
>> alignment, so it could be that we have zero buffers for both 0-11 and
>> 12-16 and detect two misaligned ranges, when both together are a
>> perfectly aligned zeroed range.
>>
>> In theory we could try to do some lookahead and merge ranges where
>> possible, which should give us the perfect result, but it would make the
>> code considerably more complicated. (Whether we want to merge them
>> doesn't only depend on the block status, but possibly also on the
>> content of a DATA range.)
>>
>> Kevin
>>
>
> Oh, I understand now the problem, thanks for explanation.
>
> Hmm, yes that means, that if the whole buf is zero, is_allocated_sectors must 
> not align it down, to be possibly "merged" with next chunk if it is zero too.
>
> But it's still good to align zeroes down, if data starts somewhere inside the 
> buf, isn't it?
>
> what about something like this:
>
> diff --git a/qemu-img.c b/qemu-img.c
> index babb5573ab..d1704584a0 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -1167,19 +1167,39 @@ static int is_allocated_sectors(const uint8_t *buf, 
> int n, int *pnum,
>  }
>  }
>  
> +    if (i == n) {
> +    /*
> + * The whole buf is the same.
> + *
> + * if it's data, just return it. It's the old behavior.
> + *
> + * if it's zero, just return too. It will work good if target is 
> alredy
> + * zeroed. And if next chunk is zero too we'll have no RMW and no 
> reason
> + * to write data.
> + */
> +    *pnum = i;
> +    return !is_zero;
> +    }
> +
>  tail = (sector_num + i) & (alignment - 1);
>  if (tail) {
>  if (is_zero && i <= tail) {
> -    /* treat unallocated areas which only consist
> - * of a small tail as allocated. */
> +    /*
> + * For sure next sector after i is data, and it will rewrite this
> + * tail anyway due to RMW. So, let's just write data now.
> + */
>  is_zero = false;
>  }
>  if (!is_zero) {
> -    /* align up end offset of allocated areas. */
> +    /* If possible, align up end offset of allocated areas. */
>  i += alignment - tail;
>  i = MIN(i, n);
>  } else {
> -    /* align down end offset of zero areas. */
> +    /*
> + * For sure next sector after i is data, and it will rewrite this
> + * tail anyway due to RMW. Better is avoid RMW and write zeroes 
> up
> + * to aligned bound.
> + */
>  i -= tail;
>  }
>  }
>
>

I think we forgot to follow up on this. Has anyone tested this suggestion?

Otherwise, I would try to rerun the tests I did with the my old and Kevins 
suggestion.


Peter






Re: Qemu block filter insertion/removal API

2021-05-19 Thread Kevin Wolf
Am 19.05.2021 um 14:19 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 19.05.2021 14:44, Kevin Wolf wrote:
> > Am 17.05.2021 um 14:44 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > > Hi all!
> > > 
> > > I'd like to be sure that we know where we are going to.
> > > 
> > > In blockdev-era where qemu user is aware about block nodes, all nodes 
> > > have good names and controlled by user we can efficiently use block 
> > > filters.
> > > 
> > > We already have some useful filters: copy-on-read, throttling, compress. 
> > > In my parallel series I make backup-top filter public and useful without 
> > > backup block jobs. But now filters could be inserted only together with 
> > > opening their child. We can specify filters in qemu cmdline, or filter 
> > > can take place in the block node chain created by blockdev-add.
> > > 
> > > Still, it would be good to insert/remove filters on demand.
> > > 
> > > Currently we are going to use x-blockdev-reopen for this. Still it can't 
> > > be used to insert a filter above root node (as x-blockdev-reopen can 
> > > change only block node options and their children). In my series "[PATCH 
> > > 00/21] block: publish backup-top filter" I propose (as Kevin suggested) 
> > > to modify qom-set, so that it can set drive option of running device. 
> > > That's not difficult, but it means that we have different scenario of 
> > > inserting/removing filters:
> > > 
> > > 1. filter above root node X:
> > > 
> > > inserting:
> > > 
> > >- do blockdev-add to add a filter (and specify X as its child)
> > >- do qom-set to set new filter as a rood node instead of X
> > > 
> > > removing
> > > 
> > >- do qom-set to make X a root node again
> > >- do blockdev-del to drop a filter
> > > 
> > > 2. filter between two block nodes P and X. (For example, X is a backing 
> > > child of P)
> > > 
> > > inserting
> > > 
> > >- do blockdev-add to add a filter (and specify X as its child)
> > >- do blockdev-reopen to set P.backing = filter
> > > 
> > > remvoing
> > > 
> > >- do blockdev-reopen to set P.backing = X
> > >- do blockdev-del to drop a filter
> > > 
> > > 
> > > And, probably we'll want transaction support for all these things.
> > > 
> > > 
> > > Is it OK? Or do we need some kind of additional blockdev-replace command, 
> > > that can replace one node by another, so in both cases we will do
> > > 
> > > inserting:
> > >- blockdev-add filter
> > >- blockdev-replace (make all parents of X to point to the new filter 
> > > instead (except for the filter itself of course)
> > > 
> > > removing
> > >- blockdev-replace (make all parante of filter to be parents of X 
> > > instead)
> > >- blockdev-del filter
> > > 
> > > It's simple to implement, and it seems for me that it is simpler to use. 
> > > Any thoughts?
> > 
> > One reason I remember why we didn't decide to go this way in the many
> > "dynamic graph reconfiguration" discussions we had, is that it's not
> > generic enough to cover all cases. But I'm not sure if we ever
> > considered root nodes as a separate case. I acknowledge that having two
> > different interfaces is inconvenient, and integrating qom-set in a
> > transaction is rather unlikely to happen.
> > 
> > The reason why it's not generic is that it restricts you to doing the
> > same thing for all parents. Imagine this:
> > 
> >  +- virtio-blk
> >  |
> >  file <- qcow2 <-+
> >  |
> >  +- NBD export
> > 
> > Now you want to throttle the NBD export so that it doesn't interfere
> > with your VM too much. With your simple blockdev-replace this isn't
> > possible. You would have to add the filter to both users or to none.
> > 
> > In theory, blockdev-replace could take a list of the edges that should
> > be changed to the new node. The problem is that edges don't have names,
> > and even the parents don't necessarily have one (and if they do, they
> > are in separate namespaces, so a BlockBackend, a job and an export could
> > all have the same name), so finding a good way to refer to them in QMP
> > doesn't sound trivial.
> > 
> 
> Hm. I like the idea. And it seems feasible to me:
> 
> Both export and block jobs works through BlockBackend.
> 
> So, for block-jobs, we can add optional parameters like
> source-blk-name, and target-blk-name. If parameters specified, blk's
> will be named, and user will be able to do blockdev-replace.

I'm not sure if giving them a name is a good idea. Wouldn't it make the
BlockBackend accessible for the user who could then make a device use
it?

> For export it's a bit trickier: it would be strange to add separate
> argument for export blk, as export already has id. So, I'd do the
> following:
> 
> 1. make blk named (with same name as the export itself) iff name does
>not conflict with other blks
> 2. deprecate duplicating existing blk names by export name.

Yes, if we decide that giving them a name is a 

Re: Qemu block filter insertion/removal API

2021-05-19 Thread Vladimir Sementsov-Ogievskiy

19.05.2021 14:44, Kevin Wolf wrote:

Am 17.05.2021 um 14:44 hat Vladimir Sementsov-Ogievskiy geschrieben:

Hi all!

I'd like to be sure that we know where we are going to.

In blockdev-era where qemu user is aware about block nodes, all nodes have good 
names and controlled by user we can efficiently use block filters.

We already have some useful filters: copy-on-read, throttling, compress. In my 
parallel series I make backup-top filter public and useful without backup block 
jobs. But now filters could be inserted only together with opening their child. 
We can specify filters in qemu cmdline, or filter can take place in the block 
node chain created by blockdev-add.

Still, it would be good to insert/remove filters on demand.

Currently we are going to use x-blockdev-reopen for this. Still it can't be used to 
insert a filter above root node (as x-blockdev-reopen can change only block node options 
and their children). In my series "[PATCH 00/21] block: publish backup-top 
filter" I propose (as Kevin suggested) to modify qom-set, so that it can set drive 
option of running device. That's not difficult, but it means that we have different 
scenario of inserting/removing filters:

1. filter above root node X:

inserting:

   - do blockdev-add to add a filter (and specify X as its child)
   - do qom-set to set new filter as a rood node instead of X

removing

   - do qom-set to make X a root node again
   - do blockdev-del to drop a filter

2. filter between two block nodes P and X. (For example, X is a backing child 
of P)

inserting

   - do blockdev-add to add a filter (and specify X as its child)
   - do blockdev-reopen to set P.backing = filter

remvoing

   - do blockdev-reopen to set P.backing = X
   - do blockdev-del to drop a filter


And, probably we'll want transaction support for all these things.


Is it OK? Or do we need some kind of additional blockdev-replace command, that 
can replace one node by another, so in both cases we will do

inserting:
   - blockdev-add filter
   - blockdev-replace (make all parents of X to point to the new filter instead 
(except for the filter itself of course)

removing
   - blockdev-replace (make all parante of filter to be parents of X instead)
   - blockdev-del filter

It's simple to implement, and it seems for me that it is simpler to use. Any 
thoughts?


One reason I remember why we didn't decide to go this way in the many
"dynamic graph reconfiguration" discussions we had, is that it's not
generic enough to cover all cases. But I'm not sure if we ever
considered root nodes as a separate case. I acknowledge that having two
different interfaces is inconvenient, and integrating qom-set in a
transaction is rather unlikely to happen.

The reason why it's not generic is that it restricts you to doing the
same thing for all parents. Imagine this:

 +- virtio-blk
 |
 file <- qcow2 <-+
 |
 +- NBD export

Now you want to throttle the NBD export so that it doesn't interfere
with your VM too much. With your simple blockdev-replace this isn't
possible. You would have to add the filter to both users or to none.

In theory, blockdev-replace could take a list of the edges that should
be changed to the new node. The problem is that edges don't have names,
and even the parents don't necessarily have one (and if they do, they
are in separate namespaces, so a BlockBackend, a job and an export could
all have the same name), so finding a good way to refer to them in QMP
doesn't sound trivial.



Hm. I like the idea. And it seems feasible to me:

Both export and block jobs works through BlockBackend.

So, for block-jobs, we can add optional parameters like source-blk-name, and 
target-blk-name. If parameters specified, blk's will be named, and user will be 
able to do blockdev-replace.

For export it's a bit trickier: it would be strange to add separate argument 
for export blk, as export already has id. So, I'd do the following:

1. make blk named (with same name as the export itself) iff name does not 
conflict with other blks
2. deprecate duplicating existing blk names by export name.


Then, blockdev-replace take a parents list, where parent is either node-name or 
blk name.

--
Best regards,
Vladimir



Re: RFC: Qemu backup interface plans

2021-05-19 Thread Kevin Wolf
Am 19.05.2021 um 13:49 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 19.05.2021 14:20, Kevin Wolf wrote:
> > Am 19.05.2021 um 08:11 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > > > > 2. Test, that we can start backup job with source = (target of
> > > > > backup-top filter), so that we have "push backup with fleecing".
> > > > > Make an option for backup to start without a filter, when we don't
> > > > > need copy-before-write operations, to not create extra superfluous
> > > > > filter.
> > > > 
> > > > OK, so the backup job is not really a backup job, but just anything
> > > > that copies data.
> > > 
> > > Not quite. For backup without a filter we should protect source from
> > > changing, by unsharing WRITE permission on it.
> > > 
> > > I'll try to avoid adding an option. The logic should work like in
> > > commit job: if there are current writers on source we create filter.
> > > If there no writers, we just unshare writes and go without a filter.
> > > And for this copy-before-write filter should be able to do
> > > WRITE_UNCHANGED in case of fleecing.
> > 
> > If we ever get to the point where we would make a block-copy job visible
> > to the user, I wouldn't copy the restrictions from the current jobs, but
> > keep it really generic to cover all cases.
> > 
> > There is no way for the QMP command starting the job to know what the
> > user is planning to do with the image in the future. Even if it's
> > currently read-only, the user may want to add a writer later.
> > 
> > I think this means that we want to always add a filter node, and then
> > possibly dynamically switch between modes if being read-only provides a
> > significant advantage for the job.
> 
> Still, in push-backup-with-fleecing scheme we really don't need the
> second filter, so why to insert extra thing to block graph?
> 
> I see your point still, that user may want to add writer later. Still,
> I'd be surprised if such use-cases exist now.
> 
> What about the following:
> 
> add some source-mode tri-state parameter for backup:
> 
> auto: insert filter iff there are existing writers [default]
> filtered: insert filter unconditionally
> immutable: don't insert filter. will fail if there are existing
> writers, and creating writers during block-job would be impossible

Yes, that's an option, too.

Kevin




Re: RFC: Qemu backup interface plans

2021-05-19 Thread Vladimir Sementsov-Ogievskiy

19.05.2021 14:20, Kevin Wolf wrote:

Am 19.05.2021 um 08:11 hat Vladimir Sementsov-Ogievskiy geschrieben:

2. Test, that we can start backup job with source = (target of
backup-top filter), so that we have "push backup with fleecing".
Make an option for backup to start without a filter, when we don't
need copy-before-write operations, to not create extra superfluous
filter.


OK, so the backup job is not really a backup job, but just anything
that copies data.


Not quite. For backup without a filter we should protect source from
changing, by unsharing WRITE permission on it.

I'll try to avoid adding an option. The logic should work like in
commit job: if there are current writers on source we create filter.
If there no writers, we just unshare writes and go without a filter.
And for this copy-before-write filter should be able to do
WRITE_UNCHANGED in case of fleecing.


If we ever get to the point where we would make a block-copy job visible
to the user, I wouldn't copy the restrictions from the current jobs, but
keep it really generic to cover all cases.

There is no way for the QMP command starting the job to know what the
user is planning to do with the image in the future. Even if it's
currently read-only, the user may want to add a writer later.

I think this means that we want to always add a filter node, and then
possibly dynamically switch between modes if being read-only provides a
significant advantage for the job.

Kevin



Still, in push-backup-with-fleecing scheme we really don't need the second 
filter, so why to insert extra thing to block graph?

I see your point still, that user may want to add writer later. Still, I'd be 
surprised if such use-cases exist now.

What about the following:

add some source-mode tri-state parameter for backup:

auto: insert filter iff there are existing writers [default]
filtered: insert filter unconditionally
immutable: don't insert filter. will fail if there are existing writers, and 
creating writers during block-job would be impossible

--
Best regards,
Vladimir



Re: Qemu block filter insertion/removal API

2021-05-19 Thread Kevin Wolf
Am 17.05.2021 um 14:44 hat Vladimir Sementsov-Ogievskiy geschrieben:
> Hi all!
> 
> I'd like to be sure that we know where we are going to.
> 
> In blockdev-era where qemu user is aware about block nodes, all nodes have 
> good names and controlled by user we can efficiently use block filters.
> 
> We already have some useful filters: copy-on-read, throttling, compress. In 
> my parallel series I make backup-top filter public and useful without backup 
> block jobs. But now filters could be inserted only together with opening 
> their child. We can specify filters in qemu cmdline, or filter can take place 
> in the block node chain created by blockdev-add.
> 
> Still, it would be good to insert/remove filters on demand.
> 
> Currently we are going to use x-blockdev-reopen for this. Still it can't be 
> used to insert a filter above root node (as x-blockdev-reopen can change only 
> block node options and their children). In my series "[PATCH 00/21] block: 
> publish backup-top filter" I propose (as Kevin suggested) to modify qom-set, 
> so that it can set drive option of running device. That's not difficult, but 
> it means that we have different scenario of inserting/removing filters:
> 
> 1. filter above root node X:
> 
> inserting:
> 
>   - do blockdev-add to add a filter (and specify X as its child)
>   - do qom-set to set new filter as a rood node instead of X
> 
> removing
> 
>   - do qom-set to make X a root node again
>   - do blockdev-del to drop a filter
> 
> 2. filter between two block nodes P and X. (For example, X is a backing child 
> of P)
> 
> inserting
> 
>   - do blockdev-add to add a filter (and specify X as its child)
>   - do blockdev-reopen to set P.backing = filter
> 
> remvoing
> 
>   - do blockdev-reopen to set P.backing = X
>   - do blockdev-del to drop a filter
> 
> 
> And, probably we'll want transaction support for all these things.
> 
> 
> Is it OK? Or do we need some kind of additional blockdev-replace command, 
> that can replace one node by another, so in both cases we will do
> 
> inserting:
>   - blockdev-add filter
>   - blockdev-replace (make all parents of X to point to the new filter 
> instead (except for the filter itself of course)
> 
> removing
>   - blockdev-replace (make all parante of filter to be parents of X instead)
>   - blockdev-del filter
> 
> It's simple to implement, and it seems for me that it is simpler to use. Any 
> thoughts?

One reason I remember why we didn't decide to go this way in the many
"dynamic graph reconfiguration" discussions we had, is that it's not
generic enough to cover all cases. But I'm not sure if we ever
considered root nodes as a separate case. I acknowledge that having two
different interfaces is inconvenient, and integrating qom-set in a
transaction is rather unlikely to happen.

The reason why it's not generic is that it restricts you to doing the
same thing for all parents. Imagine this:

+- virtio-blk
|
file <- qcow2 <-+
|
+- NBD export

Now you want to throttle the NBD export so that it doesn't interfere
with your VM too much. With your simple blockdev-replace this isn't
possible. You would have to add the filter to both users or to none.

In theory, blockdev-replace could take a list of the edges that should
be changed to the new node. The problem is that edges don't have names,
and even the parents don't necessarily have one (and if they do, they
are in separate namespaces, so a BlockBackend, a job and an export could
all have the same name), so finding a good way to refer to them in QMP
doesn't sound trivial.

Kevin




[PATCH] MAINTAINERS: update block/rbd.c maintainer

2021-05-19 Thread Ilya Dryomov
Jason has moved on from working on RBD and Ceph.  I'm taking over
his role upstream.

Signed-off-by: Ilya Dryomov 
---
 MAINTAINERS | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index eab178aeee5e..3e77ac9030fa 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3042,7 +3042,7 @@ S: Supported
 F: block/vmdk.c
 
 RBD
-M: Jason Dillaman 
+M: Ilya Dryomov 
 L: qemu-block@nongnu.org
 S: Supported
 F: block/rbd.c
-- 
2.19.2




Re: RFC: Qemu backup interface plans

2021-05-19 Thread Kevin Wolf
Am 19.05.2021 um 08:11 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > > 2. Test, that we can start backup job with source = (target of
> > > backup-top filter), so that we have "push backup with fleecing".
> > > Make an option for backup to start without a filter, when we don't
> > > need copy-before-write operations, to not create extra superfluous
> > > filter.
> > 
> > OK, so the backup job is not really a backup job, but just anything
> > that copies data.
> 
> Not quite. For backup without a filter we should protect source from
> changing, by unsharing WRITE permission on it.
> 
> I'll try to avoid adding an option. The logic should work like in
> commit job: if there are current writers on source we create filter.
> If there no writers, we just unshare writes and go without a filter.
> And for this copy-before-write filter should be able to do
> WRITE_UNCHANGED in case of fleecing.

If we ever get to the point where we would make a block-copy job visible
to the user, I wouldn't copy the restrictions from the current jobs, but
keep it really generic to cover all cases.

There is no way for the QMP command starting the job to know what the
user is planning to do with the image in the future. Even if it's
currently read-only, the user may want to add a writer later.

I think this means that we want to always add a filter node, and then
possibly dynamically switch between modes if being read-only provides a
significant advantage for the job.

Kevin




Re: [PATCH 01/21] block: introduce bdrv_replace_child_bs()

2021-05-19 Thread Vladimir Sementsov-Ogievskiy

19.05.2021 14:11, Max Reitz wrote:

On 19.05.21 12:12, Vladimir Sementsov-Ogievskiy wrote:

17.05.2021 15:09, Max Reitz wrote:

On 17.05.21 08:44, Vladimir Sementsov-Ogievskiy wrote:

Add function to transactionally replace bs inside BdrvChild.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  include/block/block.h |  2 ++
  block.c   | 36 
  2 files changed, 38 insertions(+)


As you may guess, I know little about the rewritten replacing functions, so 
this is kind of difficult to review for me.  However, nothing looks out of 
place, and the function looks sufficiently similar to 
bdrv_replace_node_common() to make me happy.


diff --git a/include/block/block.h b/include/block/block.h
index 82185965ff..f9d5fcb108 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -361,6 +361,8 @@ int bdrv_append(BlockDriverState *bs_new, BlockDriverState 
*bs_top,
  Error **errp);
  int bdrv_replace_node(BlockDriverState *from, BlockDriverState *to,
    Error **errp);
+int bdrv_replace_child_bs(BdrvChild *child, BlockDriverState *new_bs,
+  Error **errp);
  BlockDriverState *bdrv_insert_node(BlockDriverState *bs, QDict *node_options,
 int flags, Error **errp);
  int bdrv_drop_filter(BlockDriverState *bs, Error **errp);
diff --git a/block.c b/block.c
index 9ad725d205..755fa53d85 100644
--- a/block.c
+++ b/block.c
@@ -4961,6 +4961,42 @@ out:
  return ret;
  }
+int bdrv_replace_child_bs(BdrvChild *child, BlockDriverState *new_bs,
+  Error **errp)
+{
+    int ret;
+    Transaction *tran = tran_new();
+    g_autoptr(GHashTable) found = NULL;
+    g_autoptr(GSList) refresh_list = NULL;
+    BlockDriverState *old_bs = child->bs;
+
+    if (old_bs) {


Hm.  Can child->bs be ever NULL?


Seems it can. For example we have hmp_drive_del command, that removes bs from 
blk :(

qmp eject and blockdev-remove-medium seems do it too.


blk_remove_bs() doesn’t eject the BDS from the BdrvChild blk->root, though, it 
drops blk->root altogether.  Doesn’t it?



A hm, yes. What I say is that we can have empty blk. But probably we should 
never have BdrvChild with NULL bs. I'll check it


--
Best regards,
Vladimir



Re: [PATCH 01/21] block: introduce bdrv_replace_child_bs()

2021-05-19 Thread Max Reitz

On 19.05.21 12:12, Vladimir Sementsov-Ogievskiy wrote:

17.05.2021 15:09, Max Reitz wrote:

On 17.05.21 08:44, Vladimir Sementsov-Ogievskiy wrote:

Add function to transactionally replace bs inside BdrvChild.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  include/block/block.h |  2 ++
  block.c   | 36 
  2 files changed, 38 insertions(+)


As you may guess, I know little about the rewritten replacing 
functions, so this is kind of difficult to review for me.  However, 
nothing looks out of place, and the function looks sufficiently 
similar to bdrv_replace_node_common() to make me happy.



diff --git a/include/block/block.h b/include/block/block.h
index 82185965ff..f9d5fcb108 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -361,6 +361,8 @@ int bdrv_append(BlockDriverState *bs_new, 
BlockDriverState *bs_top,

  Error **errp);
  int bdrv_replace_node(BlockDriverState *from, BlockDriverState *to,
    Error **errp);
+int bdrv_replace_child_bs(BdrvChild *child, BlockDriverState *new_bs,
+  Error **errp);
  BlockDriverState *bdrv_insert_node(BlockDriverState *bs, QDict 
*node_options,

 int flags, Error **errp);
  int bdrv_drop_filter(BlockDriverState *bs, Error **errp);
diff --git a/block.c b/block.c
index 9ad725d205..755fa53d85 100644
--- a/block.c
+++ b/block.c
@@ -4961,6 +4961,42 @@ out:
  return ret;
  }
+int bdrv_replace_child_bs(BdrvChild *child, BlockDriverState *new_bs,
+  Error **errp)
+{
+    int ret;
+    Transaction *tran = tran_new();
+    g_autoptr(GHashTable) found = NULL;
+    g_autoptr(GSList) refresh_list = NULL;
+    BlockDriverState *old_bs = child->bs;
+
+    if (old_bs) {


Hm.  Can child->bs be ever NULL?


Seems it can. For example we have hmp_drive_del command, that removes bs 
from blk :(


qmp eject and blockdev-remove-medium seems do it too.


blk_remove_bs() doesn’t eject the BDS from the BdrvChild blk->root, 
though, it drops blk->root altogether.  Doesn’t it?


Max




Re: [RFC PATCH] block/vpc: Support probing of fixed-size VHD images

2021-05-19 Thread Kevin Wolf
Am 29.03.2021 um 09:25 hat Thomas Huth geschrieben:
> Fixed-size VHD images don't have a header, only a footer. To be able
> to still detect them right, support probing via the file name, too.
> 
> Without this change, images get detected as raw:
> 
> $ qemu-img create -f vpc -o subformat=fixed test.vhd 2G
> Formatting 'test.vhd', fmt=vpc size=2147483648 subformat=fixed
> $ qemu-img info test.vhd
> image: test.vhd
> file format: raw
> virtual size: 2 GiB (2147992064 bytes)
> disk size: 8 KiB
> 
> With this change:
> 
> $ qemu-img info test.vhd
> image: test.vhd
> file format: vpc
> virtual size: 2 GiB (2147991552 bytes)
> disk size: 8 KiB
> 
> Resolves: https://bugs.launchpad.net/qemu/+bug/1819182
> Signed-off-by: Thomas Huth 
> ---
>  I've marked the subject with RFC since I'm not quite sure whether this
>  is really a good idea... please let me know what you think about it...

There is precedence for using the file name, and it's convenient, so
when done carefully, I think this is okay.

The classic problem we have with probing is that a malicious guest on a
probed raw image could write a $format header into the image and be
probed as something else the next time. For headers, we prevent this in
the raw driver, for filename based probes we don't.

Of course, if you call your raw image .vhd and use probing, you're
almost explicitly asking for trouble.

What happens if you do it anyway and the guest writes a VHD footer? In
theory, we can know that it's a VHD_FIXED image, and fixed image footers
don't contain anything that would allow a guest more than destroying
itself. Other than carrying the additional metadata in the footer, they
behave the same as raw images anyway.

We have a bug in vpc_open(), though: It sets the local variable
disk_type correctly, but other functions use s->footer.type instead and
we never check that it actually matches the disk type we think we're
opening.

So I think we need to add this check (and we need to add it even if we
don't change probing), and then the change to vpc_probe() is probably
okay.

Kevin




Re: [PATCH] qemu-io-cmds: assert that we don't have .perm requested in no-blk case

2021-05-19 Thread Kevin Wolf
Am 19.05.2021 um 11:05 hat Vladimir Sementsov-Ogievskiy geschrieben:
> Coverity things blk may be NULL. It's a false-positive, as described in

s/things/thinks/

> a new comment.
> 
> Fixes: Coverity CID 1453194
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  qemu-io-cmds.c | 14 --
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
> index 998b67186d..3b7cceecbf 100644
> --- a/qemu-io-cmds.c
> +++ b/qemu-io-cmds.c
> @@ -92,9 +92,19 @@ static int command(BlockBackend *blk, const cmdinfo_t *ct, 
> int argc,
>  return -EINVAL;
>  }
>  
> -/* Request additional permissions if necessary for this command. The 
> caller
> +/*
> + * Request additional permissions if necessary for this command. The 
> caller
>   * is responsible for restoring the original permissions afterwards if 
> this
> - * is what it wants. */
> + * is what it wants.
> + *
> + * Coverity things that blk may be NULL in the following if condition. 
> It's

And here.

> + * not so: in init_check_command() we fail if blk is NULL for command 
> with
> + * both CMD_FLAG_GLOBAL and CMD_NOFILE_OK flags unset. And in
> + * qemuio_add_command() we assert that command with non-zero .perm field
> + * doesn't set this flags. So, the following assertion is to silence
> + * Coverity:
> + */
> +assert(blk || !ct->perm);
>  if (ct->perm && blk_is_available(blk)) {
>  uint64_t orig_perm, orig_shared_perm;
>  blk_get_perm(blk, _perm, _shared_perm);

Thanks, applied with the typo fixed up.

Kevin




Re: [PATCH] block/file-posix: Fix problem with fallocate(PUNCH_HOLE) on GPFS

2021-05-19 Thread Thomas Huth

On 19/05/2021 12.21, Thomas Huth wrote:

On 19/04/2021 07.06, Thomas Huth wrote:

On 16/04/2021 22.34, Nir Soffer wrote:

On Fri, Apr 16, 2021 at 8:23 AM Thomas Huth  wrote:


A customer reported that running

  qemu-img convert -t none -O qcow2 -f qcow2 input.qcow2 output.qcow2

fails for them with the following error message when the images are
stored on a GPFS file system:

  qemu-img: error while writing sector 0: Invalid argument

After analyzing the strace output, it seems like the problem is in
handle_aiocb_write_zeroes(): The call to fallocate(FALLOC_FL_PUNCH_HOLE)
returns EINVAL, which can apparently happen if the file system has
a different idea of the granularity of the operation. It's arguably
a bug in GPFS, since the PUNCH_HOLE mode should not result in EINVAL
according to the man-page of fallocate(), but the file system is out
there in production and so we have to deal with it. In commit 294682cc3a
("block: workaround for unaligned byte range in fallocate()") we also
already applied the a work-around for the same problem to the earlier
fallocate(FALLOC_FL_ZERO_RANGE) call, so do it now similar with the
PUNCH_HOLE call.

Signed-off-by: Thomas Huth 
---
  block/file-posix.c | 7 +++
  1 file changed, 7 insertions(+)

diff --git a/block/file-posix.c b/block/file-posix.c
index 20e14f8e96..7a40428d52 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1675,6 +1675,13 @@ static int handle_aiocb_write_zeroes(void *opaque)
  }
  s->has_fallocate = false;
  } else if (ret != -ENOTSUP) {
+    if (ret == -EINVAL) {
+    /*
+ * File systems like GPFS do not like unaligned byte 
ranges,
+ * treat it like unsupported (so caller falls back to 
pwrite)

+ */
+    return -ENOTSUP;


This skips the next fallback, using plain fallocate(0) if we write
after the end of the file. Is this intended?

We can treat the buggy EINVAL return value as "filesystem is buggy,
let's not try other options", or "let's try the next option". Since falling
back to actually writing zeroes is so much slower, I think it is better to
try the next option.


I just did the same work-around as in commit 294682cc3a7 ... so if we 
agree to try the other options, too, we should change that spot, too...


However, what is not clear to me, how would you handle s->has_write_zeroes 
and s->has_discard in such a case? Set them to "false"? ... but it could 
still work for some blocks with different alignment ... but if we keep 
them set to "true", the code tries again and again to call these ioctls, 
maybe wasting other precious cycles for this?


Maybe we should do a different approach instead: In case we hit a EINVAL 
here, print an error a la:


error_report_once("You are running on a buggy file system, please complain 
to the file system vendor");


and return -ENOTSUP ... then it's hopefully clear to the users why they 
are getting a bad performance, and that they should complain to the file 
system vendor instead to get their problem fixed.


Ping!

Any recommendations how to proceed here?


Never mind, Kevin just told me that he already replied - but apparently that 
mail did not make it into my "qemu-devel" folder, so I did not see it :-( 
... Anyway, I'll try to rework my patch accordingly:


 https://lists.gnu.org/archive/html/qemu-block/2021-04/msg00488.html

 Thomas




Re: [RFC PATCH] block/vpc: Support probing of fixed-size VHD images

2021-05-19 Thread Thomas Huth

On 29/03/2021 09.25, Thomas Huth wrote:

Fixed-size VHD images don't have a header, only a footer. To be able
to still detect them right, support probing via the file name, too.

Without this change, images get detected as raw:

$ qemu-img create -f vpc -o subformat=fixed test.vhd 2G
Formatting 'test.vhd', fmt=vpc size=2147483648 subformat=fixed
$ qemu-img info test.vhd
image: test.vhd
file format: raw
virtual size: 2 GiB (2147992064 bytes)
disk size: 8 KiB

With this change:

$ qemu-img info test.vhd
image: test.vhd
file format: vpc
virtual size: 2 GiB (2147991552 bytes)
disk size: 8 KiB

Resolves: https://bugs.launchpad.net/qemu/+bug/1819182
Signed-off-by: Thomas Huth 
---
  I've marked the subject with RFC since I'm not quite sure whether this
  is really a good idea... please let me know what you think about it...

  block/vpc.c | 12 +++-
  1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/block/vpc.c b/block/vpc.c
index 17a705b482..be561e4b39 100644
--- a/block/vpc.c
+++ b/block/vpc.c
@@ -191,8 +191,18 @@ static uint32_t vpc_checksum(void *p, size_t size)
  
  static int vpc_probe(const uint8_t *buf, int buf_size, const char *filename)

  {
-if (buf_size >= 8 && !strncmp((char *)buf, "conectix", 8))
+if (buf_size >= 8 && !strncmp((char *)buf, "conectix", 8)) {
  return 100;
+}
+
+/* It could be a fixed-size image without header -> check extension, too */
+if (filename) {
+int len = strlen(filename);
+if (len > 4 && !strcasecmp([len - 4], ".vhd")) {
+return 10;
+}
+}
+
  return 0;
  }


Ping!

Anybody any comments on this one?

 Thomas




Re: [PATCH] block/file-posix: Fix problem with fallocate(PUNCH_HOLE) on GPFS

2021-05-19 Thread Thomas Huth

On 19/04/2021 07.06, Thomas Huth wrote:

On 16/04/2021 22.34, Nir Soffer wrote:

On Fri, Apr 16, 2021 at 8:23 AM Thomas Huth  wrote:


A customer reported that running

  qemu-img convert -t none -O qcow2 -f qcow2 input.qcow2 output.qcow2

fails for them with the following error message when the images are
stored on a GPFS file system:

  qemu-img: error while writing sector 0: Invalid argument

After analyzing the strace output, it seems like the problem is in
handle_aiocb_write_zeroes(): The call to fallocate(FALLOC_FL_PUNCH_HOLE)
returns EINVAL, which can apparently happen if the file system has
a different idea of the granularity of the operation. It's arguably
a bug in GPFS, since the PUNCH_HOLE mode should not result in EINVAL
according to the man-page of fallocate(), but the file system is out
there in production and so we have to deal with it. In commit 294682cc3a
("block: workaround for unaligned byte range in fallocate()") we also
already applied the a work-around for the same problem to the earlier
fallocate(FALLOC_FL_ZERO_RANGE) call, so do it now similar with the
PUNCH_HOLE call.

Signed-off-by: Thomas Huth 
---
  block/file-posix.c | 7 +++
  1 file changed, 7 insertions(+)

diff --git a/block/file-posix.c b/block/file-posix.c
index 20e14f8e96..7a40428d52 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1675,6 +1675,13 @@ static int handle_aiocb_write_zeroes(void *opaque)
  }
  s->has_fallocate = false;
  } else if (ret != -ENOTSUP) {
+    if (ret == -EINVAL) {
+    /*
+ * File systems like GPFS do not like unaligned byte 
ranges,
+ * treat it like unsupported (so caller falls back to 
pwrite)

+ */
+    return -ENOTSUP;


This skips the next fallback, using plain fallocate(0) if we write
after the end of the file. Is this intended?

We can treat the buggy EINVAL return value as "filesystem is buggy,
let's not try other options", or "let's try the next option". Since falling
back to actually writing zeroes is so much slower, I think it is better to
try the next option.


I just did the same work-around as in commit 294682cc3a7 ... so if we agree 
to try the other options, too, we should change that spot, too...


However, what is not clear to me, how would you handle s->has_write_zeroes 
and s->has_discard in such a case? Set them to "false"? ... but it could 
still work for some blocks with different alignment ... but if we keep them 
set to "true", the code tries again and again to call these ioctls, maybe 
wasting other precious cycles for this?


Maybe we should do a different approach instead: In case we hit a EINVAL 
here, print an error a la:


error_report_once("You are running on a buggy file system, please complain 
to the file system vendor");


and return -ENOTSUP ... then it's hopefully clear to the users why they are 
getting a bad performance, and that they should complain to the file system 
vendor instead to get their problem fixed.


Ping!

Any recommendations how to proceed here?

 Thomas




Re: [PATCH 01/21] block: introduce bdrv_replace_child_bs()

2021-05-19 Thread Vladimir Sementsov-Ogievskiy

17.05.2021 15:09, Max Reitz wrote:

On 17.05.21 08:44, Vladimir Sementsov-Ogievskiy wrote:

Add function to transactionally replace bs inside BdrvChild.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  include/block/block.h |  2 ++
  block.c   | 36 
  2 files changed, 38 insertions(+)


As you may guess, I know little about the rewritten replacing functions, so 
this is kind of difficult to review for me.  However, nothing looks out of 
place, and the function looks sufficiently similar to 
bdrv_replace_node_common() to make me happy.


diff --git a/include/block/block.h b/include/block/block.h
index 82185965ff..f9d5fcb108 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -361,6 +361,8 @@ int bdrv_append(BlockDriverState *bs_new, BlockDriverState 
*bs_top,
  Error **errp);
  int bdrv_replace_node(BlockDriverState *from, BlockDriverState *to,
    Error **errp);
+int bdrv_replace_child_bs(BdrvChild *child, BlockDriverState *new_bs,
+  Error **errp);
  BlockDriverState *bdrv_insert_node(BlockDriverState *bs, QDict *node_options,
 int flags, Error **errp);
  int bdrv_drop_filter(BlockDriverState *bs, Error **errp);
diff --git a/block.c b/block.c
index 9ad725d205..755fa53d85 100644
--- a/block.c
+++ b/block.c
@@ -4961,6 +4961,42 @@ out:
  return ret;
  }
+int bdrv_replace_child_bs(BdrvChild *child, BlockDriverState *new_bs,
+  Error **errp)
+{
+    int ret;
+    Transaction *tran = tran_new();
+    g_autoptr(GHashTable) found = NULL;
+    g_autoptr(GSList) refresh_list = NULL;
+    BlockDriverState *old_bs = child->bs;
+
+    if (old_bs) {


Hm.  Can child->bs be ever NULL?


Seems it can. For example we have hmp_drive_del command, that removes bs from 
blk :(

qmp eject and blockdev-remove-medium seems do it too.




+    bdrv_ref(old_bs);
+    bdrv_drained_begin(old_bs);
+    }
+    bdrv_drained_begin(new_bs);


(I was wondering why we couldn’t handle the new_bs == NULL case here to replace 
bdrv_remove_filter_or_cow_child(), but then I realized it’s probably because 
that’s kind of difficult, precisely because child->bs at least should generally 
be non-NULL.  Which is why bdrv_remove_filter_or_cow_child() needs to add its own 
transaction entry to handle the BdrvChild object and the pointer to it.

Hence me wondering whether we could assume child->bs not to be NULL.)


+
+    bdrv_replace_child(child, new_bs, tran);
+
+    found = g_hash_table_new(NULL, NULL);
+    if (old_bs) {
+    refresh_list = bdrv_topological_dfs(refresh_list, found, old_bs);
+    }
+    refresh_list = bdrv_topological_dfs(refresh_list, found, new_bs);
+
+    ret = bdrv_list_refresh_perms(refresh_list, NULL, tran, errp);


Speaking of bdrv_remove_filter_or_cow_child(): That function doesn’t refresh 
permissions.  I think it’s correct to do it here, so the following question 
doesn’t really concern this patch, but: Why don’t we do it there?

I guess it’s because we expect the node to go away anyway, so we don’t need to 
refresh the permissions.  And that assumption should hold true right now, given 
its callers.  But is that a safe assumption in general?  Would there be a 
problem if we refreshed permissions there?  Or is not refreshing permissions 
just part of the function’s interface?

Max


+
+    tran_finalize(tran, ret);
+
+    if (old_bs) {
+    bdrv_drained_end(old_bs);
+    bdrv_unref(old_bs);
+    }
+    bdrv_drained_end(new_bs);
+
+    return ret;
+}
+
  static void bdrv_delete(BlockDriverState *bs)
  {
  assert(bdrv_op_blocker_is_empty(bs));






--
Best regards,
Vladimir



[PATCH] qemu-io-cmds: assert that we don't have .perm requested in no-blk case

2021-05-19 Thread Vladimir Sementsov-Ogievskiy
Coverity things blk may be NULL. It's a false-positive, as described in
a new comment.

Fixes: Coverity CID 1453194
Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 qemu-io-cmds.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index 998b67186d..3b7cceecbf 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -92,9 +92,19 @@ static int command(BlockBackend *blk, const cmdinfo_t *ct, 
int argc,
 return -EINVAL;
 }
 
-/* Request additional permissions if necessary for this command. The caller
+/*
+ * Request additional permissions if necessary for this command. The caller
  * is responsible for restoring the original permissions afterwards if this
- * is what it wants. */
+ * is what it wants.
+ *
+ * Coverity things that blk may be NULL in the following if condition. It's
+ * not so: in init_check_command() we fail if blk is NULL for command with
+ * both CMD_FLAG_GLOBAL and CMD_NOFILE_OK flags unset. And in
+ * qemuio_add_command() we assert that command with non-zero .perm field
+ * doesn't set this flags. So, the following assertion is to silence
+ * Coverity:
+ */
+assert(blk || !ct->perm);
 if (ct->perm && blk_is_available(blk)) {
 uint64_t orig_perm, orig_shared_perm;
 blk_get_perm(blk, _perm, _shared_perm);
-- 
2.29.2




Re: [PATCH v2] fdc: check null block pointer before r/w data transfer

2021-05-19 Thread P J P
+-- On Tue, 18 May 2021, John Snow wrote --+
| I assume it can be rolled into the most recent issue that actually grabbed 
| my attention:
|
|  -> https://gitlab.com/qemu-project/qemu/-/issues/338
| 
| And we can credit both reporters (and Alexander) and solve all of these issues
| all at once.
| 
| I am therefore dropping this patch from my queue. Please let me know if I am
| mistaken.

* It should be okay to collate patches together under above gitlab issue as a 
  series.
 
  Considering they've individual CVEs assigned, we'll still need to tag them 
  with CVEs, so that it's easier for downstream users to pick them.


Thank you.
--
 - P J P
8685 545E B54C 486B C6EB 271E E285 8B5A F050 DE8D




Re: [PATCH 3/3] hw/block/nvme: add id ns metadata cap (mc) enum

2021-05-19 Thread Klaus Jensen

On Apr 21 18:32, Gollu Appalanaidu wrote:

Add Idnetify Namespace Metadata Capablities (MC) enum.

Signed-off-by: Gollu Appalanaidu 
---
hw/block/nvme-ns.c   | 2 +-
include/block/nvme.h | 5 +
2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 9065a7ae99..db75302136 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -85,7 +85,7 @@ static int nvme_ns_init(NvmeNamespace *ns, Error **errp)
ds = 31 - clz32(ns->blkconf.logical_block_size);
ms = ns->params.ms;

-id_ns->mc = 0x3;
+id_ns->mc = NVME_ID_NS_MC_EXTENDED | NVME_ID_NS_MC_SEPARATE;

if (ms && ns->params.mset) {
id_ns->flbas |= NVME_ID_NS_FLBAS_EXTENDEND;
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 1d61030756..a3b610ba86 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -1344,6 +1344,11 @@ enum NvmeIdNsFlbas {
NVME_ID_NS_FLBAS_EXTENDEND  = 1 << 4,
};

+enum NvmeIdNsMc {
+NVME_ID_NS_MC_EXTENDED  = 1 << 0,
+NVME_ID_NS_MC_SEPARATE  = 1 << 1,
+};
+
#define NVME_ID_NS_DPS_TYPE(dps) (dps & NVME_ID_NS_DPS_TYPE_MASK)

typedef struct NvmeDifTuple {
--
2.17.1



So, initially I wondered why the compiler didnt complain about the 
`NVME_ID_NS_MC_EXTENDED` and `NVME_ID_NS_MC_SEPARATE` "names" being 
defined twice. A smart colleague was quick to point out that because the
`NVME_ID_NS_MC_EXTENDED(mc)` function-like macro is expanded in the 
preprocessing step, it doesn't exist when we compile so there really is 
no potential clash.


I wonder now if it is bad form to keep the function-like macro 
"accessor" there as well as the enum for the value. Does anyone have any 
opinion on this? In other words, would it be bad or confusing to do 
something like this?


   enum NvmeIdNsMc {
 NVME_ID_NS_MC_EXTENDED = 1 << 0,
   };

   #define NVME_ID_NS_MC_EXTENDED(mc) ((mc) & NVME_ID_NS_MC_EXTENDED)


signature.asc
Description: PGP signature


Re: Qemu block filter insertion/removal API

2021-05-19 Thread Vladimir Sementsov-Ogievskiy

18.05.2021 19:49, Max Reitz wrote:

On 17.05.21 14:44, Vladimir Sementsov-Ogievskiy wrote:

Hi all!

I'd like to be sure that we know where we are going to.

In blockdev-era where qemu user is aware about block nodes, all nodes have good 
names and controlled by user we can efficiently use block filters.

We already have some useful filters: copy-on-read, throttling, compress. In my 
parallel series I make backup-top filter public and useful without backup block 
jobs. But now filters could be inserted only together with opening their child. 
We can specify filters in qemu cmdline, or filter can take place in the block 
node chain created by blockdev-add.

Still, it would be good to insert/remove filters on demand.

Currently we are going to use x-blockdev-reopen for this. Still it can't be used to 
insert a filter above root node (as x-blockdev-reopen can change only block node options 
and their children). In my series "[PATCH 00/21] block: publish backup-top 
filter" I propose (as Kevin suggested) to modify qom-set, so that it can set drive 
option of running device. That's not difficult, but it means that we have different 
scenario of inserting/removing filters:

1. filter above root node X:

inserting:

   - do blockdev-add to add a filter (and specify X as its child)
   - do qom-set to set new filter as a rood node instead of X

removing

   - do qom-set to make X a root node again
   - do blockdev-del to drop a filter

2. filter between two block nodes P and X. (For example, X is a backing child 
of P)

inserting

   - do blockdev-add to add a filter (and specify X as its child)
   - do blockdev-reopen to set P.backing = filter

remvoing

   - do blockdev-reopen to set P.backing = X
   - do blockdev-del to drop a filter


And, probably we'll want transaction support for all these things.


Is it OK? Or do we need some kind of additional blockdev-replace command, that 
can replace one node by another, so in both cases we will do

inserting:

   - blockdev-add filter
   - blockdev-replace (make all parents of X to point to the new filter instead 
(except for the filter itself of course)

removing
   - blockdev-replace (make all parante of filter to be parents of X instead)
   - blockdev-del filter


It's simple to implement, and it seems for me that it is simpler to use. Any 
thoughts?


I’m afraid as a non-user of the blockdev interface, I can’t give a valuable 
opinion that would have some actual weight.

Doesn’t stop me from giving my personal and potentially invaluable opinion, 
though, obviously:

I think we expect all users to know the block graph, so they should be able to 
distinguish between cases 1 and 2.  However, I can imagine having to 
distinguish still is kind of a pain, especially if it were trivial for qemu to 
let the user not having to worry about it at all.


I discussed it yesterday with my colleagues from Virtuozzo, who will have to be 
users of that interface. And they of course prefer one command for all the 
cases :)



Also, if you want a filter unconditionally above some node, all the qom-set and 
blockdev-reopen operations for all of the original node’s parents would need to 
happen atomically.  As you say, those operations should perhaps be 
transactionable anyway, but...  Implementing blockdev-replace would provide 
this for much less cost now, I suppose?

I guess it can be argued that the downside is that having blockdev-replace 
means less pressure to make qom-set for drive and blockdev-reopen 
transactionable.

But well.  I don’t really have anything against a blockdev-replace, but again, 
I don’t know whether my opinion on this topic really has weight.


Thanks, actually my opinion is the same. I think, I'll prepare a patch a day 
later if no answers here, and we'll be able to continue discussion on top of 
new patch.

Hmm I have one additional (weak, but still) argument for blockdev-replace: it 
just seems good to avoid touching extra subsystem in block-graph operations. 
For block-jobs we don't need to touch qdev guest block devices, we are good now 
with node-names and blockdev-add. So, it's good to save this bit of interface 
beauty if we don't have strict reason to drop it.

--
Best regards,
Vladimir



Re: RFC: Qemu backup interface plans

2021-05-19 Thread Vladimir Sementsov-Ogievskiy

18.05.2021 19:39, Max Reitz wrote:

Hi,

Your proposal sounds good to me in general.  Some small independent building 
blocks that seems to make sense to me.


Thanks! I hope it's not too difficult to read and understand my English.




On 17.05.21 14:07, Vladimir Sementsov-Ogievskiy wrote:

[...]


What we lack in this scheme:

1. handling dirty bitmap in backup-top filter: backup-top does copy-before-write 
operation on any guest write, when actually we are interested only in "dirty" 
regions for incremental backup

Probable solution would allowing specifying bitmap for sync=none mode of 
backup, but I think what I propose below is better.

2. [actually it's a tricky part of 1]: possibility to not do copy-before-write 
operations for regions that was already copied to final backup. With normal 
Qemu backup job, this is achieved by the fact that block-copy state with its 
internal bitmap is shared between backup job and copy-before-write filter.

3. Not a real problem but fact: backup block-job does nothing in the scheme, 
the whole job is done by filter. So, it would be interesting to have a 
possibility to simply insert/remove the filter, and avoid block-job creation 
and managing at all for external backup. (and I'd like to send another RFC on 
how to insert/remove filters, let's not discuss it here).


Next. Think about internal backup. It has one drawback too:
4. If target is remote with slow connection, copy-before-write operations will 
slow down guest writes appreciably.

It may be solved with help of image fleecing: we create temporary qcow2 image, 
setup fleecing scheme, and instead of exporting temp image through NBD we start 
a second backup with source = temporary image and target would be real backup 
target (NBD for example).


How would a second backup work here?  Wouldn’t one want a mirror job to copy 
the data off to the real target?

(Because I see backup as something intrinsically synchronous, whereas mirror by 
default is rather lazy.)

[Note from future me where I read more below: I see you acknowledge that you’ll 
need to modify backup to do what you need here, i.e. not do any CBW operations. 
 So it’s effectively the same as a mirror that ignores new dirty areas.  Which 
could work without changing mirror if block-copy were to set 
BDRV_REQ_WRITE_UNCHANGED for the fleecing case, and bdrv_co_write_req_finish() 
would skip bdrv_set_dirty() for such writes.]


I just feel myself closer with backup block-job than with mirror :) Finally, 
yes, there is no real difference in interface. But in realization, I prefer to 
continue developing block-copy. I hope, finally all jobs and img-convert would 
work through block-copy.

(and I'll need BDRV_REQ_WRITE_UNCHANGED anyway for fleecing, so user can use 
mirror or backup)



I mean, still has the problem that the mirror job can’t tell the CBW filter 
which areas are already copied off and so don’t need to be preserved anymore, 
but...


Still, with such solution there are same [1,2] problems, 3 becomes worse:


Not sure how 3 can become worse when you said above it isn’t a real problem (to 
which I agree).


It's my perfectionism :) Yes, it's still isn't a problem, but number of extra 
user-visible objects in architecture increases, which is not good I think.




5. We'll have two jobs and two automatically inserted filters, when actually 
one filter and one job are enough (as first job is needed only to insert a 
filter, second job doesn't need a filter at all).

Note also, that this (starting two backup jobs to make push backup with 
fleecing) doesn't work now, op-blockers will be against. It's simple to fix 
(and in Virtuozzo we live with downstream-only patch, which allows push backup 
with fleecing, based on starting two backup jobs).. But I never send a patch, 
as I have better plan, which will solve all listed problems.


So, what I propose:

1. We make backup-top filter public, so that it could be inserted/removed where 
user wants through QMP (how to properly insert/remove filter I'll post another 
RFC, as backup-top is not the only filter that can be usefully inserted 
somewhere). For this first step I've sent a series today:

   subject: [PATCH 00/21] block: publish backup-top filter
   id: <20210517064428.16223-1-vsement...@virtuozzo.com>
   patchew: 
https://patchew.org/QEMU/20210517064428.16223-1-vsement...@virtuozzo.com/

(note, that one of things in this series is rename 
s/backup-top/copy-before-write/, still, I call it backup-top in this letter)

This solves [3]. [4, 5] are solved partly: we still have one extra filter, 
created by backup block jobs, and also I didn't test does this work, probably 
some op-blockers or permissions should be tuned. So, let it be step 2:

2. Test, that we can start backup job with source = (target of backup-top filter), so 
that we have "push backup with fleecing". Make an option for backup to start 
without a filter, when we don't need copy-before-write operations, to not create extra