Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context

2021-06-14 Thread Nir Soffer
On Mon, Jun 14, 2021 at 4:56 PM Eric Blake  wrote:
>
> On Sat, Jun 12, 2021 at 02:39:44AM +0300, Nir Soffer wrote:
> > Since this change is not simple, and the chance that we also get the dirty
> > bitmap included in the result seems to be very low, I decided to check the
> > direction of merging multiple extents.
> >
> > I started with merging "base:allocation" and "qemu:dirty-bitmap:xxx" since
> > we already have both. It was not hard to do, although it is not completely
> > tested yet.
> >
> > Here is the merging code:
> > https://gerrit.ovirt.org/c/ovirt-imageio/+/115216/1/daemon/ovirt_imageio/_internal/nbdutil.py
> >
> > To make merging easy and safe, we map the NBD_STATE_DIRTY bit to a private 
> > bit
> > so it cannot clash with the NBD_STATE_HOLE bit:
> > https://gerrit.ovirt.org/c/ovirt-imageio/+/115215/1/daemon/ovirt_imageio/_internal/nbd.py
> >
> > Here is a functional test using qemu-nbd showing that it works:
> > https://gerrit.ovirt.org/c/ovirt-imageio/+/115216/1/daemon/test/client_test.py
> >
> > I'll try to use "qemu:allocation-depth" in a similar way next week, probably
> > mapping depth > 0 to EXTENT_EXISTS, to use when reporting holes in
> > single qcow2 images.
> >
> > If this is successful, we can start using this in the next ovirt release, 
> > and we
> > don't need "qemu:joint-allocation".
>
> That's nice to know.  So at this point, we'll drop the patch on
> qemu:joint-allocation, and instead focus on teh patch that improves
> qemu-img map output to make it easier to use in the same way that
> qemu:allocation-depth is.

I can update that everything looks good on our side so far, thanks!




Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context

2021-06-14 Thread Eric Blake
On Sat, Jun 12, 2021 at 02:39:44AM +0300, Nir Soffer wrote:
> Since this change is not simple, and the chance that we also get the dirty
> bitmap included in the result seems to be very low, I decided to check the
> direction of merging multiple extents.
> 
> I started with merging "base:allocation" and "qemu:dirty-bitmap:xxx" since
> we already have both. It was not hard to do, although it is not completely
> tested yet.
> 
> Here is the merging code:
> https://gerrit.ovirt.org/c/ovirt-imageio/+/115216/1/daemon/ovirt_imageio/_internal/nbdutil.py
> 
> To make merging easy and safe, we map the NBD_STATE_DIRTY bit to a private bit
> so it cannot clash with the NBD_STATE_HOLE bit:
> https://gerrit.ovirt.org/c/ovirt-imageio/+/115215/1/daemon/ovirt_imageio/_internal/nbd.py
> 
> Here is a functional test using qemu-nbd showing that it works:
> https://gerrit.ovirt.org/c/ovirt-imageio/+/115216/1/daemon/test/client_test.py
> 
> I'll try to use "qemu:allocation-depth" in a similar way next week, probably
> mapping depth > 0 to EXTENT_EXISTS, to use when reporting holes in
> single qcow2 images.
> 
> If this is successful, we can start using this in the next ovirt release, and 
> we
> don't need "qemu:joint-allocation".

That's nice to know.  So at this point, we'll drop the patch on
qemu:joint-allocation, and instead focus on teh patch that improves
qemu-img map output to make it easier to use in the same way that
qemu:allocation-depth is.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context

2021-06-11 Thread Nir Soffer
On Wed, Jun 9, 2021 at 9:01 PM Eric Blake  wrote:
>
> When trying to reconstruct a qcow2 chain using information provided
> over NBD, ovirt had been relying on an unsafe assumption that any
> portion of the qcow2 file advertised as sparse would defer to the
> backing image; this worked with what qemu 5.2 reports for a qcow2 BSD
> loaded with "backing":null.  However, in 6.0, commit 0da9856851 (nbd:
> server: Report holes for raw images) also had a side-effect of
> reporting unallocated zero clusters in qcow2 files as sparse.  This
> change is correct from the NBD spec perspective (advertising bits has
> always been optional based on how much information the server has
> available, and should only be used to optimize behavior when a bit is
> set, while not assuming semantics merely because a bit is clear), but
> means that a qcow2 file that uses an unallocated zero cluster to
> override a backing file now shows up as sparse over NBD, and causes
> ovirt to fail to reproduce that cluster (ie. ovirt was assuming it
> only had to write clusters where the bit was clear, and the 6.0
> behavior change shows the flaw in that assumption).
>
> The correct fix is for ovirt to additionally use the
> qemu:allocation-depth metadata context added in 5.2: after all, the
> actual determination for what is needed to recreate a qcow2 file is
> not whether a cluster is sparse, but whether the allocation-depth
> shows the cluster to be local.  But reproducing an image is more
> efficient when handling known-zero clusters, which means that ovirt
> has to track both base:allocation and qemu:allocation-depth metadata
> contexts simultaneously.  While NBD_CMD_BLOCK_STATUS is just fine
> sending back information for two contexts in parallel, it comes with
> some bookkeeping overhead at the client side: the two contexts need
> not report the same length of replies, and it involves more network
> traffic.

Since this change is not simple, and the chance that we also get the dirty
bitmap included in the result seems to be very low, I decided to check the
direction of merging multiple extents.

I started with merging "base:allocation" and "qemu:dirty-bitmap:xxx" since
we already have both. It was not hard to do, although it is not completely
tested yet.

Here is the merging code:
https://gerrit.ovirt.org/c/ovirt-imageio/+/115216/1/daemon/ovirt_imageio/_internal/nbdutil.py

To make merging easy and safe, we map the NBD_STATE_DIRTY bit to a private bit
so it cannot clash with the NBD_STATE_HOLE bit:
https://gerrit.ovirt.org/c/ovirt-imageio/+/115215/1/daemon/ovirt_imageio/_internal/nbd.py

Here is a functional test using qemu-nbd showing that it works:
https://gerrit.ovirt.org/c/ovirt-imageio/+/115216/1/daemon/test/client_test.py

I'll try to use "qemu:allocation-depth" in a similar way next week, probably
mapping depth > 0 to EXTENT_EXISTS, to use when reporting holes in
single qcow2 images.

If this is successful, we can start using this in the next ovirt release, and we
don't need "qemu:joint-allocation".

Nir




Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context

2021-06-10 Thread Vladimir Sementsov-Ogievskiy

10.06.2021 17:04, Eric Blake wrote:

Maybe the thing to do is improve the documentation and try to avoid
ambiguous terminalogy; in qemu:allocation-depth, a return of depth 0
should be called "absent", not "unallocated".  And in libnbd, a
base:allocation of 0 should be "data" or "normal", not "allocated".


Interesting, how many problems, misunderstanding and confusion we have for 
years because of that terminology :)

Funny, that we try to imagine how to call these thing in general, but actually 
in 99.99% cases we are saying about only 5 simple things:

file-posix data
file-posix hole
qcow2 DATA
qcow2 ZERO
qcow2 UNALLOCATED

And all our problems comes from our trying to divide these thing into two 
categories: allocated/unallocated. But it never worked.

I'd divide like this:

DATA
  examples:
  - data cluster in qcow2
  - data region in file-posix
  properties:
  - data actually occupies space on disk
  - io operations are handled by this layer, backing is shadowed
  - write should not increase disk occupation

GO_TO_BACKING
  examples:
  - "unallocated" cluster in qcow2
  properties
  - read from backing image (if no backing, read zeroes)
  - disk occupation status is known only by backing image (if no backing, disk 
is not occupied)
  - write will allocate new cluster in top image, which will increase disk 
occupation

ZERO
  examples:
  - zero cluster in qcow2, no space is occupied (most probably), reads as zeroes
  - file-posix hole, no space is occupied (most probably), reads as zeroes
  properties:
  - read zeroes
  - io operations are handled by this layer, backing is shadowed
  - no space is occupied (most probably)
  - write should not increase disk occupation (most probably)


We can consider qcow2 ALLOCATED_ZERO also, and maybe SCSI unallocated which 
means that nothing is occupied but read doesn't guarantee zeroes.. But that 
doesn't really matter. What does matter is that trying to describe qcow2 
backing files in usual block terms allocated/unallocated zero/data never worked 
good. So in a good documentation (and good code) we should describe (and 
handle) qcow2 backing chains as qcow2 backing chains and don't try to shadow 
them under usual terminology.

--
Best regards,
Vladimir



Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context

2021-06-10 Thread Vladimir Sementsov-Ogievskiy

10.06.2021 16:47, Eric Blake wrote:

On Thu, Jun 10, 2021 at 03:30:17PM +0300, Vladimir Sementsov-Ogievskiy wrote:

The correct fix is for ovirt to additionally use the
qemu:allocation-depth metadata context added in 5.2: after all, the
actual determination for what is needed to recreate a qcow2 file is
not whether a cluster is sparse, but whether the allocation-depth
shows the cluster to be local.  But reproducing an image is more
efficient when handling known-zero clusters, which means that ovirt
has to track both base:allocation and qemu:allocation-depth metadata
contexts simultaneously.  While NBD_CMD_BLOCK_STATUS is just fine
sending back information for two contexts in parallel, it comes with
some bookkeeping overhead at the client side: the two contexts need
not report the same length of replies, and it involves more network
traffic.


Aren't both context described in one reply? Or what do you mean by not the same 
length?


The example file demonstrates this.  We have:

base.rawABC-
top.qcow2   -D0-
guest sees  AD00

Querying base:allocation returns:
  0   655363  hole,zero
  65536   655360  allocated
 131072  1310723  hole,zero

Querying qemu:allocation-depth returns:
  0   655360  unallocated
  65536  1310721  local
 196608   655360  unallocated


Hmm right. Sorry, I forget how BLOCK_STATUS works for several contexts. I 
thought they are combined.



That is, the query starting at 64k returns different lengths (64k for
base:allocation, 128k for qemu:allocation-depth), and the client has
to process the smaller of the two regions before moving on to the next
query.  But if the client then does a query starting at 128k, it
either has to remember that it previously has information available
from the earlier qemu:allocation-depth, or repeats efforts over the
wire.


Hmm.. but if we are going to combine contexts in qemu, we face the same 
problem, as sources of the contexts may return information by different chunks, 
so we'll have to cache something, or query the same thing twice. But yes, at 
least we avoid doing it thought the net.



The joy of having a single metadata context return both pieces of
information at once is that the client no longer has to do this
cross-correlation between the differences in extent lengths of the
parallel contexts.


We discussed in the past the option to expose also the dirty status of every
block in the response. Again this info is available using
"qemu:dirty-bitmap:xxx"
but just like allocation depth and base allocation, merging the results is hard
and if we could expose also the dirty bit, this can make clients life
even better.
In this case I'm not sure "qemu:allocation" is the best name, maybe something
more generic like "qemu:extents" or "qemu:block-status" is even better.



Oops. Could you please describe, what's the problem with parsing several 
context simultaneously?


There is no inherent technical problem, just extra work.  Joining the
work at the server side is less coding effort than recoding the
boilerplate to join the work at every single client side.  And the
information is already present.  So we could just scrap this entire
RFC by stating that the information is already available, and it is
not worth qemu's effort to provide the convenience context.

Joining base:allocation and qemu:allocation-depth was easy - in fact,
since both use bdrv_block_status under the hood, we could (and
probably should!) merge it into a single qemu query.  But joining
base:allocation and qemu:dirty-bitmap:FOO will be harder, at which
point I question whether it is worth the complications.  And if you
argue that a joint context is not worthwhile without dirty bitmap(s)
being part of that joint context, then maybe this RFC is too complex
to worry about, and we should just leave the cross-correlation of
parallel contexts to be client-side, after all.




This all sound to me as we are going to implement "joint" combined conexts for 
every useful combination of existing contexts that user wants. So, it's a kind of 
workaround of inconvenient protocol we have invented in the past.

Doesn't it mean that we instead should rework, how we export several contexts? 
Maybe we can improve generic export of several contexts simultaneously, so that 
it will be convenient for the client? Than we don't need any additional 
combined contexts.


The NBD protocol intentionally left wiggle room for servers to report
different extent lengths across different contexts.  But other than
qemu, I don't know of any other NBD servers advertising alternate
contexts.  If we think we can reasonbly restrict the NBD protocol to
require that any server sending parallel contexts to a client MUST use
the same extent lengths for all parallel contexts (clients still have
to read multiple contexts, but the cross-correlation becomes easier
because the client doesn't have to worry about length mismatches), and
code that up in 

Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context

2021-06-10 Thread Eric Blake
On Thu, Jun 10, 2021 at 04:16:27PM +0300, Nir Soffer wrote:
> On Thu, Jun 10, 2021 at 2:52 AM Nir Soffer  wrote:
> >
> > On Wed, Jun 9, 2021 at 9:01 PM Eric Blake  wrote:
> 
> I posted a work in progress patch implementing support for
> qemu:joint-allocaition
> in oVirt:
> https://gerrit.ovirt.org/c/ovirt-imageio/+/115197
> 
> The most important part is the nbd client:
> https://gerrit.ovirt.org/c/ovirt-imageio/+/115197/1/daemon/ovirt_imageio/_internal/nbd.py
> 
> With this our tests pass with qemu-nbd build with Eric patch:
> https://gerrit.ovirt.org/c/ovirt-imageio/+/115197/1/daemon/test/client_test.py

Note that you _don't_ have to use qemu:joint-allocation; you could get
the same information by using the already-existing
qemu:allocation-depth.  But your patch DOES make it obvious that it
was a lot simpler to use a single context (the bulk of your tweak is
trying an additional NBD_OPT_SET_META_CONTEXT), than it would be to
handle parallel contexts (where you would have to tweak your
NBD_CMD_BLOCK_STATUS handler to cross-correlate information).

> 
> We may need to use qemu:joint-allocation only for qcow2 images, and
> base:allocation
> for raw images, because allocation depth reporting is not correct for
> raw images. Since

Allocation depth (aka how deep in the backing chain is data allocated)
IS correct for raw images (the raw image IS the source of all data);
it is just not useful.

> we control the qemu-nbd in both cases this should not be an issue. But
> it would be
> better if allocation depth would work for any kind of image, and we always use
> qemu:joint-allocation.

Maybe the thing to do is improve the documentation and try to avoid
ambiguous terminalogy; in qemu:allocation-depth, a return of depth 0
should be called "absent", not "unallocated".  And in libnbd, a
base:allocation of 0 should be "data" or "normal", not "allocated".
But I don't see how qemu will ever advertise an allocation depth of 0
for a raw image, because raw images have no backing chain to defer to.
If you are trying to recreate a sparse raw image, then you really do
want to pay attention to NBD_STATE_HOLE and NBD_STATE_ZERO, as those
are more accurate pictures of the properties of extents within the raw
file.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context

2021-06-10 Thread Eric Blake
On Thu, Jun 10, 2021 at 03:30:17PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> > > The correct fix is for ovirt to additionally use the
> > > qemu:allocation-depth metadata context added in 5.2: after all, the
> > > actual determination for what is needed to recreate a qcow2 file is
> > > not whether a cluster is sparse, but whether the allocation-depth
> > > shows the cluster to be local.  But reproducing an image is more
> > > efficient when handling known-zero clusters, which means that ovirt
> > > has to track both base:allocation and qemu:allocation-depth metadata
> > > contexts simultaneously.  While NBD_CMD_BLOCK_STATUS is just fine
> > > sending back information for two contexts in parallel, it comes with
> > > some bookkeeping overhead at the client side: the two contexts need
> > > not report the same length of replies, and it involves more network
> > > traffic.
> 
> Aren't both context described in one reply? Or what do you mean by not the 
> same length?

The example file demonstrates this.  We have:

base.rawABC-
top.qcow2   -D0-
guest sees  AD00

Querying base:allocation returns:
 0   655363  hole,zero
 65536   655360  allocated
131072  1310723  hole,zero

Querying qemu:allocation-depth returns:
 0   655360  unallocated
 65536  1310721  local
196608   655360  unallocated

That is, the query starting at 64k returns different lengths (64k for
base:allocation, 128k for qemu:allocation-depth), and the client has
to process the smaller of the two regions before moving on to the next
query.  But if the client then does a query starting at 128k, it
either has to remember that it previously has information available
from the earlier qemu:allocation-depth, or repeats efforts over the
wire.

The joy of having a single metadata context return both pieces of
information at once is that the client no longer has to do this
cross-correlation between the differences in extent lengths of the
parallel contexts.

> > We discussed in the past the option to expose also the dirty status of every
> > block in the response. Again this info is available using
> > "qemu:dirty-bitmap:xxx"
> > but just like allocation depth and base allocation, merging the results is 
> > hard
> > and if we could expose also the dirty bit, this can make clients life
> > even better.
> > In this case I'm not sure "qemu:allocation" is the best name, maybe 
> > something
> > more generic like "qemu:extents" or "qemu:block-status" is even better.
> > 
> 
> Oops. Could you please describe, what's the problem with parsing several 
> context simultaneously?

There is no inherent technical problem, just extra work.  Joining the
work at the server side is less coding effort than recoding the
boilerplate to join the work at every single client side.  And the
information is already present.  So we could just scrap this entire
RFC by stating that the information is already available, and it is
not worth qemu's effort to provide the convenience context.

Joining base:allocation and qemu:allocation-depth was easy - in fact,
since both use bdrv_block_status under the hood, we could (and
probably should!) merge it into a single qemu query.  But joining
base:allocation and qemu:dirty-bitmap:FOO will be harder, at which
point I question whether it is worth the complications.  And if you
argue that a joint context is not worthwhile without dirty bitmap(s)
being part of that joint context, then maybe this RFC is too complex
to worry about, and we should just leave the cross-correlation of
parallel contexts to be client-side, after all.


> 
> This all sound to me as we are going to implement "joint" combined conexts 
> for every useful combination of existing contexts that user wants. So, it's a 
> kind of workaround of inconvenient protocol we have invented in the past.
> 
> Doesn't it mean that we instead should rework, how we export several 
> contexts? Maybe we can improve generic export of several contexts 
> simultaneously, so that it will be convenient for the client? Than we don't 
> need any additional combined contexts.

The NBD protocol intentionally left wiggle room for servers to report
different extent lengths across different contexts.  But other than
qemu, I don't know of any other NBD servers advertising alternate
contexts.  If we think we can reasonbly restrict the NBD protocol to
require that any server sending parallel contexts to a client MUST use
the same extent lengths for all parallel contexts (clients still have
to read multiple contexts, but the cross-correlation becomes easier
because the client doesn't have to worry about length mismatches), and
code that up in qemu, that's also something we can consider.

Or maybe even have it be an opt-in, where a client requests
NBD_OPT_ALIGN_META_CONTEXT; if the server acknowledges that option,
the client knows that it can request parallel NBD_OPT_SET_META_CONTEXT
and the extents replied to each 

Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context

2021-06-10 Thread Eric Blake
On Thu, Jun 10, 2021 at 02:52:10AM +0300, Nir Soffer wrote:
> > So, as a convenience, we can provide yet another metadata context,
> > "qemu:joint-allocation", which provides the bulk of the same
> > information already available from using "base:allocation" and
> > "qemu:allocation-depth" in parallel; the only difference is that an
> > allocation depth larger than one is collapsed to a single bit, rather
> > than remaining an integer representing actual depth.  By connecting to
> > just this context, a client has less work to perform while still
> > getting at all pieces of information needed to recreate a qcow2
> > backing chain.
> 
> Providing extended allocation is awsome, and makes client life much
> easier. But I'm not sure about the name, that comes from "joining"
> "base:allocation" and "qemu:allocation-depth". This is correct when
> thinking about qemu internals, but this is not really getting both, since
> "qemu:allocation-depth" is reduced to local and backing.
> 
> From a client point of view, I think this is best described as 
> "qemu:allocation"
> which is an extension to NBD protocol, providing the same HOLE and ZERO
> bits, and qemu specific info LOCAL, BACKING. Using different "namespace"
> ("qemu" vs "base") makes it clear that this is not the same.
> 
> We discussed in the past the option to expose also the dirty status of every
> block in the response. Again this info is available using
> "qemu:dirty-bitmap:xxx"
> but just like allocation depth and base allocation, merging the results is 
> hard
> and if we could expose also the dirty bit, this can make clients life
> even better.
> In this case I'm not sure "qemu:allocation" is the best name, maybe something
> more generic like "qemu:extents" or "qemu:block-status" is even better.

Yeah, I'm happy to bike-shed the name.

Note that dirty-bitmap tracking is harder to merge in: qemu supports
exposing more than one dirty bitmap context at once, but where each
context uses the same bit (bit 0, which overlaps with NBD_STATE_HOLE).
We'd have to decide whether we support merging only a single bitmap,
or whether a user can merge in up to 28 bitmaps (where the order in
which they specify dirty bitmaps as part of the
qemu:mega:bitmap1:bitmap2:... context name matters).

And if we decide that merging in even a single dirty bitmap is too
difficult, then clients already have to correlate a dirty bitmap with
everything else, at which point correlating 3 contexts (dirty bitmap
with base:allocation and qemu:allocation-depth) is no harder than
correlating 2 contexts (dirty bitmap with qemu:joint-allocation or
whatever we name it).

> 
> > With regards to exposing this new feature from qemu as NBD server, it
> > is sufficient to reuse the existing 'qemu-nbd -A': since that already
> > exposes allocation depth, it does not hurt to advertise two separate
> > qemu:XXX metadata contexts at once for two different views of
> > allocation depth.  And just because the server supports multiple
> > contexts does not mean a client will want or need to connect to
> > everything available.  On the other hand, the existing hack of using
> > the qemu NBD client option of x-dirty-bitmap to select an alternative
> > context from the client does NOT make it possible to read the extra
> > information exposed by the new metadata context.  For now, you MUST
> > use something like libnbd's 'nbdinfo --map=qemu:joint-allocation' in
> > order to properly see all four bits in action:
> 
> Makes sense.
> 
> > # Create a qcow2 image with a raw backing file:
> > $ qemu-img create base.raw $((4*64*1024))
> > $ qemu-img create -f qcow2 -b base.raw -F raw top.qcow2
> >
> > # Write to first 3 clusters of base:
> > $ qemu-io -f raw -c "w -P 65 0 64k" -c "w -P 66 64k 64k" \
> >   -c "w -P 67 128k 64k" base.raw
> >
> > # Write to second and third clusters of top, hiding base:
> > $ qemu-io -f qcow2 -c "w -P 69 64k 64k" -c "w -z 128k 64k" top.qcow2
> 
> Looks familiar but nicer :-)

Yeah, I just compressed your example.

> 
> > # Expose top.qcow2 without backing file over NBD
> > $ ./qemu-nbd -r -t -f qcow2 -A 'json:{"driver":"qcow2", "backing":null, 
> > \
> >   "file":{"driver":"file", "filename":"top.qcow2"}}'
> > $ nbdinfo --map=qemu:joint-allocation nbd://localhost
> >  0   655363
> >  65536   655364
> > 131072   655367
> > 196608   655363
> 
> Using the libnbd patch this shows:
> 
> $ ./nbdinfo --map="qemu:joint-allocation" nbd://localhost
>  0   655363  hole,zero,unallocated
>  65536   655364  allocated,local
> 131072   655367  hole,zero,local
> 196608   655363  hole,zero,unallocated
> 
> Looks good.
> 
> We need to convert this output to:
> 
> {"start": 0, "length": 65536, "zero": true, "hole": true},
> {"start": 65536, "length": 65536, "zero": false, "hole": false},
> {"start": 131072, "length": 65536, "zero": true, 

Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context

2021-06-10 Thread Nir Soffer
On Thu, Jun 10, 2021 at 2:52 AM Nir Soffer  wrote:
>
> On Wed, Jun 9, 2021 at 9:01 PM Eric Blake  wrote:

I posted a work in progress patch implementing support for
qemu:joint-allocaition
in oVirt:
https://gerrit.ovirt.org/c/ovirt-imageio/+/115197

The most important part is the nbd client:
https://gerrit.ovirt.org/c/ovirt-imageio/+/115197/1/daemon/ovirt_imageio/_internal/nbd.py

With this our tests pass with qemu-nbd build with Eric patch:
https://gerrit.ovirt.org/c/ovirt-imageio/+/115197/1/daemon/test/client_test.py

We may need to use qemu:joint-allocation only for qcow2 images, and
base:allocation
for raw images, because allocation depth reporting is not correct for
raw images. Since
we control the qemu-nbd in both cases this should not be an issue. But
it would be
better if allocation depth would work for any kind of image, and we always use
qemu:joint-allocation.

Nir




Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context

2021-06-10 Thread Vladimir Sementsov-Ogievskiy

10.06.2021 02:52, Nir Soffer wrote:

On Wed, Jun 9, 2021 at 9:01 PM Eric Blake  wrote:

When trying to reconstruct a qcow2 chain using information provided
over NBD, ovirt had been relying on an unsafe assumption that any
portion of the qcow2 file advertised as sparse would defer to the
backing image; this worked with what qemu 5.2 reports for a qcow2 BSD
loaded with "backing":null.  However, in 6.0, commit 0da9856851 (nbd:
server: Report holes for raw images) also had a side-effect of
reporting unallocated zero clusters in qcow2 files as sparse.  This
change is correct from the NBD spec perspective (advertising bits has
always been optional based on how much information the server has
available, and should only be used to optimize behavior when a bit is
set, while not assuming semantics merely because a bit is clear), but
means that a qcow2 file that uses an unallocated zero cluster to
override a backing file now shows up as sparse over NBD, and causes
ovirt to fail to reproduce that cluster (ie. ovirt was assuming it
only had to write clusters where the bit was clear, and the 6.0
behavior change shows the flaw in that assumption).

The correct fix is for ovirt to additionally use the
qemu:allocation-depth metadata context added in 5.2: after all, the
actual determination for what is needed to recreate a qcow2 file is
not whether a cluster is sparse, but whether the allocation-depth
shows the cluster to be local.  But reproducing an image is more
efficient when handling known-zero clusters, which means that ovirt
has to track both base:allocation and qemu:allocation-depth metadata
contexts simultaneously.  While NBD_CMD_BLOCK_STATUS is just fine
sending back information for two contexts in parallel, it comes with
some bookkeeping overhead at the client side: the two contexts need
not report the same length of replies, and it involves more network
traffic.


Aren't both context described in one reply? Or what do you mean by not the same 
length?



So, as a convenience, we can provide yet another metadata context,
"qemu:joint-allocation", which provides the bulk of the same
information already available from using "base:allocation" and
"qemu:allocation-depth" in parallel; the only difference is that an
allocation depth larger than one is collapsed to a single bit, rather
than remaining an integer representing actual depth.  By connecting to
just this context, a client has less work to perform while still
getting at all pieces of information needed to recreate a qcow2
backing chain.

Providing extended allocation is awsome, and makes client life much
easier. But I'm not sure about the name, that comes from "joining"
"base:allocation" and "qemu:allocation-depth". This is correct when
thinking about qemu internals, but this is not really getting both, since
"qemu:allocation-depth" is reduced to local and backing.

 From a client point of view, I think this is best described as 
"qemu:allocation"
which is an extension to NBD protocol, providing the same HOLE and ZERO
bits, and qemu specific info LOCAL, BACKING. Using different "namespace"
("qemu" vs "base") makes it clear that this is not the same.

We discussed in the past the option to expose also the dirty status of every
block in the response. Again this info is available using
"qemu:dirty-bitmap:xxx"
but just like allocation depth and base allocation, merging the results is hard
and if we could expose also the dirty bit, this can make clients life
even better.
In this case I'm not sure "qemu:allocation" is the best name, maybe something
more generic like "qemu:extents" or "qemu:block-status" is even better.



Oops. Could you please describe, what's the problem with parsing several 
context simultaneously?

This all sound to me as we are going to implement "joint" combined conexts for 
every useful combination of existing contexts that user wants. So, it's a kind of 
workaround of inconvenient protocol we have invented in the past.

Doesn't it mean that we instead should rework, how we export several contexts? 
Maybe we can improve generic export of several contexts simultaneously, so that 
it will be convenient for the client? Than we don't need any additional 
combined contexts.

--
Best regards,
Vladimir



Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context

2021-06-09 Thread Nir Soffer
On Wed, Jun 9, 2021 at 9:01 PM Eric Blake  wrote:
>
> When trying to reconstruct a qcow2 chain using information provided
> over NBD, ovirt had been relying on an unsafe assumption that any
> portion of the qcow2 file advertised as sparse would defer to the
> backing image; this worked with what qemu 5.2 reports for a qcow2 BSD
> loaded with "backing":null.  However, in 6.0, commit 0da9856851 (nbd:
> server: Report holes for raw images) also had a side-effect of
> reporting unallocated zero clusters in qcow2 files as sparse.  This
> change is correct from the NBD spec perspective (advertising bits has
> always been optional based on how much information the server has
> available, and should only be used to optimize behavior when a bit is
> set, while not assuming semantics merely because a bit is clear), but
> means that a qcow2 file that uses an unallocated zero cluster to
> override a backing file now shows up as sparse over NBD, and causes
> ovirt to fail to reproduce that cluster (ie. ovirt was assuming it
> only had to write clusters where the bit was clear, and the 6.0
> behavior change shows the flaw in that assumption).
>
> The correct fix is for ovirt to additionally use the
> qemu:allocation-depth metadata context added in 5.2: after all, the
> actual determination for what is needed to recreate a qcow2 file is
> not whether a cluster is sparse, but whether the allocation-depth
> shows the cluster to be local.  But reproducing an image is more
> efficient when handling known-zero clusters, which means that ovirt
> has to track both base:allocation and qemu:allocation-depth metadata
> contexts simultaneously.  While NBD_CMD_BLOCK_STATUS is just fine
> sending back information for two contexts in parallel, it comes with
> some bookkeeping overhead at the client side: the two contexts need
> not report the same length of replies, and it involves more network
> traffic.
>
> So, as a convenience, we can provide yet another metadata context,
> "qemu:joint-allocation", which provides the bulk of the same
> information already available from using "base:allocation" and
> "qemu:allocation-depth" in parallel; the only difference is that an
> allocation depth larger than one is collapsed to a single bit, rather
> than remaining an integer representing actual depth.  By connecting to
> just this context, a client has less work to perform while still
> getting at all pieces of information needed to recreate a qcow2
> backing chain.

Providing extended allocation is awsome, and makes client life much
easier. But I'm not sure about the name, that comes from "joining"
"base:allocation" and "qemu:allocation-depth". This is correct when
thinking about qemu internals, but this is not really getting both, since
"qemu:allocation-depth" is reduced to local and backing.

>From a client point of view, I think this is best described as 
>"qemu:allocation"
which is an extension to NBD protocol, providing the same HOLE and ZERO
bits, and qemu specific info LOCAL, BACKING. Using different "namespace"
("qemu" vs "base") makes it clear that this is not the same.

We discussed in the past the option to expose also the dirty status of every
block in the response. Again this info is available using
"qemu:dirty-bitmap:xxx"
but just like allocation depth and base allocation, merging the results is hard
and if we could expose also the dirty bit, this can make clients life
even better.
In this case I'm not sure "qemu:allocation" is the best name, maybe something
more generic like "qemu:extents" or "qemu:block-status" is even better.

> With regards to exposing this new feature from qemu as NBD server, it
> is sufficient to reuse the existing 'qemu-nbd -A': since that already
> exposes allocation depth, it does not hurt to advertise two separate
> qemu:XXX metadata contexts at once for two different views of
> allocation depth.  And just because the server supports multiple
> contexts does not mean a client will want or need to connect to
> everything available.  On the other hand, the existing hack of using
> the qemu NBD client option of x-dirty-bitmap to select an alternative
> context from the client does NOT make it possible to read the extra
> information exposed by the new metadata context.  For now, you MUST
> use something like libnbd's 'nbdinfo --map=qemu:joint-allocation' in
> order to properly see all four bits in action:

Makes sense.

> # Create a qcow2 image with a raw backing file:
> $ qemu-img create base.raw $((4*64*1024))
> $ qemu-img create -f qcow2 -b base.raw -F raw top.qcow2
>
> # Write to first 3 clusters of base:
> $ qemu-io -f raw -c "w -P 65 0 64k" -c "w -P 66 64k 64k" \
>   -c "w -P 67 128k 64k" base.raw
>
> # Write to second and third clusters of top, hiding base:
> $ qemu-io -f qcow2 -c "w -P 69 64k 64k" -c "w -z 128k 64k" top.qcow2

Looks familiar but nicer :-)

> # Expose top.qcow2 without backing file over NBD
> $ ./qemu-nbd -r -t -f qcow2 -A 

[PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context

2021-06-09 Thread Eric Blake
When trying to reconstruct a qcow2 chain using information provided
over NBD, ovirt had been relying on an unsafe assumption that any
portion of the qcow2 file advertised as sparse would defer to the
backing image; this worked with what qemu 5.2 reports for a qcow2 BSD
loaded with "backing":null.  However, in 6.0, commit 0da9856851 (nbd:
server: Report holes for raw images) also had a side-effect of
reporting unallocated zero clusters in qcow2 files as sparse.  This
change is correct from the NBD spec perspective (advertising bits has
always been optional based on how much information the server has
available, and should only be used to optimize behavior when a bit is
set, while not assuming semantics merely because a bit is clear), but
means that a qcow2 file that uses an unallocated zero cluster to
override a backing file now shows up as sparse over NBD, and causes
ovirt to fail to reproduce that cluster (ie. ovirt was assuming it
only had to write clusters where the bit was clear, and the 6.0
behavior change shows the flaw in that assumption).

The correct fix is for ovirt to additionally use the
qemu:allocation-depth metadata context added in 5.2: after all, the
actual determination for what is needed to recreate a qcow2 file is
not whether a cluster is sparse, but whether the allocation-depth
shows the cluster to be local.  But reproducing an image is more
efficient when handling known-zero clusters, which means that ovirt
has to track both base:allocation and qemu:allocation-depth metadata
contexts simultaneously.  While NBD_CMD_BLOCK_STATUS is just fine
sending back information for two contexts in parallel, it comes with
some bookkeeping overhead at the client side: the two contexts need
not report the same length of replies, and it involves more network
traffic.

So, as a convenience, we can provide yet another metadata context,
"qemu:joint-allocation", which provides the bulk of the same
information already available from using "base:allocation" and
"qemu:allocation-depth" in parallel; the only difference is that an
allocation depth larger than one is collapsed to a single bit, rather
than remaining an integer representing actual depth.  By connecting to
just this context, a client has less work to perform while still
getting at all pieces of information needed to recreate a qcow2
backing chain.

With regards to exposing this new feature from qemu as NBD server, it
is sufficient to reuse the existing 'qemu-nbd -A': since that already
exposes allocation depth, it does not hurt to advertise two separate
qemu:XXX metadata contexts at once for two different views of
allocation depth.  And just because the server supports multiple
contexts does not mean a client will want or need to connect to
everything available.  On the other hand, the existing hack of using
the qemu NBD client option of x-dirty-bitmap to select an alternative
context from the client does NOT make it possible to read the extra
information exposed by the new metadata context.  For now, you MUST
use something like libnbd's 'nbdinfo --map=qemu:joint-allocation' in
order to properly see all four bits in action:

# Create a qcow2 image with a raw backing file:
$ qemu-img create base.raw $((4*64*1024))
$ qemu-img create -f qcow2 -b base.raw -F raw top.qcow2

# Write to first 3 clusters of base:
$ qemu-io -f raw -c "w -P 65 0 64k" -c "w -P 66 64k 64k" \
  -c "w -P 67 128k 64k" base.raw

# Write to second and third clusters of top, hiding base:
$ qemu-io -f qcow2 -c "w -P 69 64k 64k" -c "w -z 128k 64k" top.qcow2

# Expose top.qcow2 without backing file over NBD
$ ./qemu-nbd -r -t -f qcow2 -A 'json:{"driver":"qcow2", "backing":null, \
  "file":{"driver":"file", "filename":"top.qcow2"}}'
$ nbdinfo --map=qemu:joint-allocation nbd://localhost
 0   655363
 65536   655364
131072   655367
196608   655363

[This was output from nbdinfo 1.8.0; a later version will also add a
column to decode the bits into human-readable strings]

Additionally, later qemu patches may try to improve qemu-img to
automatically take advantage of additional NBD context information,
without having to use x-dirty-bitmap.

Reported-by: Nir Soffer 
Resolves: https://bugzilla.redhat.com/1968693
Signed-off-by: Eric Blake 
---
 docs/interop/nbd.txt  | 31 ++-
 docs/tools/qemu-nbd.rst   |  4 +-
 qapi/block-export.json|  4 +-
 include/block/nbd.h   | 10 ++-
 nbd/server.c  | 87 +--
 .../tests/nbd-qemu-allocation.out |  3 +-
 6 files changed, 125 insertions(+), 14 deletions(-)

diff --git a/docs/interop/nbd.txt b/docs/interop/nbd.txt
index 10ce098a29bf..cc8ce2d5389f 100644
--- a/docs/interop/nbd.txt
+++ b/docs/interop/nbd.txt
@@ -17,7 +17,7 @@ namespace "qemu".

 == "qemu" namespace ==

-The "qemu" namespace currently contains two