Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context
On Mon, Jun 14, 2021 at 4:56 PM Eric Blake wrote: > > On Sat, Jun 12, 2021 at 02:39:44AM +0300, Nir Soffer wrote: > > Since this change is not simple, and the chance that we also get the dirty > > bitmap included in the result seems to be very low, I decided to check the > > direction of merging multiple extents. > > > > I started with merging "base:allocation" and "qemu:dirty-bitmap:xxx" since > > we already have both. It was not hard to do, although it is not completely > > tested yet. > > > > Here is the merging code: > > https://gerrit.ovirt.org/c/ovirt-imageio/+/115216/1/daemon/ovirt_imageio/_internal/nbdutil.py > > > > To make merging easy and safe, we map the NBD_STATE_DIRTY bit to a private > > bit > > so it cannot clash with the NBD_STATE_HOLE bit: > > https://gerrit.ovirt.org/c/ovirt-imageio/+/115215/1/daemon/ovirt_imageio/_internal/nbd.py > > > > Here is a functional test using qemu-nbd showing that it works: > > https://gerrit.ovirt.org/c/ovirt-imageio/+/115216/1/daemon/test/client_test.py > > > > I'll try to use "qemu:allocation-depth" in a similar way next week, probably > > mapping depth > 0 to EXTENT_EXISTS, to use when reporting holes in > > single qcow2 images. > > > > If this is successful, we can start using this in the next ovirt release, > > and we > > don't need "qemu:joint-allocation". > > That's nice to know. So at this point, we'll drop the patch on > qemu:joint-allocation, and instead focus on teh patch that improves > qemu-img map output to make it easier to use in the same way that > qemu:allocation-depth is. I can update that everything looks good on our side so far, thanks!
Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context
On Sat, Jun 12, 2021 at 02:39:44AM +0300, Nir Soffer wrote: > Since this change is not simple, and the chance that we also get the dirty > bitmap included in the result seems to be very low, I decided to check the > direction of merging multiple extents. > > I started with merging "base:allocation" and "qemu:dirty-bitmap:xxx" since > we already have both. It was not hard to do, although it is not completely > tested yet. > > Here is the merging code: > https://gerrit.ovirt.org/c/ovirt-imageio/+/115216/1/daemon/ovirt_imageio/_internal/nbdutil.py > > To make merging easy and safe, we map the NBD_STATE_DIRTY bit to a private bit > so it cannot clash with the NBD_STATE_HOLE bit: > https://gerrit.ovirt.org/c/ovirt-imageio/+/115215/1/daemon/ovirt_imageio/_internal/nbd.py > > Here is a functional test using qemu-nbd showing that it works: > https://gerrit.ovirt.org/c/ovirt-imageio/+/115216/1/daemon/test/client_test.py > > I'll try to use "qemu:allocation-depth" in a similar way next week, probably > mapping depth > 0 to EXTENT_EXISTS, to use when reporting holes in > single qcow2 images. > > If this is successful, we can start using this in the next ovirt release, and > we > don't need "qemu:joint-allocation". That's nice to know. So at this point, we'll drop the patch on qemu:joint-allocation, and instead focus on teh patch that improves qemu-img map output to make it easier to use in the same way that qemu:allocation-depth is. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context
On Wed, Jun 9, 2021 at 9:01 PM Eric Blake wrote: > > When trying to reconstruct a qcow2 chain using information provided > over NBD, ovirt had been relying on an unsafe assumption that any > portion of the qcow2 file advertised as sparse would defer to the > backing image; this worked with what qemu 5.2 reports for a qcow2 BSD > loaded with "backing":null. However, in 6.0, commit 0da9856851 (nbd: > server: Report holes for raw images) also had a side-effect of > reporting unallocated zero clusters in qcow2 files as sparse. This > change is correct from the NBD spec perspective (advertising bits has > always been optional based on how much information the server has > available, and should only be used to optimize behavior when a bit is > set, while not assuming semantics merely because a bit is clear), but > means that a qcow2 file that uses an unallocated zero cluster to > override a backing file now shows up as sparse over NBD, and causes > ovirt to fail to reproduce that cluster (ie. ovirt was assuming it > only had to write clusters where the bit was clear, and the 6.0 > behavior change shows the flaw in that assumption). > > The correct fix is for ovirt to additionally use the > qemu:allocation-depth metadata context added in 5.2: after all, the > actual determination for what is needed to recreate a qcow2 file is > not whether a cluster is sparse, but whether the allocation-depth > shows the cluster to be local. But reproducing an image is more > efficient when handling known-zero clusters, which means that ovirt > has to track both base:allocation and qemu:allocation-depth metadata > contexts simultaneously. While NBD_CMD_BLOCK_STATUS is just fine > sending back information for two contexts in parallel, it comes with > some bookkeeping overhead at the client side: the two contexts need > not report the same length of replies, and it involves more network > traffic. Since this change is not simple, and the chance that we also get the dirty bitmap included in the result seems to be very low, I decided to check the direction of merging multiple extents. I started with merging "base:allocation" and "qemu:dirty-bitmap:xxx" since we already have both. It was not hard to do, although it is not completely tested yet. Here is the merging code: https://gerrit.ovirt.org/c/ovirt-imageio/+/115216/1/daemon/ovirt_imageio/_internal/nbdutil.py To make merging easy and safe, we map the NBD_STATE_DIRTY bit to a private bit so it cannot clash with the NBD_STATE_HOLE bit: https://gerrit.ovirt.org/c/ovirt-imageio/+/115215/1/daemon/ovirt_imageio/_internal/nbd.py Here is a functional test using qemu-nbd showing that it works: https://gerrit.ovirt.org/c/ovirt-imageio/+/115216/1/daemon/test/client_test.py I'll try to use "qemu:allocation-depth" in a similar way next week, probably mapping depth > 0 to EXTENT_EXISTS, to use when reporting holes in single qcow2 images. If this is successful, we can start using this in the next ovirt release, and we don't need "qemu:joint-allocation". Nir
Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context
10.06.2021 17:04, Eric Blake wrote: Maybe the thing to do is improve the documentation and try to avoid ambiguous terminalogy; in qemu:allocation-depth, a return of depth 0 should be called "absent", not "unallocated". And in libnbd, a base:allocation of 0 should be "data" or "normal", not "allocated". Interesting, how many problems, misunderstanding and confusion we have for years because of that terminology :) Funny, that we try to imagine how to call these thing in general, but actually in 99.99% cases we are saying about only 5 simple things: file-posix data file-posix hole qcow2 DATA qcow2 ZERO qcow2 UNALLOCATED And all our problems comes from our trying to divide these thing into two categories: allocated/unallocated. But it never worked. I'd divide like this: DATA examples: - data cluster in qcow2 - data region in file-posix properties: - data actually occupies space on disk - io operations are handled by this layer, backing is shadowed - write should not increase disk occupation GO_TO_BACKING examples: - "unallocated" cluster in qcow2 properties - read from backing image (if no backing, read zeroes) - disk occupation status is known only by backing image (if no backing, disk is not occupied) - write will allocate new cluster in top image, which will increase disk occupation ZERO examples: - zero cluster in qcow2, no space is occupied (most probably), reads as zeroes - file-posix hole, no space is occupied (most probably), reads as zeroes properties: - read zeroes - io operations are handled by this layer, backing is shadowed - no space is occupied (most probably) - write should not increase disk occupation (most probably) We can consider qcow2 ALLOCATED_ZERO also, and maybe SCSI unallocated which means that nothing is occupied but read doesn't guarantee zeroes.. But that doesn't really matter. What does matter is that trying to describe qcow2 backing files in usual block terms allocated/unallocated zero/data never worked good. So in a good documentation (and good code) we should describe (and handle) qcow2 backing chains as qcow2 backing chains and don't try to shadow them under usual terminology. -- Best regards, Vladimir
Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context
10.06.2021 16:47, Eric Blake wrote: On Thu, Jun 10, 2021 at 03:30:17PM +0300, Vladimir Sementsov-Ogievskiy wrote: The correct fix is for ovirt to additionally use the qemu:allocation-depth metadata context added in 5.2: after all, the actual determination for what is needed to recreate a qcow2 file is not whether a cluster is sparse, but whether the allocation-depth shows the cluster to be local. But reproducing an image is more efficient when handling known-zero clusters, which means that ovirt has to track both base:allocation and qemu:allocation-depth metadata contexts simultaneously. While NBD_CMD_BLOCK_STATUS is just fine sending back information for two contexts in parallel, it comes with some bookkeeping overhead at the client side: the two contexts need not report the same length of replies, and it involves more network traffic. Aren't both context described in one reply? Or what do you mean by not the same length? The example file demonstrates this. We have: base.rawABC- top.qcow2 -D0- guest sees AD00 Querying base:allocation returns: 0 655363 hole,zero 65536 655360 allocated 131072 1310723 hole,zero Querying qemu:allocation-depth returns: 0 655360 unallocated 65536 1310721 local 196608 655360 unallocated Hmm right. Sorry, I forget how BLOCK_STATUS works for several contexts. I thought they are combined. That is, the query starting at 64k returns different lengths (64k for base:allocation, 128k for qemu:allocation-depth), and the client has to process the smaller of the two regions before moving on to the next query. But if the client then does a query starting at 128k, it either has to remember that it previously has information available from the earlier qemu:allocation-depth, or repeats efforts over the wire. Hmm.. but if we are going to combine contexts in qemu, we face the same problem, as sources of the contexts may return information by different chunks, so we'll have to cache something, or query the same thing twice. But yes, at least we avoid doing it thought the net. The joy of having a single metadata context return both pieces of information at once is that the client no longer has to do this cross-correlation between the differences in extent lengths of the parallel contexts. We discussed in the past the option to expose also the dirty status of every block in the response. Again this info is available using "qemu:dirty-bitmap:xxx" but just like allocation depth and base allocation, merging the results is hard and if we could expose also the dirty bit, this can make clients life even better. In this case I'm not sure "qemu:allocation" is the best name, maybe something more generic like "qemu:extents" or "qemu:block-status" is even better. Oops. Could you please describe, what's the problem with parsing several context simultaneously? There is no inherent technical problem, just extra work. Joining the work at the server side is less coding effort than recoding the boilerplate to join the work at every single client side. And the information is already present. So we could just scrap this entire RFC by stating that the information is already available, and it is not worth qemu's effort to provide the convenience context. Joining base:allocation and qemu:allocation-depth was easy - in fact, since both use bdrv_block_status under the hood, we could (and probably should!) merge it into a single qemu query. But joining base:allocation and qemu:dirty-bitmap:FOO will be harder, at which point I question whether it is worth the complications. And if you argue that a joint context is not worthwhile without dirty bitmap(s) being part of that joint context, then maybe this RFC is too complex to worry about, and we should just leave the cross-correlation of parallel contexts to be client-side, after all. This all sound to me as we are going to implement "joint" combined conexts for every useful combination of existing contexts that user wants. So, it's a kind of workaround of inconvenient protocol we have invented in the past. Doesn't it mean that we instead should rework, how we export several contexts? Maybe we can improve generic export of several contexts simultaneously, so that it will be convenient for the client? Than we don't need any additional combined contexts. The NBD protocol intentionally left wiggle room for servers to report different extent lengths across different contexts. But other than qemu, I don't know of any other NBD servers advertising alternate contexts. If we think we can reasonbly restrict the NBD protocol to require that any server sending parallel contexts to a client MUST use the same extent lengths for all parallel contexts (clients still have to read multiple contexts, but the cross-correlation becomes easier because the client doesn't have to worry about length mismatches), and code that up in
Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context
On Thu, Jun 10, 2021 at 04:16:27PM +0300, Nir Soffer wrote: > On Thu, Jun 10, 2021 at 2:52 AM Nir Soffer wrote: > > > > On Wed, Jun 9, 2021 at 9:01 PM Eric Blake wrote: > > I posted a work in progress patch implementing support for > qemu:joint-allocaition > in oVirt: > https://gerrit.ovirt.org/c/ovirt-imageio/+/115197 > > The most important part is the nbd client: > https://gerrit.ovirt.org/c/ovirt-imageio/+/115197/1/daemon/ovirt_imageio/_internal/nbd.py > > With this our tests pass with qemu-nbd build with Eric patch: > https://gerrit.ovirt.org/c/ovirt-imageio/+/115197/1/daemon/test/client_test.py Note that you _don't_ have to use qemu:joint-allocation; you could get the same information by using the already-existing qemu:allocation-depth. But your patch DOES make it obvious that it was a lot simpler to use a single context (the bulk of your tweak is trying an additional NBD_OPT_SET_META_CONTEXT), than it would be to handle parallel contexts (where you would have to tweak your NBD_CMD_BLOCK_STATUS handler to cross-correlate information). > > We may need to use qemu:joint-allocation only for qcow2 images, and > base:allocation > for raw images, because allocation depth reporting is not correct for > raw images. Since Allocation depth (aka how deep in the backing chain is data allocated) IS correct for raw images (the raw image IS the source of all data); it is just not useful. > we control the qemu-nbd in both cases this should not be an issue. But > it would be > better if allocation depth would work for any kind of image, and we always use > qemu:joint-allocation. Maybe the thing to do is improve the documentation and try to avoid ambiguous terminalogy; in qemu:allocation-depth, a return of depth 0 should be called "absent", not "unallocated". And in libnbd, a base:allocation of 0 should be "data" or "normal", not "allocated". But I don't see how qemu will ever advertise an allocation depth of 0 for a raw image, because raw images have no backing chain to defer to. If you are trying to recreate a sparse raw image, then you really do want to pay attention to NBD_STATE_HOLE and NBD_STATE_ZERO, as those are more accurate pictures of the properties of extents within the raw file. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context
On Thu, Jun 10, 2021 at 03:30:17PM +0300, Vladimir Sementsov-Ogievskiy wrote: > > > The correct fix is for ovirt to additionally use the > > > qemu:allocation-depth metadata context added in 5.2: after all, the > > > actual determination for what is needed to recreate a qcow2 file is > > > not whether a cluster is sparse, but whether the allocation-depth > > > shows the cluster to be local. But reproducing an image is more > > > efficient when handling known-zero clusters, which means that ovirt > > > has to track both base:allocation and qemu:allocation-depth metadata > > > contexts simultaneously. While NBD_CMD_BLOCK_STATUS is just fine > > > sending back information for two contexts in parallel, it comes with > > > some bookkeeping overhead at the client side: the two contexts need > > > not report the same length of replies, and it involves more network > > > traffic. > > Aren't both context described in one reply? Or what do you mean by not the > same length? The example file demonstrates this. We have: base.rawABC- top.qcow2 -D0- guest sees AD00 Querying base:allocation returns: 0 655363 hole,zero 65536 655360 allocated 131072 1310723 hole,zero Querying qemu:allocation-depth returns: 0 655360 unallocated 65536 1310721 local 196608 655360 unallocated That is, the query starting at 64k returns different lengths (64k for base:allocation, 128k for qemu:allocation-depth), and the client has to process the smaller of the two regions before moving on to the next query. But if the client then does a query starting at 128k, it either has to remember that it previously has information available from the earlier qemu:allocation-depth, or repeats efforts over the wire. The joy of having a single metadata context return both pieces of information at once is that the client no longer has to do this cross-correlation between the differences in extent lengths of the parallel contexts. > > We discussed in the past the option to expose also the dirty status of every > > block in the response. Again this info is available using > > "qemu:dirty-bitmap:xxx" > > but just like allocation depth and base allocation, merging the results is > > hard > > and if we could expose also the dirty bit, this can make clients life > > even better. > > In this case I'm not sure "qemu:allocation" is the best name, maybe > > something > > more generic like "qemu:extents" or "qemu:block-status" is even better. > > > > Oops. Could you please describe, what's the problem with parsing several > context simultaneously? There is no inherent technical problem, just extra work. Joining the work at the server side is less coding effort than recoding the boilerplate to join the work at every single client side. And the information is already present. So we could just scrap this entire RFC by stating that the information is already available, and it is not worth qemu's effort to provide the convenience context. Joining base:allocation and qemu:allocation-depth was easy - in fact, since both use bdrv_block_status under the hood, we could (and probably should!) merge it into a single qemu query. But joining base:allocation and qemu:dirty-bitmap:FOO will be harder, at which point I question whether it is worth the complications. And if you argue that a joint context is not worthwhile without dirty bitmap(s) being part of that joint context, then maybe this RFC is too complex to worry about, and we should just leave the cross-correlation of parallel contexts to be client-side, after all. > > This all sound to me as we are going to implement "joint" combined conexts > for every useful combination of existing contexts that user wants. So, it's a > kind of workaround of inconvenient protocol we have invented in the past. > > Doesn't it mean that we instead should rework, how we export several > contexts? Maybe we can improve generic export of several contexts > simultaneously, so that it will be convenient for the client? Than we don't > need any additional combined contexts. The NBD protocol intentionally left wiggle room for servers to report different extent lengths across different contexts. But other than qemu, I don't know of any other NBD servers advertising alternate contexts. If we think we can reasonbly restrict the NBD protocol to require that any server sending parallel contexts to a client MUST use the same extent lengths for all parallel contexts (clients still have to read multiple contexts, but the cross-correlation becomes easier because the client doesn't have to worry about length mismatches), and code that up in qemu, that's also something we can consider. Or maybe even have it be an opt-in, where a client requests NBD_OPT_ALIGN_META_CONTEXT; if the server acknowledges that option, the client knows that it can request parallel NBD_OPT_SET_META_CONTEXT and the extents replied to each
Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context
On Thu, Jun 10, 2021 at 02:52:10AM +0300, Nir Soffer wrote: > > So, as a convenience, we can provide yet another metadata context, > > "qemu:joint-allocation", which provides the bulk of the same > > information already available from using "base:allocation" and > > "qemu:allocation-depth" in parallel; the only difference is that an > > allocation depth larger than one is collapsed to a single bit, rather > > than remaining an integer representing actual depth. By connecting to > > just this context, a client has less work to perform while still > > getting at all pieces of information needed to recreate a qcow2 > > backing chain. > > Providing extended allocation is awsome, and makes client life much > easier. But I'm not sure about the name, that comes from "joining" > "base:allocation" and "qemu:allocation-depth". This is correct when > thinking about qemu internals, but this is not really getting both, since > "qemu:allocation-depth" is reduced to local and backing. > > From a client point of view, I think this is best described as > "qemu:allocation" > which is an extension to NBD protocol, providing the same HOLE and ZERO > bits, and qemu specific info LOCAL, BACKING. Using different "namespace" > ("qemu" vs "base") makes it clear that this is not the same. > > We discussed in the past the option to expose also the dirty status of every > block in the response. Again this info is available using > "qemu:dirty-bitmap:xxx" > but just like allocation depth and base allocation, merging the results is > hard > and if we could expose also the dirty bit, this can make clients life > even better. > In this case I'm not sure "qemu:allocation" is the best name, maybe something > more generic like "qemu:extents" or "qemu:block-status" is even better. Yeah, I'm happy to bike-shed the name. Note that dirty-bitmap tracking is harder to merge in: qemu supports exposing more than one dirty bitmap context at once, but where each context uses the same bit (bit 0, which overlaps with NBD_STATE_HOLE). We'd have to decide whether we support merging only a single bitmap, or whether a user can merge in up to 28 bitmaps (where the order in which they specify dirty bitmaps as part of the qemu:mega:bitmap1:bitmap2:... context name matters). And if we decide that merging in even a single dirty bitmap is too difficult, then clients already have to correlate a dirty bitmap with everything else, at which point correlating 3 contexts (dirty bitmap with base:allocation and qemu:allocation-depth) is no harder than correlating 2 contexts (dirty bitmap with qemu:joint-allocation or whatever we name it). > > > With regards to exposing this new feature from qemu as NBD server, it > > is sufficient to reuse the existing 'qemu-nbd -A': since that already > > exposes allocation depth, it does not hurt to advertise two separate > > qemu:XXX metadata contexts at once for two different views of > > allocation depth. And just because the server supports multiple > > contexts does not mean a client will want or need to connect to > > everything available. On the other hand, the existing hack of using > > the qemu NBD client option of x-dirty-bitmap to select an alternative > > context from the client does NOT make it possible to read the extra > > information exposed by the new metadata context. For now, you MUST > > use something like libnbd's 'nbdinfo --map=qemu:joint-allocation' in > > order to properly see all four bits in action: > > Makes sense. > > > # Create a qcow2 image with a raw backing file: > > $ qemu-img create base.raw $((4*64*1024)) > > $ qemu-img create -f qcow2 -b base.raw -F raw top.qcow2 > > > > # Write to first 3 clusters of base: > > $ qemu-io -f raw -c "w -P 65 0 64k" -c "w -P 66 64k 64k" \ > > -c "w -P 67 128k 64k" base.raw > > > > # Write to second and third clusters of top, hiding base: > > $ qemu-io -f qcow2 -c "w -P 69 64k 64k" -c "w -z 128k 64k" top.qcow2 > > Looks familiar but nicer :-) Yeah, I just compressed your example. > > > # Expose top.qcow2 without backing file over NBD > > $ ./qemu-nbd -r -t -f qcow2 -A 'json:{"driver":"qcow2", "backing":null, > > \ > > "file":{"driver":"file", "filename":"top.qcow2"}}' > > $ nbdinfo --map=qemu:joint-allocation nbd://localhost > > 0 655363 > > 65536 655364 > > 131072 655367 > > 196608 655363 > > Using the libnbd patch this shows: > > $ ./nbdinfo --map="qemu:joint-allocation" nbd://localhost > 0 655363 hole,zero,unallocated > 65536 655364 allocated,local > 131072 655367 hole,zero,local > 196608 655363 hole,zero,unallocated > > Looks good. > > We need to convert this output to: > > {"start": 0, "length": 65536, "zero": true, "hole": true}, > {"start": 65536, "length": 65536, "zero": false, "hole": false}, > {"start": 131072, "length": 65536, "zero": true,
Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context
On Thu, Jun 10, 2021 at 2:52 AM Nir Soffer wrote: > > On Wed, Jun 9, 2021 at 9:01 PM Eric Blake wrote: I posted a work in progress patch implementing support for qemu:joint-allocaition in oVirt: https://gerrit.ovirt.org/c/ovirt-imageio/+/115197 The most important part is the nbd client: https://gerrit.ovirt.org/c/ovirt-imageio/+/115197/1/daemon/ovirt_imageio/_internal/nbd.py With this our tests pass with qemu-nbd build with Eric patch: https://gerrit.ovirt.org/c/ovirt-imageio/+/115197/1/daemon/test/client_test.py We may need to use qemu:joint-allocation only for qcow2 images, and base:allocation for raw images, because allocation depth reporting is not correct for raw images. Since we control the qemu-nbd in both cases this should not be an issue. But it would be better if allocation depth would work for any kind of image, and we always use qemu:joint-allocation. Nir
Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context
10.06.2021 02:52, Nir Soffer wrote: On Wed, Jun 9, 2021 at 9:01 PM Eric Blake wrote: When trying to reconstruct a qcow2 chain using information provided over NBD, ovirt had been relying on an unsafe assumption that any portion of the qcow2 file advertised as sparse would defer to the backing image; this worked with what qemu 5.2 reports for a qcow2 BSD loaded with "backing":null. However, in 6.0, commit 0da9856851 (nbd: server: Report holes for raw images) also had a side-effect of reporting unallocated zero clusters in qcow2 files as sparse. This change is correct from the NBD spec perspective (advertising bits has always been optional based on how much information the server has available, and should only be used to optimize behavior when a bit is set, while not assuming semantics merely because a bit is clear), but means that a qcow2 file that uses an unallocated zero cluster to override a backing file now shows up as sparse over NBD, and causes ovirt to fail to reproduce that cluster (ie. ovirt was assuming it only had to write clusters where the bit was clear, and the 6.0 behavior change shows the flaw in that assumption). The correct fix is for ovirt to additionally use the qemu:allocation-depth metadata context added in 5.2: after all, the actual determination for what is needed to recreate a qcow2 file is not whether a cluster is sparse, but whether the allocation-depth shows the cluster to be local. But reproducing an image is more efficient when handling known-zero clusters, which means that ovirt has to track both base:allocation and qemu:allocation-depth metadata contexts simultaneously. While NBD_CMD_BLOCK_STATUS is just fine sending back information for two contexts in parallel, it comes with some bookkeeping overhead at the client side: the two contexts need not report the same length of replies, and it involves more network traffic. Aren't both context described in one reply? Or what do you mean by not the same length? So, as a convenience, we can provide yet another metadata context, "qemu:joint-allocation", which provides the bulk of the same information already available from using "base:allocation" and "qemu:allocation-depth" in parallel; the only difference is that an allocation depth larger than one is collapsed to a single bit, rather than remaining an integer representing actual depth. By connecting to just this context, a client has less work to perform while still getting at all pieces of information needed to recreate a qcow2 backing chain. Providing extended allocation is awsome, and makes client life much easier. But I'm not sure about the name, that comes from "joining" "base:allocation" and "qemu:allocation-depth". This is correct when thinking about qemu internals, but this is not really getting both, since "qemu:allocation-depth" is reduced to local and backing. From a client point of view, I think this is best described as "qemu:allocation" which is an extension to NBD protocol, providing the same HOLE and ZERO bits, and qemu specific info LOCAL, BACKING. Using different "namespace" ("qemu" vs "base") makes it clear that this is not the same. We discussed in the past the option to expose also the dirty status of every block in the response. Again this info is available using "qemu:dirty-bitmap:xxx" but just like allocation depth and base allocation, merging the results is hard and if we could expose also the dirty bit, this can make clients life even better. In this case I'm not sure "qemu:allocation" is the best name, maybe something more generic like "qemu:extents" or "qemu:block-status" is even better. Oops. Could you please describe, what's the problem with parsing several context simultaneously? This all sound to me as we are going to implement "joint" combined conexts for every useful combination of existing contexts that user wants. So, it's a kind of workaround of inconvenient protocol we have invented in the past. Doesn't it mean that we instead should rework, how we export several contexts? Maybe we can improve generic export of several contexts simultaneously, so that it will be convenient for the client? Than we don't need any additional combined contexts. -- Best regards, Vladimir
Re: [PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context
On Wed, Jun 9, 2021 at 9:01 PM Eric Blake wrote: > > When trying to reconstruct a qcow2 chain using information provided > over NBD, ovirt had been relying on an unsafe assumption that any > portion of the qcow2 file advertised as sparse would defer to the > backing image; this worked with what qemu 5.2 reports for a qcow2 BSD > loaded with "backing":null. However, in 6.0, commit 0da9856851 (nbd: > server: Report holes for raw images) also had a side-effect of > reporting unallocated zero clusters in qcow2 files as sparse. This > change is correct from the NBD spec perspective (advertising bits has > always been optional based on how much information the server has > available, and should only be used to optimize behavior when a bit is > set, while not assuming semantics merely because a bit is clear), but > means that a qcow2 file that uses an unallocated zero cluster to > override a backing file now shows up as sparse over NBD, and causes > ovirt to fail to reproduce that cluster (ie. ovirt was assuming it > only had to write clusters where the bit was clear, and the 6.0 > behavior change shows the flaw in that assumption). > > The correct fix is for ovirt to additionally use the > qemu:allocation-depth metadata context added in 5.2: after all, the > actual determination for what is needed to recreate a qcow2 file is > not whether a cluster is sparse, but whether the allocation-depth > shows the cluster to be local. But reproducing an image is more > efficient when handling known-zero clusters, which means that ovirt > has to track both base:allocation and qemu:allocation-depth metadata > contexts simultaneously. While NBD_CMD_BLOCK_STATUS is just fine > sending back information for two contexts in parallel, it comes with > some bookkeeping overhead at the client side: the two contexts need > not report the same length of replies, and it involves more network > traffic. > > So, as a convenience, we can provide yet another metadata context, > "qemu:joint-allocation", which provides the bulk of the same > information already available from using "base:allocation" and > "qemu:allocation-depth" in parallel; the only difference is that an > allocation depth larger than one is collapsed to a single bit, rather > than remaining an integer representing actual depth. By connecting to > just this context, a client has less work to perform while still > getting at all pieces of information needed to recreate a qcow2 > backing chain. Providing extended allocation is awsome, and makes client life much easier. But I'm not sure about the name, that comes from "joining" "base:allocation" and "qemu:allocation-depth". This is correct when thinking about qemu internals, but this is not really getting both, since "qemu:allocation-depth" is reduced to local and backing. >From a client point of view, I think this is best described as >"qemu:allocation" which is an extension to NBD protocol, providing the same HOLE and ZERO bits, and qemu specific info LOCAL, BACKING. Using different "namespace" ("qemu" vs "base") makes it clear that this is not the same. We discussed in the past the option to expose also the dirty status of every block in the response. Again this info is available using "qemu:dirty-bitmap:xxx" but just like allocation depth and base allocation, merging the results is hard and if we could expose also the dirty bit, this can make clients life even better. In this case I'm not sure "qemu:allocation" is the best name, maybe something more generic like "qemu:extents" or "qemu:block-status" is even better. > With regards to exposing this new feature from qemu as NBD server, it > is sufficient to reuse the existing 'qemu-nbd -A': since that already > exposes allocation depth, it does not hurt to advertise two separate > qemu:XXX metadata contexts at once for two different views of > allocation depth. And just because the server supports multiple > contexts does not mean a client will want or need to connect to > everything available. On the other hand, the existing hack of using > the qemu NBD client option of x-dirty-bitmap to select an alternative > context from the client does NOT make it possible to read the extra > information exposed by the new metadata context. For now, you MUST > use something like libnbd's 'nbdinfo --map=qemu:joint-allocation' in > order to properly see all four bits in action: Makes sense. > # Create a qcow2 image with a raw backing file: > $ qemu-img create base.raw $((4*64*1024)) > $ qemu-img create -f qcow2 -b base.raw -F raw top.qcow2 > > # Write to first 3 clusters of base: > $ qemu-io -f raw -c "w -P 65 0 64k" -c "w -P 66 64k 64k" \ > -c "w -P 67 128k 64k" base.raw > > # Write to second and third clusters of top, hiding base: > $ qemu-io -f qcow2 -c "w -P 69 64k 64k" -c "w -z 128k 64k" top.qcow2 Looks familiar but nicer :-) > # Expose top.qcow2 without backing file over NBD > $ ./qemu-nbd -r -t -f qcow2 -A
[PATCH 2/2] nbd: Add new qemu:joint-allocation metadata context
When trying to reconstruct a qcow2 chain using information provided over NBD, ovirt had been relying on an unsafe assumption that any portion of the qcow2 file advertised as sparse would defer to the backing image; this worked with what qemu 5.2 reports for a qcow2 BSD loaded with "backing":null. However, in 6.0, commit 0da9856851 (nbd: server: Report holes for raw images) also had a side-effect of reporting unallocated zero clusters in qcow2 files as sparse. This change is correct from the NBD spec perspective (advertising bits has always been optional based on how much information the server has available, and should only be used to optimize behavior when a bit is set, while not assuming semantics merely because a bit is clear), but means that a qcow2 file that uses an unallocated zero cluster to override a backing file now shows up as sparse over NBD, and causes ovirt to fail to reproduce that cluster (ie. ovirt was assuming it only had to write clusters where the bit was clear, and the 6.0 behavior change shows the flaw in that assumption). The correct fix is for ovirt to additionally use the qemu:allocation-depth metadata context added in 5.2: after all, the actual determination for what is needed to recreate a qcow2 file is not whether a cluster is sparse, but whether the allocation-depth shows the cluster to be local. But reproducing an image is more efficient when handling known-zero clusters, which means that ovirt has to track both base:allocation and qemu:allocation-depth metadata contexts simultaneously. While NBD_CMD_BLOCK_STATUS is just fine sending back information for two contexts in parallel, it comes with some bookkeeping overhead at the client side: the two contexts need not report the same length of replies, and it involves more network traffic. So, as a convenience, we can provide yet another metadata context, "qemu:joint-allocation", which provides the bulk of the same information already available from using "base:allocation" and "qemu:allocation-depth" in parallel; the only difference is that an allocation depth larger than one is collapsed to a single bit, rather than remaining an integer representing actual depth. By connecting to just this context, a client has less work to perform while still getting at all pieces of information needed to recreate a qcow2 backing chain. With regards to exposing this new feature from qemu as NBD server, it is sufficient to reuse the existing 'qemu-nbd -A': since that already exposes allocation depth, it does not hurt to advertise two separate qemu:XXX metadata contexts at once for two different views of allocation depth. And just because the server supports multiple contexts does not mean a client will want or need to connect to everything available. On the other hand, the existing hack of using the qemu NBD client option of x-dirty-bitmap to select an alternative context from the client does NOT make it possible to read the extra information exposed by the new metadata context. For now, you MUST use something like libnbd's 'nbdinfo --map=qemu:joint-allocation' in order to properly see all four bits in action: # Create a qcow2 image with a raw backing file: $ qemu-img create base.raw $((4*64*1024)) $ qemu-img create -f qcow2 -b base.raw -F raw top.qcow2 # Write to first 3 clusters of base: $ qemu-io -f raw -c "w -P 65 0 64k" -c "w -P 66 64k 64k" \ -c "w -P 67 128k 64k" base.raw # Write to second and third clusters of top, hiding base: $ qemu-io -f qcow2 -c "w -P 69 64k 64k" -c "w -z 128k 64k" top.qcow2 # Expose top.qcow2 without backing file over NBD $ ./qemu-nbd -r -t -f qcow2 -A 'json:{"driver":"qcow2", "backing":null, \ "file":{"driver":"file", "filename":"top.qcow2"}}' $ nbdinfo --map=qemu:joint-allocation nbd://localhost 0 655363 65536 655364 131072 655367 196608 655363 [This was output from nbdinfo 1.8.0; a later version will also add a column to decode the bits into human-readable strings] Additionally, later qemu patches may try to improve qemu-img to automatically take advantage of additional NBD context information, without having to use x-dirty-bitmap. Reported-by: Nir Soffer Resolves: https://bugzilla.redhat.com/1968693 Signed-off-by: Eric Blake --- docs/interop/nbd.txt | 31 ++- docs/tools/qemu-nbd.rst | 4 +- qapi/block-export.json| 4 +- include/block/nbd.h | 10 ++- nbd/server.c | 87 +-- .../tests/nbd-qemu-allocation.out | 3 +- 6 files changed, 125 insertions(+), 14 deletions(-) diff --git a/docs/interop/nbd.txt b/docs/interop/nbd.txt index 10ce098a29bf..cc8ce2d5389f 100644 --- a/docs/interop/nbd.txt +++ b/docs/interop/nbd.txt @@ -17,7 +17,7 @@ namespace "qemu". == "qemu" namespace == -The "qemu" namespace currently contains two